In [None]:
!pip install -U pandas polars duckdb pyarrow

# Pandas 2

New features of Pandas 2 includes:

- Pyarrow integration
  - Speed up IO and calculations
  - Support for missing values
- Copy-on-write

Overlooked features of Pandas API:

- Ponder (Snowflake) scales out Pandas code
- CuDF scales out to GPUs


## DataFrame Methods in Pandas 2

### 1. Data Manipulation and Transformation:
- **Arithmetic Operations**: `add`, `sub`, `mul`, `div`, `floordiv`, `mod`, `pow`, `dot`, `cummax`, `cummin`, `cumprod`, `cumsum`
- **Data Transformation**: `apply`, `applymap`, `transform`, `map`, `agg`, `aggregate`
- **Type Conversion**: `astype`, `convert_dtypes`, `infer_objects`
- **Data Reshaping**: `melt`, `pivot`, `pivot_table`, `stack`, `unstack`, `explode`, `get_dummies`
- **String Manipulation**: `add_prefix`, `add_suffix`
- **Data Combining**: `combine`, `combine_first`, `merge`, `join`
- **Data Cleaning**: `replace`, `drop`, `drop_duplicates`, `filter`, `clip`, `mask`, `where`, `truncate`
- **Data Sampling and Randomization**: `sample`
- **Data Sorting and Ordering**: `sort_values`, `sort_index`, `nsmallest`, `nlargest`

### 2. Data Retrieval and Indexing:
- **Selection and Indexing**: `loc`, `iloc`, `iat`, `at`, `xs`, `get`, `item`, `pop`, `query`
- **Data Retrieval**: `head`, `tail`, 

### 3. Aggregation and Descriptive Statistics:
- `sum`, `mean`, `median`, `min`, `max`, `mode`, `std`, `var`, `skew`, `kurt`, `kurtosis`, `quantile`, `count`, `corr`, `cov`, `rank`, `pct_change`, `sem`, `all`, `any`, `first`, `last`

### 4. Grouping and Aggregation:
- `groupby`, `resample`, `expanding`, `ewm`, `rolling`

### 5. Handling Missing Values:
- `isna`, `isnull`, `notna`, `notnull`, `dropna`, `fillna`, `interpolate`

### 6. Data Inspection and Information:
- `info`, `describe`, `shape`, `size`, `ndim`, `empty`, `memory_usage`, `dtype`, `dtypes`, `axes`

### 7. File I/O and Serialization:
- **To File**: `to_csv`, `to_excel`, `to_json`, `to_html`, `to_latex`, `to_markdown`, `to_string`, `to_clipboard`, `to_feather`, `to_parquet`, `to_stata`, `to_gbq`, `to_orc`, `to_pickle`, `to_sql`, `to_records`, `to_dict`, `to_xarray`, `to_xml`
- **From File/Records**: `from_dict`, `from_records`

### 8. Time Series and Date Handling:
- `asfreq`, `asof`, `to_period`, `to_timestamp`, `tz_convert`, `tz_localize`, `at_time`, `between_time`, `resample`

### 9. Data Alignment and Missing Value Handling:
- `align`, `backfill`, `bfill`, `ffill`, `pad`, `reindex`, `reindex_like`

### 10. Data Styling and Visualization:
- `style`, `plot`, `boxplot`, `hist`

### 11. Utility and Miscellaneous:
- `copy`, `equals`, `isin`, `duplicated`,  `idxmax`, `idxmin`, `first_valid_index`, `last_valid_index`, `keys`, `items`, `iterrows`, `itertuples`, `set_axis`, `set_flags`, `set_index`, `reset_index`, `reorder_levels`, `swapaxes`, `swaplevel`, `update`, `pipe`, `squeeze`, `equals`, `compare`, `value_counts`



## Pandas Functions
Features in the `pd` namespace:

### 1. Data Manipulation and Transformation:
- **Data Reshaping**: `lreshape`, `wide_to_long`, `crosstab`
- **Combining and Merging**: `concat`, 
- **Conversion and Casting**: `to_datetime`, `to_timedelta`, `to_numeric`
- **Factorization and Binning**: `factorize`, `cut`, `qcut`
- **Dummy Variable Encoding**: `get_dummies`, `from_dummies`

### 2. Data Loading and IO Operations:
- **Reading Data**: `read_csv`, `read_excel`, `read_json`, `read_html`, `read_sql`, `read_sql_query`, `read_sql_table`, `read_parquet`, `read_pickle`, `read_hdf`, `read_feather`, `read_stata`, `read_sas`, `read_spss`, `read_gbq`, `read_orc`, `read_fwf`, `read_clipboard`, `read_table`, `read_xml`
- **Saving Data**: `to_pickle`

### 3. Options and Configuration:
- `set_option`, `get_option`, `reset_option`, `describe_option`, `option_context`, `set_eng_float_format`

### 4. Data Inspection and Information:
- `show_versions`, `testing`, `test`


### 5. Time Series and Date Handling:
- `date_range`, `bdate_range`, `period_range`, `timedelta_range`, `infer_freq`, `offsets`

# Load Data

In [None]:
import numpy as np
import pandas as pd
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df_pd = pd.read_csv(url,
                 engine='pyarrow', dtype_backend='pyarrow')

## Exercise API

In [1]:
def make_to_origin(make):
    """
    Convert car make to country of origin.
    
    Args:
        make (str): Car make.
        
    Returns:
        str: Country of origin.
    """
    # Dictionary mapping car makes to countries of origin
    origin_dict = {
        'Chevrolet': 'USA',
        'Ford': 'USA',
        'Dodge': 'USA',
        'GMC': 'USA',
        'Toyota': 'Japan',
        'BMW': 'Germany',
        'Mercedes-Benz': 'Germany',
        'Nissan': 'Japan',
        'Volkswagen': 'Germany',
        'Mitsubishi': 'Japan',
        'Porsche': 'Germany',
        'Mazda': 'Japan',
        'Audi': 'Germany',
        'Honda': 'Japan',
        'Jeep': 'USA',
        'Pontiac': 'USA',
        'Subaru': 'Japan',
        'Volvo': 'Sweden',
        'Hyundai': 'South Korea',
        'Chrysler': 'USA',
        'Tesla': 'USA'
    }
    
    return origin_dict.get(make, "Unknown")

(df_pd
 .assign(origin=lambda df: df['make'].apply(make_to_origin),
         # replace EST and EDT with offset in createdOn
        createdOn=lambda df: df['createdOn'].str.replace('EDT', '-04:00').str.replace('EST', '-05:00')
 )
 .assign(
        # convert createdOn to datetime using strftime for  Tue Jan 01 00:00:00 -05:00 2013
        # has mixed timezones so we need to use utc=True
        createdOn=lambda df: pd.to_datetime(df['createdOn'], format='%a %b %d %H:%M:%S %z %Y', utc=True),
        )
 .query('origin != "Unknown" and year < 2020')
 .loc[:, ['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn']]
    .groupby(['origin', 'year'])
    .city08
    .mean()
    .unstack('origin')
  .plot(title='Average Mileage by Year and Country of Origin')
 )

NameError: name 'df_pd' is not defined

In [None]:
df_pd.to_csv('vehicles-pd.csv', index=False)

## Exponential Growth

In [None]:
# I invest $1 and it grows by 1% every day for 1,000 days
# how would I calculate the value with pandas?

investment = pd.Series([1])

def compound_growth(start, rate, periods):
    result = pd.Series([start]*periods, dtype='float64[pyarrow]')
    for i in range(1, periods):
        result[i] = result[i-1] * (1+rate)
    return result

compound_growth(1, 0.01, 10_000)

In [None]:
%%timeit
compound_growth(1, 0.01, 10_000)

In [None]:
# numpy version
def compound_growth_np(start, rate, periods):
    result = np.empty(periods, dtype='float64')
    result[0] = start
    for i in range(1, periods):
        result[i] = result[i-1] * (1+rate)
    return result


In [None]:
%%timeit
compound_growth_np(1, 0.01, 10_000)

In [None]:
# convert compound_growth to a numba function with types
import numba
import numpy as np

@numba.njit('float64[:](float64, float64, int64)')
def compound_growth_nb(start, rate, periods):
    result = np.empty(periods, dtype='float64')
    result[0] = start
    for i in range(1, periods):
        result[i] = result[i-1] * (1+rate)
    return result


In [None]:
%%timeit
compound_growth_nb(1, 0.01, 10_000)

In [None]:
import cython
cython.__version__

In [None]:
%load_ext cython

In [None]:
%%cython

import numpy as np
cimport numpy as cnp 
cnp.import_array()

DTYPE = np.float64
ctypedef cnp.float64_t DTYPE_t

cimport cython

def compound_growth_cy(float start, float rate, int periods):
    # https://cython.readthedocs.io/en/latest/src/tutorial/numpy.html#:~:text=Efficient%20indexing-,%C2%B6,-There%E2%80%99s%20still%20a
    cdef cnp.ndarray[DTYPE_t, ndim=1] result = np.zeros(periods, dtype=DTYPE)
    #cdef cnp.ndarray result = np.zeros(periods, dtype=DTYPE)
    result[0] = start
    cdef int i
    cdef DTYPE_t value 
    for i in range(1, periods):
        value = result[i-1] * (1+rate)
        result[i] = value
    return result


In [None]:
%%timeit
compound_growth_cy(1, 0.01, 10_000)

In [None]:
pd.Series(compound_growth_cy(1, 0.01, 10_000))

In [None]:
pd.Series(compound_growth_nb(1, 0.01, 10_000))

## Copy on Write

In [None]:
pd.options.mode.copy_on_write = True

In [None]:
!pip install psutil

In [None]:
import psutil

import os

class MemoryTracker:
    def __init__(self):
        self.previous_memory = self._get_process_memory()

    def _get_process_memory(self):
        process = psutil.Process(os.getpid())
        memory_info = process.memory_info()
        return memory_info.rss / (1024 ** 2)  # Convert bytes to megabytes

    def __call__(self, df, txt=''):
        current_memory = self._get_process_memory()
        memory_growth = current_memory - self.previous_memory
        print(f"{txt} Process memory usage: {current_memory:.2f} MB (growth: {memory_growth:.2f} MB)")
        self.previous_memory = current_memory
        return df


In [None]:
def make_to_origin(make):
    """
    Convert car make to country of origin.
    
    Args:
        make (str): Car make.
        
    Returns:
        str: Country of origin.
    """
    # Dictionary mapping car makes to countries of origin
    origin_dict = {
        'Chevrolet': 'USA',
        'Ford': 'USA',
        'Dodge': 'USA',
        'GMC': 'USA',
        'Toyota': 'Japan',
        'BMW': 'Germany',
        'Mercedes-Benz': 'Germany',
        'Nissan': 'Japan',
        'Volkswagen': 'Germany',
        'Mitsubishi': 'Japan',
        'Porsche': 'Germany',
        'Mazda': 'Japan',
        'Audi': 'Germany',
        'Honda': 'Japan',
        'Jeep': 'USA',
        'Pontiac': 'USA',
        'Subaru': 'Japan',
        'Volvo': 'Sweden',
        'Hyundai': 'South Korea',
        'Chrysler': 'USA',
        'Tesla': 'USA'
    }
    
    return origin_dict.get(make, "Unknown")

mt = MemoryTracker()


(df_pd
 .pipe(mt, txt='Start')
 .assign(origin=lambda df: df['make'].apply(make_to_origin),
         # replace EST and EDT with offset in createdOn
        createdOn=lambda df: df['createdOn'].str.replace('EDT', '-04:00').str.replace('EST', '-05:00')
 )
 .pipe(mt, txt='assign')
 .assign(
        # convert createdOn to datetime using strftime for  Tue Jan 01 00:00:00 -05:00 2013
        # has mixed timezones so we need to use utc=True
        createdOn=lambda df: pd.to_datetime(df['createdOn'], format='%a %b %d %H:%M:%S %z %Y', utc=True),
        )
 .pipe(mt, txt='assign2')
 .query('origin != "Unknown" and year < 2020')
 .pipe(mt, txt='query')
 .loc[:, ['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn']]
 .pipe(mt, txt='loc')
 .groupby(['origin', 'year'])
 .city08
 .mean()
 .pipe(mt, txt='grouping')
 .unstack('origin')
 .pipe(mt, txt='unstack')
  #.plot(title='Average Mileage by Year and Country of Origin')
 )

In [None]:
def make_to_origin(make):
    """
    Convert car make to country of origin.
    
    Args:
        make (str): Car make.
        
    Returns:
        str: Country of origin.
    """
    # Dictionary mapping car makes to countries of origin
    origin_dict = {
        'Chevrolet': 'USA',
        'Ford': 'USA',
        'Dodge': 'USA',
        'GMC': 'USA',
        'Toyota': 'Japan',
        'BMW': 'Germany',
        'Mercedes-Benz': 'Germany',
        'Nissan': 'Japan',
        'Volkswagen': 'Germany',
        'Mitsubishi': 'Japan',
        'Porsche': 'Germany',
        'Mazda': 'Japan',
        'Audi': 'Germany',
        'Honda': 'Japan',
        'Jeep': 'USA',
        'Pontiac': 'USA',
        'Subaru': 'Japan',
        'Volvo': 'Sweden',
        'Hyundai': 'South Korea',
        'Chrysler': 'USA',
        'Tesla': 'USA'
    }
    
    return origin_dict.get(make, "Unknown")

pd.options.mode.copy_on_write = False
mt = MemoryTracker()


(df_pd
 .pipe(mt, txt='Start')
 .assign(origin=lambda df: df['make'].apply(make_to_origin),
         # replace EST and EDT with offset in createdOn
        createdOn=lambda df: df['createdOn'].str.replace('EDT', '-04:00').str.replace('EST', '-05:00')
 )
 .pipe(mt, txt='assign')
 .assign(
        # convert createdOn to datetime using strftime for  Tue Jan 01 00:00:00 -05:00 2013
        # has mixed timezones so we need to use utc=True
        createdOn=lambda df: pd.to_datetime(df['createdOn'], format='%a %b %d %H:%M:%S %z %Y', utc=True),
        )
 .pipe(mt, txt='assign2')
 .query('origin != "Unknown" and year < 2020')
 .pipe(mt, txt='query')
 .loc[:, ['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn']]
 .pipe(mt, txt='loc')
 .groupby(['origin', 'year'])
 .city08
 .mean()
 .pipe(mt, txt='grouping')
 .unstack('origin')
 .pipe(mt, txt='unstack')
  #.plot(title='Average Mileage by Year and Country of Origin')
 )

# Pandas Exercises

1. **Basic DataFrame Operations**
   - Show the shape of the `df_pd` DataFrame.
   - Print the first 5 rows of the `df_pd` DataFrame.
   - Print the last 5 rows of the `df_pd` DataFrame.
   - Print the list of columns in the `df_pd` DataFrame.

2. **Data Exploration**
   - Print the number of unique values in each column of the `df_pd` DataFrame.
   - Print the number of null values in each column of the `df_pd` DataFrame.
   - Print the mean and standard deviation of the 'city08' column of the `df_pd` DataFrame.
   - Print the median and 75th percentile of the 'city08' column of the `df_pd` DataFrame.

4. **String Manipulation**
   - Upper case the 'make' column of the `df_pd` DataFrame.
   - Combine the 'year' and 'make' columns of the `df_pd` DataFrame into a new column called 'year_make'.

5. **Datetime Conversion**
   - Convert the 'createdOn' column to the New York timezone.

6. **Data Filtering**
   - Filter the `df_pd` DataFrame to only include rows where the 'make' column is 'Ford'.
   - Filter the data to only include rows where the 'model' column is a single word.
   - Filter the rows where the city mileage is greater than 75% of the city mileage values.

7. **Grouping and Aggregation (Moderate)**
   - Find the average mileage for Ford, Tesla, and Toyota vehicles.
   - Find the average mileage by year and make
