In [None]:
!pip install -U polars duckdb pyarrow

## Polars

- Addresses "10 Things I Hate Abourt Pandas" - https://wesmckinney.com/blog/apache-arrow-pandas-internals/
- Has Series and DataFrame.
- Core in Rust
- Supports multicore
- Supports streaming (larger than RAM)
- Lazy (query planning)

In [None]:
import polars as pl

In [None]:
pl.__version__

In [None]:
ser1 = pl.Series(['matt', 'fred', 'suzy'])
ser2 = pl.Series([42, 43, 44])
df1 = pl.DataFrame({'name': ser1, 'age': ser2,
                    'pet': ['cat', 'dog', 'bird']
                    })

df1

In [None]:
df1_pd = df1.to_pandas()
df1_pd

In [None]:
# common methods
print(sorted(set(dir(df1_pd)) & set(dir(df1))))

## Contexts and Expressions

Polars contexts are used to evaluate expressions. 

### Contexts

Types of contexts are available:

- Selection Contexts - used to select columns:
  - `df.select('column_name')` - only has the selected column
  - `df.with_columns(['column_name_1', 'column_name_2'])` - adds the selected columns to the DataFrame
- Filter Contexts - used to filter rows:
    - `df.filter(pl.col('column_name') > 0)` - filters the DataFrame to only include rows where the given expression is true
- Aggregation Contexts - used to aggregate data:
  - `df.groupby('column_name').agg(...)` - aggregates the data by the first column and aggregates with the given function (...)
- Join Contexts - used to join data:
    - `df.join(df2, left_on='column_name', right_on='column_name')` - joins the two DataFrames on the given column names
    
## DataFrame Methods    

### 1. Data Manipulation and Transformation:
- **Apply Functions**: `apply`, `map_rows`
- **Aggregation**: `max`, `min`, `sum`, `mean`, `median`, `std`, `var`, `product`, `n_unique`, `null_count`, `count`
- **Sorting and Ordering**: `sort`, `top_k`, `bottom_k`, `set_sorted`
- **Filtering and Selection**: `filter`, `select`, `select_seq`, `drop`, `take_every`
- **Missing Value Handling**: `null` and `NaN` are different (float can have both). `fill_nan`, `fill_null`, `drop_nulls`, `interpolate`
- **Type Casting**: `cast`, `to_series`, `to_numpy`, `to_pandas`, `to_dict`, `to_dicts`, `to_arrow`, `to_struct`
- **Data Reshaping**: `melt`, `pivot`, `transpose`, `unnest`, `unstack`, `explode`, `with_row_count`, `with_columns`, `with_columns_seq`, `insert_at_idx`, `extend`, `to_dummies`
- **Combining and Merging**: `join`, `join_asof`, `hstack`, `vstack`, `concat`

### 2. Data Inspection and Exploration:
- **Basic Information**: `shape`, `width`, `height`, `columns`, `dtypes`, `flags`, `schema`
- **Data Retrieval**: `head`, `tail`, `row`, `rows`, `get_column`, `get_columns`, `find_idx_by_name`, `item`
- **Data Summary**: `describe`, `glimpse`, `hash_rows`, `corr`
- **Data Sampling**: `sample`

### 3. Grouping and Aggregation:
- `group_by`, `groupby`, `group_by_dynamic`, `groupby_dynamic`, `group_by_rolling`, `groupby_rolling`, `partition_by`
- `upsample`

### 4. Rolling and Window Functions:
- `rolling`

### 5. Utility and Miscellaneous:
- `clone`, `copy`, `clear`, `shrink_to_fit`, `to_init_repr`, `fold`, `pipe`, `lazy`, `apply`

### 6. Data Cleaning and Preprocessing:
- `replace`, `update`, `drop_in_place`

### 7. Handling Duplicates:
- `is_duplicated`, `is_unique`, `unique`

### 8. Handling Missing Values:
- `drop_nulls`

### 9. Data Export and Serialization:
- **To File**: `write_csv`, `write_json`, `write_parquet`, `write_excel`, `write_avro`, `write_ipc`, `write_ipc_stream`, `write_ndjson`, `write_database`, `write_delta`
- **To Database**: `write_database`

### 10. Utility and Miscellaneous:
- **Indexing and Slicing**: `slice`, `select_at_idx`, `replace_at_idx`, `shift`, `shift_and_fill`
- **Utility**: `reverse`, `n_chunks`, `estimated_size`, `iter_rows`, `iter_slices`, `rows_by_key`, `map_rows`, `merge_sorted`, `rechunk`, `rename`, `is_empty`

### 11. Advanced Features:
- **Lazy Evaluation**: `lazy`, `collect`

### 12. Debugging and Inspection:
- `frame_equal`, `glimpse`



### Expressions

Expressions are used to define the operations to be performed in the context. Polars can optimize the execution of expressions to improve performance. 
Expressions are found in the `pl` namespace and `pl.col('column_name')`. Some examples include:

- `pl.col('column_name')` - references a column
- `pl.col(r'.*(_editor)')` - selects all columns that end with '_editor'
- `(pl.col('sale') * .3).alias('finders_fee')` - use math to create a new column with the given expression
- `(pl.col('age') > 18).alias('is_adult')` - use a boolean expression to create a new column
- `pl.all()` - selects all columns
- `pl.all().exclude('col1', 'col2')` - selects all columns except the given columns
- `pl.col('birth_date').dt.year()` - gets the year from a date column
- `pl.col(pl.Float64, pl.Boolean)` - select all float64 and boolean columns
- `pl.col(pl.all() - pl.Float64)` - select all columns except float64 columns (set operator)
  - `-` (set difference), `&` (set intersection), `|` (set union)
- `cs.float()` - select all float columns (using column selectors)
- `cs.contains('_editor')` - select all columns that contain '_editor' (using column selectors)
- `cs.matches(r'.*(_editor)')` - select all columns that end with '_editor' (using column selectors)
- `pl.sum(pl.col('column_name'))` - sum a column
- `pl.lit(1)` - creates a literal value

If you want to apply operations to the columns that were selected and not the expression, use `.as_expr()`.

### Expression Functions

### 1. Arithmetic Operations:
- **Basic Arithmetic**: `add`, `sub`, `mul`, `div`, `floordiv`, `mod`, `pow`
- **Increment/Decrement**: `cumsum`, `cumprod`, `diff`
- **Aggregations**: `sum`, `product`, `mean`, `median`, `min`, `max`, `count`, `n_unique`, `nan_min`, `nan_max`, `len`, `any`, `all`, `null_count`

### 2. Logical and Comparison Operations:
- **Comparison**: `eq`, `ne`, `lt`, `le`, `gt`, `ge`
- **Logical**: `and_`, `or_`, `not_`, `xor`
- **If/Then/Else**: `pl.when(conditional).then(then_expr).otherwise(else_expr)`

### 3. Missing Value Handling:
- **Detection**: `is_nan`, `is_not_nan`, `is_null`, `is_not_null`, `is_finite`, `is_infinite`
- **Imputation**: `fill_nan`, `fill_null`, `backward_fill`, `forward_fill`, `drop_nans`, `drop_nulls`

### 4. String Operations:
- `cat`, `str`

### 5. List Operations:
- **Manipulation**: `arr`, `list`, `flatten`, `explode`, `take`, `append`, `extend_constant`
- **Aggregation**: `first`, `last`

### 6. Date and Time Operations:
- `dt`

### 7. Mathematical Functions:
- **Trigonometric**: `sin`, `cos`, `tan`, `cot`, `asin`, `acos`, `atan`, `arcsin`, `arccos`, `arctan`, `sinh`, `cosh`, `tanh`, `arcsinh`, `arccosh`, `arctanh`
- **Exponential and Logarithmic**: `exp`, `log`, `log10`, `log1p`, `sqrt`, `cbrt`
- **Rounding**: `round`, `floor`, `ceil`
- **Other**: `abs`, `clip`, `clip_min`, `clip_max`, `degrees`, `radians`

### 8. Statistical Functions:
- `std`, `var`, `skew`, `kurtosis`, `quantile`, `mode`, `value_counts`

### 9. Sorting and Ordering:
- `sort`, `sort_by`, `arg_sort`, `arg_min`, `arg_max`, `rank`, `top_k`, `bottom_k`, `search_sorted`, `set_sorted`

### 10. Filtering and Searching:
- `filter`, `where`, `is_in`, `is_not`, `is_unique`, `is_duplicated`, `is_first`, `is_last`, `is_first_distinct`, `is_last_distinct`, `unique`, `unique_counts`

### 11. Window Functions:
- `rolling`, `ewm_mean`, `ewm_std`, `ewm_var`, `rolling_apply`, `rolling_max`, `rolling_min`, `rolling_mean`, `rolling_median`, `rolling_std`, `rolling_var`, `rolling_sum`, `rolling_skew`, `rolling_quantile`

### 12. Grouping and Aggregation:
- `agg_groups`, `groupby`, `over`, `cummax`, `cummin`, `cumcount`, `peak_max`, `peak_min`

### 13. Type Casting and Conversion:
- `cast`, `to_physical`, `reinterpret`, `shrink_dtype`

### 14. JSON and Struct Operations:
- `from_json`, `struct`

### 15. Sampling and Randomization:
- `sample`, `shuffle`

### 16. Nameing:
- `alias`, `suffix`, `prefix`

### 17. Function Application:
- `apply`, `pipe`, `map`, `map_batches`, `map_elements`, `map_alias`, `map_dict`

### 18. Debugging:
- `inspect`, `meta`

### 18. Utility and Miscellaneous:
- `keep_name`,  `hash`, `repeat_by`, `reshape`, `reverse`, `slice`, `take_every`, `limit`, `head`, `tail`, `cut`, `qcut`, `cumulative_eval`, `pct_change`, `cache`, `rechunk`, `rle`, `rle_id`, `product`, `sign`, `dot`, `exclude`, `implode`, `interpolate`, `shift`, `shift_and_fill`, `select`, `rename`, `lower_bound`, `upper_bound`



In [None]:
# debug selector
from polars.selectors import is_selector, expand_selector
import polars.selectors as cs

eds = cs.matches(r'(name|age)')
is_selector(eds), expand_selector(df1, eds)

## Lazy Evaluation

Polar's expressions enable lazy evaluation. This means that the expression is not evaluated until it is needed. This can improve performance significantly. Polars supports

- Predicate Pushdown - filters are applied as early as possible
- Projection Pushdown - only use/read the columns needed

Typical flow is to use `pl.scan_csv` to create a lazy DataFrame, or `df.lazy` to create a lazy DataFrame from an existing DataFrame. 

To materialize the DataFrame, use `df.collect` to execute the plan and return the results. Use `df.collect(streaming=True)` to execute the results in batches. (Not supported by all operations.)



In [None]:
df1.lazy()

## Load Data

In [None]:
import polars as pl
import pyarrow as pa

pl.__version__, pa.__version__

In [None]:
import polars as pl

import urllib.request
from zipfile import ZipFile

url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'

# download and unzip the url
dest_file = 'vehicles.csv.zip'
urllib.request.urlretrieve(url, dest_file)
with open(dest_file.replace('.zip', ''), 'wb') as f:
    z = ZipFile(dest_file)
    f.write(z.read('vehicles.csv'))

In [None]:
import polars as pl

def make_to_origin(make_col):
    # Dictionary mapping car makes to countries of origin
    origin_dict = {
        'Chevrolet': 'USA',
        'Ford': 'USA',
        'Dodge': 'USA',
        'GMC': 'USA',
        'Toyota': 'Japan',
        'BMW': 'Germany',
        'Mercedes-Benz': 'Germany',
        'Nissan': 'Japan',
        'Volkswagen': 'Germany',
        'Mitsubishi': 'Japan',
        'Porsche': 'Germany',
        'Mazda': 'Japan',
        'Audi': 'Germany',
        'Honda': 'Japan',
        'Jeep': 'USA',
        'Pontiac': 'USA',
        'Subaru': 'Japan',
        'Volvo': 'Sweden',
        'Hyundai': 'South Korea',
        'Chrysler': 'USA',
        'Tesla': 'USA'
    }
    
    return origin_dict.get(make_col, "Unknown")

df_pl = pl.read_csv('vehicles-pd.csv')

result = (df_pl
          .with_columns(pl.col('make').map_elements(make_to_origin).alias("origin"),
                pl.col('createdOn').str.to_datetime('%a %b %d %H:%M:%S %Z %Y'))
          .filter((pl.col("origin") != "Unknown") & (pl.col("year") < 2020))
          .select(['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn'])
          .group_by(['origin', 'year'])
          .agg(pl.col("city08").mean().alias("avg_city08"))
          .pivot(index='year', columns='origin', values='avg_city08')
          .sort('year')
)

(result
  # leverages Array transfer (use pl.from_pandas to go other way)
 .to_pandas(use_pyarrow_extension_array=True)
 .set_index('year')
 .plot(title='Average Mileage by Year and Country of Origin')
)

In [None]:
import polars as pl

# convert make to origin to expr
def make_to_origin_expr(make_col):
    # Dictionary mapping car makes to countries of origin
    origin_dict = {
        'Chevrolet': 'USA',
        'Ford': 'USA',
        'Dodge': 'USA',
        'GMC': 'USA',
        'Toyota': 'Japan',
        'BMW': 'Germany',
        'Mercedes-Benz': 'Germany',
        'Nissan': 'Japan',
        'Volkswagen': 'Germany',
        'Mitsubishi': 'Japan',
        'Porsche': 'Germany',
        'Mazda': 'Japan',
        'Audi': 'Germany',
        'Honda': 'Japan',
        'Jeep': 'USA',
        'Pontiac': 'USA',
        'Subaru': 'Japan',
        'Volvo': 'Sweden',
        'Hyundai': 'South Korea',
        'Chrysler': 'USA',
        'Tesla': 'USA'
    }
    expr = None
    col = pl.col(make_col)
    for k, v in origin_dict.items():
        if expr is None:
            expr = pl.when(col == k).then(pl.lit(v))
            continue
        expr = pl.when(col == k).then(pl.lit(v))
    expr = expr.otherwise(pl.lit('Unknown'))
    return expr

df_pl = pl.read_csv('vehicles-pd.csv')

result = (df_pl
          .with_columns(#pl.col('make').map_elements(make_to_origin).alias("origin"),
              make_to_origin_expr('make').alias('origin'),
                pl.col('createdOn').str.to_datetime('%a %b %d %H:%M:%S %Z %Y'))
          .filter((pl.col("origin") != "Unknown") & (pl.col("year") < 2020))
          .select(['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn'])
          .group_by(['origin', 'year'])
          .agg(pl.col("city08").mean().alias("avg_city08"))
          .pivot(index='year', columns='origin', values='avg_city08')
          .sort('year')
)

(result
 .to_pandas(use_pyarrow_extension_array=True)
 .set_index('year')
 .plot(title='Average Mileage by Year and Country of Origin')
)

In [None]:
%%timeit
(df_pl
  .with_columns(#
      pl.col('make').map_elements(make_to_origin).alias("origin")
  )
)
           

In [None]:
%%timeit
(df_pl
  .with_columns(#
     make_to_origin_expr('make').alias('origin'),
  )
)
           

### Exponential Growth

In [None]:
def exponential_growth_pl(start, rate, periods):
    result = pl.Series([start]*periods)#, dtype='Float64')
    for i in range(1, periods):
        result[i] = result[i-1] * (1+rate)
    return result

exponential_growth_pl(1., 0.01, 10_000)

In [None]:
%%timeit

exponential_growth_pl(1., 0.01, 10_000) 

### New Rust UDF
https://pola-rs.github.io/polars/user-guide/expressions/plugins/#writing-the-expression

In [None]:
%%bash
mkdir rustfn

In [None]:
%%writefile rustfn/Cargo.toml
[package]
name = "expression_lib"
version = "0.1.0"
edition = "2021"

[lib]
name = "expression_lib"
crate-type = ["cdylib"]

[dependencies]
polars = { version = "*" }
pyo3 = { version = "0.20.0", features = ["extension-module"] }
pyo3-polars = { version = "*", features = ["derive"] }
serde = { version = "1", features = ["derive"] }

In [None]:
# write exponential growth as a rust function
%%writefile rustfn/exponential_growth.rs
use polars::prelude::*;
use pyo3_polars::derive::polars_expr;
use std::fmt::Write;

fn exponential_growth(start: f64, rate: f64, periods: usize) -> PolarsResult<Series> {
    let mut result = Series::full("growth", start, periods);
    for i in 1..periods {
        result.set(i, result.get(i-1) * (1.+rate))?;
    }
    Ok(result)
}

In [None]:
import polars as pl
from polars.type_aliases import IntoExpr
from polars.utils.udfs import _get_shared_lib_location

lib = _get_shared_lib_location(__file__)

@pl.api.register_dataframe_namespace('exponential_growth')
class ExponentialGrowth:
    def __init__(self, expr: pl.Expr) -> None:
        self._expr = expr
    
    def exponential_growth(self) -> pl.Expr:
        return self._expr._register_plugin(
            lib=lib,
            symbol='exponential_growth',
            is_elementwise=False,
        )

In [None]:
!pip install maturin

In [None]:
pl.__version__

## Laziness

In [None]:
import polars as pl

# convert make to origin to expr
def make_to_origin_expr(make_col):
    # Dictionary mapping car makes to countries of origin
    origin_dict = {
        'Chevrolet': 'USA',
        'Ford': 'USA',
        'Dodge': 'USA',
        'GMC': 'USA',
        'Toyota': 'Japan',
        'BMW': 'Germany',
        'Mercedes-Benz': 'Germany',
        'Nissan': 'Japan',
        'Volkswagen': 'Germany',
        'Mitsubishi': 'Japan',
        'Porsche': 'Germany',
        'Mazda': 'Japan',
        'Audi': 'Germany',
        'Honda': 'Japan',
        'Jeep': 'USA',
        'Pontiac': 'USA',
        'Subaru': 'Japan',
        'Volvo': 'Sweden',
        'Hyundai': 'South Korea',
        'Chrysler': 'USA',
        'Tesla': 'USA'
    }
    expr = None
    col = pl.col(make_col)
    for k, v in origin_dict.items():
        if expr is None:
            expr = pl.when(col == k).then(pl.lit(v))
            continue
        expr = pl.when(col == k).then(pl.lit(v))
    expr = expr.otherwise(pl.lit('Unknown'))
    return expr

df_pl = pl.scan_csv('vehicles-pd.csv')

result = (df_pl
          .with_columns(#pl.col('make').map_elements(make_to_origin).alias("origin"),
              make_to_origin_expr('make').alias('origin'),
                pl.col('createdOn').str.to_datetime('%a %b %d %H:%M:%S %Z %Y'))
          .filter((pl.col("origin") != "Unknown") & (pl.col("year") < 2020))
          .select(['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn'])
          .group_by(['origin', 'year'])
          .agg(pl.col("city08").mean().alias("avg_city08"))
          .pivot(index='year', columns='origin', values='avg_city08')
          .sort('year')
)

result

In [None]:
result = (df_pl
          .with_columns(#pl.col('make').map_elements(make_to_origin).alias("origin"),
              make_to_origin_expr('make').alias('origin'),
                pl.col('createdOn').str.to_datetime('%a %b %d %H:%M:%S %Z %Y'))
          .filter((pl.col("origin") != "Unknown") & (pl.col("year") < 2020))
          .select(['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn'])
          .group_by(['origin', 'year'])
          .agg(pl.col("city08").mean().alias("avg_city08"))
          #.pivot(index='year', columns='origin', values='avg_city08')
          #.sort('year')
)

result

In [None]:
%%timeit
df_pl = pl.read_csv('vehicles-pd.csv')

result = (df_pl
          .with_columns(#pl.col('make').map_elements(make_to_origin).alias("origin"),
              make_to_origin_expr('make').alias('origin'),
                pl.col('createdOn').str.to_datetime('%a %b %d %H:%M:%S %Z %Y'))
          .filter((pl.col("origin") != "Unknown") & (pl.col("year") < 2020))
          .select(['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn'])
          .group_by(['origin', 'year'])
          .agg(pl.col("city08").mean().alias("avg_city08"))
          #.pivot(index='year', columns='origin', values='avg_city08')
          #.sort('year')
)


In [None]:
%%timeit
df_pl = pl.scan_csv('vehicles-pd.csv')

result = (df_pl
          .with_columns(#pl.col('make').map_elements(make_to_origin).alias("origin"),
              make_to_origin_expr('make').alias('origin'),
                pl.col('createdOn').str.to_datetime('%a %b %d %H:%M:%S %Z %Y'))
          .filter((pl.col("origin") != "Unknown") & (pl.col("year") < 2020))
          .select(['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn'])
          .group_by(['origin', 'year'])
          .agg(pl.col("city08").mean().alias("avg_city08"))
          #.pivot(index='year', columns='origin', values='avg_city08')
          #.sort('year')
          .collect()
)


In [None]:
# compare to pandas
import pandas as pd
def make_to_origin(make):
    """
    Convert car make to country of origin.
    
    Args:
        make (str): Car make.
        
    Returns:
        str: Country of origin.
    """
    # Dictionary mapping car makes to countries of origin
    origin_dict = {
        'Chevrolet': 'USA',
        'Ford': 'USA',
        'Dodge': 'USA',
        'GMC': 'USA',
        'Toyota': 'Japan',
        'BMW': 'Germany',
        'Mercedes-Benz': 'Germany',
        'Nissan': 'Japan',
        'Volkswagen': 'Germany',
        'Mitsubishi': 'Japan',
        'Porsche': 'Germany',
        'Mazda': 'Japan',
        'Audi': 'Germany',
        'Honda': 'Japan',
        'Jeep': 'USA',
        'Pontiac': 'USA',
        'Subaru': 'Japan',
        'Volvo': 'Sweden',
        'Hyundai': 'South Korea',
        'Chrysler': 'USA',
        'Tesla': 'USA'
    }
    
    return origin_dict.get(make, "Unknown")
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'




In [None]:
%%timeit
df_pd = pd.read_csv(url,
                 engine='pyarrow', dtype_backend='pyarrow')
(df_pd
 .assign(origin=lambda df: df['make'].apply(make_to_origin),
         # replace EST and EDT with offset in createdOn
        createdOn=lambda df: df['createdOn'].str.replace('EDT', '-04:00').str.replace('EST', '-05:00')
 )
 .assign(
        # convert createdOn to datetime using strftime for  Tue Jan 01 00:00:00 -05:00 2013
        # has mixed timezones so we need to use utc=True
        createdOn=lambda df: pd.to_datetime(df['createdOn'], format='%a %b %d %H:%M:%S %z %Y', utc=True),
        )
 .query('origin != "Unknown" and year < 2020')
 .loc[:, ['make', 'model', 'year', 'city08', 'highway08', 'origin', 'createdOn']]
    .groupby(['origin', 'year'])
    .city08
    .mean()
  #  .unstack('origin')
  #.plot(title='Average Mileage by Year and Country of Origin')
 )

## Polars Exercises

1. **Basic DataFrame Operations**

   - Show the shape of the `df_pl` DataFrame.
   - Print the first 5 rows of the `df_pl` DataFrame.
   - Print the last 5 rows of the `df_pl` DataFrame.
   - Print the list of columns in the `df_pl` DataFrame.
   
2. **Data Exploration**
   - Print the number of unique values in each column of the `df_pl` DataFrame.
   - Print the number of null values in each column of the `df_pl` DataFrame. (Hint: see `df.select` and `pl.all`)
   - Print the mean and standard deviation of the 'city08' column of the `df_pl` DataFrame.
   - Print the median and 75th percentile of the 'city08' column of the `df_pl` DataFrame.

3. **String Manipulation**
   - Upper case the 'make' column of the `df_pl` DataFrame. (Hint: see `dir(pl.col('make').str)` )
   - Combine the 'year' and 'make' columns of the `df_pl` DataFrame into a new column called 'year_make'. (Hint: see `pl.Col.cast`)

4. **Datetime Conversion**
   - Convert the 'createdOn' column to the New York timezone. (Hint: see `pl.Col.dt.replace_timezone`)

5. **Data Filtering**
   - Filter the `df_pl` DataFrame to only include rows where the 'make' column is 'Ford'. (Hint: see `df.filter`)
   - Filter the data to only include rows where the 'model' column is a single word. (Hint: see `pl.Col.str.split` ane `pl.Col.list.lengths`)
   - Filter the rows where the city mileage is greater than 75% of the city mileage values.

6. **Grouping and Aggregation (Moderate)**
   - Find the average mileage for Ford, Tesla, and Toyota vehicles.
   - Find the average mileage by year and make
