**Table of contents**<a id='toc0_'></a>    
- [About](#toc1_)    
- [Data importation](#toc2_)    
- [Data conversion](#toc3_)    
- [Exploratory Data Analysis](#toc4_)    
- [Drop columns](#toc5_)    
- [Remove features](#toc6_)    
  - [Features with too many nulls](#toc6_1_)    
  - [Features with near zero variance](#toc6_2_)    
- [Create features](#toc7_)    
  - [Scale data](#toc7_1_)    
  - [Create time features](#toc7_2_)    
  - [Check if monotonic](#toc7_3_)    
  - [Add seasons](#toc7_4_)    
- [Train test split](#toc8_)    
- [One Hot Encoding](#toc9_)    
- [Train & evaluate model](#toc10_)    
  - [Quantitative evaluation](#toc10_1_)    
  - [Qualitative evaluation](#toc10_2_)    
- [Make predictions](#toc11_)    
- [Export data](#toc12_)    
- [Charts & Graphs](#toc13_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [116]:
import polars as pl
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import make_scorer, root_mean_squared_error
import pandas as pd
import numpy as np
from datetime import datetime
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, HoltWinters

# <a id='toc1_'></a>[About](#toc0_)
This notebook illustrates how to perform time-series modelling using the Craigslist vehicle sales as sample data.

In [117]:
pd.options.display.max_columns = None
N_ROWS = None

In [118]:
def display_pl(df: pl.LazyFrame):
    if N_ROWS is not None:
        display(df.head().collect())

# <a id='toc2_'></a>[Data importation](#toc0_)
Import all data without inferring to make data importation faster. Columns will be converted to different data types on demand.

In [119]:
craigslist_vehicles = pl.scan_csv("./data/craigslist_vehicles.csv", n_rows=N_ROWS, infer_schema_length=0)
display_pl(craigslist_vehicles)

In [120]:
# Confirming the date format
if N_ROWS is not None:
    craigslist_vehicles.collect().sample(20).select(pl.col("removal_date")).to_series().to_list()

# <a id='toc3_'></a>[Data conversion](#toc0_)
Converting specific columns to different data types.

In [121]:
numeric_cols = ["price", "odometer"]
date_cols = ["posting_date", "removal_date"]

In [122]:
def convert_data(data: pl.LazyFrame, date_columns: list = date_cols, numeric_columns: list = numeric_cols) -> pl.LazyFrame:
    
    for d in date_columns:
        data = data.with_columns(pl.col(d).str.to_datetime(format="%Y-%m-%d %H:%M:%S%z"))
        
    for n in numeric_columns:
        data = data.with_columns(pl.col(n).cast(pl.Float32()))
    
    data = data.sort(by="removal_date", descending=False)
    data = data.drop_nulls(subset="removal_date")
    return data

craigslist_vehicles = convert_data(craigslist_vehicles)
display_pl(craigslist_vehicles)

In [123]:
min_sale_date = craigslist_vehicles.select("removal_date").min().collect().to_series()[0]

# <a id='toc4_'></a>[Exploratory Data Analysis](#toc0_)
Perform high level Exploratory Data Analysis (EDA).

In [124]:
(craigslist_vehicles
 .collect()
 .describe()
 .transpose(column_names="describe", include_header=True)
)

column,count,null_count,mean,std,min,25%,50%,75%,max
str,str,str,str,str,str,str,str,str,str
"""""","""426812""","""0""",,,"""100""",,,,"""99999"""
"""id""","""426812""","""0""",,,"""7301583321""",,,,"""7317101084"""
"""url""","""426812""","""0""",,,"""https://abilen…",,,,"""https://zanesv…"
"""region""","""426812""","""0""",,,"""SF bay area""",,,,"""zanesville / c…"
"""region_url""","""426812""","""0""",,,"""https://abilen…",,,,"""https://zanesv…"
"""price""","""426812.0""","""0.0""","""75209.2734375""","""12183253.0""","""0.0""","""5900.0""","""13950.0""","""26489.0""","""3736928768.0"""
"""year""","""425675""","""1137""",,,"""1900.0""",,,,"""2022.0"""
"""manufacturer""","""409234""","""17578""",,,"""acura""",,,,"""volvo"""
"""model""","""421603""","""5209""",,,"""""t""""",,,,"""🔥GMC Sierra 15…"
"""condition""","""252776""","""174036""",,,"""excellent""",,,,"""salvage"""


# <a id='toc5_'></a>[Drop columns](#toc0_)
Drops columns that cannot be used as features. These are columns such as identifiers, urls, text descriptions, etc.

In [125]:
id_cols = ['', "id", "url", "region_url", "VIN", "image_url", "description", "lat", "long", "year"]
craigslist_vehicles = craigslist_vehicles.drop(id_cols)

# <a id='toc6_'></a>[Remove features](#toc0_)
Removes additional columns that cannot be used as features. These are columns such as those that have no variance, with too many null values, or have too many unique values that cannot be consolidated into fewer values.

## <a id='toc6_1_'></a>[Features with too many nulls](#toc0_)
Identify columns that have greater than a certain threshold of null values.

In [126]:
def find_excess_nulls(data: pl.LazyFrame, thr: float = 0.2) -> list:
    df = (data
        .null_count()
        .collect()
        .transpose(include_header=True, column_names=["null_count"])
        .with_columns(pl.lit(value=len(craigslist_vehicles.collect())).alias("obs"))
        .with_columns((pl.col("null_count") / pl.col("obs")).alias("prop"))
        .with_columns((pl.col("prop") > thr).alias("is_excess_nulls"))
        .filter(pl.col("is_excess_nulls") == True)
    )
    
    print(df)
        
    excess_nulls = (df
        .select("column")
        .to_series().to_list()
    )
    
    return excess_nulls

excess_null_cols = find_excess_nulls(craigslist_vehicles)
excess_null_cols

shape: (7, 5)
┌─────────────┬────────────┬────────┬──────────┬─────────────────┐
│ column      ┆ null_count ┆ obs    ┆ prop     ┆ is_excess_nulls │
│ ---         ┆ ---        ┆ ---    ┆ ---      ┆ ---             │
│ str         ┆ u32        ┆ i32    ┆ f64      ┆ bool            │
╞═════════════╪════════════╪════════╪══════════╪═════════════════╡
│ condition   ┆ 174036     ┆ 426812 ┆ 0.407758 ┆ true            │
│ cylinders   ┆ 177610     ┆ 426812 ┆ 0.416132 ┆ true            │
│ drive       ┆ 130499     ┆ 426812 ┆ 0.305753 ┆ true            │
│ size        ┆ 306293     ┆ 426812 ┆ 0.71763  ┆ true            │
│ type        ┆ 92790      ┆ 426812 ┆ 0.217403 ┆ true            │
│ paint_color ┆ 130135     ┆ 426812 ┆ 0.3049   ┆ true            │
│ county      ┆ 426812     ┆ 426812 ┆ 1.0      ┆ true            │
└─────────────┴────────────┴────────┴──────────┴─────────────────┘


['condition', 'cylinders', 'drive', 'size', 'type', 'paint_color', 'county']

## <a id='toc6_2_'></a>[Features with near zero variance](#toc0_)
Identifies columns with little to no variation in their values.

In [127]:
def find_nzv_categorical(data: pl.LazyFrame, thr: float = 0.8) -> list:
    """Identifies categorical columns with a dominant feature that's greater than a certain threshold."""
    
    cols = data.select(pl.col(pl.Utf8)).columns
    
    categorical_cols = [c 
        for c in cols 
        if c not in 
        numeric_cols + excess_null_cols + id_cols
        ]
    
    df = (data
    .select(categorical_cols)
    .melt(variable_name="column")
    .group_by(pl.all())
    .len()
    .rename({"len": "null_count"})
    .with_columns(pl.col("null_count").sum().over("column").alias("total"))
    .with_columns((pl.col("null_count") / pl.col("total")).alias("prop"))
    .with_columns((pl.col("prop") > thr).alias("is_nzv"))
    .sort(by="column")
    .filter(pl.col("is_nzv") == True)
    .collect()
    )
    
    print(df)
    
    is_nzv = (df
    .select("column")
    .to_series().to_list()
    )
    
    return is_nzv

is_nzv_categorical = find_nzv_categorical(craigslist_vehicles)
is_nzv_categorical

shape: (2, 6)
┌──────────────┬───────┬────────────┬────────┬──────────┬────────┐
│ column       ┆ value ┆ null_count ┆ total  ┆ prop     ┆ is_nzv │
│ ---          ┆ ---   ┆ ---        ┆ ---    ┆ ---      ┆ ---    │
│ str          ┆ str   ┆ u32        ┆ u32    ┆ f64      ┆ bool   │
╞══════════════╪═══════╪════════════╪════════╪══════════╪════════╡
│ fuel         ┆ gas   ┆ 356209     ┆ 426812 ┆ 0.834581 ┆ true   │
│ title_status ┆ clean ┆ 405117     ┆ 426812 ┆ 0.94917  ┆ true   │
└──────────────┴───────┴────────────┴────────┴──────────┴────────┘


['fuel', 'title_status']

In [128]:
def find_nzv_numeric(data: pl.LazyFrame, num_cols: list = numeric_cols, thr: float = 0.8) -> list:
    """Identifies numeric columns with little to no variance in their values."""
    
    numeric_data = data.select(num_cols).with_columns(pl.all().cast(pl.Float32()))
    nzv = VarianceThreshold(thr * (1 - thr))
    nzv.fit_transform(numeric_data.collect())
    idx = nzv.get_support(indices=False)
    retained_feats = nzv.get_feature_names_out()[idx]
    return [f for f in numeric_data.columns if f not in retained_feats]

is_nzv_numeric = find_nzv_numeric(craigslist_vehicles)
is_nzv_numeric

[]

In [129]:
# Consolidate all the columns to remove
cols_to_drop = set(id_cols + excess_null_cols + is_nzv_categorical + is_nzv_numeric)
craigslist_vehicles = craigslist_vehicles.drop(cols_to_drop)

display_pl(craigslist_vehicles)

# <a id='toc7_'></a>[Create features](#toc0_)
Create specific features of interest.

## <a id='toc7_1_'></a>[Scale data](#toc0_)
Add price and odomoter bands. A better approach here would be to scale the values.

In [130]:
def create_scaled_features(data: pl.LazyFrame, scaler = StandardScaler()) -> pl.LazyFrame:
    """Scale numeric features and return the original dataframe with the scaled features."""
    
    numeric_feat = data.select(pl.col(pl.NUMERIC_DTYPES))
    cols = numeric_feat.columns
    categorical_feat = data.drop(cols)
    print("Numeric features:", cols)
    
    scaled_data = scaler.fit_transform(numeric_feat.collect())
    pd_scaled = pd.DataFrame(scaled_data, columns=cols)
    pl_scaled = pl.LazyFrame(pd_scaled)
    
    return pl.concat(items=[categorical_feat, pl_scaled], how="horizontal")

craigslist_vehicles_scaled = create_scaled_features(craigslist_vehicles)
display_pl(craigslist_vehicles_scaled)

Numeric features: ['price', 'odometer']


## <a id='toc7_2_'></a>[Create time features](#toc0_)
Determine how long it has taken to sell a vehicle in terms of years, months, and days.

In [131]:
group_by_cols = craigslist_vehicles_scaled.columns + ["year_sold", "month_sold", "day_sold"]
group_by_cols = [c for c in group_by_cols if not c.endswith("date")]

def create_time_features(data: pl.LazyFrame, time_col: str = "removal_date", group_by: list = group_by_cols) -> pl.LazyFrame:
    """Extract the year, month, and day from a specific time column."""
    
    res = (data
        .with_columns(
            pl.col(time_col).dt.year().alias("cal_year"),
            pl.col(time_col).dt.month().alias("month_sold"),
            pl.col(time_col).dt.day().alias("day_sold")
        )
        .with_columns((pl.col("cal_year").min() - pl.col("cal_year")).alias("year_sold"))
        .group_by(pl.col(group_by), maintain_order=True)
        .len()
        .rename({"len": "count"})
    )
    
    return res

model_data = create_time_features(craigslist_vehicles_scaled)
display_pl(model_data)

## <a id='toc7_3_'></a>[Check if monotonic](#toc0_)
Determine if the data is monotonically increasing.

In [132]:
def create_removal_date(data: pd.DataFrame | pl.LazyFrame) -> list:
    if isinstance(data, pl.LazyFrame):
        data = data.collect().to_pandas()
    
    year_sold = data.year_sold.astype("Int8")
    
    removal_year = [c + min_sale_date.year for c in year_sold]
    removal_month = data.month_sold.astype("Int8")
    removal_day = data.day_sold.astype("Int8")
    
    removal_date = []
    for y, m, d in zip(removal_year, removal_month, removal_day):
        try:
            r = datetime(y, m, d)
            removal_date.append(r)
        except TypeError:
            removal_date.append(None)
    
    return data.assign(removal_date = removal_date)

In [133]:
assert create_removal_date(model_data).removal_date.is_monotonic_increasing, "The dataset should increase monotonically"

## <a id='toc7_4_'></a>[Add seasons](#toc0_)
Change month of the year into season. This may not be necessary to the model but would be useful for visualisation.

In [134]:
def map_month_to_season(month):
    seasons = {
        "spring": [3, 4, 5],
        "summer": [6, 7, 8],
        "autumn": [9, 10, 11],
        "winter": [12, 1, 2]
    }

    for season, months in seasons.items():
        if month in months:
            return season
        else:
            return "unknown"

def create_seaons(data: pl.LazyFrame) -> pl.LazyFrame:
    res = data.with_columns(
        pl.col("month_sold").map_elements(function=map_month_to_season, skip_nulls=False).alias("season")
    )
    return res

model_data = create_seaons(model_data)
display_pl(model_data)

# <a id='toc8_'></a>[Train test split](#toc0_)
Split the data into train/test. Being time series data, we will use scikit-learn's `TimeSeriesSplit`.

In [135]:
y_data = model_data.select("count").collect().to_series()
y = pd.Series(y_data)

X = model_data.drop("count").collect().to_pandas()
print(X.shape)

(424031, 11)


In [136]:
# We want the forecase for next week
ts_cv = TimeSeriesSplit(n_splits=5, gap=7)
splits = list(ts_cv.split(X, y))

In [137]:
for i, (tr, te) in enumerate(splits):
    tr_len, te_len = len(tr), len(te)
    print(f"split {i} has {tr_len} training indices and {te_len} testing indices: test prop > {te_len/tr_len}")

split 0 has 70669 training indices and 70671 testing indices: test prop > 1.000028300952327
split 1 has 141340 training indices and 70671 testing indices: test prop > 0.5000070751379652
split 2 has 212011 training indices and 70671 testing indices: test prop > 0.33333647782426384
split 3 has 282682 training indices and 70671 testing indices: test prop > 0.250001768771977
split 4 has 353353 training indices and 70671 testing indices: test prop > 0.20000113201246345


In [138]:
# split with test proportion of 30%
SPLIT_30_PERC = 2

train_idx, test_idx = splits[SPLIT_30_PERC]
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

In [139]:
# The max date of an obs in train should be less than the min date of an obs in test
print("Max removal date from train")
max_removal_date_train = create_removal_date(X_train).removal_date.max()
print(max_removal_date_train)

print("\nMin removal date from test")
min_removal_date_test = create_removal_date(X_test).removal_date.min()
print(min_removal_date_test)

diff_days = (min_removal_date_test - max_removal_date_train).days

assert create_removal_date(X).removal_date.is_monotonic_increasing, "Data should increase monotonically"

Max removal date from train
2021-05-09 00:00:00

Min removal date from test
2021-05-09 00:00:00


# <a id='toc9_'></a>[One Hot Encoding](#toc0_)
Perform One Hot Encoding of the dataset to convert categorical features to numeric.

In [140]:
ohe = OneHotEncoder(drop="if_binary", max_categories=5, dtype=np.int8)
ohe.fit(X)

In [141]:
def ohe_dataframe(data: pd.DataFrame, encoder: OneHotEncoder = ohe) -> pd.DataFrame:
    """One hot encodes the dataframe and labels the columns of the resulting dataframe"""
    
    X = encoder.transform(data)
    X_df = pd.DataFrame(X.toarray(), columns=ohe.get_feature_names_out())
    print("The infrequent categories are:", len(encoder.infrequent_categories_))
    return X_df

X_train_encoded = ohe_dataframe(X_train)
display(X_train_encoded)

The infrequent categories are: 11


Unnamed: 0,region_columbus,region_jacksonville,region_portland,region_spokane / coeur d'alene,region_infrequent_sklearn,manufacturer_chevrolet,manufacturer_ford,manufacturer_honda,manufacturer_toyota,manufacturer_infrequent_sklearn,model_1500,model_f-150,model_silverado 1500,model_None,model_infrequent_sklearn,transmission_automatic,transmission_manual,transmission_other,transmission_None,state_ca,state_fl,state_ny,state_tx,state_infrequent_sklearn,price_-0.006173175956192352,price_-0.005599026480337212,price_-0.0055169464980848685,price_-0.005352786533580183,price_infrequent_sklearn,odometer_-0.4584007831926601,odometer_-0.458396107700952,odometer_0.009148387614118052,odometer_nan,odometer_infrequent_sklearn,year_sold_0,month_sold_4,month_sold_5,month_sold_6,day_sold_9,day_sold_10,day_sold_13,day_sold_14,day_sold_infrequent_sklearn,season_unknown
0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0
1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0
2,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0
3,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0
4,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212006,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0
212007,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0
212008,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0
212009,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0


# <a id='toc10_'></a>[Train & evaluate model](#toc0_)

## <a id='toc10_1_'></a>[Quantitative evaluation](#toc0_)
Trains and evaluates the resulting ML model using the `Root Mean Square Error (RMSE)` evaluation metric.

In [142]:
def train_model(X_train, X_test, y_train, y_test, cv = ts_cv, regressor = HistGradientBoostingRegressor()):
    model = regressor.fit(X_train, y_train)
    
    X_test_encoded = ohe_dataframe(data=X_test)
    y_pred = model.predict(X_test_encoded)
    rmse = root_mean_squared_error(y_true=y_test, y_pred=y_pred)
    
    print("TESTING ERROR:")
    print("rmse:", rmse)
    print("\n")
    
    X_encoded = ohe_dataframe(X)
    scorer = make_scorer(score_func=root_mean_squared_error)
    cv_scores = cross_val_score(estimator=model, X=X_encoded, y=y, scoring=scorer, cv=cv)
    
    print("TESTING ERRORS:")
    print("cv_scores:", cv_scores)
    print("rmse:", cv_scores.mean())
    print("std:", cv_scores.std())

    return model

hgbr_model = train_model(X_train=X_train_encoded, X_test=X_test, y_train=y_train, y_test=y_test)

The infrequent categories are: 11
TESTING ERROR:
rmse: 0.09476576720099102


The infrequent categories are: 11
TESTING ERRORS:
cv_scores: [0.09868755 0.10223465 0.09477364 0.08347318 0.0543154 ]
rmse: 0.08669688394941966
std: 0.0173746066728448


In [143]:
craigslist_vehicles.collect().head(20)

region,price,manufacturer,model,odometer,transmission,state,posting_date,removal_date
str,f32,str,str,f32,str,str,"datetime[μs, UTC]","datetime[μs, UTC]"
"""lehigh valley""",3800.0,"""honda""","""crv""",170000.0,"""automatic""","""pa""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""augusta""",41990.0,"""lincoln""","""continental re…",9345.0,"""other""","""ga""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""battle creek""",21900.0,"""toyota""","""prius four""",19000.0,"""automatic""","""mi""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""bellingham""",12999.0,"""jeep""","""compass sport …",65000.0,"""automatic""","""wa""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""bellingham""",49999.0,"""chevrolet""","""silverado 2500…",24148.0,"""automatic""","""wa""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""bellingham""",49999.0,"""chevrolet""","""silverado 3500…",82144.0,"""automatic""","""wa""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""bellingham""",44999.0,"""jeep""","""wrangler unlim…",34695.0,"""automatic""","""wa""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""bellingham""",38999.0,"""jeep""","""grand cherokee…",30796.0,"""automatic""","""wa""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""charleston""",24990.0,"""alfa-romeo""","""romeo giulia s…",51087.0,"""other""","""sc""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC
"""charleston""",23990.0,"""acura""","""rdx sport util…",49760.0,"""automatic""","""sc""",2021-04-04 00:00:00 UTC,2021-04-04 00:00:00 UTC


In [144]:
# Does each observation need to represent a daily aggregate?
sf_models = StatsForecast(models=[
    AutoARIMA(),
    HoltWinters()
], freq="W", n_jobs=-1)

#sf_models.fit()

## <a id='toc10_2_'></a>[Qualitative evaluation](#toc0_)
Qualitative evaluation of the performance of the ML model.

In [145]:
predicted_sales = hgbr_model.predict(ohe_dataframe(X_test))
print("obs:", len(y_test))
print("total sales (predicted):", sum(predicted_sales))
print("total sales (actual):", sum(y_test))
print("difference:", sum(predicted_sales) - sum(y_test))

The infrequent categories are: 11
obs: 70671
total sales (predicted): 71275.64179187351
total sales (actual): 71208
difference: 67.64179187350965


# <a id='toc11_'></a>[Make predictions](#toc0_)
Prepare data for the dashboard.

In [146]:
predictions = pd.concat(objs=[
    X_test.reset_index(drop=True), 
    pd.Series(y_test, name="actual_sales").reset_index(drop=True),
    pd.Series(predicted_sales, name="predicted_sales"),
    ], axis=1)

predictions

Unnamed: 0,region,manufacturer,model,transmission,state,price,odometer,year_sold,month_sold,day_sold,season,actual_sales,predicted_sales
0,northern michigan,jeep,wrangler,automatic,mi,-0.004573,0.037201,0,5,9,spring,1,1.008997
1,northern michigan,,oldsmobile cutlass,automatic,mi,-0.005681,0.046552,0,5,9,spring,1,1.008997
2,norfolk / hampton roads,acura,mdx,automatic,va,-0.005235,0.150470,0,5,9,spring,1,1.008997
3,norfolk / hampton roads,ford,f-250 super duty,automatic,va,-0.006173,-0.312049,0,5,9,spring,1,1.014720
4,norfolk / hampton roads,lincoln,town car,automatic,va,-0.005845,0.415785,0,5,9,spring,1,1.008997
...,...,...,...,...,...,...,...,...,...,...,...,...,...
70666,chicago,nissan,altima 3.5sl,automatic,il,-0.005863,0.249296,0,5,15,spring,1,1.008997
70667,chicago,ford,f-150 lariat,automatic,il,-0.005354,0.179299,0,5,15,spring,1,1.007304
70668,chicago,chevrolet,express 2500,automatic,il,-0.004698,0.140623,0,5,15,spring,1,1.008997
70669,chicago,nissan,altima 2.5,other,il,-0.005499,0.093934,0,5,15,spring,1,1.001651


# <a id='toc12_'></a>[Export data](#toc0_)

In [147]:
def extract_band_values(data: pl.LazyFrame, cols: list):
    res = {}
    for c in cols:
        r = (data
            .unique(subset=f"{c}_band")
            .select(pl.col([f"{c}_band", f"{c}_band_values"]))
            .filter(pl.col(f"{c}_band").is_not_null())
            .collect()
            .sort(by=f"{c}_band")
        )
        
        res[c]=r
        
    return res

In [None]:
with pd.ExcelWriter(path="./data/predictions.xlsx", mode="w") as writer:
    predictions.to_excel(excel_writer=writer, index=False, sheet_name="predictions")
    # band_values_dict = extract_band_values(band_values, band_cols)
    # for k,v in band_values_dict.items():
    #     v.to_pandas().to_excel(excel_writer=writer, index=False, sheet_name=k)

# <a id='toc13_'></a>[Charts & Graphs](#toc0_)

* You can be able to explore trends, insights, etc of the model over different 
time spans using [this interactive dashboard](https://lookerstudio.google.com/reporting/2803f46f-1fdf-48d0-8bf7-5c6d6a665bd1/page/xEeoD) that has been published.

* The predictions can be found on [this Google Sheets](https://docs.google.com/spreadsheets/d/1gfdVHUMXRjXx1QRTUdIMxKd-FGWmEJnt9DMn1xrgat8/edit#gid=1316071412).


![alt text](./dashboard_ss.png "Screenshot of Craiglist Vehicle Sales Dashboard")