# HW30: Car Model Prediction

In this notebook we:
- build a prediction model using `Gender` and `Color` as input features
- predict car model as output
- compute model accuracy
- persist the trained model
- perform several pandas data analysis tasks
---

## Model Training and Evaluation

### Import required libraries

We import pandas for data processing, scikit-learn for machine learning, our custom converter module for categorical encoding, and joblib for model persistence.

In [None]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from src import converter

import joblib

### Load dataset

We load the cleaned car sales dataset from CSV file into a pandas `DataFrame`.

In [None]:
df = pd.read_csv("car_sales_cleaned1.csv")
df.head()

### Data cleaning

We remove duplicate rows and rows with missing values to improve data quality for machine learning.

In [None]:
df = df.drop_duplicates()
df = df.dropna()

### Create machine learning model

We use `DecisionTreeClassifier` as a simple and interpretable classification model.


In [None]:
model = DecisionTreeClassifier()

### Define target variable

The target variable is the car `Model` that we want to predict.

In [None]:
Y = df[["Model"]]

### Encode categorical features

We use the `converter` module from `HW29` to transform categorical string values into numerical values.

In [None]:
mapper = converter.columns_mapper(
    ["Gender", "Company", "Model", "Transmission", "Color"],
    df
)

dfConverted = converter.convert_x(df, mapper)

### Define input features

We use only `Gender` and `Color` as input features (`X`) according to the task requirements.

In [None]:
X = dfConverted[["Gender", "Color"]]

### Split dataset into training and testing sets

We split the data to evaluate model performance on unseen data.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.2
)

### Train the model

The model is trained using the training subset.

In [None]:
model.fit(x_train, y_train)

### Evaluate model accuracy

We compute accuracy score to measure prediction quality.

In [None]:
predictions = model.predict(x_test)
accuracy = accuracy_score(y_test, predictions)
accuracy

### Retrain model on full dataset

We retrain the model on the full dataset before saving it.

In [None]:
model.fit(X.values, Y)

### Make a single prediction example

We predict a car model for a `Female` customer and `Pale White` color.

In [None]:
CUSTOMER = mapper["Gender"]["Female"]
COLOR = mapper["Color"]["Pale White"]

model.predict([[CUSTOMER, COLOR]])

### Save trained model

The trained model is persisted using `joblib`.

In [None]:
joblib.dump(model, "gender-color-model.joblib")

---

## Pandas data analysis tasks
### Filter `Hyundai` Cars

In [None]:
dfHyundai = df[df["Company"] == "Hyundai"].reset_index()
dfHyundai

### `Toyota` cars with price greater than `40000`

In [None]:
dfToyotaExp = df.query(
    'Company == "Toyota" and `Price ($)` > 40000'
).reset_index()

dfToyotaExp

### Three most popular car models

In [None]:
mostPopularCars = (
    df[["Model", "Company"]]
    .value_counts()
    .head(3)
    .reset_index()
)

mostPopularCars