# Machine Learning Project

## Table of Contents
- [Import Data](#import-data)
  - [Import Data Summary](#import-data-summary)
- [Data Exploration](#data-exploration)
  - [Boolean Features](#boolean-features)
    - [Boolean Features Analysis](#boolean-features-analysis)
  - [Categorical Features](#categorical-features)
    - [Check Categorical Features Consistency](#check-categorical-features-consistency)
    - [Categorical Features Summary](#categorical-features-summary)
  - [Numerical Features](#numerical-features)
    - [Numerical Plots](#plots)
    - [Analysis of Numerical Distributions](#analysis-of-numerical-distributions)
- [Pre-processing](#pre-processing)
  - [Functions](#functions)
  - [Split Data](#split-data)
  - [Outliers and Missing Values](#outliers-and-missing-values)
    - [Categorical Outliers and Missing Values](#categorical-outliers-and-missing-values)
    - [Numerical Outliers](#numerical-outliers)
    - [Numerical Missing Values](#numerical-missing-values)
- [Feature Engineering](#feature-engineering)

<a id="import-data"></a>
## Import Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# to calculate distance between strings
import nltk

In [None]:
df = pd.read_csv('data/train.csv').set_index('carID')
df.head()

In [None]:
df.info()

In [None]:
num_duplicated_ids = df.index.duplicated().sum()
print(f'Number of duplicated carIDs: {num_duplicated_ids}')

<a id="import-data-summary"></a>
#### Import Data Summary
- Dataset loaded successfully with `carID` as the index
- There are no duplicate entries in carID
- The dataset contains information about cars including both numerical features (price, mileage, tax, etc.) and categorical features (brand, model, transmission, etc.)
- Initial inspection shows multiple features that will require preprocessing:
  - Numerical features that need cleaning (negative values, outliers)
  - Categorical features that need standardization
  - Presence of missing values in several columns

<a id="data-exploration"></a>
## Data Exploration

<a id="boolean-features"></a>
### Boolean Features

In [None]:
df['hasDamage'].value_counts(dropna=False)

<a id="boolean-features-analysis"></a>
#### Boolean Features Analysis

Key observations about `hasDamage` feature:
- Only contains binary values (0) and NaN
- No instances of value 1 found, suggesting potential data collection issues
- May indicate:
  - Cars with damage not being listed
  - System default setting of 0 for non-damaged cars
  - Incomplete damage assessment process
- Requires special handling in preprocessing:
  - Consider treating NaN as a separate category
  - Validate if 0 truly represents "no damage"
  - May need to be treated as a categorical rather than boolean feature

<a id="categorical-features"></a>
### Categorical Features

<a id="check-categorical-features-consistency"></a>
#### Check Categorical Features Consistency

In [None]:
# List of categorical features
cat_cols = ['Brand', 'model', 'fuelType', 'transmission']

# Identify outlier examples in categorical features
cat_outliers_examples = {col: df[col].value_counts().tail(10).index for col in cat_cols}

# Display the outlier examples
pd.DataFrame(cat_outliers_examples)

<a id="categorical-features-summary"></a>
#### Categorical Features Summary
- Initial analysis reveals significant data quality issues across all categorical columns
- No standardization in categorical features, with multiple variations of the same values (different spellings, capitalizations)
- Solution: We will implement string distance-based standardization using the `thefuzz` library to clean and standardize these features

<a id="numerical-features"></a>
### Numerical Features

In [None]:
# Dict of numerical features
num_cols = {
    'price': 'continuous',
    'mileage': 'continuous',
    'tax': 'continuous',
    'mpg': 'continuous',
    'paintQuality%': 'continuous',
    'engineSize': 'continuous',
    'year': 'discrete',
    'previousOwners': 'discrete'
}

<a id="plots"></a>
#### Numerical Plots

In [None]:
# Plot figures for numerical features and the target variable (price)
plt.figure(figsize=(16, 10))
for i, (col, var_type) in enumerate(num_cols.items(), 1):
    plt.subplot(4, 2, i)

    # Plot based on variable type
    if var_type == 'continuous':
        sns.histplot(data=df, x=col, kde=True, color="lightcoral", bins=30)
        plt.title(f"Distribution of {col}", fontsize=11)
    elif var_type == 'discrete':
        sns.countplot(data=df, x=col, color="lightcoral")
        plt.title(f"Distribution of {col}", fontsize=11)
        plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

In [None]:
# Boxplots for continuous numerical features and the target variable (price)
continuous_cols = [col for col, var_type in num_cols.items() if var_type == 'continuous']
plt.figure(figsize=(16, 10))
for i, col in enumerate(continuous_cols, 1):
    plt.subplot(3, 2, i)
    sns.boxplot(data=df, x=col, color="lightblue")
    plt.title(f"Distribution of {col}", fontsize=11)

plt.tight_layout()
plt.show()

<a id="analysis-of-numerical-distributions"></a>
#### Analysis of Numerical Distributions

Key observations from the plots:
- **Target Variable (Price)**:
  - Highly right-skewed distribution
  - Contains significant number of outliers in the upper range
  - Most cars are concentrated in the lower price range

- **Mileage**:
  - Right-skewed distribution
  - Large range from nearly new cars to high-mileage vehicles
  - Some outliers in upper range suggesting possible data entry errors

- **Tax**:
  - Multiple peaks suggesting different tax bands
  - Contains negative values which require investigation (possible tax benefits/rebates)
  - Large number of outliers on both ends of the distribution

- **MPG (Miles Per Gallon)**:
  - Approximately normal distribution with slight right skew
  - Some unrealistic extreme values that need cleaning
  - Reasonable median around typical car efficiency ranges

- **Paint Quality %**:
  - Contains values above 100% which are logically impossible
  - Left-skewed distribution suggesting optimistic ratings
  - Requires standardization to 0-100 range

- **Engine Size**:
  - There are engine size with zero values which are not realistic (might indicate electric vehicles)
  - Some unusual patterns that need investigation
  - Contains outliers that may represent specialty vehicles

- **Year**:
  - Should be discrete but contains decimal values

- **Previous Owners**:
  - Should be integer but contains float values
  - Right-skewed distribution as expected
  - Maximum values need validation (unusually high number of previous owners)

<a id="pre-processing"></a>
## Pre-processing

<a id="functions"></a>
### Functions

In [None]:
def  general_cleaning(df: pd.DataFrame) -> pd.DataFrame:
    """Perform general data cleaning on the DataFrame.
    
    This function handles logical inconsistencies and data quality issues that
    don't require statistical calculations (mean, median, etc.) to prevent data
    leakage between training and validation sets.
    
    Args:
        df (pd.DataFrame): The input DataFrame containing car data with columns:
            Brand, model, year, transmission, fuelType, mileage, tax, mpg, 
            engineSize, paintQuality%, previousOwners, hasDamage.
        
    Returns:
        pd.DataFrame: The cleaned DataFrame with logical inconsistencies resolved.
    """

    df = df.copy()

    # Set negative values to NaN for features that shouldn't be negative
    for col in ['previousOwners', 'mileage', 'tax', 'mpg', 'engineSize']:
        df.loc[df[col] < 0, col] = np.nan

    for col in ['Brand', 'model', 'transmission', 'fuelType']:
        df[col] = df[col].str.lower()
        df[col] = df[col].replace('', np.nan)

    # Remove decimal part from 'year'
    df['year'] = np.floor(df['year']).astype('Int64')

    # Remove decimal part from 'previousOwners'
    df['previousOwners'] = np.floor(df['previousOwners']).astype('Int64')

    # Ensure 'paintQuality%' is within 0-100
    df.loc[(df['paintQuality%'] < 0) | (df['paintQuality%'] > 100), 'paintQuality%'] = np.nan

    # Fill missing 'hasDamage' with 1
    df['hasDamage'] = df['hasDamage'].fillna(1).astype('Int64')

    return df

In [None]:
def standardize_categorical_col(series: pd.Series, 
                                standardised_cats: list[str], 
                                distance_threshold: int = 2) -> pd.Series:
    """Standardizes a categorical column using edit distance with a threshold.

    1. Maps values to a standard category if they are a likely typo
       (i.e., within the edit distance_threshold).
    2. Keeps values that are already in the standard list.
    3. Groups all other values that don't match into an 'other' bin.
    
    Args:
        series (pd.Series): The categorical column to standardize.
        standardised_cats (list[str]): The list of "good" categories to match against.
        distance_threshold (int): The max edit distance to consider something a typo.
                                  A value of 1 or 2 is recommended.
                                  
    Returns:
        pd.Series: The standardized categorical column.
    """
    
    # 1. Get all unique non-null values from the series
    unique_values = series.dropna().unique()
    
    # 2. Build the mapping dictionary
    mapping = {}
    
    for x in unique_values:
        x_str = str(x)
        
        # Check if it's already a perfect match
        if x_str in standardised_cats:
            mapping[x] = x_str
            continue

        # Find the closest match and its distance
        distances = [nltk.edit_distance(x_str, cat) for cat in standardised_cats]
        min_distance = np.min(distances)
        
        if min_distance <= distance_threshold:
            closest_cat = standardised_cats[np.argmin(distances)]
            mapping[x] = closest_cat
        else:
            mapping[x] = 'other'
            
    return series.map(mapping)

In [None]:
def get_categories_high_freq(series: pd.Series, percent_threshold: float = 0.02) -> list[str]:
    """Get categories that appear more than a dynamic percentage threshold.
    
    Args:
        series (pd.Series): The categorical series to analyze.
        percent_threshold (float): The minimum percentage of total rows a category
                                   must have to be included (e.g., 0.01 for 1%).
                                   
    Returns:
        list[str]: List of categories with frequency above the dynamic threshold.
    """
    
    # Calculate the dynamic count threshold based on the percentage
    dynamic_count_threshold = len(series) * percent_threshold
    
    value_counts = series.value_counts()
    
    # Use the *same logic* as before, but with the new dynamic threshold
    high_freq_cats = value_counts[value_counts > dynamic_count_threshold].index.tolist()
    
    return high_freq_cats

In [None]:
def calculate_upper_bound(series: pd.Series) -> float:
    """Calculates the upper outlier bound for a pandas Series."""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    return Q3 + (1.5 * IQR)

In [None]:
def clean_outliers(series: pd.Series, 
                   upper_bound: float,
                   lower_bound: float = 0.0, 
                   return_missing: bool = True) -> pd.Series:
    """Clean outliers in the Series based on specified bounds.

    This function clips values outside the specified bounds or sets them to NaN.
    
    Args:
        series (pd.Series): The input Series containing numerical data.
        lower_bound (float): The lower bound for valid values.
        upper_bound (float): The upper bound for valid values.
        return_missing (bool): If True, set out-of-bound values to NaN.
                              If False, clip values to the bounds.
    
    Returns:
        pd.Series: The cleaned Series with outliers handled.
    """
    cleaned = series.copy()
    
    if return_missing:
        # Set out-of-bound values to NaN
        cleaned[(cleaned < lower_bound) | (cleaned > upper_bound)] = np.nan
    else:
        # Clip values to the specified bounds
        cleaned = cleaned.clip(lower=lower_bound, upper=upper_bound)
    
    return cleaned


<a id="split-data"></a>
### Split Data

In [None]:
# Apply general cleaning to the entire dataset, before splitting
df_cleaned = general_cleaning(df)

# Split the cleaned data into training and validation sets
X = df_cleaned.drop(columns=["price"])   
y = df_cleaned["price"]
del num_cols['price']
continuous_cols.remove('price')

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"Training set size: {X_train.shape}")
print(f"Validation set size: {X_val.shape}")


<a id="outliers-and-missing-values"></a>
### Outliers and Missing Values

<a id="categorical-outliers-and-missing-values"></a>
#### Categorical Outliers and Missing Values

In [None]:
# map of categorical columns to their high frequency categories
# Using only the categories in train
high_freq_categories_by_col = {col: get_categories_high_freq(X_train[col]) for col in cat_cols}

for col in cat_cols:
    standardised_cats = high_freq_categories_by_col[col]
    # process training data
    X_train[col] = standardize_categorical_col(X_train[col], standardised_cats)
    X_train[col] =X_train[col].fillna('other')
    
    # process validation data
    X_val[col] = standardize_categorical_col(X_val[col], standardised_cats)
    X_val[col] = X_val[col].fillna('other')

<a id="numerical-outliers"></a>
#### Numerical Outliers

In [None]:
# Mileage is right_skewed and extremes might exist
mileage_upper_bound = X_train['mileage'].quantile(0.95)

# Calculate upper bounds for each numerical features
upper_bounds_num_cols = {col: calculate_upper_bound(X_train[col]) for col in continuous_cols}

# Winsorization at 95th percentile
X_train['mileage'] = clean_outliers(X_train['mileage'], 
                                     lower_bound=0, 
                                     upper_bound=mileage_upper_bound, 
                                     return_missing=False)


# Turn outliers for remaining columns into np.nan
for col in continuous_cols:
    if col == 'mileage':
        continue
    upper_bound = upper_bounds_num_cols[col]
    
    X_train[col] = clean_outliers(X_train[col], upper_bound)
    X_val[col] = clean_outliers(X_val[col], upper_bound)

<a id="numerical-missing-values"></a>
#### Numerical Missing Values

In [None]:
median_num_cols = {col: X_train[col].median() for col in num_cols}

for col in num_cols:
    calc_median = median_num_cols[col]
    X_train[col] = X_train[col].fillna(calc_median)
    X_val[col] = X_val[col].fillna(calc_median)


<a id="feature-engineering"></a>
# Feature Engineering

Transform categorical variables into numeric ones using One-Hot Encoding.
One-Hot Encoding creates a separate binary column for each category, avoiding any unintended ordinal relationship among categorical values and allowing the model to interpret each category independently.

In [None]:
categorical_features = cat_cols
numeric_features = list(num_cols.keys()) 

The parameter handle_unknown="ignore" ensures that unseen categories in the validation set do not cause errors during transformation.

In [None]:
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
ohe_train = pd.DataFrame(
    ohe.fit_transform(X_train[categorical_features]),
    columns=ohe.get_feature_names_out(categorical_features),
    index=X_train.index
)

In [None]:
ohe_val = pd.DataFrame(
    ohe.transform(X_val[categorical_features]),
    columns=ohe.get_feature_names_out(categorical_features),
    index=X_val.index
)

The numeric variables were then standardized using Z-Score normalization through StandardScaler.
This process ensures that all numeric features have a mean of 0 and a standard deviation of 1, keeping them on a comparable scale.

In [None]:
X_train = pd.concat([X_train.drop(columns=categorical_features), ohe_train], axis=1)
X_val   = pd.concat([X_val.drop(columns=categorical_features),   ohe_val],   axis=1)

### Correlation Matrix 
In order to help with both feature selection and engineering

In [None]:
fig = plt.figure(figsize=(10, 8))

corr = X_train[list(num_cols.keys())].corr(method="pearson")

sns.heatmap(data=corr, annot=True, )


plt.show()

## Observations
- Most correlations are weak, meaning variables are largely independent

---

## Insights

### **Mileage vs Year (-0.68)**
- **Strong negative correlation.**
- Newer cars (higher `year`) tend to have lower mileage.
- These two features describe similar aspects of car age/usage.
-  **Potential issue:** Including both in models may cause multicollinearity.

---

###  **Tax vs mpg (-0.51)**
- **Moderate negative correlation.**
- Cars with better fuel efficiency (higher mpg) tend to have lower taxes, likely due to emissions-based tax systems.

---

### **Tax vs Year (0.35)**
- **Weak positive correlation.**
- Newer cars may have slightly higher taxes, possibly reflecting newer model valuations or updated emission standards.

---

### **Mileage vs mpg (0.25)**
- **Weak positive correlation.**
- Slightly counterintuitive — could suggest that cars with better mpg are driven more (used as daily vehicles).

---

###  **Other variables**
- `paintQuality%`, `engineSize`, and `previousOwners` show very low correlations (~ 0) with other variables.  
- These likely capture unique information and should be kept for modeling.

---

## Feature Selection Opinion

-Keep  `engineSize`, `paintQuality%`, `previousOwners`:  Low correlation (independent features)

---

## Feature Engineering Opinion

- **CarAge** (`2020 - year`):  Converts `year` into a more interpretable and continuous measure of vehicle age 
- **UsageRate** (`mileage / CarAge`): Represents how much the car is driven per year — better proxy for wear and tear 
- **EfficiencyScore** (`mpg / tax`): Combines efficiency and cost factors — higher efficiency and lower taxes leds to a higher price
- **OwnerTurnover** (`previousOwners / CarAge`): Indicates how frequently ownership changes — may relate to reliability or desirability 
