# Feature Engineering & Preprocessing

Enhance and preprocess `MachineLearningRating_v3.csv` for machine learning tasks.

**Steps**:
- Engineer features (e.g., `LossRatio`, time-based features).
- Encode categorical variables.
- Handle missing values and outliers.
- Visualize engineered features.

In [1]:
import os
import sys
sys.path.append(os.path.abspath('..'))  # Adjust path to include the src directory


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_preprocessor import load_and_preprocess
from src.utils import save_plot

%matplotlib inline
plt.style.use('seaborn-v0_8')
pd.set_option('display.max_columns', 60)
pd.set_option('display.max_rows', 100)

# Load and preprocess data
df, scaler = load_and_preprocess('../data/MachineLearningRating_v3.csv')

## Feature Engineering Summary

In [2]:
print("Data Shape after Preprocessing:", df.shape)
print("\nNew Features:\n", [col for col in df.columns if col not in pd.read_csv('../data/MachineLearningRating_v3.csv', low_memory=False).columns])
print("\nDescriptive Statistics of Engineered Features:\n", df[['LossRatio', 'TransactionYear', 'HasClaim', 'AvgClaimsByMake']].describe())

Data Shape after Preprocessing: (1000098, 71)


KeyboardInterrupt: 

## Visualization

In [3]:
# Histogram: LossRatio
plt.figure(figsize=(10, 6))
sns.histplot(df['LossRatio'].dropna(), bins=50, color='teal')
plt.title('Distribution of Loss Ratio')
plt.xlabel('Loss Ratio')
plt.ylabel('Frequency')
save_plot('loss_ratio_hist.png')

# Box plot: TotalClaims by HasClaim
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='HasClaim', y='TotalClaims', palette='Pastel1')
plt.title('Total Claims by Claim Status')
save_plot('total_claims_by_hasclaim.png')


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df, x='HasClaim', y='TotalClaims', palette='Pastel1')


## Data Quality Check

In [4]:
print("\nMissing Values after Preprocessing:\n", df.isnull().sum())


# 1. Print missing value summary
missing_summary = df.isnull().sum()
print("Missing values per column (top 10):\n", missing_summary[missing_summary > 0].sort_values(ascending=False).head(10))


Missing Values after Preprocessing:
 UnderwrittenCoverID                    0
PolicyID                               0
TransactionMonth                       0
IsVATRegistered                        0
Citizenship                            0
LegalType                              0
Title                                  0
Language                               0
Bank                              145961
AccountType                        40232
MaritalStatus                     994467
Country                                0
PostalCode                             0
MainCrestaZone                         0
SubCrestaZone                          0
ItemType                               0
mmcode                               552
RegistrationYear                       0
make                                 552
Model                                552
Cylinders                            552
cubiccapacity                        552
kilowatts                            552
bodytype           

In [5]:
# 2. Optionally drop columns with >95% missing values (customize threshold as needed)
high_missing = missing_summary[missing_summary > 0]
drop_cols = high_missing[high_missing > 0.95 * len(df)].index.tolist()
if drop_cols:
    print(f"Dropping columns with >95% missing values: {drop_cols}")
    df = df.drop(columns=drop_cols)

# 3. Impute remaining missing values
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].fillna('Unknown')
    else:
        df[col] = df[col].fillna(df[col].median())

# 4. Encode categorical variables (if not already encoded)
cat_cols = df.select_dtypes(include=['object', 'category']).columns
if len(cat_cols) > 0:
    print(f"Encoding categorical columns: {list(cat_cols)}")
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

# 5. Final check: ensure no missing values remain
assert df.isnull().sum().sum() == 0, "There are still missing values in the data!"

# 6. Save processed data for modeling
processed_path = '../data/processed/processed_data.csv'
df.to_csv(processed_path, index=False)
print(f"Processed data saved to {processed_path}")

# 7. Print final shape and column count
print(f"Final processed data shape: {df.shape}")
print(f"Number of features: {len(df.columns)}")

Dropping columns with >95% missing values: ['MaritalStatus', 'NumberOfVehiclesInFleet']
Encoding categorical columns: ['Citizenship', 'LegalType', 'Title', 'Language', 'Bank', 'AccountType', 'Country', 'MainCrestaZone', 'SubCrestaZone', 'ItemType', 'make', 'Model', 'bodytype', 'VehicleIntroDate', 'AlarmImmobiliser', 'TrackingDevice', 'NewVehicle', 'WrittenOff', 'Rebuilt', 'Converted', 'TermFrequency', 'ExcessSelected', 'CoverCategory', 'CoverType', 'CoverGroup', 'Section', 'Product', 'StatutoryClass', 'StatutoryRiskType']
Processed data saved to ../data/processed/processed_data.csv
Final processed data shape: (1000098, 860)
Number of features: 860
