# Feature Engineering & Preprocessing

Enhance and preprocess `MachineLearningRating_v3.csv` for machine learning tasks.

**Steps**:
- Engineer features (e.g., `LossRatio`, time-based features).
- Encode categorical variables.
- Handle missing values and outliers.
- Visualize engineered features.

In [1]:
import os
import sys
sys.path.append(os.path.abspath('..'))  # Adjust path to include the src directory


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_preprocessor import load_and_preprocess
from src.utils import save_plot

%matplotlib inline
plt.style.use('seaborn-v0_8')
pd.set_option('display.max_columns', 60)
pd.set_option('display.max_rows', 100)

# Load and preprocess data
df, scaler = load_and_preprocess('../data/MachineLearningRating_v3.csv')

## Feature Engineering Summary

In [2]:
print("Data Shape after Preprocessing:", df.shape)
print("\nNew Features:\n", [col for col in df.columns if col not in pd.read_csv('../data/MachineLearningRating_v3.csv', low_memory=False).columns])
print("\nDescriptive Statistics of Engineered Features:\n", df[['LossRatio', 'TransactionYear', 'HasClaim', 'AvgClaimsByMake']].describe())

Data Shape after Preprocessing: (1000098, 69)

New Features:
 ['TransactionYear', 'LossRatio', 'TransactionMonthNum', 'TransactionQuarter', 'HasClaim', 'AvgClaimsByMake', 'Province_Free State', 'Province_Gauteng', 'Province_KwaZulu-Natal', 'Province_Limpopo', 'Province_Mpumalanga', 'Province_North West', 'Province_Northern Cape', 'Province_Western Cape', 'Gender_Male', 'VehicleType_Heavy Commercial', 'VehicleType_Light Commercial', 'VehicleType_Medium Commercial', 'VehicleType_Passenger Vehicle', 'make_freq', 'Model_freq']

Descriptive Statistics of Engineered Features:
            LossRatio  TransactionYear      HasClaim  AvgClaimsByMake
count  618464.000000     1.000098e+06  1.000098e+06    999546.000000
mean        0.349885     2.014754e+03  2.787727e-03        64.340071
std         9.286479     4.370288e-01  5.272531e-02        16.137360
min       -18.700121     2.013000e+03  0.000000e+00         0.000000
25%         0.000000     2.015000e+03  0.000000e+00        63.626435
50%     

## Visualization

In [3]:
# Histogram: LossRatio
plt.figure(figsize=(10, 6))
sns.histplot(df['LossRatio'].dropna(), bins=50, color='teal')
plt.title('Distribution of Loss Ratio')
plt.xlabel('Loss Ratio')
plt.ylabel('Frequency')
save_plot('loss_ratio_hist.png')

# Box plot: TotalClaims by HasClaim
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='HasClaim', y='TotalClaims', palette='Pastel1')
plt.title('Total Claims by Claim Status')
save_plot('total_claims_by_hasclaim.png')


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df, x='HasClaim', y='TotalClaims', palette='Pastel1')


## Data Quality Check

In [4]:
print("\nMissing Values after Preprocessing:\n", df.isnull().sum())


Missing Values after Preprocessing:
 UnderwrittenCoverID                    0
PolicyID                               0
TransactionMonth                       0
IsVATRegistered                        0
Citizenship                            0
LegalType                              0
Title                                  0
Language                               0
Bank                              145961
AccountType                        40232
MaritalStatus                     994467
Country                                0
PostalCode                             0
MainCrestaZone                         0
SubCrestaZone                          0
ItemType                               0
mmcode                               552
RegistrationYear                       0
make                                 552
Model                                552
Cylinders                            552
cubiccapacity                        552
kilowatts                            552
bodytype           