# Feature Engineering & Preprocessing

Enhance and preprocess `MachineLearningRating_v3.csv` for machine learning tasks.

**Steps**:
- Engineer features (e.g., `LossRatio`, time-based features).
- Encode categorical variables.
- Handle missing values and outliers.
- Visualize engineered features.

In [1]:
import os
import sys
sys.path.append(os.path.abspath('..'))  # Adjust path to include the src directory


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from src.data_preprocessor import load_and_preprocess
from src.utils import save_plot

%matplotlib inline
plt.style.use('seaborn-v0_8')
pd.set_option('display.max_columns', 60)
pd.set_option('display.max_rows', 100)

# Load and preprocess data
df, scaler = load_and_preprocess('../data/MachineLearningRating_v3.csv')

  df = pd.read_csv(csv_path)


## Feature Engineering Summary

In [None]:
print("Data Shape after Preprocessing:", df.shape)
print("\nNew Features:\n", [col for col in df.columns if col not in pd.read_csv('../data/MachineLearningRating_v3.csv').columns])
print("\nDescriptive Statistics of Engineered Features:\n", df[['LossRatio', 'TransactionYear', 'HasClaim', 'AvgClaimsByMake']].describe())

## Visualization

In [None]:
# Histogram: LossRatio
plt.figure(figsize=(10, 6))
sns.histplot(df['LossRatio'].dropna(), bins=50, color='teal')
plt.title('Distribution of Loss Ratio')
plt.xlabel('Loss Ratio')
plt.ylabel('Frequency')
save_plot('loss_ratio_hist.png')

# Box plot: TotalClaims by HasClaim
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='HasClaim', y='TotalClaims', palette='Pastel1')
plt.title('Total Claims by Claim Status')
save_plot('total_claims_by_hasclaim.png')

## Data Quality Check

In [None]:
print("\nMissing Values after Preprocessing:\n", df.isnull().sum())