## Multi-Model Training & Comparison (Random Forest, Logistic Regression, SVM)

In this section, we train and compare multiple machine learning models to classify individuals into different **Risk Profiles**. The models used are:

- **Random Forest Classifier**
- **Logistic Regression** (Multinomial)
- **Support Vector Machine (SVM)**

### 🔧 Data Preparation & Preprocessing

- Column names were **cleaned** to replace spaces and hyphens with underscores for consistency.
- The dataset was split into:
  - `X_train`, `y_train` for training features and target
  - `X_test`, `y_test` for evaluation

- **Categorical Columns** were encoded using `OneHotEncoder`
- **Numerical Columns** were scaled using `StandardScaler`

A `ColumnTransformer` was used to handle preprocessing separately for categorical and numerical features.

### 🤖 Model Pipelines

Each model was embedded into a **Pipeline** that includes:
1. Preprocessing (scaling + encoding)
2. Classifier (RandomForest, LogisticRegression, or SVM)

This ensures an end-to-end transformation + prediction flow.

### 🏋️ Model Training & Evaluation

Each model was trained using `.fit()` on the training set and evaluated using:

- **Accuracy Score**
- **Classification Report** (Precision, Recall, F1-score)

### ✅ Results Summary

After training and evaluating all models, the best-performing model based on **accuracy** is selected and displayed.


In [1]:
#Basic imports 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
#Modeling 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler , LabelEncoder 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Warnings and display 
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns' , None)

In [2]:
#Load dataset 
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")


In [None]:
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import pandas as pd
import numpy as np



# Clean column names
train.columns = train.columns.str.strip().str.replace(' ', '_').str.replace('-', '_')
test.columns = test.columns.str.strip().str.replace(' ', '_').str.replace('-', '_')

# Verify cleaned columns
print("Train columns:", train.columns.tolist())
print("Test columns:", test.columns.tolist())

# Define features and target USING CLEANED NAMES
X_train = train.drop('Risk_Profile', axis=1)
y_train = train['Risk_Profile']
X_test = test.drop('Risk_Profile', axis=1)
y_test = test['Risk_Profile']

# Identify categorical columns - USING ACTUAL CLEANED NAMES FROM THE DATA
categorical_cols = [
    'Gender', 'Marital_Status', 'Education_Level', 'Occupation', 'Housing_Status',
    'City_or_Region_of_Residence', 'Previous_Bankruptcy_Status', 'Health_Condition',
    'Family_Health_History', 'Residency_Stability', 'Financial_Stability_of_Parents',
    'Tax_Filing_History', 'Utility_Bills_Payment_History', 'Job_Loss', 'Divorce_History',
    'Major_Medical_Emergency', 'Adoption_History', 'Bankruptcy_History', 'Health_related_Legal_Claims',
    'Domestic_or_International_Relocation', 'Economic_Sentiment', 'Financial_Planner_Involvement',
    'Life_Insurance_Adequacy', 'Long_term_Financial_Goals'
]

# Numerical columns (excluding target)
numerical_cols = [col for col in X_train.columns 
                  if col not in categorical_cols 
                  and col != 'Risk_Profile']

# Verifying missing columns
missing_cat = [col for col in categorical_cols if col not in X_train.columns]
if missing_cat:
    print(f"⚠️ Missing categorical columns: {missing_cat}")
    categorical_cols = [col for col in categorical_cols if col in X_train.columns]

print("\nUsing categorical columns:", categorical_cols)
print("Using numerical columns:", numerical_cols)

# Preprocessing
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols)
])

# Defining models
models = {
    'RandomForest': RandomForestClassifier(random_state=42, n_jobs=-1),
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42, 
                                           multi_class='multinomial', n_jobs=-1),
    'SVM': SVC(probability=True, random_state=42)
}

# Train and evaluate models
results = {}
best_accuracy = 0
best_model = None

for name, model in models.items():
    try:
        print(f"\nTraining {name}...")
        pipeline = Pipeline([
            ('preprocessor', preprocessor),
            ('classifier', model)
        ])
        pipeline.fit(X_train, y_train)
        
        y_pred = pipeline.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        results[name] = {
            'model': pipeline,
            'accuracy': accuracy,
            'report': classification_report(y_test, y_pred)
        }
        
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_model = name
            
        print(f"{name} trained successfully | Accuracy: {accuracy:.4f}")
    except Exception as e:
        print(f"❌ {name} failed: {str(e)}")

# Display results
print("\n" + "="*50)
for name, res in results.items():
    print(f"\n{name} Performance:")
    print(f"Accuracy: {res['accuracy']:.4f}")
    print("Classification Report:")
    print(res['report'])

if best_model:
    print("\n" + "="*50)
    print(f"🏆 BEST MODEL: {best_model} (Accuracy: {best_accuracy:.4f})")
    print("="*50)
else:
    print("\nNo models trained successfully")


Train columns: ['Age', 'Gender', 'Marital_Status', 'Number_of_Dependents', 'Household_Size', 'Education_Level', 'Occupation', 'Years_in_Current_Job', 'Income_Level', 'Credit_Score', 'Number_of_Credit_Inquiries', 'Housing_Status', 'City_or_Region_of_Residence', 'Previous_Bankruptcy_Status', 'Health_Condition', 'Family_Health_History', 'Marital_History', 'Residency_Stability', 'Financial_Stability_of_Parents', 'Average_Monthly_Expenses', 'Credit_Card_Usage', 'Savings_Rate', 'Number_of_Loans_Taken', 'Mortgage_Information', 'Investment_Accounts', 'Emergency_Fund_Status', 'Loan_Delinquencies_History', 'Bank_Account_Activity', 'Tax_Filing_History', 'Utility_Bills_Payment_History', 'Number_of_Credit_Cards_Held', 'Job_Loss', 'Divorce_History', 'Major_Medical_Emergency', 'Adoption_History', 'Bankruptcy_History', 'Health_related_Legal_Claims', 'Domestic_or_International_Relocation', 'Local_Unemployment_Rate', 'Inflation_Rate', 'Interest_Rates', 'Economic_Sentiment', 'Risk_Tolerance', 'Financial_