# Advanced Research & Dash App


# Dash App creation

## 💡 Feature: Guided User Input with Validation Hints

To improve user experience and reduce input errors in the loan default prediction app, we implemented the following enhancements:

### 1. Data Type Detection
- For each input feature, we determine its **expected data type** from the training dataset.
- This allows us to set suitable **placeholder values** and define meaningful **tooltip hints**.

### 2. Placeholder Hints
- Each input field displays `"Enter Value (e.g., 15000)"` as a placeholder.
- The example is dynamically generated based on the actual data type or common value range.

### 3. Mouse Hover Tooltip
- When a user hovers over a field, a tooltip appears showing a **detailed description or example** for the input.
- This helps prevent format errors (e.g., entering strings where integers are expected).

### 4. Optional Future Enhancements
- Input validation that restricts typing to allowed formats.
- Auto-suggestion or drop-downs for categorical fields.

This combination ensures a cleaner UI and smarter form input handling for model deployment.

# 🎛️ Input UI Design for Dash App

## 🎯 Objective

To enhance user experience and ensure data integrity, we aim to create a user-friendly input form where each input field:
- Accepts only valid values based on the feature's data type.
- Provides a default prompt like "Enter Value".
- Shows an example tooltip on hover for guidance.

## 🧠 How It Works

1. **7x5 Grid Layout**:
   - Inputs are arranged in a grid layout of 7 rows and 5 columns.
   - This ensures compactness and better use of screen space.

2. **Dynamic Tooltips**:
   - Each input field has a tooltip that displays an example value extracted from the training dataset.
   - Hovering over the input field will show guidance like:  
     _"Example: 12000"_ or _"Example: Verified"_.

3. **Input Type Enforcement**:
   - Numerical fields use `type='number'` to prevent invalid entries.
   - Categorical fields use dropdowns or text with validation logic in the backend.

## ✅ Benefits

- Prevents user input errors at the UI level.
- Reduces backend validation complexity.
- Ensures better form completion rate and usability.

## 🔍 Example Input Field (in Dash)
```python
dcc.Input(
    id='loan_amount',
    type='number',
    placeholder='Enter Loan Amount',
    debounce=True,
    title='Example: 12000'
)


# 🧠 Why the Dash App Fails Without Preprocessing?

When we train a machine learning model using **scikit-learn**, we typically convert all categorical (text-based) features into **numerical representations** — because models like `SVC`, `RandomForest`, or `XGBoost` **cannot operate on raw string data**.

This process is called **feature engineering**, which often includes:

- **One-hot encoding** (e.g., `Home Ownership = RENT` → column `Home Ownership_RENT = 1`)
- **Label encoding**
- **Scaling** (e.g., MinMaxScaler or StandardScaler)
- **Handling missing values**
- **Date feature extraction** (e.g., month, year)

During training, this transformation was applied **before the model saw the data**, and the model only learned from the transformed features — which are **all numeric**.

---

### ⚠️ Problem in the Current Dash App

Right now, the app takes user input as raw values like:
```python
"Term" = "Short Term"  # string
"Home Ownership" = "RENT"  # string
```

But then it sends these **raw strings** to the trained model:
```python
model.predict(input_data)
```

The model expects preprocessed numeric input like:
```python
"Term_Short Term" = 1
"Home Ownership_RENT" = 1
"Home Ownership_OWN" = 0
```

Hence, you're getting errors like:
```plaintext
could not convert string to float: 'Short Term'
```

---

### ✅ Correct Approach for Deployment

1. **Recreate your exact preprocessing pipeline** used during training.
2. **Save this pipeline** as a `Pipeline` object that combines both preprocessing and the model.
3. In the Dash app:
   - Accept raw user input
   - Pass it through the same pipeline
   - Then send it to `.predict()` or `.predict_proba()`

This ensures your input format **matches** the model’s training data structure.

---

### 🛠️ What Should You Do Next?

You need to **retrain or refit** your model inside a `Pipeline`, like this:

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC

# Step 1: Define preprocessing
categorical_cols = ["Term", "Home Ownership", ...]
numerical_cols = ["Loan Amount", "Interest Rate", ...]

preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
    ("num", "passthrough", numerical_cols)
])

# Step 2: Combine preprocessing + model
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", SVC(probability=True))
])

# Step 3: Fit pipeline
pipeline.fit(X_raw, y)  # X_raw = original unprocessed DataFrame

# Step 4: Save pipeline
import joblib
joblib.dump(pipeline, "svc_pipeline_model.pkl")
```

Then in your Dash app, simply load and use:

```python
model = joblib.load("svc_pipeline_model.pkl")
model.predict(user_input_df)
```

---

This guarantees your deployed app is using the **same pipeline logic** as used during training, making it robust and production-ready.

## 🔁 Why Retrain Inside a Pipeline?
When you trained your model earlier, you:

    - Cleaned and encoded your DataFrame using pandas (like one-hot encoding)
    - Dropped or transformed columns manually
    - Fed the final numerical DataFrame into the model
    - That was great for offline evaluation, but now in your Dash app:

The user is entering raw values (like "RENT" or "Short Term")
The model expects preprocessed numbers
So you need to bundle the preprocessing + model together into a Pipeline so it handles everything seamlessly during prediction

## 🧠 Next Step
Now that you're ready, let’s:

    - Rebuild your preprocessing logic using ```ColumnTransformer```
    - Combine it with ```SVC(probability=True)``` inside a ```Pipeline```
    - Fit the pipeline on raw training data (train.csv)
    - Save the pipeline as ```svc_pipeline_model.pkl```

## Use it in Dash 🎯

In [317]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder
import pandas as pd

class BankLoanPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.grade_order = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7}
        self.label_cols = ["Sub Grade", "Batch Enrolled"]
        self.label_encoders = {}
        self.freq_map = None
        self.one_hot_cols = ["Initial List Status", "Employment Duration", "Verification Status"]
        self.one_hot_columns_fitted = None  # to align test data with train

    def fit(self, X, y=None):
        # Fit Label Encoders
        for col in self.label_cols:
            le = LabelEncoder()
            le.fit(X[col].astype(str))
            self.label_encoders[col] = le

        # Fit Frequency Encoding
        self.freq_map = X["Loan Title"].value_counts().to_dict()

        # Fit One-Hot Columns
        dummies = pd.get_dummies(X[self.one_hot_cols], drop_first=True)
        self.one_hot_columns_fitted = dummies.columns.tolist()

        return self

    def transform(self, X):
        X = X.copy()

        # Grade Ordinal Mapping
        X["Grade"] = X["Grade"].map(self.grade_order)

        # Label Encoding
        for col in self.label_cols:
            X[col] = self.label_encoders[col].transform(X[col].astype(str))

        # Frequency Encoding
        X["Loan Title"] = X["Loan Title"].map(self.freq_map).fillna(0)

        # One-Hot Encoding (align columns)
        dummies = pd.get_dummies(X[self.one_hot_cols], drop_first=True)
        for col in self.one_hot_columns_fitted:
            if col not in dummies:
                dummies[col] = 0
        dummies = dummies[self.one_hot_columns_fitted]
        X = X.drop(columns=self.one_hot_cols)
        X = pd.concat([X, dummies], axis=1)

        return X

In [324]:
# --- 1. Imports ---
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# Custom transformer from previous step
#from bank_preprocessor import BankLoanPreprocessor  # or copy-paste the class if in same notebook

# --- 2. Prepare Data ---
# 1. Split back into raw training and test data
df_train_raw = df_combined.loc["train"].copy()
df_test_raw = df_combined.loc["test"].copy()

# 2. Define features and target
X_raw = df_train_raw.drop(columns=["Loan Status"])
y_raw = df_train_raw["Loan Status"]

X_kaggle_test = df_test_raw.drop(columns=["Loan Status"])
y_kaggle_test = df_test_raw["Loan Status"]


# Split for evaluation purposes
#X_train_raw, X_val_raw, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y_raw)

# --- 3. Define Pipeline ---
pipeline = ImbPipeline([
    ("preprocessor", BankLoanPreprocessor()),
    ("smote", SMOTE(random_state=42)),
    ("classifier", RandomForestClassifier(random_state=42))
])

# --- 4. Fit the Pipeline ---
pipeline.fit(X_raw, y_raw)

# --- 5. Evaluate ---
y_pred = pipeline.predict(X_kaggle_test)
print("📊 Classification Report:\n", classification_report(y_kaggle_test, y_pred))
print("🧱 Confusion Matrix:\n", confusion_matrix(y_kaggle_test, y_pred))

# --- 6. Save the model ---
joblib.dump(pipeline, "rf_pipeline_model.pkl")
print("✅ Model saved as 'rf_pipeline_model.pkl'")


ValueError: could not convert string to float: 'n'

In [321]:
# Predict on Kaggle test set (raw)
y_pred_kaggle = pipeline.predict(df_test)
print(classification_report(df_target["Loan Status"], y_pred_kaggle))

              precision    recall  f1-score   support

           0       0.53      0.98      0.69     15300
           1       0.56      0.02      0.05     13613

    accuracy                           0.53     28913
   macro avg       0.55      0.50      0.37     28913
weighted avg       0.55      0.53      0.39     28913



# 🧠 Why the Dash App Fails Without Preprocessing?

When we train a machine learning model using **scikit-learn**, we typically convert all categorical (text-based) features into **numerical representations** — because models like `SVC`, `RandomForest`, or `XGBoost` **cannot operate on raw string data**.

This process is called **feature engineering**, which often includes:

- **One-hot encoding** (e.g., `Home Ownership = RENT` → column `Home Ownership_RENT = 1`)
- **Label encoding**
- **Scaling** (e.g., MinMaxScaler or StandardScaler)
- **Handling missing values**
- **Date feature extraction** (e.g., month, year)

During training, this transformation was applied **after merging** `df_train` and `df_test` into `df_combined`. This ensured **consistent encoding across all values**, even if a value appeared only in the test set (e.g., `'n'` or `'BAT2522922'` in `Batch Enrolled`).

---

### ⚠️ Problem in the Current Pipeline

Now, when trying to train or deploy a new model with only `df_train` or `df_test` separately, these inconsistencies **resurface** because unseen labels are not accounted for:

```plaintext
ValueError: could not convert string to float: 'n'
```

---

### ✅ Solution: Reuse Consistent Transformation

To ensure consistent encoding, we need to:

1. Combine the original cleaned `df_train` and `df_test` again into `df_combined`:
```python
combined = pd.concat([df_train, df_test], axis=0, keys=["train", "test"])
```

2. Apply encoding to the entire `df_combined`:
```python
# Grade ordinal encoding
combined['Grade'] = combined['Grade'].map(grade_order)

# Label encoding
for col in label_cols:
    le = LabelEncoder()
    combined[col] = le.fit_transform(combined[col].astype(str))

# Frequency encoding for Loan Title
combined['Loan Title'] = combined['Loan Title'].map(freq_map).fillna(0)

# One-hot encoding
combined = pd.get_dummies(combined, columns=one_hot_cols, drop_first=True)
```

3. Split them back:
```python
X_train_final = combined.loc['train'].drop(columns=['Loan Status'])
y_train_final = combined.loc['train']['Loan Status']
X_test_final = combined.loc['test'].drop(columns=['Loan Status'])
y_test_final = combined.loc['test']['Loan Status']
```

4. Use these for fitting your final pipeline.

---

### 🧱 Outcome

By encoding consistently across both training and test datasets upfront, you avoid mismatches and ensure your model generalizes well at inference time — especially when deployed or reused in apps.

Let me know to implement this in code.


In [325]:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# Combine cleaned train and test sets
combined = pd.concat([df_train, df_test], axis=0, keys=["train", "test"])

# Grade ordinal encoding
grade_order = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7}
combined['Grade'] = combined['Grade'].map(grade_order)

# Label encoding
label_cols = ["Sub Grade", "Batch Enrolled"]
for col in label_cols:
    le = LabelEncoder()
    combined[col] = le.fit_transform(combined[col].astype(str))

# Frequency encoding for Loan Title
freq_map = combined["Loan Title"].value_counts().to_dict()
combined["Loan Title"] = combined["Loan Title"].map(freq_map).fillna(0)

# One-hot encoding
one_hot_cols = ["Initial List Status", "Employment Duration", "Verification Status"]
combined = pd.get_dummies(combined, columns=one_hot_cols, drop_first=True)

# Final train-test split
X_train_final = combined.loc['train'].drop(columns=['Loan Status'])
y_train_final = combined.loc['train']['Loan Status']
X_test_final = combined.loc['test'].drop(columns=['Loan Status'])
y_test_final = combined.loc['test']['Loan Status']

# Define pipeline
pipeline = ImbPipeline([
    ("smote", SMOTE(random_state=42)),
    ("classifier", RandomForestClassifier(random_state=42))
])

# Train pipeline
pipeline.fit(X_train_final, y_train_final)

# Evaluate
y_pred = pipeline.predict(X_test_final)
print("📊 Classification Report:\n", classification_report(y_test_final, y_pred))
print("🧱 Confusion Matrix:\n", confusion_matrix(y_test_final, y_pred))

# Save model
import joblib
joblib.dump(pipeline, "final_rf_model.pkl")
print("✅ Model saved as 'final_rf_model.pkl'")

ValueError: Input y_true contains NaN.

In [327]:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import joblib

# Combine cleaned train and test sets
combined = pd.concat([df_train, df_test], axis=0, keys=["train", "test"])

# Grade ordinal encoding
grade_order = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7}
combined['Grade'] = combined['Grade'].map(grade_order)

# Label encoding
label_cols = ["Sub Grade", "Batch Enrolled"]
for col in label_cols:
    le = LabelEncoder()
    combined[col] = le.fit_transform(combined[col].astype(str))

# Frequency encoding for Loan Title
freq_map = combined["Loan Title"].value_counts().to_dict()
combined["Loan Title"] = combined["Loan Title"].map(freq_map).fillna(0)

# One-hot encoding
one_hot_cols = ["Initial List Status", "Employment Duration", "Verification Status"]
combined = pd.get_dummies(combined, columns=one_hot_cols, drop_first=True)

# Final train-test split
X_train_final = combined.loc['train'].drop(columns=['Loan Status'])
y_train_final = combined.loc['train']['Loan Status']
X_test_final = combined.loc['test'].drop(columns=['Loan Status'], errors='ignore')

# Define pipeline
pipeline = ImbPipeline([
    ("smote", SMOTE(random_state=42)),
    ("classifier", RandomForestClassifier(random_state=42))
])

# Train pipeline
pipeline.fit(X_train_final, y_train_final)

# Save model
joblib.dump(pipeline, "final_rf_model.pkl")
print("✅ Model saved as 'final_rf_model.pkl'")

# Create final submission
preds = pipeline.predict(X_test_final)
submission = pd.DataFrame({
    "ID": original_test["ID"],
    "Loan Status": preds
})
submission.to_csv("final_submission.csv", index=False)
print("📁 'final_submission.csv' created successfully!")

✅ Model saved as 'final_rf_model.pkl'
📁 'final_submission.csv' created successfully!
