# CIS 5450 Project: Difficulty Topics
**Team Members:** Prithvi Seshadri, Vamsi Krishna, Anaya Choudhari
## Video Link: [Project Presentation Video](https://drive.google.com/file/d/1Wq_g_4Nbs5JRGb5VOnJ73Qawm4WYkICU/view?usp=sharing)

> This notebook documents how you implemented difficulty topics in your project. Use the link button in the top right when you select a cell to get a **hyperlink**.


---

## Topic 1: **Imbalance Data**

**Hyperlink:** [lifestyle_project_v4.ipynb](lifestyle_project_v4.ipynb#4.4-Handling-Class-Imbalance-with-SMOTE)


### **Implementation & Rationale**

#### **Rationale**
Our dataset's target variable, `disease_risk`, was significantly imbalanced, with far fewer high-risk cases than low-risk ones (e.g., a 90/10 split). Training standard models on such skew often leads to a bias toward the majority class (healthy), resulting in poor sensitivity (recall) for the minority class (high risk). In a medical context, missing a high-risk patient (False Negative) is a critical error compared to flagging a healthy one (False Positive).

#### **Implementation Strategy**
We implemented **SMOTE (Synthetic Minority Over-sampling Technique)** using the `imblearn` library. Crucially, we applied SMOTE **only to the training split** to prevent data leakage (synthesizing validation data would artificially inflate scores). This synthesized new examples for the minority class by interpolating between existing ones, balancing the class distribution for training.

#### **Pseudocode / Steps**
```python
# 1. Split Data to preventing leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# 2. Initialize SMOTE
smote = SMOTE(random_state=42)

# 3. Resample ONLY the training data (Never touch validation data!)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# 4. Check new class distribution
print(y_train_resampled.value_counts()) # Should be balanced 50/50

# 5. Train model on resampled data
model.fit(X_train_resampled, y_train_resampled)
```


---

## Topic 2: **Feature Selection**

**Hyperlink:** [lifestyle_project_v4.ipynb](lifestyle_project_v4.ipynb#4.4-Feature-Importance-Analysis-(Mutual-Information))

### **Implementation & Rationale**

#### **Rationale**
Our dataset contained multiple lifestyle and physiological features, some of which could be redundant (multicollinearity) or noisy. Including irrelevant features increases model complexity ("curse of dimensionality"), risks overfitting, and reduces interpretability. We needed to identify which specific factors (e.g., Steps vs. Sleep) truly drive disease risk to simplify the model and explain it to clinicians.

#### **Implementation Strategy**
We utilized **Mutual Information (MI)** from `sklearn.feature_selection`. Unlike simple correlation (which only captures linear relationships like `y = mx + b`), MI captures non-linear dependencies between features and the target. For example, BMI might only be risky above a certain threshold, which correlation might miss but MI captures. We calculated MI scores on the balanced training set to ensure the importance of minority-class predictors was not drowned out.

#### **Pseudocode / Steps**
```python
from sklearn.feature_selection import mutual_info_classif

# 1. Calculate Mutual Information scores
mi_scores = mutual_info_classif(X_train_resampled, y_train_resampled)

# 2. Create a dataframe for visualization
mi_df = pd.DataFrame({'Feature': X.columns, 'Score': mi_scores})
mi_df = mi_df.sort_values(by='Score', ascending=False)

# 3. Visualize to decide importance
plt.barh(mi_df['Feature'], mi_df['Score'])
plt.title("Mutual Information Scores")
plt.show()
```


---

## Topic 3: **Ensemble Models**

**Hyperlink:** [lifestyle_project_v4.ipynb](lifestyle_project_v4.ipynb#5.2-Challenger-Model-Tuned-Random-Forest)

### **Implementation & Rationale**

#### **Rationale**
Single estimators (like a Decision Tree) often suffer from high variance (overfitting specific data points). We chose **Random Forest**, an ensemble of decision trees, for its robustness and predictive power. Medical risk factors often behave as "step-functions" (e.g., risk doesn't increase linearly with BP, but spikes after 140/90), which trees handle well. By averaging multiple trees (Bagging), Random Forest reduces the variance of individual trees and provides more stable probability estimates.

#### **Implementation Strategy**
We implemented a **Hyperparameter-Tuned Random Forest**. rather than using default settings, we used `RandomizedSearchCV` to explore a grid of parameters (`n_estimators`, `max_depth`, `min_samples_split`). This allowed us to find a model configuration that balanced complexity (depth) and generalization (estimators), ensuring we didn't just memorize the training data.

#### **Pseudocode / Steps**
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

# 1. Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# 2. Initialize Randomized Search
rf_random = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_grid,
    n_iter=10,
    cv=3,
    scoring='recall' # Optimizing for sensitivity
)

# 3. Fit to resampled training data
rf_random.fit(X_train_resampled, y_train_resampled)

# 4. Evaluate best model params
print(rf_random.best_params_)
best_model = rf_random.best_estimator_
```


---

## Topic 4: **Threshold Tuning & Business Logic**

**Hyperlink:** [lifestyle_project_v4.ipynb](lifestyle_project_v4.ipynb#5.5-Strategic-Fix-Threshold-Tuning-for-Recall)

### **Implementation & Rationale**

#### **Rationale**
Standard classifiers use a default probability threshold of 0.50 to classify a positive case. However, in medicine, the cost of a False Negative (missing specific disease) is often higher than a False Positive. But simply maximizing recall can lead to "alarm fatigue" where too many healthy people are flagged. We needed to find a custom threshold that balanced these trade-offs according to our specific business goal: creating a **High-Credibility Screening Tool**.

#### **Implementation Strategy**
We analyzed the Precision-Recall trade-off by varying the decision threshold from 0.0 to 1.0. We discovered that lowering the threshold to 0.15 maximized sensitivity but flagged 97% of healthy people as sick (low precision). We made a strategic decision to keep the threshold near 0.50 to prioritize **High Specificity (Precision)**. This means the model acts as a "High Certainty" filter: it may not catch every case, but when it *does* flag someone, clinicians can trust the alert.

#### **Pseudocode / Steps**
```python
from sklearn.metrics import precision_recall_curve

# 1. Get probability scores instead of hard classes
y_scores = best_model.predict_proba(X_test)[:, 1]

# 2. Calculate Precision & Recall for all thresholds
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)

# 3. Analyze Trade-offs (Visual Inspection)
def plot_pr_curve(precisions, recalls):
    plt.plot(recalls, precisions)
    plt.xlabel('Recall (Sensitivity)')
    plt.ylabel('Precision (Credibility)')

# 4. Select Custom Threshold (e.g., 0.50 for high credibility)
custom_threshold = 0.50
y_custom_pred = (y_scores >= custom_threshold).astype(int)

# 5. Evaluate final business metrics
print(classification_report(y_test, y_custom_pred))
```
