# HW04

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not.  

### Data preparation
- Check if the missing values are presented in the features.
- If there are missing values:
    - For caterogiral features, replace them with 'NA'
    - For numerical features, replace with with 0.0  
Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution. Use `train_test_split` function for that with `random_state=1

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score

In [4]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv"
)

In [None]:
# Standardize column names (following notebook style)
df.columns = df.columns.str.lower().str.replace(" ", "_")

The target variable 'converted' is already in 0/1 format

In [6]:
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [10]:
# check dtypes
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [11]:
# Identify categorical and numerical features
numerical = [
    "number_of_courses_viewed",
    "annual_income",
    "interaction_count",
    "lead_score",
]
categorical = ["lead_source", "industry", "employment_status", "location"]

# Data Preparation: Handle missing values
for col in categorical:
    # For categorical features, replace missing values with 'NA'
    df[col] = df[col].fillna("NA")

for col in numerical:
    # For numerical features, replace missing values with 0.0
    df[col] = df[col].fillna(0.0)

### To split data into three parts => Train/Validation/Test

In [12]:
# check how many data
n = len(df)
n

1462

In [13]:
# Split the data into 3 parts: train/validation/test with 60%/20%/20% distribution.
# First split: full_train (80%) and test (20%)
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

# Second split: train (60%) and validation (20%) from full_train (80% * 0.25 = 20%)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [14]:
# Reset index
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_full_train = df_full_train.reset_index(drop=True)  # Used for Q5/Q6 CV

In [15]:
# Extract the target variables
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values
y_full_train = df_full_train.converted.values

In [16]:
# Delete the target variable from the feature dataframes
del df_train["converted"]
del df_val["converted"]
del df_test["converted"]
del df_full_train["converted"]

### Question 1: ROC AUC feature importance

ROC AUC could also be used to evaluate feature importance of numerical variables. 

Let's do that

* For each numerical variable, use it as score (aka prediction) and compute the AUC with the `y` variable as ground truth.
* Use the training dataset for that


If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. `-df_train['balance']`)

AUC can go below 0.5 if the variable is negatively correlated with the target variable. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

- `lead_score`
- `number_of_courses_viewed`
- `interaction_count`
- `annual_income`

In [17]:
# Question 1: ROC AUC feature importance
print("--- Question 1: ROC AUC Feature Importance ---")

scores = {}

for col in numerical:
    # Use the numerical feature as the score (prediction)
    auc = roc_auc_score(y_train, df_train[col])

    # If AUC < 0.5, invert the variable and recalculate AUC
    if auc < 0.5:
        auc = roc_auc_score(y_train, -df_train[col])
        print(f"Feature {col} was inverted.")

    scores[col] = auc
    print(f"{col:25} {auc:.4f}")

# Find the feature with the highest AUC
best_feature_q1 = max(scores, key=scores.get)

print(
    f"\nThe numerical variable with the highest AUC is: {best_feature_q1} ({scores[best_feature_q1]:.4f})"
)

--- Question 1: ROC AUC Feature Importance ---
number_of_courses_viewed  0.7636
annual_income             0.5520
interaction_count         0.7383
lead_score                0.6145

The numerical variable with the highest AUC is: number_of_courses_viewed (0.7636)


### Q1 Ans is number_of_courses_viewed

### Question 2: Training the model

Apply one-hot-encoding using `DictVectorizer` and train the logistic regression with these parameters:

```python
LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
```

What's the AUC of this model on the validation dataset? (round to 3 digits)

- 0.32
- 0.52
- 0.72
- 0.92


In [18]:
# --- Feature Engineering ---
# Combine all feature names
all_features = categorical + numerical

In [19]:
# One-hot encoding using DictVectorizer
dv = DictVectorizer(sparse=False)

In [20]:
# Fit on train data and transform
train_dict = df_train[all_features].to_dict(orient="records")
X_train = dv.fit_transform(train_dict)

# Transform validation data
val_dict = df_val[all_features].to_dict(orient="records")
X_val = dv.transform(val_dict)

In [22]:
# --- Model Training and Evaluation ---
# Train Logistic Regression model
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=1)
model.fit(X_train, y_train)

# Predict probabilities on the validation set
y_pred = model.predict_proba(X_val)[:, 1]

# Calculate AUC on the validation set
auc_val = roc_auc_score(y_val, y_pred)
auc_val_rounded = round(auc_val, 3)

print("--- Question 2: Training the model (AUC on Validation Set) ---")
print(f"Model AUC on validation set: {auc_val:.4f}")
print(f"Rounded AUC: {auc_val_rounded}")

--- Question 2: Training the model (AUC on Validation Set) ---
Model AUC on validation set: 0.8171
Rounded AUC: 0.817


### Q2 Ans should be 0.72 (0.817 is more closer to 0.72)

### Question 3: Precision and Recall

Now let's compute precision and recall for our model.

* Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
* For each threshold, compute precision and recall
* Plot them

At which threshold precision and recall curves intersect?

* 0.145
* 0.345
* 0.545
* 0.745

In [24]:
print("\n--- Question 3: Precision and Recall Intersection ---")

thresholds = np.linspace(0.0, 1.0, 101)
scores_q3 = []

for t in thresholds:
    # Apply threshold to the predicted probabilities
    actual_positive_predictions = y_pred >= t

    # Calculate True Positives, False Positives, False Negatives
    tp = ((actual_positive_predictions == True) & (y_val == 1)).sum()
    fp = ((actual_positive_predictions == True) & (y_val == 0)).sum()
    fn = ((actual_positive_predictions == False) & (y_val == 1)).sum()

    # Calculate Precision and Recall
    # Handle the case where the denominator is zero
    precision = tp / (tp + fp) if (tp + fp) > 0 else 1.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0

    scores_q3.append((t, precision, recall))

df_scores = pd.DataFrame(scores_q3, columns=["threshold", "precision", "recall"])

# Find the threshold where precision and recall curves intersect (minimal absolute difference)
df_scores["diff"] = np.abs(df_scores["precision"] - df_scores["recall"])
intersection_point = df_scores.sort_values(by="diff").iloc[0]
intersect_threshold = intersection_point.threshold

print(f"Intersection Threshold: {intersect_threshold:.3f}")


--- Question 3: Precision and Recall Intersection ---
Intersection Threshold: 0.640


### Q3 ANS should be 0.545 (0.64 is more closer)

### Question 4: F1 score

Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both

This is the formula for computing F1:

$$F_1 = 2 \cdot \cfrac{P \cdot R}{P + R}$$

Where $P$ is precision and $R$ is recall.

Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01

At which threshold F1 is maximal?

- 0.14
- 0.34
- 0.54
- 0.74

In [25]:
print("\n--- Question 4: F1 Score Maximal Threshold ---")

# Compute F1 score using the formula
df_scores["f1"] = (
    2
    * (df_scores["precision"] * df_scores["recall"])
    / (df_scores["precision"] + df_scores["recall"])
)
# Handle the case where P+R=0
df_scores.loc[df_scores["precision"] + df_scores["recall"] == 0, "f1"] = 0.0

# Find the threshold with maximal F1 score
max_f1_point = df_scores.sort_values(by="f1", ascending=False).iloc[0]
max_f1_threshold = max_f1_point.threshold

print(f"Max F1 Score Threshold: {max_f1_threshold:.2f}")


--- Question 4: F1 Score Maximal Threshold ---
Max F1 Score Threshold: 0.57


### Question 5: 5-Fold CV


Use the `KFold` class from Scikit-Learn to evaluate our model on 5 different folds:

```
KFold(n_splits=5, shuffle=True, random_state=1)
```

* Iterate over different folds of `df_full_train`
* Split the data into train and validation
* Train the model on train with these parameters: `LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)`
* Use AUC to evaluate the model on validation

How large is standard deviation of the scores across different folds?

- 0.0001
- 0.006
- 0.06
- 0.36

In [27]:
print("\n--- Question 5: 5-Fold CV Standard Deviation ---")

# --- K-Fold CV Setup (using X_full_train) ---
# Prepare data for K-Fold CV
dv_cv = DictVectorizer(sparse=False)
full_train_dict = df_full_train[all_features].to_dict(orient="records")
X_full_train = dv_cv.fit_transform(full_train_dict)

# Initialize KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=1)

# --- CV Calculation ---
auc_scores_q5 = []
model_q5 = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=1)

for train_idx, val_idx in kfold.split(X_full_train):
    X_train_cv = X_full_train[train_idx]
    y_train_cv = y_full_train[train_idx]
    X_val_cv = X_full_train[val_idx]

    # Train the model
    model_q5.fit(X_train_cv, y_train_cv)

    # Evaluate the model
    y_pred_cv = model_q5.predict_proba(X_val_cv)[:, 1]
    auc_scores_q5.append(roc_auc_score(y_full_train[val_idx], y_pred_cv))

std_auc = np.std(auc_scores_q5)
print(f"Standard Deviation of AUC scores: {std_auc:.4f}")


--- Question 5: 5-Fold CV Standard Deviation ---
Standard Deviation of AUC scores: 0.0358


### Q5 ANS should 0.06

### Question 6: Hyperparameter Tuning

Now let's use 5-Fold cross-validation to find the best parameter `C`

* Iterate over the following `C` values: `[0.000001, 0.001, 1]`
* Initialize `KFold` with the same parameters as previously
* Use these parameters for the model: `LogisticRegression(solver='liblinear', C=C, max_iter=1000)`
* Compute the mean score as well as the std (round the mean and std to 3 decimal digits)

Which `C` leads to the best mean score?

- 0.000001
- 0.001
- 1

If you have ties, select the score with the lowest std. If you still have ties, select the smallest `C`.

In [28]:
print("\n--- Question 6: Hyperparameter Tuning ---")

C_values = [0.000001, 0.001, 1]
results = []

for C in C_values:
    auc_scores_c = []
    model_params = LogisticRegression(
        solver="liblinear", C=C, max_iter=1000, random_state=1
    )

    # Perform K-Fold CV for the current C value
    for train_idx, val_idx in kfold.split(X_full_train):
        X_train_cv = X_full_train[train_idx]
        y_train_cv = y_full_train[train_idx]
        X_val_cv = X_full_train[val_idx]

        # Train the model
        model_params.fit(X_train_cv, y_train_cv)

        # Evaluate the model
        y_pred_cv = model_params.predict_proba(X_val_cv)[:, 1]
        auc_scores_c.append(roc_auc_score(y_full_train[val_idx], y_pred_cv))

    mean_auc = np.mean(auc_scores_c)
    std_auc = np.std(auc_scores_c)

    results.append({"C": C, "mean_auc": mean_auc, "std_auc": std_auc})

    print(f"C={C:<10}: Mean AUC = {mean_auc:.3f}, Std AUC = {std_auc:.3f}")

# Find the best C based on criteria
best_c_result = (
    pd.DataFrame(results)
    .sort_values(by=["mean_auc", "std_auc", "C"], ascending=[False, True, True])
    .iloc[0]
)

best_C = best_c_result["C"]
print(f"\nThe best C value is: {best_C}")


--- Question 6: Hyperparameter Tuning ---
C=1e-06     : Mean AUC = 0.560, Std AUC = 0.024
C=0.001     : Mean AUC = 0.867, Std AUC = 0.029
C=1         : Mean AUC = 0.822, Std AUC = 0.036

The best C value is: 0.001
