# Homework
> Note: sometimes your answer doesn't match one of the options exactly.
> That's fine.
> Select the option that's closest to your solution.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns

## Dataset
In this homework, we will use the lead scoring dataset [Bank Marketing dataset](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

You can download it with `wget`:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```

In this dataset, our desired target for the classification task will be the `converted` variable - has the client signed up to the platform or not.

In [3]:
csv_file_name = "course_lead_scoring.csv"
!wget -O {csv_file_name} https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

--2025-10-15 14:30:18--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8003::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘course_lead_scoring.csv’


2025-10-15 14:30:18 (2,48 MB/s) - ‘course_lead_scoring.csv’ saved [80876/80876]



## Data Preparation
- Check if the missing values are present in the features.
- If there are missing values:
    - For categorical features, replace them with 'NA'
    - For numerical features, replace them with 0.0

In [4]:
df = pd.read_csv(csv_file_name)
target = "converted"

# create list of categorical and numerical features
categorical_features = df.select_dtypes(
    include=["object", "category", "bool"]
).columns.tolist()
numerical_features = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
if target in categorical_features:
    categorical_features.remove(target)
elif target in numerical_features:
    numerical_features.remove(target)

df[categorical_features] = df[categorical_features].fillna("NA")
df[numerical_features] = df[numerical_features].fillna(0.0)

# sanity check
if len(df.columns) - len(categorical_features) - len(numerical_features) - 1 != 0:
    raise RuntimeWarning("Something is off with the number of columns")

## Question 1
What is the most frequent observation (mode) for the column `industry`?

In [5]:
df["industry"].mode()

0    retail
Name: industry, dtype: object

- `NA`
- `technology`
- `healthcare`
- **`retail`**

## Question 2
Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset.
In a correlation matrix, you compute the correlation coefficient between every pair of features.

In [6]:
correlation_matrix = df.corr(numeric_only=True)

What are the two features that have the biggest correlation?
- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- **`annual_income` and `interaction_count`**

Only consider the pairs above when answering this question.

In [7]:
column_pairs = [
    ("interaction_count", "lead_score"),
    ("number_of_courses_viewed", "lead_score"),
    ("number_of_courses_viewed", "interaction_count"),
    ("annual_income", "interaction_count"),
]

correlations = {(c1, c2): correlation_matrix.loc[c1, c2] for c1, c2 in column_pairs}

sorted_correlations = sorted(
    correlations.items(), key=lambda item: abs(item[1]), reverse=True
)

print(sorted_correlations)
print(f"The pair with the highest absolute correlation is {sorted_correlations[0][0]}")

[(('annual_income', 'interaction_count'), np.float64(0.02703647240481443)), (('number_of_courses_viewed', 'interaction_count'), np.float64(-0.023565222882888037)), (('interaction_count', 'lead_score'), np.float64(0.009888182496913131)), (('number_of_courses_viewed', 'lead_score'), np.float64(-0.004878998354681276))]
The pair with the highest absolute correlation is ('annual_income', 'interaction_count')


## Split the Data
- Split your data in train/val/test sets with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
- Make sure that the target value `y` is not in your dataframe.

In [8]:
def my_split(df, target, test_size, val_size, seed=42):
    # split factors
    train_size_of_full_train = val_size / (1 - test_size)

    df_full_train, df_test = train_test_split(
        df, test_size=test_size, random_state=seed
    )
    df_train, df_val = train_test_split(
        df_full_train, test_size=train_size_of_full_train, random_state=seed
    )
    # train, val, test
    return (
        df_train.drop(columns=[target]),
        df_train[target].values,
        df_val.drop(columns=[target]),
        df_val[target].values,
        df_test.drop(columns=[target]),
        df_test[target].values,
    )


df_train, y_train, df_val, y_val, df_test, y_test = my_split(df, target, 0.2, 0.2)

# sanity check
print(
    f"relative sizes train, val, test: {np.array([len(df_train), len(df_val), len(df_test)]) / (len(df_train) + len(df_val) + len(df_test))}"
)

relative sizes train, val, test: [0.59917921 0.2004104  0.2004104 ]


## Question 3
- Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
- Round the scores to 2 decimals using `round(score, 2)`.

In [9]:
mutual_information_scores = {
    variable: mutual_info_score(y_train, df_train[variable])
    for variable in categorical_features
}
mutual_information_scores_sorted = sorted(
    mutual_information_scores.items(), key=lambda item: item[1], reverse=True
)
print(mutual_information_scores_sorted)
print(
    f"{mutual_information_scores_sorted[0][0]} has the highest mutual information score."
)

[('lead_source', 0.03539624379726594), ('employment_status', 0.012937677269442782), ('industry', 0.011574521435657112), ('location', 0.004464157884038034)]
lead_source has the highest mutual information score.


Which of these variables has the biggest mutual information score?
- `industry`
- `location`
- **`lead_source`**
- `employment_status`

## Question 4
- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [10]:
dv = DictVectorizer(sparse=False)
train_dict = df_train.to_dict(orient="records")
X_train_encoded = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient="records")
X_val_encoded = dv.transform(val_dict)


model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_encoded, y_train)

y_pred = model.predict(X_val_encoded)
baseline_accuracy = (y_val == y_pred).mean()
round(baseline_accuracy, 2)

np.float64(0.7)

What accuracy did you get?
- 0.64
- **0.74**
- 0.84
- 0.94

## Question 5
- Let's find the least useful feature using the *feature elimination* technique.
- Train a model using the same features and parameters as in Q4 (without rounding).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.


In [11]:
features_to_drop = ["industry", "employment_status", "lead_score"]

feature_impacts = {}
for col in features_to_drop:
    df_train_dropped_col = df_train.drop(columns=[col])
    df_val_dropped_col = df_val.drop(columns=[col])

    dv = DictVectorizer(sparse=False)
    train_dict = df_train_dropped_col.to_dict(orient="records")
    X_train_encoded = dv.fit_transform(train_dict)

    val_dict = df_val_dropped_col.to_dict(orient="records")
    X_val_encoded = dv.transform(val_dict)

    model = LogisticRegression(
        solver="liblinear", C=1.0, max_iter=1000, random_state=42
    )
    model.fit(X_train_encoded, y_train)

    y_pred = model.predict(X_val_encoded)
    accuracy = (y_val == y_pred).mean()
    feature_impacts[col] = baseline_accuracy - accuracy

sorted_feature_impacts = sorted(
    feature_impacts.items(), key=lambda item: abs(item[1]), reverse=False
)
print(sorted_feature_impacts)
print(f"{sorted_feature_impacts[0][0]} has the smallest impact")

[('industry', np.float64(0.0)), ('employment_status', np.float64(0.0034129692832763903)), ('lead_score', np.float64(-0.0068259385665528916))]
industry has the smallest impact



Which of following feature has the smallest difference?
- **`'industry'`**
- `'employment_status'`
- `'lead_score'`
> **Note**: The difference doesn't have to be positive.

## Question 6
- Now let's train a regularized logistic regression.
- Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
- Train models using all the features as in Q4.
- Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

In [19]:
c_values = [0.01, 0.1, 1, 10, 100]

dv = DictVectorizer(sparse=False)
train_dict = df_train.to_dict(orient="records")
X_train_encoded = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient="records")
X_val_encoded = dv.transform(val_dict)

c_to_accuracy = {}
for c in c_values:
    model = LogisticRegression(solver="liblinear", C=c, max_iter=1000, random_state=42)
    model.fit(X_train_encoded, y_train)

    y_pred = model.predict(X_val_encoded)
    acc = (y_val == y_pred).mean()
    c_to_accuracy[c] = acc.round(3)

sorted_accuracies = sorted(
    c_to_accuracy.items(), key=lambda item: item[1], reverse=True
)
print(sorted_accuracies)

[(0.01, np.float64(0.7)), (0.1, np.float64(0.7)), (1, np.float64(0.7)), (10, np.float64(0.7)), (100, np.float64(0.7))]


Which of these `C` leads to the best accuracy on the validation set?
- **0.01**
- 0.1
- 1
- 10
- 100
> **Note**: If there are multiple options, select the smallest `C`.

## Submit the Results
- Submit your results [here](https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw03)
- If your answer doesn't match options exactly, select the closest one.