## Homework

> Note: sometimes your answer doesn't match one of the options exactly. 
> That's fine. 
> Select the option that's closest to your solution.




### Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not. 



In [None]:
# !wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

--2025-10-13 15:43:12--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘course_lead_scoring.csv.1’


2025-10-13 15:43:12 (4.07 MB/s) - ‘course_lead_scoring.csv.1’ saved [80876/80876]



### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0 



In [23]:
import pandas as pd
import numpy as np
import plotly.express as px

In [6]:
df = pd.read_csv('course_lead_scoring.csv')
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [13]:
df.columns[df.isnull().sum()>0]

Index(['lead_source', 'industry', 'annual_income', 'employment_status',
       'location'],
      dtype='object')

In [None]:
df[['lead_source', 'industry', 'annual_income', 'employment_status',
    'location']].isnull().sum()

lead_source          128
industry             134
annual_income        181
employment_status    100
location              63
dtype: int64

In [15]:
df[['lead_source', 'industry', 'annual_income', 'employment_status','location']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1462 entries, 0 to 1461
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   lead_source        1334 non-null   object 
 1   industry           1328 non-null   object 
 2   annual_income      1281 non-null   float64
 3   employment_status  1362 non-null   object 
 4   location           1399 non-null   object 
dtypes: float64(1), object(4)
memory usage: 57.2+ KB


In [21]:
# Loop through columns
for col in df.columns[df.isnull().sum()>0]:
    if df[col].dtype == 'object':  # Categorical
        df[col].fillna('NA')
    else:  # Numerical
        df[col].fillna(0.0)

In [22]:
df.isnull().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- **`retail`**




In [26]:
px.histogram(df,x='industry',text_auto=True,width=500,height=400).update_layout(template='simple_white',xaxis={'categoryorder':'total descending'})

In [28]:
df.industry.describe().T

count       1462
unique         8
top       retail
freq         203
Name: industry, dtype: object

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`

Only consider the pairs above when answering this question.



In [33]:
corr_cols = ['interaction_count','lead_score','number_of_courses_viewed','annual_income']
corr_matrix = df[corr_cols].corr().round(4)
fig = px.imshow(
    corr_matrix,
    text_auto=True,
    color_continuous_scale='RdBu_r',
    title='Correlation Matrix', width=1000, height=700
)
fig.update_layout(template='simple_white')
fig.show()

In [34]:
corr_matrix

Unnamed: 0,interaction_count,lead_score,number_of_courses_viewed,annual_income
interaction_count,1.0,0.0099,-0.0236,0.027
lead_score,0.0099,1.0,-0.0049,0.0156
number_of_courses_viewed,-0.0236,-0.0049,1.0,0.0098
annual_income,0.027,0.0156,0.0098,1.0


- `interaction_count` and `lead_score`: 0.0099
- `number_of_courses_viewed` and `lead_score`: -0.0049
- `number_of_courses_viewed` and `interaction_count` : **-0.0236**
- `annual_income` and `interaction_count`:**0.027**

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.



In [73]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.2, random_state=42)

len(df_train), len(df_val), len(df_test)

(935, 234, 293)

In [74]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [75]:
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

del df_train['converted']
del df_val['converted']
del df_test['converted']

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `industry`
- `location`
- **`lead_source`**
- `employment_status`




In [76]:
from sklearn.metrics import mutual_info_score

display(mutual_info_score(df_full_train.converted, df_full_train.industry))
display(mutual_info_score(df_full_train.annual_income, df_full_train.converted))

np.float64(0.011684562750165564)

np.float64(0.5816088091455935)

In [77]:
def mutual_info_converted_score(series):
    return mutual_info_score(series, df_full_train.converted)

In [78]:
df_full_train.select_dtypes(include='object').columns.to_list()

['lead_source', 'industry', 'employment_status', 'location']

In [79]:
categorical = df_full_train.select_dtypes(include='object').columns.to_list()
mi = df_full_train[categorical].apply(mutual_info_converted_score)
mi.sort_values(ascending=False)*100

lead_source          2.566537
employment_status    1.325850
industry             1.168456
location             0.225304
dtype: float64

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- **0.74**
- 0.84
- 0.94




In [80]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [90]:
# Identify categorical & numerical columns
categorical = df_train.select_dtypes(include='object').columns.to_list()
numerical = df_train.select_dtypes(include='number').columns.to_list()

In [91]:
# Identify categorical & numerical columns
categorical = df_train.select_dtypes(include='object').columns.to_list()
numerical = df_train.select_dtypes(include='number').columns.to_list()

# Convert training data to dict
train_dict = df_train[categorical + numerical].to_dict(orient='records')
val_dict = df_val[categorical + numerical].to_dict(orient='records')

# One-hot encode using DictVectorizer
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

# Initialize and train the model
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Predict probabilities
y_pred = model.predict_proba(X_val)[:, 1]

# Convert probabilities to binary predictions
y_pred_binary = (y_pred >= 0.5)

# Calculate accuracy
accuracy = (y_pred_binary == y_val).mean()
print("Validation accuracy:", round(accuracy, 2))

Validation accuracy: 0.71


In [None]:
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = (y_pred >= 0.5).astype(int)
df_pred['actual'] = y_val

df_pred['correct'] = df_pred.prediction == df_pred.actual
display(df_pred)
round(df_pred.correct.mean()*100,2)

Unnamed: 0,probability,prediction,actual,correct
0,0.612835,1,0,False
1,0.794857,1,1,True
2,0.523759,1,0,False
3,0.470296,0,0,True
4,0.566273,1,0,False
...,...,...,...,...
229,0.469111,0,0,True
230,0.532373,1,0,False
231,0.856832,1,1,True
232,0.374891,0,1,False


np.float64(70.51)

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- **`'industry'`**
- `'employment_status'`
- `'lead_score'`

> **Note**: The difference doesn't have to be positive.




In [93]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

# Get all features (categorical + numerical)
features = categorical + numerical

# Baseline model
dv = DictVectorizer(sparse=False)
train_dict = df_train[features].to_dict(orient='records')
val_dict = df_val[features].to_dict(orient='records')

X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_val)[:, 1]
y_pred_binary = (y_pred >= 0.5)
baseline_accuracy = (y_pred_binary == y_val).mean()
print("Baseline accuracy:", baseline_accuracy)

# Feature elimination
differences = {}

for f in features:
    selected = [col for col in features if col != f]

    dv_temp = DictVectorizer(sparse=False)
    train_dict_temp = df_train[selected].to_dict(orient='records')
    val_dict_temp = df_val[selected].to_dict(orient='records')

    X_train_temp = dv_temp.fit_transform(train_dict_temp)
    X_val_temp = dv_temp.transform(val_dict_temp)

    model_temp = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model_temp.fit(X_train_temp, y_train)
    y_pred_temp = model_temp.predict_proba(X_val_temp)[:, 1]
    y_pred_binary_temp = (y_pred_temp >= 0.5)
    accuracy_temp = (y_pred_binary_temp == y_val).mean()

    differences[f] = baseline_accuracy - accuracy_temp

# Sort features by smallest impact
sorted_diff = sorted(differences.items(), key=lambda x: x[1])
print("Feature impact (smallest first):")
for f, d in sorted_diff:
    print(f"{f}: {d:.5f}")


Baseline accuracy: 0.7051282051282052
Feature impact (smallest first):
annual_income: -0.16667
location: -0.00427
industry: 0.00000
lead_score: 0.00000
lead_source: 0.00427
employment_status: 0.01282
interaction_count: 0.02564
number_of_courses_viewed: 0.14530


### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- **0.01**
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.



In [96]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer

# Prepare data (same as Q4)
features = categorical + numerical
train_dict = df_train[features].to_dict(orient='records')
val_dict = df_val[features].to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

C_values = [0.01, 0.1, 1, 10, 100]
accuracies = {}

for c in C_values:
    model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_val)[:, 1]
    y_pred_binary = (y_pred >= 0.5)
    acc = (y_pred_binary == y_val).mean()
    accuracies[c] = acc

display(accuracies)


{0.01: np.float64(0.7051282051282052),
 0.1: np.float64(0.7051282051282052),
 1: np.float64(0.7051282051282052),
 10: np.float64(0.7051282051282052),
 100: np.float64(0.7051282051282052)}

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw03
* If your answer doesn't match options exactly, select the closest one
