<a href="https://colab.research.google.com/github/oleksiyo/machine-learning-zoomcamp/blob/master/cohorts/2025/03-classification/Homework_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning for Classification

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

--2025-10-16 08:37:33--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘course_lead_scoring.csv’


2025-10-16 08:37:33 (53.3 MB/s) - ‘course_lead_scoring.csv’ saved [80876/80876]



In [3]:
df_full = pd.read_csv('course_lead_scoring.csv')

In [4]:
df_full.head().T

Unnamed: 0,0,1,2,3,4
lead_source,paid_ads,social_media,events,paid_ads,referral
industry,,retail,healthcare,retail,education
number_of_courses_viewed,1,1,5,2,3
annual_income,79450.0,46992.0,78796.0,83843.0,85012.0
employment_status,unemployed,employed,unemployed,,self_employed
location,south_america,south_america,australia,australia,europe
interaction_count,4,1,3,1,3
lead_score,0.94,0.8,0.69,0.87,0.62
converted,1,0,1,0,1


In [5]:
df_full.dtypes

Unnamed: 0,0
lead_source,object
industry,object
number_of_courses_viewed,int64
annual_income,float64
employment_status,object
location,object
interaction_count,int64
lead_score,float64
converted,int64


## Data preparation
* Check if the missing values are presented in the features.
* If there are missing values:
* For caterogiral features, replace them with 'NA'
* For numerical features, replace with with 0.0

In [6]:
def get_missing_values(df):
  missing_summary = (
      df_full.isnull()
        .sum()
        .reset_index()
        .rename(columns={'index': 'column_name', 0: 'missing_values'})
  )

  # Add column type
  missing_summary['dtype'] = missing_summary['column_name'].apply(lambda x: df_full[x].dtype)

  # Keep only columns that actually have missing values
  missing_summary = missing_summary[missing_summary['missing_values'] > 0]
  return missing_summary


missing_summary = get_missing_values(df_full)

# Display the result
print("Missing values summary before filling:")
display(missing_summary)


Missing values summary before filling:


Unnamed: 0,column_name,missing_values,dtype
0,lead_source,128,object
1,industry,134,object
3,annual_income,181,float64
4,employment_status,100,object
5,location,63,object


In [7]:
# Separate column types
categorical = df_full.select_dtypes(include=['object']).columns
numerical = df_full.select_dtypes(include=[np.number]).columns

# # Replace missing values
df_full[categorical] = df_full[categorical].fillna('NA')
df_full[numerical] = df_full[numerical].fillna(0.0)
print(categorical)
print(numerical)

Index(['lead_source', 'industry', 'employment_status', 'location'], dtype='object')
Index(['number_of_courses_viewed', 'annual_income', 'interaction_count',
       'lead_score', 'converted'],
      dtype='object')


In [8]:
missing_summary = get_missing_values(df_full)

# Display the result
print("Missing values summary before filling:")
display(missing_summary)

Missing values summary before filling:


Unnamed: 0,column_name,missing_values,dtype


# Question 1
What is the most frequent observation (mode) for the column **industry**?

In [9]:
df_full.industry.value_counts()

Unnamed: 0_level_0,count
industry,Unnamed: 1_level_1
retail,203
finance,200
other,198
healthcare,187
education,187
technology,179
manufacturing,174
,134


Answer: The most frequent observation (mode) for the column **industry** is **retail**

# Question 2
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

In [10]:
X_num = df_full.select_dtypes(include=[np.number])
X_cat = df_full.select_dtypes(include=[object])

# check for for the two highest correlated features
corr = X_num.corr()

interaction_count and lead_score
number_of_courses_viewed and lead_score
number_of_courses_viewed and interaction_count
annual_income and interaction_count

In [11]:
# all correlation pairs in a dataframe sorted by absolute correlation without duplicates
corr_pairs = corr.abs().unstack().sort_values(ascending=False).drop_duplicates()
corr_pairs = corr_pairs[corr_pairs != 1]
pd.DataFrame(corr_pairs, columns=["correlation"]).style.background_gradient(cmap='coolwarm')

Unnamed: 0,Unnamed: 1,correlation
converted,number_of_courses_viewed,0.435914
interaction_count,converted,0.374573
lead_score,converted,0.193673
converted,annual_income,0.053131
interaction_count,annual_income,0.027036
number_of_courses_viewed,interaction_count,0.023565
lead_score,annual_income,0.01561
lead_score,interaction_count,0.009888
number_of_courses_viewed,annual_income,0.00977
number_of_courses_viewed,lead_score,0.004879


**Answer**: **interaction_count** and **annual_income**

## Split the data
* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
* Make sure that the target value y is not in your dataframe.

In [12]:
from sklearn.model_selection import train_test_split

# split data into train/val/test with 60%/20%/20% ratio
SEED = 42

df_full_train, df_test = train_test_split(df_full, test_size=0.2, random_state=SEED)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=SEED)

assert len(df_full) == (len(df_train) + len(df_val) + len(df_test))

In [13]:
# Reset_index for y_train, y_val and y_test.
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
# Define y_train, y_val and y_test.
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values
# # Drop y from y_train, y_val and y_test.
# df_train = df_train.drop('converted', axis=1)
# df_val = df_val.drop('converted', axis=1)
# df_test = df_test.drop('converted', axis=1)

del df_train["converted"]
del df_test["converted"]
del df_val["converted"]

# Question 3
* Calculate the mutual information score between y and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using round(score, 2).

In [14]:
from sklearn.metrics import mutual_info_score
def mutual_info(series):
    return mutual_info_score(series, y_train)


In [15]:
# List the categorical columns.
df_cat = df_full.copy().select_dtypes(exclude='number').columns
df_cat

Index(['lead_source', 'industry', 'employment_status', 'location'], dtype='object')

In [16]:
cat_features = ['lead_source', 'industry', 'employment_status', 'location']
# Calculate MI.
df_mi = df_train[cat_features].apply(mutual_info).round(2)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='mi')
df_mi

Unnamed: 0,mi
lead_source,0.04
industry,0.01
employment_status,0.01
location,0.0


In [32]:
max_mi_feature = df_mi['mi'].idxmax()
max_mi_value = df_mi['mi'].max()

print(f"Answer: {max_mi_feature} has the biggest mutual information score {max_mi_value}")

Answer: lead_source has the biggest mutual information score 0.04


# Question 4
* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.

1.   To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
2.   model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [18]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mutual_info_score, accuracy_score

dv = DictVectorizer(sparse=False)
train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

In [19]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [20]:
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

In [33]:
y_pred = model.predict(X_val)
accuracy = np.round(accuracy_score(y_val, y_pred), 2)
print(f'Answer: Accuracy = {accuracy}')

Answer: Accuracy = 0.71


# Question 5
* Let's find the least useful feature using the feature elimination technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.


In [22]:
# List the featurea.
features = df_train.columns.to_list()
features

['lead_source',
 'industry',
 'number_of_courses_viewed',
 'annual_income',
 'employment_status',
 'location',
 'interaction_count',
 'lead_score']

In [23]:
# Store the results in a dictionary
accuracy_differences = {}

# Apply the feature elimination technique.
original_score = accuracy
scores = pd.DataFrame(columns=['eliminated_feature', 'accuracy', 'difference'])
for feature in features:
    subset = features.copy()
    subset.remove(feature)

    dv = DictVectorizer(sparse=False)
    train_dict = df_train[subset].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    val_dict = df_val[subset].to_dict(orient='records')
    X_val = dv.transform(val_dict)

    y_pred = model.predict(X_val)
    score = accuracy_score(y_val, y_pred)

    scores.loc[len(scores)] = [feature, score, original_score - score]

In [24]:
scores['difference'] = [abs(x) for x in scores['difference']]
scores

Unnamed: 0,eliminated_feature,accuracy,difference
0,lead_source,0.703072,0.003072
1,industry,0.699659,0.000341
2,number_of_courses_viewed,0.556314,0.143686
3,annual_income,0.853242,0.153242
4,employment_status,0.696246,0.003754
5,location,0.709898,0.009898
6,interaction_count,0.556314,0.143686
7,lead_score,0.706485,0.006485


**Answer**: industry

# Question 6
* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

In [34]:
# Regularization values
reg_values = [0.01, 0.1, 1, 10, 100]

accuracy_reg_values = {}

for C in reg_values:
    # Logistic Regression
    model_reg = LogisticRegression(
        solver="liblinear", C=C, max_iter=1_000, random_state=42
    )
    # Train the model
    model_reg.fit(X_train, y_train)

    # Calculate predictions
    y_pred_reg = model_reg.predict_proba(X_val)[:, 1]
    decision_reg = y_pred_reg >= 0.5
    reg_accuracy = (decision_reg == y_val).mean()

    # Fill the accuracy_reg_values dictionary
    accuracy_reg_values[C] = reg_accuracy

    print(f"Regularization parameter: {C}  accuracy = {reg_accuracy}")

Regularization parameter: 0.01  accuracy = 0.6962457337883959
Regularization parameter: 0.1  accuracy = 0.6996587030716723
Regularization parameter: 1  accuracy = 0.7064846416382252
Regularization parameter: 10  accuracy = 0.7064846416382252
Regularization parameter: 100  accuracy = 0.7064846416382252


**Answer**: 0.01