# Dataset
In this homework, we will use the lead scoring dataset Bank Marketing dataset (https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv)

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# assure that plots are displayed in jupyter notebook's cells
%matplotlib inline

In [2]:
df_ori = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv')
df_ori.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [13]:
df.shape

(1462, 9)

# Data preparation
* Check if the missing values are presented in the features.

* If there are missing values:

    * For caterogiral features, replace them with 'NA'

    * For numerical features, replace with with 0.0

In [3]:
df_ori.isnull().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [4]:
df_ori.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [5]:
categorical = df_ori.select_dtypes(include=['object']).columns
categorical

Index(['lead_source', 'industry', 'employment_status', 'location'], dtype='object')

In [6]:
numerical = df_ori.select_dtypes(include=['int64', 'float64']).columns
numerical

Index(['number_of_courses_viewed', 'annual_income', 'interaction_count',
       'lead_score', 'converted'],
      dtype='object')

In [7]:
df = df_ori.copy()
df[categorical] = df[categorical].fillna('NA')
df[numerical] = df[numerical].fillna(0)
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


check again to make sure every missing value has been filled.

In [8]:
df.isnull().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

# Question 1
What is the most frequent observation (mode) for the column `industry`?

In [9]:
df['industry'].mode()[0]

'retail'

# Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

In [10]:
df[['interaction_count', 'number_of_courses_viewed']].corrwith(df['lead_score']).to_frame('correlation to lead_score')


Unnamed: 0,correlation to lead_score
interaction_count,0.009888
number_of_courses_viewed,-0.004879


In [11]:
df[['number_of_courses_viewed', 'annual_income']].corrwith(df['interaction_count']).to_frame('correlation to interaction_count')


Unnamed: 0,correlation to interaction_count
number_of_courses_viewed,-0.023565
annual_income,0.027036


`annual_income` and `interaction_count` have the biggest correlation.

# Split the data

Split your data in train/val/test sets with 60%/20%/20% distribution.

Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.

Make sure that the target value y is not in your dataframe.

In [29]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

y_train = df_train['converted']
y_val = df_val['converted']
y_test = df_test['converted']

del df_train['converted']
del df_val['converted']
del df_test['converted']

df_train.shape, df_val.shape, df_test.shape, y_train.shape, y_val.shape, y_test.shape

((876, 8), (293, 8), (293, 8), (876,), (293,), (293,))

# Question 3
Calculate the mutual information score between y and other categorical variables in the dataset. Use the training set only.

Round the scores to 2 decimals using round(score, 2).

Which of these variables has the biggest mutual information score?

In [32]:
from sklearn.metrics import mutual_info_score
for c in categorical:
    print(f'mutual_info_score between {c} and converted: {round(mutual_info_score(df_train[c], y_train), 2)}') 

mutual_info_score between lead_source and converted: 0.04
mutual_info_score between industry and converted: 0.01
mutual_info_score between employment_status and converted: 0.01
mutual_info_score between location and converted: 0.0


`lead_source` has the largest mutual_info_score with y

# Question 4

Now let's train a logistic regression.

Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.

Fit the model on the training dataset.

To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer

train_dict = df_train.to_dict(orient='records')
train_dict[0]

{'lead_source': 'paid_ads',
 'industry': 'retail',
 'number_of_courses_viewed': 0,
 'annual_income': 58472.0,
 'employment_status': 'student',
 'location': 'middle_east',
 'interaction_count': 5,
 'lead_score': 0.03}

In [40]:
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)
X_train = dv.transform(train_dict)
list(X_train[0])

[np.float64(58472.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(1.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(1.0),
 np.float64(0.0),
 np.float64(5.0),
 np.float64(0.03),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(1.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(1.0),
 np.float64(0.0),
 np.float64(0.0),
 np.float64(0.0)]

In [38]:
list(dv.get_feature_names_out())

['annual_income',
 'employment_status=NA',
 'employment_status=employed',
 'employment_status=self_employed',
 'employment_status=student',
 'employment_status=unemployed',
 'industry=NA',
 'industry=education',
 'industry=finance',
 'industry=healthcare',
 'industry=manufacturing',
 'industry=other',
 'industry=retail',
 'industry=technology',
 'interaction_count',
 'lead_score',
 'lead_source=NA',
 'lead_source=events',
 'lead_source=organic_search',
 'lead_source=paid_ads',
 'lead_source=referral',
 'lead_source=social_media',
 'location=NA',
 'location=africa',
 'location=asia',
 'location=australia',
 'location=europe',
 'location=middle_east',
 'location=north_america',
 'location=south_america',
 'number_of_courses_viewed']

In [None]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)
y_val_pred_prob = model.predict_proba(X_val)[:, 1]
original_accuracy = ( (y_val_pred_prob > 0.5) == y_val).mean()
original_accuracy

np.float64(0.7)

# Question 5

Let's find the least useful feature using the feature elimination technique.

Train a model using the same features and parameters as in Q4 (without rounding).

Now exclude each feature from this set and train a model without it. Record the accuracy for each model.

For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

In [77]:
excluded_features = ['industry', 'employment_status', 'lead_score']
for e in excluded_features:
    features = list(dv.get_feature_names_out())
    feature_indices = []
    for idx, val in enumerate(features):
        if val.startswith(e):
            continue
        feature_indices.append(idx)
    model.fit(X_train[:, feature_indices], y_train)
    # val_dict = df_val.to_dict(orient='records')
    # X_val = dv.transform(val_dict)
    y_val_pred_prob = model.predict_proba(X_val[:, feature_indices])[:, 1]
    new_accuracy = ( (y_val_pred_prob > 0.5) == y_val).mean()
    print(f'Accuracy abs diff: {abs(original_accuracy - new_accuracy)} when excluding {e}')


    

Accuracy abs diff: 0.0003412969283276279 when excluding industry
Accuracy abs diff: 0.003754266211604018 when excluding employment_status
Accuracy abs diff: 0.006484641638225264 when excluding lead_score


`industry` has the smallest accuracy difference.

# Question 6

Now let's train a regularized logistic regression.

Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].

Train models using all the features as in Q4.

Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these C leads to the best accuracy on the validation set?

In [86]:
C_val = [0.01, 0.1, 1, 10, 100]
for c in C_val:
    model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    # print(list(model.intercept_), list(model.coef_))
    # val_dict = df_val.to_dict(orient='records') # No need to redefine val_dict every time
    # X_val = dv.transform(val_dict) # No need to redefine X_val every time
    y_val_pred_prob = model.predict_proba(X_val)[:, 1]
    accuracy_with_c = ( (y_val_pred_prob > 0.5) == y_val).mean()
    print(f'Accuracy with C:{c}: {accuracy_with_c}')

Accuracy with C:0.01: 0.6996587030716723
Accuracy with C:0.1: 0.6996587030716723
Accuracy with C:1: 0.6996587030716723
Accuracy with C:10: 0.6996587030716723
Accuracy with C:100: 0.6996587030716723


Since they all have the same accuracy, the answer is C=0.01.