# HOMEWORK 3- ML FOR CLASSIFICATION
## Using the lead scoring dataset Bank Marketing

## Set up environment

In [260]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

## Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset.

In [261]:
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv')

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not.

## Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    - For categorical features, replace them with 'NA'
    - For numerical features, replace with with 0.0

## Question 1

What is the most frequent observation (mode) for the column `industry`?

* NA
* technology
* healthcare
* retail

In [262]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1462 entries, 0 to 1461
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   lead_source               1334 non-null   object 
 1   industry                  1328 non-null   object 
 2   number_of_courses_viewed  1462 non-null   int64  
 3   annual_income             1281 non-null   float64
 4   employment_status         1362 non-null   object 
 5   location                  1399 non-null   object 
 6   interaction_count         1462 non-null   int64  
 7   lead_score                1462 non-null   float64
 8   converted                 1462 non-null   int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 102.9+ KB


In [263]:
# check if there are missing values in the df
df.isnull().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [264]:
# Replacing missing values

for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].fillna('NA')
    elif np.issubdtype(df[col].dtype, np.number):
        df[col] = df[col].fillna(0)
    else: print('Type unexpected')

df.isnull().sum()


lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

In [265]:
df.industry.value_counts()

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

In [266]:
mode_industry = df.industry.mode()
print(f'The most frequent observation for "industry" column\n is: {mode_industry[0]}')

The most frequent observation for "industry" column
 is: retail


## Question 2

Create the _correlation matrix_ for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

* `interaction_count` and `lead_score`
* `number_of_courses_viewed` and `lead_score`
* `number_of_courses_viewed` and `interaction_count`
* `annual_income` and `interaction_count`
  
Only consider the pairs above when answering this question.

## Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

In [267]:
# Correlation matrix

df_numeric = df.select_dtypes(include='number')
del df_numeric['converted']

corr_matrix = df_numeric.corr()

In [268]:
corr_matrix

Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score
number_of_courses_viewed,1.0,0.00977,-0.023565,-0.004879
annual_income,0.00977,1.0,0.027036,0.01561
interaction_count,-0.023565,0.027036,1.0,0.009888
lead_score,-0.004879,0.01561,0.009888,1.0


In [269]:
stacked_corr = corr_matrix.stack()
stacked_corr

number_of_courses_viewed  number_of_courses_viewed    1.000000
                          annual_income               0.009770
                          interaction_count          -0.023565
                          lead_score                 -0.004879
annual_income             number_of_courses_viewed    0.009770
                          annual_income               1.000000
                          interaction_count           0.027036
                          lead_score                  0.015610
interaction_count         number_of_courses_viewed   -0.023565
                          annual_income               0.027036
                          interaction_count           1.000000
                          lead_score                  0.009888
lead_score                number_of_courses_viewed   -0.004879
                          annual_income               0.015610
                          interaction_count           0.009888
                          lead_score                  1

In [270]:
stacked_corr = stacked_corr[stacked_corr.index.get_level_values(0) != stacked_corr.index.get_level_values(1)]

In [271]:
features_max_corr = stacked_corr.idxmax()
max_corr = stacked_corr.max()

print(f"\nThe biggest correlation value is : **{max_corr:.4f}**")
print(f"... between the two features: **{features_max_corr[0]}** and **{features_max_corr[1]}**")


The biggest correlation value is : **0.0270**
... between the two features: **annual_income** and **interaction_count**


In [272]:
# Splitting the data

from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)
len(df_train), len(df_val), len(df_test)

(876, 293, 293)

In [273]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [274]:
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

del df_train['converted']
del df_val['converted']
del df_test['converted']

# Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.
  
Which of these variables has the biggest mutual information score?

* `industry`
* `location`
* `lead_source`
* `employment_status`

In [275]:
from sklearn.metrics import mutual_info_score

df_cat_train = df_train.select_dtypes(include='object')

In [276]:
def mutual_info_converted_score(series):
    return mutual_info_score(series, y_train)

In [277]:
mi = df_cat_train.apply(mutual_info_converted_score)
mi.sort_values(ascending=False)

lead_source          0.035396
employment_status    0.012938
industry             0.011575
location             0.004464
dtype: float64

In [278]:
print(f'The variable with the biggest mutual information score is **{mi.idxmax()}** = {round(mi.max(),2)}')

The variable with the biggest mutual information score is **lead_source** = 0.04


# Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
  
What accuracy did you get?

* 0.64
* 0.74
* 0.84
* 0.94

In [291]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [280]:
df_train.info()
name_num_col= df_numeric.columns.values
name_cat_col = df_cat_train.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 876 entries, 0 to 875
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   lead_source               876 non-null    object 
 1   industry                  876 non-null    object 
 2   number_of_courses_viewed  876 non-null    int64  
 3   annual_income             876 non-null    float64
 4   employment_status         876 non-null    object 
 5   location                  876 non-null    object 
 6   interaction_count         876 non-null    int64  
 7   lead_score                876 non-null    float64
dtypes: float64(2), int64(2), object(4)
memory usage: 54.9+ KB


In [281]:
# Feature Scaling - Preparing the numerical features
X_train_num = df_train[name_num_col].values

scaler = StandardScaler()
X_train_num =scaler.fit_transform(X_train_num)

In [282]:
# One Hot Encoding - Preparing the categorical features
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_train_cat = ohe.fit_transform(df_train[name_cat_col].values)

In [283]:
ohe.get_feature_names_out()

array(['x0_NA', 'x0_events', 'x0_organic_search', 'x0_paid_ads',
       'x0_referral', 'x0_social_media', 'x1_NA', 'x1_education',
       'x1_finance', 'x1_healthcare', 'x1_manufacturing', 'x1_other',
       'x1_retail', 'x1_technology', 'x2_NA', 'x2_employed',
       'x2_self_employed', 'x2_student', 'x2_unemployed', 'x3_NA',
       'x3_africa', 'x3_asia', 'x3_australia', 'x3_europe',
       'x3_middle_east', 'x3_north_america', 'x3_south_america'],
      dtype=object)

In [284]:
# Combine the two matrices into one
X_train = np.column_stack([X_train_num, X_train_cat])

In [285]:
# Train de model
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'liblinear'
,max_iter,1000


In [287]:
# Now let's check its accuracy
X_val_num = df_val[name_num_col].values
X_val_num =scaler.transform(X_val_num)

X_val_cat = ohe.transform(df_val[name_cat_col].values)

# Combine the two matrices into one
X_val = np.column_stack([X_val_num, X_val_cat])

In [292]:
y_pred = model.predict_proba(X_val)[:, 1]
accuracy_score(y_val, y_pred >= 0.5)

0.8532423208191127

In [298]:
# Procedure without "accuracy_score" Scikit-Learn Module
converted_decision = (y_pred >= 0.5)
val_accuracy = (y_val == converted_decision).mean()
print(f'Validation dataset accuracy for *Converted* decision = {round(val_accuracy,2)}')

Validation dataset accuracy for *Converted* decision = 0.85


# Question 5

* Let's find the least useful feature using the _feature elimination_ technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
* 
Which of following feature has the smallest difference?

* `'industry'`
* `'employment_status'`
* `'lead_score'`
  
*Note:* The difference doesn't have to be positive.

In [316]:
name_cols=np.concatenate([name_num_col, name_cat_col])

In [342]:
diff = {}
for f in name_cols:
    subset = [col for col in name_cols if col != f]
    
    #Identifying numerical and categorical columns
    df_train_ss = df_train[subset]
    name_num_col_ss = (df_train_ss.select_dtypes(include='number')).columns.values
    name_cat_col_ss = (df_train_ss.select_dtypes(include='object')).columns.values

    # Feature Scaling - Preparing the numerical features
    X_train_num_ss = df_train_ss[name_num_col_ss].values
    X_train_num_ss = scaler.fit_transform(X_train_num_ss)

    # One Hot Encoding - Preparing the categorical features
    X_train_cat_ss = ohe.fit_transform(df_train_ss[name_cat_col_ss].values)

    # Combine the two matrices into one
    X_train_ss = np.column_stack([X_train_num_ss, X_train_cat_ss])

    # Train the model
    model.fit(X_train_ss, y_train)

    # Now let's check its accuracy
    df_val_ss = df_val[subset]
    X_val_num_ss = df_val_ss[name_num_col_ss].values
    X_val_num_ss = scaler.transform(X_val_num_ss)
    
    X_val_cat_ss = ohe.transform(df_val_ss[name_cat_col_ss].values)
    
    # Combine the two matrices into one
    X_val_ss = np.column_stack([X_val_num_ss, X_val_cat_ss])

    # Accuracy score
    y_pred_ss = model.predict_proba(X_val_ss)[:, 1]
    val_acc_ss = accuracy_score(y_val, y_pred_ss >= 0.5)

    diff[f] = val_accuracy - val_acc_ss
    print(f'\nAccuracy for **{f}** = {round(val_acc_ss,4)} \n--> Diff = {round(diff[f],4)}')
    


Accuracy for **number_of_courses_viewed** = 0.7372 
--> Diff = 0.116

Accuracy for **annual_income** = 0.8532 
--> Diff = 0.0

Accuracy for **interaction_count** = 0.7713 
--> Diff = 0.0819

Accuracy for **lead_score** = 0.8191 
--> Diff = 0.0341

Accuracy for **lead_source** = 0.8498 
--> Diff = 0.0034

Accuracy for **industry** = 0.8464 
--> Diff = 0.0068

Accuracy for **employment_status** = 0.8362 
--> Diff = 0.0171

Accuracy for **location** = 0.8498 
--> Diff = 0.0034


In [362]:
key_subset = ['industry','employment_status','lead_score']
subdir = {
    k:v
    for k, v in diff.items()
    if k in key_subset
}
subdir

{'lead_score': np.float64(0.03412969283276457),
 'industry': np.float64(0.0068259385665528916),
 'employment_status': np.float64(0.017064846416382284)}

In [382]:
min_k = min(subdir, key=subdir.get)
print(f'The feature with the smallest difference is\n **{min_k} = {subdir[min_k]}**')

The feature with the smallest difference is
 **industry = 0.0068259385665528916**


# Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

* 0.01
* 0.1
* 1
* 10
* 100
  
*Note:* If there are multiple options, select the smallest `C`.

In [389]:
C_values = [0.01, 0.1, 1, 10, 100]
V_acc=[]
for c in C_values:
    # Train de model
    model6 = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
    model6.fit(X_train, y_train)
    y_pred = model6.predict_proba(X_val)[:, 1]
    V_acc.append(accuracy_score(y_val, y_pred >= 0.5))

In [392]:
for c, a in zip(C_values,V_acc):
    print(f' For C = {c}, Accuracy = {round(a,3)})')

 For C = 0.01, Accuracy = 0.84)
 For C = 0.1, Accuracy = 0.857)
 For C = 1, Accuracy = 0.853)
 For C = 10, Accuracy = 0.853)
 For C = 100, Accuracy = 0.853)


In [404]:
max_acc=max(V_acc)
ind_max=V_acc.index(max_acc)
print(f'The best accuracy = {round(max_acc,3)} was obtained\n with **C ={C_values[ind_max]}**')

The best accuracy = 0.857 was obtained
 with **C =0.1**
