# HOMEWORK 3- ML FOR CLASSIFICATION
## Using the lead scoring dataset Bank Marketing

## Set up environment

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

## Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv')

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not.

## Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    - For categorical features, replace them with 'NA'
    - For numerical features, replace with with 0.0

## Question 1

What is the most frequent observation (mode) for the column `industry`?

* NA
* technology
* healthcare
* retail

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1462 entries, 0 to 1461
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   lead_source               1334 non-null   object 
 1   industry                  1328 non-null   object 
 2   number_of_courses_viewed  1462 non-null   int64  
 3   annual_income             1281 non-null   float64
 4   employment_status         1362 non-null   object 
 5   location                  1399 non-null   object 
 6   interaction_count         1462 non-null   int64  
 7   lead_score                1462 non-null   float64
 8   converted                 1462 non-null   int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 102.9+ KB


In [53]:
# check if there are missing values in the df
df.isnull().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [55]:
# Replacing missing values

for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = df[col].fillna('NA')
    elif np.issubdtype(df[col].dtype, np.number):
        df[col] = df[col].fillna(0)
    else: print('Type unexpected')

df.isnull().sum()


lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

In [81]:
df.industry.value_counts()

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

In [82]:
mode_industry = df.industry.mode()
print(f'The most frequent observation for "industry" column\n is: {mode_industry[0]}')

The most frequent observation for "industry" column
 is: retail


## Question 2

Create the _correlation matrix_ for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

* `interaction_count` and `lead_score`
* `number_of_courses_viewed` and `lead_score`
* `number_of_courses_viewed` and `interaction_count`
* `annual_income` and `interaction_count`
  
Only consider the pairs above when answering this question.

## Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

# Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.
  
Which of these variables has the biggest mutual information score?

* `industry`
* `location`
* `lead_source`
* `employment_status`

# Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
  
What accuracy did you get?

* 0.64
* 0.74
* 0.84
* 0.94

# Question 5

* Let's find the least useful feature using the _feature elimination_ technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
* 
Which of following feature has the smallest difference?

* `'industry'`
* `'employment_status'`
* `'lead_score'`
  
*Note:* The difference doesn't have to be positive.

# Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

* 0.01
* 0.1
* 1
* 10
* 100
  
*Note:* If there are multiple options, select the smallest `C`.