### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0 

In [131]:
import pandas as pd
import numpy as np
df = pd.read_csv("course_lead_scoring.csv")

In [132]:
df

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.80,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1
...,...,...,...,...,...,...,...,...,...
1457,referral,manufacturing,1,,self_employed,north_america,4,0.53,1
1458,referral,technology,3,65259.0,student,europe,2,0.24,1
1459,paid_ads,technology,1,45688.0,student,north_america,3,0.02,1
1460,referral,,5,71016.0,self_employed,north_america,0,0.25,1


In [133]:
df.isnull().sum()


lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [134]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [135]:
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
numerical_columns=["number_of_courses_viewed","annual_income","interaction_count","lead_score","converted" ]

for c in categorical_columns:
    
    df[c] = df[c].fillna('NA')

len(numerical_columns)+len(categorical_columns)

9

In [136]:
df.isnull().sum()


lead_source                   0
industry                      0
number_of_courses_viewed      0
annual_income               181
employment_status             0
location                      0
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [137]:
df['annual_income'] = df['annual_income'].fillna(0.0)
df.isnull().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

In [138]:
dfnew=df.copy()

### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- `retail`


In [139]:
df.industry.mode()

0    retail
Name: industry, dtype: object

### Answer Question 1: retail

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`

Only consider the pairs above when answering this question.

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

In [140]:
numerical_columns=["number_of_courses_viewed","annual_income","interaction_count","lead_score","converted" ]

# Select numerical features from the dataset
numerical_features = df[numerical_columns]

# Compute the correlation matrix
correlation_matrix = numerical_features.corr()

# Display the correlation matrix
print("Correlation matrix:")
correlation_matrix

Correlation matrix:


Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score,converted
number_of_courses_viewed,1.0,0.00977,-0.023565,-0.004879,0.435914
annual_income,0.00977,1.0,0.027036,0.01561,0.053131
interaction_count,-0.023565,0.027036,1.0,0.009888,0.374573
lead_score,-0.004879,0.01561,0.009888,1.0,0.193673
converted,0.435914,0.053131,0.374573,0.193673,1.0


In [141]:
print(correlation_matrix.loc["interaction_count","lead_score"])
print(correlation_matrix.loc ["number_of_courses_viewed","lead_score"])
print(correlation_matrix.loc["number_of_courses_viewed","interaction_count"])
print(correlation_matrix.loc["annual_income","interaction_count"])

   
    


0.009888182496913131
-0.004878998354681276
-0.023565222882888037
0.02703647240481443


### Answer Question 2: annual_income and interaction_count

In [142]:
from sklearn.model_selection import train_test_split
df = df.rename(columns={"converted": "y"})

In [143]:
# Separate the features (X) from the target (y)
y = df['y']

X = df.drop(columns=['y'])

# First, split into 60% training and 40% remaining data
X_train, X_rem, y_train, y_rem = train_test_split(X, y, test_size=0.4, random_state=42)

# Then, split the remaining 40% into 20% validation and 20% test data
X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=42)

# Display the shape of each split to confirm
print("Training set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_val.shape, y_val.shape)
print("Test set shape:", X_test.shape, y_test.shape)


Training set shape: (877, 8) (877,)
Validation set shape: (292, 8) (292,)
Test set shape: (293, 8) (293,)


### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `industry`
- `location`
- `lead_source`
- `employment_status`

In [144]:
from sklearn.feature_selection import mutual_info_classif
import pandas as pd

# Select categorical features from the training set
categorical_features = ['industry', 'location', 'lead_score', 'employment_status']
X_train_categorical = X_train[categorical_features]

# One-hot encode categorical variables
X_train_encoded = pd.get_dummies(X_train_categorical, drop_first=True)
X_train_encoded
# Calculate the mutual information scores

Unnamed: 0,lead_score,industry_education,industry_finance,industry_healthcare,industry_manufacturing,industry_other,industry_retail,industry_technology,location_africa,location_asia,location_australia,location_europe,location_middle_east,location_north_america,location_south_america,employment_status_employed,employment_status_self_employed,employment_status_student,employment_status_unemployed
442,0.65,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True
319,0.09,True,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False
767,0.61,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,True,False,False
756,0.84,False,False,False,False,True,False,False,False,False,False,False,True,False,False,True,False,False,False
424,0.70,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1130,0.30,False,False,False,True,False,False,False,False,False,False,False,False,False,True,True,False,False,False
1294,0.44,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False
860,0.02,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
1459,0.02,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,True,False


In [145]:

mi_scores = mutual_info_classif(X_train_encoded, y_train, random_state=42)


In [146]:
# Create a DataFrame to display the scores alongside the features
mi_scores_df = pd.DataFrame({
    'Feature': X_train_encoded.columns,
    'Mutual Information Score': mi_scores
})

# Extract the original categorical variable names by splitting on the first underscore
mi_scores_df['Variable'] = mi_scores_df['Feature'].str.split('_').str[0]

# Group by the original categorical variables and sum their MI scores
mi_scores_summed = mi_scores_df.groupby('Variable')['Mutual Information Score'].sum()

# Round the scores to 2 decimals
mi_scores_summed = mi_scores_summed.round(2)

# Sort the scores in descending order to find the variable with the highest MI score
mi_scores_summed_sorted = mi_scores_summed.sort_values(ascending=False)

# Display the mutual information scores
print("Mutual Information Scores:")
print(mi_scores_summed_sorted)

# Identify the variable with the highest mutual information score
highest_mi_variable = mi_scores_summed_sorted.idxmax()
highest_mi_score = mi_scores_summed_sorted.max()

print(f"\nThe variable with the highest mutual information score is '{highest_mi_variable}' with a score of {highest_mi_score}.")

Mutual Information Scores:
Variable
industry      0.07
lead          0.03
location      0.03
employment    0.02
Name: Mutual Information Score, dtype: float64

The variable with the highest mutual information score is 'industry' with a score of 0.07.


### Answer Question 3: Industry

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- 0.74
- 0.84
- 0.94

In [147]:
dfnew

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.80,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1
...,...,...,...,...,...,...,...,...,...
1457,referral,manufacturing,1,0.0,self_employed,north_america,4,0.53,1
1458,referral,technology,3,65259.0,student,europe,2,0.24,1
1459,paid_ads,technology,1,45688.0,student,north_america,3,0.02,1
1460,referral,,5,71016.0,self_employed,north_america,0,0.25,1


In [148]:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(dfnew, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
len(df_train), len(df_val), len(df_test)

(876, 293, 293)

In [149]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
y_train = df_train.converted.values
y_test = df_test.converted.values
y_val = df_test.converted.values

del df_train['converted']
del df_test['converted']
del df_val['converted']
df_full_train.isnull().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

In [150]:
df_full_train.converted.value_counts()

converted
1    715
0    454
Name: count, dtype: int64

In [151]:
df_full_train.converted.value_counts(normalize=True)

converted
1    0.611634
0    0.388366
Name: proportion, dtype: float64

In [152]:
df_full_train.converted.mean()

np.float64(0.611633875106929)

In [153]:
df_full_train.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [154]:
df_full_train

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
98,referral,,1,56659.0,employed,asia,4,0.75,1
1188,social_media,education,2,66171.0,unemployed,north_america,2,0.66,0
1407,events,finance,1,66523.0,self_employed,europe,3,0.64,1
1083,social_media,finance,1,56746.0,student,north_america,3,0.98,0
404,referral,,0,55449.0,student,australia,4,0.47,0
...,...,...,...,...,...,...,...,...,...
715,referral,,2,35103.0,unemployed,africa,0,0.88,0
905,social_media,other,1,66006.0,employed,south_america,5,0.64,1
1096,events,finance,2,73688.0,self_employed,asia,2,0.07,0
235,referral,technology,2,76723.0,employed,north_america,3,0.49,1


In [155]:
categorical = ['lead_source', 'industry', 'employment_status', 'location']
df_full_train[categorical].nunique()

lead_source          6
industry             8
employment_status    5
location             8
dtype: int64

In [156]:
numerical=["number_of_courses_viewed","annual_income","interaction_count","lead_score" ]


In [157]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
categorical = ['lead_source', 'industry', 'employment_status', 'location']
train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [158]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
# solver='lbfgs' is the default solver in newer version of sklearn
# for older versions, you need to specify it explicitly
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'liblinear'
,max_iter,1000


In [159]:
model.intercept_[0]

np.float64(-0.10464329280767921)

In [160]:
model.coef_[0].round(3)

array([-0.   , -0.027,  0.035, -0.01 ,  0.014, -0.117, -0.022,  0.033,
       -0.008, -0.023, -0.007, -0.033, -0.027, -0.018,  0.326,  0.032,
        0.   , -0.005, -0.024, -0.112,  0.07 , -0.034,  0.005, -0.012,
       -0.01 , -0.028, -0.013, -0.019, -0.023, -0.005,  0.453])

In [161]:
y_pred = model.predict_proba(X_val)[:, 1]
output_model=(y_pred >= 0.5)
(y_val == output_model).mean()

np.float64(0.621160409556314)

### Answer Question 4: 0.64

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

> **Note**: The difference doesn't have to be positive.



In [162]:
original_accuracy=0.621160409556314
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)

for feature  in ['industry', 'employment_status', 'lead_source']:
    categorical = ['lead_source', 'industry', 'employment_status', 'location']
    categorical.remove(feature)
    train_dict = df_train[categorical + numerical].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)
    val_dict = df_val[categorical + numerical].to_dict(orient='records')
    X_val = dv.transform(val_dict)
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_val)[:, 1]
    output_model=(y_pred >= 0.5)
    print(feature, f"--->Difference={original_accuracy-(y_val == output_model).mean()}")
   

industry --->Difference=0.0
employment_status --->Difference=0.0034129692832765013
lead_source --->Difference=0.010238907849829393


### Answer Question 5: Industry

### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.


In [163]:
for C in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_val)[:, 1]
    output_model=(y_pred >= 0.5)
    print(f"C={C}--->Accuracy={(y_val == output_model).mean().round(3)}")
    
    

C=0.01--->Accuracy=0.611
C=0.1--->Accuracy=0.611
C=1--->Accuracy=0.611
C=10--->Accuracy=0.608
C=100--->Accuracy=0.608


### Answer Question 6: C=0.1