<a href="https://colab.research.google.com/github/milan-banura/LeadScoring_CaseStudy_ML1/blob/main/Leads_Scoring_CaseStudy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lead Score - Case Study

<b><font color = maroon>Problem Statement</font></b><br>

<p align="justify">X Education is an education company that offers online courses for industry professionals. The company attracts many visitors to its website through various marketing channels. The company faces a problem: its lead conversion rate is very low. Out of 100 leads, only 30 become customers on average.</p>

<p align="justify">To solve this problem, X Education wants to identify the most potential leads, also known as ‘Hot Leads’. The company has hired you to help them with this task. Your job is to build a model that can assign a lead score to each lead based on various factors, such as their demographics, behavior, preferences, etc. The higher the lead score, the more likely the lead is to convert. The lower the lead score, the less likely the lead is to convert. The company’s CEO has set a target of achieving an 80% lead conversion rate with this model.</p>

<b><font color = maroon>Goals and Objective</font></b><br>
- <p align="justify">Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.</p>
- <p align="justify">There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. Please fill it based on the logistic regression model you got in the first step. Also, make sure you include this in your final PPT where you'll make recommendations.</p>

#### Steps Followed  
- Reading Data
- Cleaning Data
- Data Visualization
- Data Preparation
- Model Builiding
- ROC Curve
- Model Evaluations
- Prediction on test set
- Conclusion

## Step 1: Reading and Understanding the Data

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

#Importing required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("ggplot")

In [None]:
url = "https://raw.githubusercontent.com/milan-banura/LeadScoring_CaseStudy_ML1/main/Leads.csv"
lead_df = pd.read_csv(url)
# Reading leads dataframe as lead_df

lead_df_original = lead_df.copy()
lead_df.head()

In [None]:
lead_df.describe()

In [None]:
# Inspect the various aspects of the data dataframe
lead_df.info()

## Step 2: Data Cleaning

In [None]:
# To check for duplicates
lead_df.loc[lead_df.duplicated()]

#### No duplicates in the data!

In [None]:
# To check for duplicates in columns
print(sum(lead_df.duplicated(subset = 'Lead Number')))
print(sum(lead_df.duplicated(subset = 'Prospect ID')))

#### As the values in these columns are different for each entry/row, there are just indicative of the ID and are not important from an analysis point of view. Hence, can be dropped.

In [None]:
lead_df = lead_df.drop(['Lead Number','Prospect ID'],axis = 1)

#### 'Select' seems to be the default value stored in the backend for columns that are optional in nature and the prospective lead has chosen not to select any of options available in the dropdown menu.

In [None]:
# To convert 'Select' values to NaN
lead_df = lead_df.replace('Select', np.nan)

In [None]:
# To get percentage of null values in each column
round(100*(lead_df.isnull().sum()/len(lead_df.index)), 2)

#### We'll drop columns with more than 50% of missing values as it does not make sense to impute these many values. But the variable 'Lead Quality', which has 51.6% missing values seems promising. So we'll keep it for now.

In [None]:
# To drop columns with more than 50% of missing values as it does not make sense to impute these many values
lead_df = lead_df.drop(lead_df.loc[:,list(round(100*(lead_df.isnull().sum()/len(lead_df.index)), 2)>52)].columns, axis = 1)

In [None]:
round(100*(lead_df.isnull().sum()/len(lead_df.index)), 2)


#### For other columns, we have to work on column by column basis.  
- For categorical variables, we'll analyse the count/percentage plots.
- For numerical variable, we'll describe the variable and analyse the box plots.

In [None]:
# Function for percentage plots
def percent_plot(var):
    values = (lead_df[var].value_counts(normalize=True)*100)
    plt_p = values.plot.bar(color=sns.color_palette('deep'))
    plt_p.set(xlabel = var, ylabel = '% in dataset')

In [None]:
# For Lead Quality
percent_plot('Lead Quality')

### Null values in the 'Lead Quality' column can be imputed with the value 'Not Sure' as we can assume that not filling in a column means the employee does not know or is not sure about the option.

In [None]:
lead_df['Lead Quality'] = lead_df['Lead Quality'].replace(np.nan, 'Not Sure')

In [None]:
# For 'Asymmetrique Activity Index', 'Asymmetrique Profile Index', 'Asymmetrique Activity Score', 'Asymmetrique Profile Score'
asym_list = ['Asymmetrique Activity Index', 'Asymmetrique Profile Index', 'Asymmetrique Activity Score', 'Asymmetrique Profile Score']
plt.figure(figsize=(10, 7))
for var in asym_list:
    plt.subplot(2,2,asym_list.index(var)+1)
    if 'Index' in var:
        sns.countplot(data=lead_df, x=var)
    else:
        sns.boxplot(data=lead_df, x=var)
plt.show()

In [None]:
# To describe numerical variables
lead_df[asym_list].describe()

#### These four variables have more than 45% missing values and it can be seen from the plots that there is a lot of variation in them. So, it's not a good idea to impute 45% of the data. Even if we impute with mean/median for numerical variables, these values will not have any significant importance in the model. We'll have to drop these variables.

In [None]:
lead_df = lead_df.drop(asym_list,axis = 1)

In [None]:
# To see percentage of null values in each column
round(100*(lead_df.isnull().sum()/len(lead_df.index)), 2)

In [None]:
# For 'City'
percent_plot('City')

#### Around 60% of the City values are Mumbai. We can impute 'Mumbai' in the missing values.

In [None]:
lead_df['City'] = lead_df['City'].replace(np.nan, 'Mumbai')

In [None]:
# For 'Specialization'
percent_plot('Specialization')

#### There are a lot of different specializations and it's not accurate to directly impute with the mean. It is possible that the person does not have a specialization or his/her specialization is not in the options. We can create a new column for that.

In [None]:
lead_df['Specialization'] = lead_df['Specialization'].replace(np.nan, 'Others')

In [None]:
# For 'Tags', 'What matters most to you in choosing a course', 'What is your current occupation' and 'Country'
var_list = ['Tags', 'What matters most to you in choosing a course', 'What is your current occupation', 'Country']
plt.figure(figsize=(15, 7))
for var in var_list:
    plt.subplot(2,2,var_list.index(var)+1)
    percent_plot(var)

#### In all these categorical variables, one value is clearly more frequent than all others. So it makes sense to impute with the most frequent values.

In [None]:
# To impute with the most frequent value
for var in var_list:
    top_frequent = lead_df[var].describe()['top']
    lead_df[var] = lead_df[var].replace(np.nan, top_frequent)

In [None]:
# Let's see percentage of null values in each column
round(100*(lead_df.isnull().sum()/len(lead_df.index)), 2)

In [None]:
# For 'TotalVisits' and 'Page Views Per Visit'
visit_list = ['TotalVisits', 'Page Views Per Visit']
plt.figure(figsize=(15, 5))
for var in visit_list:
    plt.subplot(1,2,visit_list.index(var)+1)
    sns.boxplot(data=lead_df, x=var)
plt.show()

lead_df[visit_list].describe()

#### From the above analysis, it can be seen that there is a lot of variation in both of the variables. As the percentage of missing values for both of them are less than 2%, it is better to drop the rows containing missing values.

In [None]:
# For 'Lead Source' and 'Last Activity'
var_list = ['Lead Source', 'Last Activity']

for var in var_list:
    percent_plot(var)
    plt.show()

#### In these categorical variables, imputing with the most frequent value is not accurate as the next most frequent value has similar frequency. Also, as these variables have very little missing values, it is better to drop the rows containing these missing values. Hence, we'll drop the rows containing any missing missing values for above four variables.

In [None]:
# To drop the rows containing missing values
lead_df.dropna(inplace = True)

In [None]:
# Let's see percentage of null values in each column
round(100*(lead_df.isnull().sum()/len(lead_df.index)), 2)

#### Great! No more missing values.

## Step 3: Data Visualtization

In [None]:
# For the target variable 'Converted'
percent_plot('Converted')

In [None]:
(sum(lead_df['Converted'])/len(lead_df['Converted'].index))*100

#### 37.8% of the 'Converted' data is 1 ie. 37.8% of the leads are converted. This means we have enough data of converted leads for modelling.

### Visualising Numerical Variables and Outlier Treatment

In [None]:
# Boxplots
num_var = ['TotalVisits','Total Time Spent on Website','Page Views Per Visit']
plt.figure(figsize=(15, 10))
for var in num_var:
    plt.subplot(3,1,num_var.index(var)+1)
    sns.boxplot(data=lead_df, x=var)
plt.show()

In [None]:
lead_df[num_var].describe([0.05,.25, .5, .75, .90, .95])

#### From the boxplots, we can see that there are outliers present in the variables.
- For 'TotalVisits', the 95% quantile is 10 whereas the maximum value is 251. Hence, we should cap these outliers at 95% value.
- There are no significant outliers in 'Total Time Spent on Website'
- For 'Page Views Per Visit', similar to 'TotalVisits', we should cap outliers at 95% value.

In [None]:
# Outlier treatment
percentile = lead_df['TotalVisits'].quantile([0.95]).values
lead_df['TotalVisits'][lead_df['TotalVisits'] >= percentile[0]] = percentile[0]

percentile = lead_df['Page Views Per Visit'].quantile([0.95]).values
lead_df['Page Views Per Visit'][lead_df['Page Views Per Visit'] >= percentile[0]] = percentile[0]

In [None]:
# Plot Boxplots to verify
plt.figure(figsize=(15, 10))
for var in num_var:
    plt.subplot(3,1,num_var.index(var)+1)
    sns.boxplot(data=lead_df, x=var)
plt.show()

In [None]:
# To plot numerical variables against target variable to analyse relations
plt.figure(figsize=(15, 5))
for var in num_var:
    plt.subplot(1,3,num_var.index(var)+1)
    sns.boxplot(y = var , x = 'Converted', data = lead_df)
plt.show()


#### **Observations:**  
- 'TotalVisits' has same median values for both outputs of leads. No conclusion can be drwan from this.
- People spending more time on the website are more likely to be converted. This is also aligned with our general knowledge.
- 'Page Views Per Visit' also has same median values for both outputs of leads. Hence, inconclusive.

### Visualising Categorical Variables

In [None]:
# Categorical variables
cat_var = list(lead_df.columns[lead_df.dtypes == 'object'])
cat_var

#### We saw percentage plots for categorical variables while cleaning the data. Here, we'll see these plots with respect to target variable 'Converted'

In [None]:
# Functions to plot countplots for categorical variables with target variable

# For single plot
def plot_cat_var(var):
    plt.figure(figsize=(20, 7))
    sns.countplot(x = var, hue = "Converted", data = lead_df)
    plt.xticks(rotation = 90)
    plt.show()

# For multiple plots
def plot_cat_vars(lst):
    l = int(len(lst)/2)
    plt.figure(figsize=(20, l*7))
    for var in lst:
        plt.subplot(l,2,lst.index(var)+1)
        sns.countplot(x = var, hue = "Converted", data = lead_df)
        plt.xticks(rotation = 90)
    plt.show()

In [None]:
plot_cat_var(cat_var[0])

### **Observations for Lead Origin:**  
'API' and 'Landing Page Submission' generate the most leads but have less conversion rates of around 30%. Whereas, 'Lead Add Form' generates less leads but conversion rate is great. **We should try to increase conversion rate for 'API' and 'Landing Page Submission', and increase leads generation using 'Lead Add Form'**. 'Lead Import' does not seem very significant.

In [None]:
plot_cat_var(cat_var[1])

### **Observations for `Lead Source`:**
- Spelling error: We've to change 'google' to 'Google'
- As it can be seen from the graph, number of leads generated by many of the sources are negligible. There are sufficient numbers till Facebook. We can convert all others in one single category of 'Others'.
- 'Direct Traffic' and 'Google' generate maximum number of leads while maximum conversion rate is achieved through 'Reference' and 'Welingak Website'.

In [None]:
# To correct spelling error
lead_df['Lead Source'] = lead_df['Lead Source'].replace(['google'], 'Google')

In [None]:
categories = lead_df['Lead Source'].unique()
categories

In [None]:
# To reduce categories
lead_df['Lead Source'] = lead_df['Lead Source'].replace(categories[8:], 'Others')

In [None]:
# To plot new categories
plot_cat_var(cat_var[1])

In [None]:
plot_cat_vars([cat_var[2],cat_var[3]])

### **Observations for `Do Not Email` and `Do Not Call`:**  
As one can expect, most of the responses are 'No' for both the variables which generated most of the leads.

In [None]:
plot_cat_var(cat_var[4])

### **Observations for `Last Activity`:**  
- Highest number of lead are generated where the last activity is 'Email Opened' while maximum conversion rate is for the activity of 'SMS Sent'. Its conversion rate is significantly high.
- Categories after the 'SMS Sent' have almost negligible effect. We can aggregate them all in one single category.

In [None]:
categories = lead_df['Last Activity'].unique()
categories

#### We can see that we do not require last five categories.

In [None]:
# To reduce categories
lead_df['Last Activity'] = lead_df['Last Activity'].replace(categories[-5:], 'Others')

In [None]:
# To plot new categories
plot_cat_var(cat_var[4])

In [None]:
plot_cat_var(cat_var[5])

### **Observations for `Country`:**  
Most of the responses are for India. Others are not significant.

In [None]:
plot_cat_var(cat_var[6])

### **Observations for `Specialization`:**  
Conversion rates are mostly similar across different specializations.

In [None]:
plot_cat_vars([cat_var[7],cat_var[8]])

### **Observations for `What is your current occupation` and `What matters most to you in choosing a course`:**  
- The highest conversion rate is for 'Working Professional'. High number of leads are generated for 'Unemployed' but conversion rate is low.
- Variable 'What matters most to you in choosing a course' has only one category with significant count.

In [None]:
plot_cat_vars(cat_var[9:17])

### **Observations for `Search`, `Magazine`, `Newspaper Article`, `X Education Forums`, `Newspaper`, `Digital Advertisement`, `Through Recommendations`, and `Receive More Updates About Our Courses`:**  
As all the above variables have most of the values as no, nothing significant can be inferred from these plots.

In [None]:
plot_cat_vars([cat_var[17],cat_var[18]])

### **Observations for `Tags` and `Lead Quality`:**  
- In Tags, categories after 'Interested in full time MBA' have very few leads generated, so we can combine them into one single category.
- Most leads generated and the highest conversion rate are both attributed to the tag 'Will revert after reading the email'.
- In Lead quality, as expected, 'Might be' as the highest conversion rate while 'Worst' has the lowest.

In [None]:
categories = lead_df['Tags'].unique()
categories

#### We can combine that last eight categories.

In [None]:
# To reduce categories
lead_df['Tags'] = lead_df['Tags'].replace(categories[-8:], 'Others')

In [None]:
# To plot new categories
plot_cat_var(cat_var[17])

In [None]:
plot_cat_vars(cat_var[19:25])

### **Observations for `Update me on Supply Chain Content`, `Get updates on DM Content`, `City`, `I agree to pay the amount through cheque`, `A free copy of Mastering The Interview`, and `Last Notable Activity` :

- Most of these variables are insignificant in analysis as many of them only have one significant category 'NO'.
- In City, most of the leads are generated for 'Mumbai'.
In 'A free copy of Mastering The Interview', both categories have similar conversion rates.
- In 'Last Notable Activity', we can combine categories after 'SMS Sent' similar to the variable 'Last Activity'. - It has most generated leads for the category 'Modified' while most conversion rate for 'SMS Sent' activity.

In [None]:
categories = lead_df['Last Notable Activity'].unique()
categories

#### We can see that we do not require last six categories.

In [None]:
# To reduce categories
lead_df['Last Notable Activity'] = lead_df['Last Notable Activity'].replace(categories[-6:], 'Others')

In [None]:
# To plot new categories
plot_cat_var(cat_var[24])

#### Based on the data visualization, we can drop the variables which are not significant for analysis and will not any information to the model.

In [None]:
lead_df = lead_df.drop(['Do Not Call','Country','What matters most to you in choosing a course','Search','Magazine','Newspaper Article',
                          'X Education Forums','Newspaper','Digital Advertisement','Through Recommendations',
                          'Receive More Updates About Our Courses','Update me on Supply Chain Content',
                          'Get updates on DM Content','I agree to pay the amount through cheque',
                          'A free copy of Mastering The Interview'],axis = 1)


In [None]:
# Final DataFrame
lead_df.head()

In [None]:
lead_df.info()

In [None]:
lead_df.describe()

## Step 4: Data Preparation

In [None]:
# To convert binary variable (Yes/No) to 0/1
lead_df['Do Not Email'] = lead_df['Do Not Email'].map({'Yes': 1, 'No': 0})

### Dummy Variable Creation

#### For categorical variables with multiple levels, we create dummy features (one-hot encoded).

In [None]:
# Categorical variables
cat_var = list(lead_df.columns[lead_df.dtypes == 'object'])
cat_var

In [None]:
# Step 1: Convert boolean columns to integers (0 & 1)
lead_df = lead_df.astype({col: 'int' for col in lead_df.select_dtypes(include=['bool']).columns})

# Step 2: Identify categorical columns correctly
cat_var = list(lead_df.select_dtypes(include=['object', 'category']).columns)

# Step 3: Apply One-Hot Encoding only if categorical variables exist
if cat_var:
    dummy = pd.get_dummies(lead_df[cat_var], drop_first=True)
    lead_df = pd.concat([lead_df, dummy], axis=1)
    lead_df = lead_df.drop(cat_var, axis=1)  # Drop original categorical columns

In [None]:
lead_df.head()

#### Train-Test Split

In [None]:
# Importing required package
from sklearn.model_selection import train_test_split

In [None]:
# To put feature variable to X
X = lead_df.drop(['Converted'],axis=1)
y = lead_df['Converted']

In [None]:
# To split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

#### Feature Scaling

In [None]:
# Importing required package
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
# Numerical variables
num_var

In [None]:
#Applying scaler to all numerical columns
X_train[num_var] = scaler.fit_transform(X_train[num_var])

X_train.head()

## Step 5: Model Building

#### Feature Selection Using RFE

In [None]:
# To create an instance of Logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, n_features_to_select=15)             # running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# Features selected
col = X_train.columns[rfe.support_]
col

In [None]:
# Features eliminated
X_train.columns[~rfe.support_]

#### Assessing the Model with StatsModels

In [None]:
import statsmodels.api as sm

# Function for building the model
def build_model(X,y):
    X_sm = sm.add_constant(X)    # To add a constant
    logm = sm.GLM(y, X_sm, family = sm.families.Binomial()).fit()    # To fit the model
    print(logm.summary())    # Summary of the model
    return X_sm, logm

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Function to calculate Variance Inflation Factor (VIF)
def check_VIF(X_in):
    X = X_in.drop('const', axis=1)  # ✅ Fix: Use axis=1
    vif = pd.DataFrame()
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif

In [None]:
# Function to get predicted values on train set

def get_pred(X,logm):
    y_train_pred = logm.predict(X)
    y_train_pred = y_train_pred.values.reshape(-1)
    # To create a dataframe to store original and predicted values
    y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_prob':y_train_pred})
    y_train_pred_final['Lead ID'] = y_train.index
    # Using default threshold of 0.5 for now
    y_train_pred_final['predicted'] = y_train_pred_final.Converted_prob.map(lambda x: 1 if x > 0.5 else 0)
    return y_train_pred_final

In [None]:
from sklearn import metrics

# Function to get confusion matrix and accuracy
def conf_mat(Converted,predicted):
    confusion = metrics.confusion_matrix(Converted, predicted )
    print("Confusion Matrix:")
    print(confusion)
    print("Training Accuracy: ", metrics.accuracy_score(Converted, predicted))
    return confusion

In [None]:
# Function for calculating metric beyond accuracy
def other_metrics(confusion):
    TP = confusion[1,1]    # True positives
    TN = confusion[0,0]    # True negatives
    FP = confusion[0,1]    # False positives
    FN = confusion[1,0]    # False negatives
    print("Sensitivity: ", TP / float(TP+FN))
    print("Specificity: ", TN / float(TN+FP))
    print("False postive rate: ", FP/ float(TN+FP))
    print("Positive predictive value: ", TP / float(TP+FP))
    print("Negative predictive value: ", TN / float(TN+FN))

**Model 1**  
Running the first model by using the features selected by RFE

`Tags_invalid number` has a very high p-value > 0.05. Hence, it is insignificant and can be dropped.

In [None]:
# Convert all boolean columns to int (0,1)
X_train = X_train.astype(int)

# Ensure all columns are numeric (int or float)
X_train = X_train.apply(pd.to_numeric, errors='coerce')

# Drop any remaining NaN values
X_train = X_train.dropna()

# Now, call the build_model function
X1, logm1 = build_model(X_train[col], y_train)

**Model 2**

In [None]:
# Ensure all data is numeric (convert bool to int and object columns to numeric)
X_train = X_train.astype(int)

# Convert all columns to numeric (handle any remaining object data)
X_train = X_train.apply(pd.to_numeric, errors='coerce')

# Drop any remaining NaN values
X_train = X_train.dropna()

col1 = col.drop(['Tags_invalid number'], errors='ignore')
# Now, rebuild the model
X2, logm2 = build_model(X_train[col1], y_train)

`Tags_number not provided` has a very high p-value > 0.05. Hence, it is insignificant and can be dropped.

**Model 3**

In [None]:
col2 = col1.drop(['Tags_number not provided'], errors='ignore')

# To rebuild the model
X3, logm3 = build_model(X_train[col2],y_train)

`Tags_wrong number given` has a very high p-value > 0.05. Hence, it is insignificant and can be dropped.

## Model 4

In [None]:
col3 = col2.drop(['Tags_wrong number given'],errors='ignore')

# To rebuild the model
X4, logm4 = build_model(X_train[col3],y_train)

### All of the features have p-value close to zero i.e. they all seem significant.

We also have to check VIFs (Variance Inflation Factors) of features to see if there's any multicollinearity present.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

def check_VIF(X):
    vif_data = pd.DataFrame()
    vif_data["Features"] = X.columns
    vif_data["VIF"] = [round(variance_inflation_factor(X.values, i), 2) for i in range(X.shape[1])]

    # Sort values by VIF in descending order
    vif_data = vif_data.sort_values(by="VIF", ascending=False).reset_index(drop=True)

    # Print the output
    print(vif_data.to_string(index=False))

# Call the function with your dataset X4
check_VIF(X4)

In [None]:
# To plot correlations
plt.figure(figsize = (20,10))
sns.heatmap(X4.corr(),annot = True)

#### From VIF values and heat maps, we can see that there is not much multicollinearity present. All variables have a good value of VIF. These features seem important from the business aspect as well. So we need not drop any more variables and we can proceed with making predictions using this model only.

In [None]:
# Get the features that were used to train logm4
expected_features = logm4.params.index.tolist()

# Reorder X4 columns to match the trained model and add missing columns (if any)
for col in expected_features:
    if col not in X4.columns:
        X4[col] = 0  # Add missing columns with default value 0

# Ensure the order matches the model's expectations
X4 = X4[expected_features]

In [None]:
# To get predicted values on train set
y_train_pred_final = get_pred(X4, logm4)
y_train_pred_final.head()

In [None]:
# Confusion Matrix and accuracy
confusion = conf_mat(y_train_pred_final.Converted,y_train_pred_final.predicted)

### | Predicted/Actual | Not converted Leads | Converted Leads |
    | --- | --- | --- |
    | Not converted Leads | 3753 | 152 |
    | Converted Leads | 567 | 1879 |

In [None]:
other_metrics(confusion)

This is our **final model**:

1.  All p-values are very close to zero.
2.  VIFs for all features are very low. There is hardly any multicollinearity present.
3.  Training accuracy of 88.67% at a probability threshold of 0.05 is also very good.

## Step 6: Model Evaluation

### Plotting the ROC Curve

### An ROC curve demonstrates several things:

- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
# Function to plot ROC
def plot_roc(actual,probs):
    fpr, tpr, thresholds = metrics.roc_curve(actual, probs, drop_intermediate = False)
    auc_score = metrics.roc_auc_score(actual, probs)
    plt.figure(figsize=(5, 5))
    plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_prob, drop_intermediate = False)

In [None]:
# To plot ROC
plot_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_prob)

In [None]:
print("Area under curve: ", metrics.roc_auc_score(y_train_pred_final.Converted, y_train_pred_final.Converted_prob))

### Area under curve (auc) is approximately 0.94 which is very close to ideal auc of 1.

#### Finding Optimal Cutoff Point

Optimal cutoff probability is the prob where we get balanced sensitivity and specificity.

In [None]:
# To create columns with different probability cutoffs
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# To calculate accuracy, sensitivity and specificity for various probability cutoffs
cutoff_df = pd.DataFrame(columns = ['prob','accuracy','sensi','speci'])

# TP = confusion[1,1]    # True positive
# TN = confusion[0,0]    # True negatives
# FP = confusion[0,1]    # False positives
# FN = confusion[1,0]    # False negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1

    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# To plot accuracy, sensitivity and specificity for various probabilities
sns.set_style('white')
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

### From the curve above, **0.2 is the optimum point to take as a cutoff probability**.

In [None]:
# Using 0.2 threshold for predictions
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_prob.map(lambda x: 1 if x > 0.2 else 0)

y_train_pred_final.head()

In [None]:
# Confusion matrix and Overall Accuracy
confusion2 = conf_mat(y_train_pred_final.Converted,y_train_pred_final.final_predicted)

In [None]:
# To plot confusion matrix
from mlxtend.plotting import plot_confusion_matrix

fig, ax = plot_confusion_matrix(conf_mat=confusion2)
plt.show()

In [None]:
# Other metrics
other_metrics(confusion2)

#### Classification Report

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_train_pred_final.Converted, y_train_pred_final.final_predicted))

## Step 7: Precision and Recall

In [None]:
confusion[1,1]/(confusion[0,1]+confusion[1,1])

### Recall = TP / TP + FN

In [None]:
confusion[1,1]/(confusion[1,0]+confusion[1,1])

### Using sklearn utilities for the same:

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)

### Precision and Recall Tradeoff

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_prob)

In [None]:
# To plot precision vs recall for different thresholds
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

### From the curve above, 0.25 is the optimum point to take as a cutoff probability using Precision-Recall. We can check our accuracy using this cutoff too.

In [None]:
# Using 0.25 threshold for predictions
y_train_pred_final['final_predicted_pr'] = y_train_pred_final.Converted_prob.map(lambda x: 1 if x > 0.25 else 0)

y_train_pred_final.head()

In [None]:
# Confusion matrix and overall accuracy
confusion3 = conf_mat(y_train_pred_final.Converted,y_train_pred_final.final_predicted_pr)

In [None]:
# Other metrics
other_metrics(confusion3)

### Accuracy and other metrics yield similar values for both the cutoffs. We'll use the cutoff of 0.25 as derived earlier for predictions on the test set.

## Step 8: Prediction on test set

In [None]:
# Feature transform on Test set
X_test = X_test.astype(int)

# Apply feature scaling only on numeric variables
X_test[num_var] = scaler.transform(X_test[num_var])

X_test.head()

In [None]:
# To get final features
X_test_sm = X_test[col3]

In [None]:
# Select relevant columns, remove duplicates, and add constant
X_test_sm = sm.add_constant(X_test.loc[:, ~X_test.columns.duplicated()][logm4.params.index[1:]], has_constant="add")

# Convert everything to float (fixes 'exp' error)
X_test_sm = X_test_sm.astype(float)

# Check for NaNs and fill with 0 (or other appropriate value)
X_test_sm = X_test_sm.fillna(0)

# Predict
y_test_pred = logm4.predict(X_test_sm)

In [None]:
# To convert y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)

y_pred_1.head()

In [None]:
# To convert y_test to dataframe
y_test_df = pd.DataFrame(y_test)

# Putting Lead ID to index
y_test_df['Lead ID'] = y_test_df.index

In [None]:
# To remove index for both dataframes to append them side by side
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

# To append y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

# To Rename the column
y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_prob'})

y_pred_final.head()

In [None]:
# To put the threshold of 0.2 as derived
y_pred_final['final_predicted'] = y_pred_final.Converted_prob.map(lambda x: 1 if x > 0.25 else 0)

y_pred_final.head()

In [None]:
print("Area under curve: ", metrics.roc_auc_score(y_pred_final.Converted, y_pred_final.Converted_prob))

In [None]:
# Confusion matrix and overall accuracy
confusion_test = conf_mat(y_pred_final.Converted,y_pred_final.final_predicted)

In [None]:
# To plot confusion matrix
plot_confusion_matrix(conf_mat=confusion_test)
plt.show()

 | Predicted/Actual | Not converted Leads | Converted Leads |
    | --- | --- | --- |
    | Not converted Leads | 1635 | 95 |
    | Converted Leads | 158 | 831 |

In [None]:
# Other metrics
other_metrics(confusion_test)

### Classification Report

In [None]:
print(classification_report(y_pred_final.Converted, y_pred_final.final_predicted))

## Step 9: Determining Feature Importance

### Assigning Lead Score

#### Lead Score = 100 * ConversionProbability
#### This needs to be calculated for all the leads from the original dataset (train + test).

In [None]:
# To select test set
leads_test_pred = y_pred_final.copy()
leads_test_pred.head()

In [None]:
# To select train set
leads_train_pred = y_train_pred_final.copy()
leads_train_pred.head()

In [None]:
# To drop unnecessary columns from train set
leads_train_pred = leads_train_pred[['Lead ID','Converted','Converted_prob','final_predicted']]
leads_train_pred.head()

In [None]:
# To concatenate 2 datasets
lead_full_pred = pd.concat([leads_train_pred, leads_test_pred])  # Fix for .append() deprecation
lead_full_pred.head()

In [None]:
# To inspect the shape of the final dataset
print(leads_train_pred.shape)
print(leads_test_pred.shape)
print(lead_full_pred.shape)

In [None]:
# To ensure uniqueness of Lead IDs
len(lead_full_pred['Lead ID'].unique().tolist())

In [None]:
# To calculate the Lead Score
lead_full_pred['Lead_Score'] = lead_full_pred['Converted_prob'].apply(lambda x : round(x*100))
lead_full_pred.head()

In [None]:
# To make the Lead ID column as index
lead_full_pred = lead_full_pred.set_index('Lead ID').sort_index(axis = 0, ascending = True)
lead_full_pred.head()

In [None]:
# To get Lead Number column from original data
leads_original = lead_df_original[['Lead Number']]
leads_original.head()

In [None]:
# To concatenate the 2 dataframes based on index
leads_with_score = pd.concat([leads_original, lead_full_pred], axis=1)
leads_with_score.head()

In [None]:
# To concatenate the 2 dataframes based on index
leads_with_score = pd.concat([leads_original, lead_full_pred], axis=1)
leads_with_score.head()

#### We have a new data frame consisting of Lead Number and Lead Score. Lead Number will help in easy referencing with the original data.

#### Determining Feature Importance

In [None]:
# To display features with corrsponding coefficients in final model
pd.options.display.float_format = '{:.2f}'.format
new_params = logm4.params[1:]
new_params

In [None]:
# Relative feature importance
feature_importance = new_params
feature_importance = 100.0 * (feature_importance / feature_importance.max())
feature_importance

In [None]:
# To sort features based on importance
sorted_idx = np.argsort(feature_importance,kind='quicksort',order='list of str')
sorted_idx

In [None]:
# To plot features with their relative importance
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1, 1, 1)
pos = np.arange(sorted_idx.shape[0])
ax.barh(pos, feature_importance[sorted_idx])
ax.set_yticks(pos)
ax.set_yticklabels(np.array(X_train[col3].columns)[sorted_idx], fontsize=12)
ax.set_xlabel('Relative Feature Importance', fontsize=12)
plt.show()

## Conclusions

### After trying out saveral models, our final model has following characteristics:  
1. All p-values are very close to zero.
2. VIFs for all features are very low. There is hardly any multicollinearity present.
3. The overall testing accuracy of 90.67% at a probability threshold of 0.05 is also very good.

| Dataset | Accuracy | Sensitivity | Specificity | False Positive Rate | Positive Predictive Rate | Negative Predictive Value|  AUC |
| ------- | -------- | ----------- | ----------- | ------------------ | ----------------------- | ------------------------ |    --- |
| Train   | 0.9104   |  0.8597     | 0.9421      | 0.0578             | 0.9829                  | 0.9147                   | 0.9393 |
| Test    | 0.9067   |  0.8432     | 0.9429      | 0.0570             | 0.8938                  | 0.9134                   | 0.9454 |

The **optimal threshold** for the model is **0.25** which is calculated based on tradeoff between sensitivity, specificity and accuracy. According to business needs, this threshold can be changed to increase or decrease a specific metric.  


High sensitivity ensures that most of the leads who are likely to convert are correctly predicted, while high specificity ensures that most of the leads who are not likely to convert are correctly predicted.  


**Twelve features** were selected as the most significant in predicting the conversion:  
- Features having positive impact on conversion probability in decreasing order of impact:  


|**Features with Positive Coefficient Values**|
|-|
|Tags_Lost to EINS|
|Tags_Closed by Horizzon|
|Tags_Will revert after reading the email|
|Tags_Busy|
|Lead Source_Welingak Website|
|Last Notable Activity_SMS Sent|
|Lead Origin_Lead Add Form|


- Features having negative impact on conversion probability in decreasing order of impact:  

|**Features with Negative Coefficient Values**|
|-|
|Lead Quality_Worst|
|Lead Quality_Not Sure|
|Tags_switched off|
|Tags_Ringing|
|Do Not Email|