# Lead Scoring Case Study


Steps followed to build the model:
1. [Importing Libraries and Data](#1)
2. [Data Understanding and Inspection](#2)
3. [Data Cleaning](#3)
4. [Data Analysis (EDA)](#4)
5. [Data Preparation](#5)
6. [Test-Train Split](#6)
7. [Feature Scaling](#7)
8. [Feature Selection](#8)
9. [Model Building](#9)
10. [Model Evaluation](#10)
11. [Predictions on Test Set](#11)
12. [Conclusion](#12)

## <p id="1">1. Importing Libraries and Data</p>

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'iframe'
pio.templates.default = "plotly_dark"

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
from sklearn.metrics import precision_recall_curve


In [None]:
# reading the leads dataset
leads_df = pd.read_csv('Leads.csv')

## <p id="2">2. Data Understanding and Inspection</p>

In [None]:
leads_df.shape

In [None]:
leads_df.describe()

find unique values in each column in the data frame

In [None]:
# display the info of the dataframe
leads_df.info()

In [None]:
# check for columnwise null count
leads_df.isnull().sum()


In [None]:
#columnwise null values count in terms of percentages sorted in descending order
round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2).sort_values(ascending=False)

<strong><span style="color:blue">Observation:</span></strong>  There are 13 columns with missing values rate > 15%.
Columns with high missing values rate can be dropped.

In [None]:
leads_df.columns

In [None]:
# unique values count in each column
leads_df.nunique().sort_values(ascending=False)

In [None]:
# check duplicate rows
leads_df.duplicated().sum()

<strong><span style="color:blue">Observation:</span></strong>  No duplicate columns found in the dataframe

## <p id="3">3. Data Cleaning</p>

#### Handling Missing Values

In [None]:
#columnwise null values count in terms of percentages sorted in descending order
round(100*(leads_df.isna().mean()), 2).sort_values(ascending=False)

#### Replacing 'Select' with NaN

Problem statement states that "Many of the categorical variables have a level called 'Select' which needs to be handled because it is as good as a null value"

Considering the above statement, we will replace the 'Select' values with NaN.

In [None]:
# find all the column names having value 'Select' in it
def find_cols_with_select_val(df):
    for col in df.columns:
        if df[col].dtype == 'object':
            if df[col].str.contains('Select').any():
                print(col)                

find_cols_with_select_val(leads_df)

<strong><span style="color:blue">Observation:</span></strong>  There are 4 columns containing 'Select' as a value. We can replace them with NaN as they are not useful for our analysis.

In [None]:
# replace 'Select' with NaN
leads_df = leads_df.replace('Select', np.nan)

In [None]:
find_cols_with_select_val(leads_df)

<strong><span style="color:blue">Observation:</span></strong>  Select values are now replaced with NaN values.

In [None]:
# lets check columnwise null ratio again
round(100*(leads_df.isna().mean()), 2).sort_values(ascending=False)

#### Lets consider 40% as the cut off the null values. If the column has more than 40% null values, we will drop the column.

In [None]:
# drop all the columns with 40% or more missing values

leads_df = leads_df.dropna(thresh=0.6*len(leads_df), axis=1)

In [None]:
leads_df.shape

<strong><span style="color:blue">Observation:</span></strong>  Number of columns in the dataset are now reduced to 30 from 37.

In [None]:
# columns with categorical data

leads_df_cat = leads_df.select_dtypes(include=['object']).columns
print('Number of Categorical Columns: ', len(leads_df_cat))
print('Categorical Columns: ', leads_df_cat)

### Imputing missing values

In [None]:
# City column has 39.71% missing values.
#lets check the value counts and decide what to do with it
leads_df['City'].value_counts(normalize=True)*100

<strong><span style="color:blue">Observation:</span></strong>  Data is not uniformly distributed. Mumbai has the maximum number of leads. Lets drop the city column as it is skewed towards Mumbai.

In [None]:
# drop city column
leads_df.drop('City', axis=1, inplace=True)
leads_df.shape

In [None]:
# Specialization column has 36.58% missing values.
# lets check the value counts of the column
leads_df['Specialization'].value_counts(normalize=True)*100

<strong><span style="color:blue">Observation:</span></strong>  Data is uniformly distributed. No outliers are present. Lets create a new category called 'Others'

In [None]:
# create a new category "Others" for the variable "Specialization" with all null values
# leads_df['Specialization'] = leads_df['Specialization'].replace(np.nan, 'Others')
leads_df['Specialization'] = leads_df['Specialization'].fillna('Others')

In [None]:
# Tags column has 36.29% missing values.
# lets check the value counts of the column
leads_df['Tags'].value_counts(normalize=True)*100

In [None]:
#### Tags and Country column is irrelavant for the model. Hence, dropping it.
leads_df.drop(['Tags', 'Country'], axis=1, inplace=True)

In [None]:
# 'What matters most to you in choosing a course' column has 29.32% missing values.
# lets check the value counts of the column
leads_df['What matters most to you in choosing a course'].value_counts(normalize=True)*100

In [None]:
# 'What matters most to you in choosing a course' column data is highly skewed.
# So we are dropping this column.
leads_df.drop('What matters most to you in choosing a course', axis=1, inplace=True)

In [None]:
# 'What is your current occupation' has 29.11% missing values
# lets check the value counts
leads_df['What is your current occupation'].value_counts(normalize=True)*100

In [None]:
# lets impute the missing values in 'What is your current occupation' with 'Unemployed'
leads_df['What is your current occupation'].fillna('Unemployed', inplace=True)

In [None]:
leads_df['TotalVisits'].value_counts(normalize=True)*100

In [None]:
# Impute TotalVisits with mode
leads_df['TotalVisits'].fillna(leads_df['TotalVisits'].mode()[0], inplace=True) 

In [None]:
leads_df['Page Views Per Visit'].value_counts(normalize=True)*100

In [None]:
# Impute Page Views Per Visit with mode value
leads_df['Page Views Per Visit'].fillna(leads_df['Page Views Per Visit'].mode()[0], inplace=True)

In [None]:
leads_df['Lead Source'].value_counts(normalize=True)*100

In [None]:
# Imputing the Lead Source column with the mode value i.e. Google
leads_df['Lead Source'].fillna('Google', inplace=True)

In [None]:
leads_df['Last Activity'].value_counts(normalize=True)*100

In [None]:
# Imputing the Last Activity column with the mode value i.e. Email Opened
leads_df['Last Activity'].fillna('Email Opened', inplace=True)

In [None]:
# lets check the unique values in each column
leads_df.nunique()

In [None]:
# assign column names with 1 unique value to a list
cols_with_one_unique_val = [col for col in leads_df.columns if leads_df[col].nunique() == 1]
cols_with_one_unique_val

In [None]:
# columns with one unique value doesnt contribute to the model building,
# lets drop them
leads_df.drop(cols_with_one_unique_val, axis=1, inplace=True)

In [None]:
# dropping Prospect ID, Lead Number, Last Notable Activity as they do not contribute to the model
leads_df.drop(['Prospect ID','Lead Number','Last Notable Activity'],axis=1,inplace=True)

In [None]:
leads_df.shape

In [None]:
# function to plot count plots for categorical variables
def plot_count_plots(dataframe, cols):
    plt.figure(figsize=(15, 40))  # Adjust the figure size if needed

    for col in cols:
        plt.subplot(8, 2, cols.index(col) + 1)
        sns.countplot(data=dataframe, x=col)
        plt.title(col, fontsize=12)
        plt.xticks(rotation=90)
    
    plt.tight_layout()
    plt.show()

categorical_col = leads_df.select_dtypes(include=['category', 'object']).columns.tolist()
plot_count_plots(leads_df, categorical_col)

<strong><span style="color:blue">Observation:</span></strong> 
Following columns are highly skewed
- Through Recommendations
- Newspaper
- Newspaper Article
- Digital Advertisement
- X Education Forums
- Search
- Do not Call

Since these columns are highly skewed, we can drop these columns as they will not add any value to our analysis.


In [None]:

leads_df.drop(['Do Not Call','Search','Newspaper Article','X Education Forums','Newspaper','Digital Advertisement','Through Recommendations'],axis=1,inplace=True)
print(leads_df.shape)

In [None]:
# Mapping binary categorical variables (Yes/No to 1/0) 
leads_df['Do Not Email'] = leads_df['Do Not Email'].apply(lambda x: 1 if x =='Yes' else 0)

leads_df['A free copy of Mastering The Interview'] = leads_df['A free copy of Mastering The Interview'].apply(lambda x: 1 if x =='Yes' else 0)

#### Outlier Analysis

In [None]:
# numeric columns
numerical_cols = leads_df.select_dtypes(exclude=['category', 'object']).columns.tolist()

In [None]:
for col in numerical_cols:
    plt.figure(figsize=(6, 4))  # Adjust the figure size if needed

    sns.boxplot(data=leads_df[col])

    plt.title(f'Box Plot of {col}', fontsize=14)
    plt.ylabel('Values')
    plt.xlabel(col)

    plt.tight_layout()
    plt.show()

#### Outlier Treatment

In [None]:
def perform_outlier_treatment(dataframe, column):
    Q1 = dataframe[column].quantile(0.25)
    Q3 = dataframe[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    dataframe[column] = np.where(dataframe[column] > upper_bound, upper_bound, dataframe[column])
    dataframe[column] = np.where(dataframe[column] < lower_bound, lower_bound, dataframe[column])

columns_to_treat = ['TotalVisits', 'Page Views Per Visit']

for col in columns_to_treat:
    perform_outlier_treatment(leads_df, col)

In [None]:
leads_df.shape

In [None]:
leads_df.describe(percentiles=[.10,.25,.50,.75,.95])

In [None]:
leads_df['Lead Source'].value_counts(normalize=True)*100

In [None]:
# Changing google to Google
leads_df['Lead Source'] = leads_df['Lead Source'].replace("google","Google")

# Group the values of the column 'Lead Source' into a new value 'Others' if the value count is less than 10 in the column 'Lead Source'
category_counts = leads_df['Lead Source'].value_counts()
category_names_less_than_10 = category_counts[category_counts < 10].index.tolist()
print(category_names_less_than_10)
leads_df.loc[leads_df['Lead Source'].isin(category_names_less_than_10), 'Lead Source'] = 'Others'

leads_df['Lead Source'].value_counts(normalize=True)*100

In [None]:
# Group the values of the variable 'Last Activity' into a new category called 'Others' if the value count is less than 100

category_counts = leads_df['Last Activity'].value_counts()
category_names_less_than_100 = category_counts[category_counts < 100].index.tolist()

leads_df.loc[leads_df['Last Activity'].isin(category_names_less_than_100), 'Last Activity'] = 'Others'

leads_df['Last Activity'].value_counts(normalize=True)*100

In [None]:
print(leads_df.select_dtypes(include=['category', 'object']).columns.tolist())
print(leads_df.select_dtypes(exclude=['category', 'object']).columns.tolist())

In [None]:
leads_df.info()

## <p id="4">4. Data Analysis (EDA)</p>

#### Univariate Analysis

In [None]:
# List of columns for which to create count plots
cols = [
    'Lead Origin', 'Lead Source', 'Last Activity',
    'What is your current occupation', 'Do Not Email',
    'Converted', 'Specialization',
    'A free copy of Mastering The Interview'
]


sns.set_theme(style="dark")


for col in cols:
    plt.figure(figsize=(10, 6))
    
    # Create a count plot
    sns.countplot(data=leads_df, x=col)
    plt.title(f'Count Plot of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    
    # Calculate and display percentages on top of the bars
    total_counts = leads_df[col].value_counts()
    for patch in plt.gca().patches:
        x = patch.get_x() + patch.get_width() / 2
        y = patch.get_height()
        percentage = y / len(leads_df)  # Calculate percentage based on the total number of entries
        plt.annotate(f'{percentage:.2%}', (x, y), ha='center', va='bottom')
    
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()






<strong><span style="color:blue">Observation:</span></strong>  

**Here is the list of features from variables which are present in majority (Converted and Not Converted included)** 

- **Lead Origin:** "Landing Page Submission" identified 53% customers, "API" identified 39%. 
- **Lead Source:** 58% Lead source is from Google & Direct Traffic combined
- **Last Activity:** 68% of customers contribution in SMS Sent & Email Opened activities
- **Current_occupation:** It has 90% of the customers as Unemployed
- **Do Not Email:** 92% of the people has opted that they dont want to be emailed about the course.


#### Bivariate Analysis

In [None]:
def plot_bivariate_count(data, x_col, y_col):
    # Create a cross-tabulation (crosstab) of the two columns
    crosstab = pd.crosstab(data[x_col], data[y_col], normalize='index') * 100
    count_crosstab = pd.crosstab(data[x_col], data[y_col])

    plt.figure(figsize=(10, 6))
    
    # Define a custom color palette for the count plot bars
    custom_palette = sns.color_palette(['#FF8080', '#80FF80'])

    
    ax = sns.countplot(data=data, x=x_col, hue=y_col, palette=custom_palette)
    plt.title(f'Lead Conversion Rate: {x_col}')
    plt.xlabel(x_col)
    plt.ylabel('Count')
    
    plt.xticks(rotation=90)  # Rotate x-axis labels by 90 degrees
    plt.legend(title=y_col, loc='upper right', labels=['No', 'Yes'])  # Place legend at top right
    
#     total=len(leads_df[x_col])

#     for p in ax.patches:
#         text = '{:.1f}%'.format(100*p.get_height()/total)
#         x = p.get_x() + p.get_width() / 2.
#         y = p.get_height()

#         ax.annotate(text, (x,y), ha = 'center', va = 'center', xytext = (0, 5), textcoords = 'offset points')
    
    # Add percentage labels on top of the bars
    all_heights = [[p.get_height() if not pd.isna(p.get_height()) else 0 for p in bars] for bars in ax.containers]

    for bars in ax.containers:
        for i, p in enumerate(bars):
            total = sum(xgroup[i] for xgroup in all_heights)
            percentage = f'{(100 * p.get_height() / total) :.1f}%'
            ax.annotate(percentage, (p.get_x() + p.get_width() / 2, p.get_height()), size=11, ha='center', va='bottom')

    plt.show()

cols = [
    'Lead Origin', 'Lead Source', 'Last Activity',
    'What is your current occupation', 'Do Not Email', 'Specialization',
    'A free copy of Mastering The Interview'
]

for col in cols:
    plot_bivariate_count(leads_df, col, 'Converted')

<strong><span style="color:blue">Observation:</span></strong>  

**Lead Origin:**
- About 53% of leads stem from "Landing Page Submission," boasting a conversion rate of 36%.
- The "API" accounts for approximately 39% of customers, showing a conversion rate of 31%.

**Current Occupation:**
- Approximately 90% of customers fall under the "Unemployed" category, with a conversion rate of 34%.
- Despite constituting only 7.6% of the total customer base, "Working Professionals" exhibit an impressive 92% conversion rate.

**Do Not Email:**
- A significant 92% of individuals have chosen not to receive email communications regarding the course.

**Lead Source:**
- "Google" yields a conversion rate of 40% among the 31% of customers from this source.
- "Direct Traffic" contributes a lower conversion rate of 32% with a customer percentage of 27%.
- Although "Organic Search" contributes to 37.8% of the conversion rate, it represents only 12.5% of the customer base.
- "Reference" showcases a remarkable conversion rate of 91%, yet it comprises merely around 6% of the customer acquisition.

**Last Activity:**
- The act of sending an "SMS" displays a notably high conversion rate of 63%, driven by 30% of last activities.
- "Email Opened" encompasses 38% of the customer's recent interactions, accompanied by a conversion rate of 37%.

**Specialization:**
- "Marketing Management," "HR Management," and "Finance Management" emerge as prominent contributors.


In [None]:
num_cols = [ 'TotalVisits', 'Total Time Spent on Website', 'Page Views Per Visit']
plt.figure(figsize=(20,15))
sns.pairplot(data=leads_df, vars=num_cols, hue="Converted")
plt.show()

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(1,3,1)
sns.boxplot(y = 'TotalVisits', x = 'Converted', data = leads_df)
plt.subplot(1,3,2)
sns.boxplot(y = 'Page Views Per Visit', x = 'Converted', data = leads_df)
plt.subplot(1,3,3)
sns.boxplot(y = 'Total Time Spent on Website', x = 'Converted', data = leads_df)
plt.show()

<strong><span style="color:blue">Observation:</span></strong>
#### Leads who spent more time on the website had high conversion rate

## <p id="5">5. Data Preparation</p>

In [None]:
leads_df.head()

#### Before creating dummy variables lets shorten the columns with large names

In [None]:
# rename 'A free copy of Mastering The Interview' column to 'free_copy' and 'What is your current occupation' to 'occupation'
leads_df.rename(columns={'A free copy of Mastering The Interview':'Free_copy','What is your current occupation':'Occupation'},inplace=True)

In [None]:
leads_df.shape

In [None]:
x = ["Lead Source", "Lead Origin","Last Activity","Specialization","Occupation"]

for i in x:
    print(i, len(leads_df[i].value_counts()))

In [None]:
# df = pd.get_dummies(leads_df[["Lead Origin","Lead Source","Last Activity","Specialization","Occupation"]], drop_first=True)
# print(df.shape)

# create dummy variables for categorical variables
leads_df = pd.get_dummies(data=leads_df, columns=["Lead Source", "Lead Origin","Last Activity","Specialization","Occupation"], drop_first=True)
# dropping the first column as k-1 dummies can explain k categories
print(df.shape)
# leads_df = pd.concat([leads_df, df], axis=1)
# print(leads_df.shape)

In [None]:
leads_df.head()

In [None]:
leads_df.shape

## <p id="6">6. Test-Train Split</p>

In [None]:
# 'Converted' is the dependent variable
y = leads_df.pop('Converted')

# All remaining variable are independent variables
X = leads_df

print('Before split:',X.shape, y.shape)

# Train Test split with 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=100)

print('After split X data', X_train.shape, X_test.shape)
print('After split y data', y_train.shape, y_test.shape)

## <p id="7">7. Feature Scaling</p>

In [None]:
num_cols=X_train.select_dtypes(include=['int64','float64']).columns

#Use Normalized scaler to scale
scaler = MinMaxScaler()

#Fit and transform training set only
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])


In [None]:
X_train[num_cols].describe()

In [None]:
# analyse correlation matrix
plt.figure(figsize = (50,15))        
sns.heatmap(leads_df.corr(),linewidths=0.01,cmap="GnBu",annot=True)
plt.show()

In [None]:
plt.figure(figsize = (5,5))        
sns.heatmap(leads_df[["Lead Source_Facebook","Lead Origin_Lead Import","Lead Origin_Lead Add Form","Lead Source_Reference"]].corr(),linewidths=0.01,cmap="crest",annot=True)
plt.show()

<strong><span style="color:blue">Observation:</span></strong>
These predictor variables above are very highly correlated with each other near diagonal with (0.98 and 0.85), it is better that we drop one of these variables from each pair as they won’t add much value to the model. So , we can drop any of them, lets drop `'Lead Origin_Lead Import'` and `'Lead Origin_Lead Add Form'`.

In [None]:
# drop 'Lead Origin_Lead Import' and 'Lead Origin_Lead Add Form' columns as they are highly correlated with 'Lead Source_Facebook' and 'Lead Source_Reference' respectively
X_train.drop(['Lead Origin_Lead Import', 'Lead Origin_Lead Add Form'], axis=1, inplace=True)
X_test.drop(['Lead Origin_Lead Import', 'Lead Origin_Lead Add Form'], axis=1, inplace=True)

## <p id="8">8. Feature Selection</p>

In [None]:
len(X_train.columns)

##### Using automated approach to cut down the features
Feature ranking with recursive feature elimination(RFE).

In [None]:
# Lets use RFE to reduce variables 
logreg = LogisticRegression()
rfe = RFE(logreg, n_features_to_select=15)            
rfe = rfe.fit(X_train, y_train)

In [None]:
# all columns
X_train.columns

In [None]:
# 15 features selected by RFE
rfe_cols = X_train.columns[rfe.support_].values.tolist()
print(rfe_cols)
print(X_train.columns[rfe.support_])

In [None]:
# Features not selected by RFE
X_train.columns[~rfe.support_]

##### Manual Elimination

In [None]:
#Function to build a model using statsmodel api - Takes the columns to be selected for model as a parameter
def build_model(cols):
    X_train_sm = sm.add_constant(X_train[cols])
    lm = sm.GLM(y_train,X_train_sm,family = sm.families.Binomial()).fit()  
    print(lm.summary())
    return lm

In [None]:
#Function to calculate VIFs and print them -Takes the columns for which VIF to be calculated as a parameter
def get_vif(cols):
    df1 = X_train[cols]
    vif = pd.DataFrame()
    vif['Features'] = df1.columns
    vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
    vif['VIF'] = round(vif['VIF'],2)
    print(vif.sort_values(by='VIF',ascending=False))

## <p id="9">9. Model Building</p>

Model Evaulation criteria
- p-value < 0.05
- VIF < 5

#### Model-1

In [None]:
#Selected columns for Model 1 - all columns selected by RFE
build_model(rfe_cols)
get_vif(rfe_cols)

<strong><span style="color:Blue">NOTE : </span></strong> "Occupation_Housewife" column will be removed from model due to high p-value of 0.999, which is above the accepted threshold of 0.05 for statistical significance.

In [None]:
def remove_and_return_element(arr, element):
    updated_array = [x for x in arr if x != element]
    return updated_array

In [None]:
rfe_cols = remove_and_return_element(rfe_cols, 'Occupation_Housewife')

#### Model-2

In [None]:
build_model(rfe_cols)
get_vif(rfe_cols)

<strong><span style="color:Blue">NOTE:</span></strong> "Lead Source_Others" column will be removed from model due to high p-value of 0.095,  which is above the accepted threshold of 0.05 for statistical significance.

In [None]:
rfe_cols = remove_and_return_element(rfe_cols, 'Lead Source_Others')

#### Model-3

In [None]:
build_model(rfe_cols)
get_vif(rfe_cols)

<strong><span style="color:Blue">NOTE:</span></strong> "Page Views Per Visit" column will be removed from model due to high VIF value of 6.34, which is greater than the accepted threshold of 5

In [None]:
rfe_cols = remove_and_return_element(rfe_cols, 'Page Views Per Visit')

#### Model-4

In [None]:
build_model(rfe_cols)
get_vif(rfe_cols)

<strong><span style="color:Blue">NOTE:</span></strong> "TotalVisits" column will be removed from model due to high p-value of 0.086, which is above the accepted threshold of 0.05 for statistical significance.

In [None]:
rfe_cols = remove_and_return_element(rfe_cols, 'TotalVisits')

#### Model-5

In [None]:
logm = build_model(rfe_cols)
get_vif(rfe_cols)

<strong><span style="color:Blue">NOTE:</span></strong> No variable needs to be dropped as they all have significant p-values within the threshold (p-values < 0.05) and all have good VIF values less than 5.
- p-values for all variables is less than 0.05
- This model looks acceptable as everything is under control (p-values & VIFs).
- So we will final our Model 5 for `Model Evaluation`.

## <p id="10">10. Model Evaluation</p>

In [None]:
X_train_sm5 = sm.add_constant(X_train[rfe_cols])

# Getting the predicted values on the train set
y_train_pred = logm.predict(X_train_sm5)         # giving prob. of getting 1

y_train_pred[:10]

In [None]:
# for array
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
# Creating a dataframe with the actual converted flag and the predicted probabilities

y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
y_train_pred_final['Prospect ID'] = y_train.index
y_train_pred_final.head()

# y_train.values actual Converted values from df_leads dataset
# y_train_pred probability of Converted values predicted by model

##### Creating new column 'predicted' with 1 if Converted_Prob > 0.5 else 0


In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final["Converted_Prob"].map(lambda x: 1 if x > 0.5 else 0)

# checking head
y_train_pred_final.head()

#### Confusion matrix 


In [None]:
confusion = metrics.confusion_matrix(y_train_pred_final["Converted"], y_train_pred_final["Predicted"])

print(confusion)

In [None]:
# Predicted     not_converted    converted
# Actual
# not_converted        3572      430
# converted            844       1622  

#### Accuracy

In [None]:
print(metrics.accuracy_score(y_train_pred_final["Converted"], y_train_pred_final["Predicted"]))

#### Metrics beyond simply accuracy
- Sensitivity and Specificity
- When we have Predicted at threshold 0.5 probability

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
print("Sensitivity :",TP / float(TP+FN))

In [None]:
# Let us calculate specificity
print("Specificity :",TN / float(TN+FP))


In [None]:
# Calculate false postive rate - predicting conversion when customer does not have converted
print(FP/ float(TN+FP))

In [None]:
# positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

#### Plotting the ROC Curve

An ROC curve demonstrates several things:

- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
# UDF to draw ROC curve 
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train_pred_final["Converted"], y_train_pred_final["Converted_Prob"], drop_intermediate = False )

In [None]:
# Drawing ROC curve for Train Set
draw_roc(y_train_pred_final["Converted"], y_train_pred_final["Converted_Prob"])

<strong><span style="color:Blue">NOTE:</span></strong> Area under ROC curve is 0.88 out of 1 which indicates a good predictive model

#### Finding Optimal Cutoff Point/ Probability
- It is that probability where we get `balanced sensitivity and specificity`

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final['Converted_Prob'].map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [float(x)/10 for x in range(10)]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final["Converted"], y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.

from scipy.interpolate import interp1d
from scipy.optimize import fsolve

# Finding the intersection points of the sensitivity and accuracy curves
sensi_interp = interp1d(cutoff_df['prob'], cutoff_df['sensi'], kind='linear')
acc_interp = interp1d(cutoff_df['prob'], cutoff_df['accuracy'], kind='linear')
intersection_1 = np.round(float(fsolve(lambda x : sensi_interp(x) - acc_interp(x), 0.5)), 3)

# Find the intersection points of the specificity and accuracy curves
speci_interp = interp1d(cutoff_df['prob'], cutoff_df['speci'], kind='linear')
intersection_2 = np.round(float(fsolve(lambda x : speci_interp(x) - acc_interp(x), 0.5)), 3)

# Calculate the average of the two intersection points
intersection_x = (intersection_1 + intersection_2) / 2

# Interpolate the accuracy, sensitivity, and specificity at the intersection point
accuracy_at_intersection = np.round(float(acc_interp(intersection_x)), 2)
sensitivity_at_intersection = np.round(float(sensi_interp(intersection_x)), 2)
specificity_at_intersection = np.round(float(speci_interp(intersection_x)), 2)

# Plot the three curves and add vertical and horizontal lines at intersection point
cutoff_df.plot.line(x='prob', y=['accuracy', 'sensi', 'speci'])
plt.axvline(x=intersection_x, color='grey',linewidth=0.55, linestyle='--')
plt.axhline(y=accuracy_at_intersection, color='grey',linewidth=0.55, linestyle='--')

# Adding annotation to display the (x,y) intersection point coordinates 
plt.annotate(f'({intersection_x} , {accuracy_at_intersection})',
             xy=(intersection_x, accuracy_at_intersection),
             xytext=(0,20),
             textcoords='offset points',
             ha='center',
             fontsize=9)

# Displaying the plot
plt.show()


<strong><span style="color:Blue">NOTE:</span></strong> 0.351 is the approx. point where all the curves meet, so 0.351 seems to be our `Optimal cutoff point` for probability threshold .
- Lets do mapping again using optimal cutoff point 

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final['Converted_Prob'].map( lambda x: 1 if x > 0.351 else 0)

# deleting the unwanted columns from dataframe
y_train_pred_final.drop([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,"Predicted"],axis = 1, inplace = True) 
y_train_pred_final.head()

#### Calculating all metrics using confusion matrix for Train

In [None]:
# Checking the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final["Converted"], y_train_pred_final["final_predicted"]))

# or can be found using confusion matrix with formula, lets find all matrix in one go ahead using UDF

In [None]:
# UDF for all Logistic Regression Metrics
def logreg_all_metrics(confusion_matrix):
    TN =confusion_matrix[0,0]
    TP =confusion_matrix[1,1]
    FP =confusion_matrix[0,1]
    FN =confusion_matrix[1,0]
    
    accuracy = (TN+TP)/(TN+TP+FN+FP)
    sensi = TP/(TP+FN)
    speci = TN/(TN+FP)
    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    TPR = TP/(TP + FN)
    TNR = TN/(TN + FP)
    
    #Calculate false postive rate - predicting conversion when customer does not have converted
    FPR = FP/(FP + TN)     
    FNR = FN/(FN +TP)
    
    print ("True Negative                    : ", TN)
    print ("True Positive                    : ", TP)
    print ("False Negative                   : ", FN)
    print ("False Positve                    : ", FP) 
    
    print ("Model Accuracy                   : ", round(accuracy,4))
    print ("Model Sensitivity                : ", round(sensi,4))
    print ("Model Specificity                : ", round(speci,4))
    print ("Model Precision                  : ", round(precision,4))
    print ("Model Recall                     : ", round(recall,4))
    print ("Model True Positive Rate (TPR)   : ", round(TPR,4))
    print ("Model False Positive Rate (FPR)  : ", round(FPR,4))
    
    

In [None]:
# Finding Confusion metrics for 'y_train_pred_final' df
confusion_matrix = metrics.confusion_matrix(y_train_pred_final['Converted'], y_train_pred_final['final_predicted'])
print("*"*50,"\n")

#
print("Confusion Matrix")
print(confusion_matrix,"\n")

print("*"*50,"\n")

# Using UDF to calculate all metrices of logistic regression
logreg_all_metrics(confusion_matrix)

print("\n")
print("*"*50,"\n")

#### Precision and recall tradeoff
- Let's compare all metrics of Precision-Recall view with Specificity-Sensivity view and get better probability threshold for boosting conversion rate to 80% as asked by CEO.

In [None]:
# Creating precision-recall tradeoff curve
y_train_pred_final['Converted'], y_train_pred_final['final_predicted']
p, r, thresholds = precision_recall_curve(y_train_pred_final['Converted'], y_train_pred_final['Converted_Prob'])

In [None]:
# plot precision-recall tradeoff curve
plt.plot(thresholds, p[:-1], "g-", label="Precision")
plt.plot(thresholds, r[:-1], "r-", label="Recall")

# add legend and axis labels

plt.axvline(x=0.39, color='teal',linewidth = 0.55, linestyle='--')
plt.legend(loc='lower left')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')

plt.show()

<strong><span style="color:Blue">NOTE:</span></strong> The intersection point of the curve is the threshold value where the model achieves a balance between precision and recall. It can be used to optimise the performance of the model based on business requirement,Here our probability threshold is 0.39 approx from above curve.

In [None]:
# copying df to test model evaluation with precision recall threshold of 0.39
y_train_precision_recall = y_train_pred_final.copy()

In [None]:
# assigning a feature for 0.41 cutoff from precision recall curve to see which one is best view (sensi-speci or precision-recall)
y_train_precision_recall['precision_recall_prediction'] = y_train_precision_recall['Converted_Prob'].map( lambda x: 1 if x > 0.39 else 0)
y_train_precision_recall.head()

In [None]:
## Lets see all metrics at 0.39 cutoff in precision-recall view and compare it with 0.351 cutoff from sensi-speci view

# Finding Confusion metrics for 'y_train_precision_recall' df
confusion_matrix = metrics.confusion_matrix(y_train_precision_recall['Converted'], y_train_precision_recall['precision_recall_prediction'])
print("*"*50,"\n")

#
print("Confusion Matrix")
print(confusion_matrix,"\n")

print("*"*50,"\n")

# Using UDF to calculate all metrices of logistic regression
logreg_all_metrics(confusion_matrix)

print("\n")
print("*"*50,"\n")


<strong><span style="color:Blue">NOTE:</span></strong> 
- As we can see in above metrics when we used precision-recall threshold cut-off of 0.39 the values in True Positive Rate ,Sensitivity, Recall have dropped to around 73%, but we need it close to 80% as the Business Objective.
- We are getting metric values close to 80% with the sensitivity-specificity cut-off threshold of 0.351. So, we will go with sensitivity-specificity view for our Optimal cut-off for final predictions.


### <strong><span style="color:purple"> Adding `Lead Score` Feature to Training dataframe </span></strong> 
- A higher score would mean that the lead is hot, i.e. is most likely to convert 
- Whereas a lower score would mean that the lead is cold and will mostly not get converted.

In [None]:
# Lets add Lead Score 

y_train_pred_final['Lead_Score'] = y_train_pred_final['Converted_Prob'].map( lambda x: round(x*100))
y_train_pred_final.head()

## <p id="11">11. Predictions on Test Set</p>

#### Scaling Test dataset

In [None]:
# fetching int64 and float64 dtype columns from dataframe for scaling
num_cols=X_test.select_dtypes(include=['int64','float64']).columns

# scaling columns
X_test[num_cols] = scaler.transform(X_test[num_cols])

X_test = X_test[rfe_cols]
X_test.head()

#### Prediction on Test Dataset using final model 

In [None]:
# Adding contant value
X_test_sm = sm.add_constant(X_test)
X_test_sm.shape

In [None]:
# making prediction using model 6 (final model)
y_test_pred = logm.predict(X_test_sm)

In [None]:
# Changing to dataframe of predicted probability
y_test_pred = pd.DataFrame(y_test_pred)
y_test_pred.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)
y_test_df.head()

In [None]:
# Putting Prospect ID to index
y_test_df['Prospect ID'] = y_test_df.index

# Removing index for both dataframes to append them side by side 
y_test_pred.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

# Appending y_test_df and y_test_pred
y_pred_final = pd.concat([y_test_df, y_test_pred],axis=1)
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_Prob'})

# Rearranging the columns
y_pred_final = y_pred_final.reindex(['Prospect ID','Converted','Converted_Prob'], axis=1)

y_pred_final.head()

In [None]:
# taking sensitivity-specificity method at 0.345 probability cutoff during training
y_pred_final['final_predicted'] = y_pred_final['Converted_Prob'].map(lambda x: 1 if x > 0.351 else 0)
y_pred_final.head()

In [None]:
# Drawing ROC curve for Test Set
fpr, tpr, thresholds = metrics.roc_curve(y_pred_final["Converted"], y_pred_final["Converted_Prob"], drop_intermediate = False )

draw_roc(y_pred_final["Converted"], y_pred_final["Converted_Prob"])

<strong><span style="color:Blue">NOTE:</span></strong> Area under ROC curve is 0.87 out of 1 which indicates a good predictive model

<strong><span style="color:Blue">NOTE:</span></strong> 
- Now that the final predictions have been made, the next step would be to evaluate the performance of the predictive model on a test set. 
- We will do this by comparing the predicted labels (final_predicted) to the actual labels (Converted) to compute various performance metrics such as accuracy, precision, recall, etc.

#### Test set Model Evaluation
- Calculating all metrics using confusion matrix for Test set

In [None]:
# Finding Confusion metrics for 'y_train_pred_final' df
confusion_matrix = metrics.confusion_matrix(y_pred_final['Converted'], y_pred_final['final_predicted'])
print("*"*50,"\n")

#
print("Confusion Matrix")
print(confusion_matrix,"\n")

print("*"*50,"\n")

# Using UDF to calculate all metrices of logistic regression
logreg_all_metrics(confusion_matrix)

print("\n")
print("*"*50,"\n")

In [None]:
# features and their coefficicent from final model
parameters=logm.params.sort_values(ascending=False)
parameters

## <strong><span style="color:purple"> Adding `Lead Score` Feature to Test dataframe </span></strong> 
- A higher score would mean that the lead is hot, i.e. is most likely to convert 
- Whereas a lower score would mean that the lead is cold and will mostly not get converted.

In [None]:
# Lets add Lead Score 

y_pred_final['Lead_Score'] = y_pred_final['Converted_Prob'].map( lambda x: round(x*100))
y_pred_final.head()

<strong><span style="color:purple">Lead Score: </span></strong> Lead Score is assigned to the customers
- The customers with a higher lead score have a higher conversion chance 
- The customers with a lower lead score have a lower conversion chance.

## <p id="12">12. Conclusion</p>

## 📌 Train - Test
### <strong><span style="color:purple">Train Data Set:</span></strong>   

- <strong><span style="color:Green">Accuracy:</span></strong> 78.79%

- <strong><span style="color:Green">Sensitivity:</span></strong> 77.9%

- <strong><span style="color:Green">Specificity:</span></strong> 79.31%

### <strong><span style="color:purple">Test Data Set:</span></strong> 

- <strong><span style="color:Green">Accuracy:</span></strong> 78.72%

- <strong><span style="color:Green">Sensitivity:</span></strong> 76.89%

- <strong><span style="color:Green">Specificity:</span></strong> 79.99%
 

<strong><span style="color:Blue">NOTE:</span></strong> The evaluation metrics are pretty close to each other so it indicates that the model is performing consistently across different evaluation metrics in both test and train dataset.

- The model achieved a `sensitivity of 77.9%` in the train set and 76.89% in the test set, using a cut-off value of 0.351.
- Sensitivity in this case indicates how many leads the model identify correctly out of all potential leads which are converting
- `The CEO of X Education had set a target sensitivity of around 80%.`
- The model also achieved an accuracy of 78%, which is in line with the study's objectives.
<hr/>



## 📌Model parameters
- The final Logistic Regression Model has 11 features

### <strong><span style="color:purple">`Top 3 features` that contributing `positively` to predicting hot leads in the model are:</span></strong> 
- <strong><span style="color:Green">Lead Source_Welingak Website</span></strong>

- <strong><span style="color:Green">Total Time Spent on Website</span></strong> 

- <strong><span style="color:Green">Occupation_Working Professional</span></strong> 

<strong><span style="color:Blue">NOTE: </span></strong> The Optimal cutoff probability point is 0.351.Converted probability greater than 0.351 will be predicted as Converted lead (Hot lead) & probability smaller than 0.351 will be predicted as not Converted lead (Cold lead).
<hr/>

# ✅<strong><span style="color:brown">Recommendations </span></strong> 

### <strong><span style="color:purple">To increase our Lead Conversion Rates: </span></strong>  

- Focus on features with positive coefficients for targeted marketing strategies.
- Develop strategies to attract high-quality leads from top-performing lead sources.
- Engage working professionals with tailored messaging.
- Optimize communication channels based on lead engagement impact.
- More budget/spend can be done on Welingak Website in terms of advertising, etc.
- Incentives/discounts for providing reference that convert to lead, encourage providing more references.
- Working professionals to be aggressively targeted as they have high conversion rate and will have better financial situation to pay higher fees too. 


### <strong><span style="color:purple">To identify areas of improvement: </span></strong>  

- Analyze negative coefficients in specialization offerings.
- Review landing page submission process for areas of improvement.



