# STEP-1  -->  LOAD DATASET & EXPLORE #

In [1]:
# Import necessary libraries #

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load the dataset
df = pd.read_csv('Leads.csv')

# Display the first few rows of the dataset
df.head()

# Summary of the dataset (to understand data types and missing values)
df.info()

# Basic descriptive statistics of numerical columns
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

Unnamed: 0,Lead Number,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Asymmetrique Activity Score,Asymmetrique Profile Score
count,9240.0,9240.0,9103.0,9240.0,9103.0,5022.0,5022.0
mean,617188.435606,0.38539,3.445238,487.698268,2.36282,14.306252,16.344883
std,23405.995698,0.486714,4.854853,548.021466,2.161418,1.386694,1.811395
min,579533.0,0.0,0.0,0.0,0.0,7.0,11.0
25%,596484.5,0.0,1.0,12.0,1.0,14.0,15.0
50%,615479.0,0.0,3.0,248.0,2.0,14.0,16.0
75%,637387.25,1.0,5.0,936.0,3.0,15.0,18.0
max,660737.0,1.0,251.0,2272.0,55.0,18.0,20.0


### KEY OBSERVATIONS ###

1. Dataset Size: 9240 rows and 37 columns.

2. Target Variable: Converted is the target variable. It's binary, indicating whether a lead was converted (1) or not (0).

3. Missing Values: Some columns have missing values, such as:

    --> Lead Source: 36 missing values.

    --> TotalVisits, Page Views Per Visit: 137 missing values each

    --> Country, Specialization, How did you hear about X Education, etc. have significant missing values.

    --> Some columns, like Lead Quality, have over 50% missing data, which might require either imputation or dropping.

4. Categorical Variables: Many columns are categorical and would need encoding.

5. Select Placeholder: We need to check if any categorical columns have a "Select" option that acts as a missing value.

# STEP-2  -->  DATA PREPROCESSING #

In this step, we'll:
1. Handle missing values.
2. Remove irrelevant columns.
3. Replace "Select" in categorical variables with NaN.
4. Handle categorical variables.

In [9]:
# Check missing values

missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data = missing_data[missing_data['Missing Values'] > 0]
missing_data.sort_values(by='Percentage', ascending=False)

Unnamed: 0,Missing Values,Percentage
How did you hear about X Education,7250,78.463203
Lead Profile,6855,74.188312
Asymmetrique Activity Score,4218,45.649351
Asymmetrique Profile Index,4218,45.649351
Asymmetrique Activity Index,4218,45.649351
Asymmetrique Profile Score,4218,45.649351
City,3669,39.707792
Specialization,3380,36.580087
Tags,3353,36.287879
What matters most to you in choosing a course,2709,29.318182


### KEY OBSERVATIONS ###

#### Based on the table, here's what we can conclude and decide about handling the missing values:

1. High Missing Values: 
--> Columns with more than 50% missing values (e.g., Lead Quality, Asymmetrique Activity Score/Index) were dropped to avoid introducing noise into the model.

2. Moderate Missing Values:
--> Columns like Tags, Country, Occupation, etc., had 20-35% missing data. For these, we imputed missing values with 'Unknown' or 'Other' to retain useful information.

3. Low Missing Values:
--> Columns with less than 2% missing values (e.g., Page Views Per Visit, TotalVisits, Last Activity) were imputed using the median for numerical columns and the most frequent value for categorical columns.

In [10]:
# Replace 'Select' with NaN

df.replace('Select', pd.NA, inplace=True)

In [12]:
# Drop irrelevant columns

df.drop(['Asymmetrique Profile Score', 'Asymmetrique Activity Score', 'Asymmetrique Profile Index', 'Asymmetrique Activity Index'], axis=1, inplace=True)

In [13]:
# Impute missing categorical values with 'Unknown'

df['Tags'].fillna('Unknown', inplace=True)
df['What matters most to you in choosing a course'].fillna('Unknown', inplace=True)
df['What is your current occupation'].fillna('Other', inplace=True)
df['Country'].fillna('Other', inplace=True)
df['Specialization'].fillna('Other', inplace=True)
df['City'].fillna('Other', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Tags'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['What matters most to you in choosing a course'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate obje

In [14]:
# Impute missing numerical values with median
df['Page Views Per Visit'].fillna(df['Page Views Per Visit'].median(), inplace=True)
df['TotalVisits'].fillna(df['TotalVisits'].median(), inplace=True)

# Impute missing categorical values with the most frequent value
df['Last Activity'].fillna(df['Last Activity'].mode()[0], inplace=True)
df['Lead Source'].fillna(df['Lead Source'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Page Views Per Visit'].fillna(df['Page Views Per Visit'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalVisits'].fillna(df['TotalVisits'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work beca

In [15]:
# Check if there are still missing values

df.isnull().sum()


Lead Origin                                         0
Lead Source                                         0
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                         0
Total Time Spent on Website                         0
Page Views Per Visit                                0
Last Activity                                       0
Country                                             0
Specialization                                      0
How did you hear about X Education               7250
What is your current occupation                     0
What matters most to you in choosing a course       0
Search                                              0
Magazine                                            0
Newspaper Article                                   0
X Education Forums                                  0
Newspaper                   

### Updated Analysis of Missing Values:

Columns Still Containing Missing Values:
1. How did you hear about X Education - 22% missing values

---> So, we can either impute with "Unknown" or drop this column, depending on its importance. Since it's an informational column, imputing with "Unknown" would help retain the maximum data.

2. Lead Profile - 74% missing

---> With 74% missing data, this column might not provide significant value, so it's better to drop it.


In [16]:
# Impute "Unknown"

df['How did you hear about X Education'].fillna('Unknown', inplace=True)

# Drop Lead Profile Column

df.drop('Lead Profile', axis=1, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['How did you hear about X Education'].fillna('Unknown', inplace=True)


In [17]:
# Check if there are still missing values

df.isnull().sum()

Lead Origin                                      0
Lead Source                                      0
Do Not Email                                     0
Do Not Call                                      0
Converted                                        0
TotalVisits                                      0
Total Time Spent on Website                      0
Page Views Per Visit                             0
Last Activity                                    0
Country                                          0
Specialization                                   0
How did you hear about X Education               0
What is your current occupation                  0
What matters most to you in choosing a course    0
Search                                           0
Magazine                                         0
Newspaper Article                                0
X Education Forums                               0
Newspaper                                        0
Digital Advertisement          

So, as we can see there is zero missing values.

# STEP-3 ---> ENCODING CATEGORICAL VARIABLES

As we know logistic regression models require numerical inputs. So we need to convert categorical varibles into a numerical format. 

We can handle it in 2 ways:
1. Binary/Boolean Variables: For variables like 'Do Not Email' and 'Do Not Call', which have only two values, we can convert them into '0' and '1'.

2. One-Hot Encoding: For categorical variables with more than two categories (e.g., Lead Source, Last Activity), we can apply one-hot encoding to convert them into dummy variables.

In [18]:
# Convert 'Yes'/'No' to 1/0 for binary variables

df['Do Not Email'] = df['Do Not Email'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Do Not Call'] = df['Do Not Call'].apply(lambda x: 1 if x == 'Yes' else 0)

In [19]:
# Apply one-hot encoding to categorical variables

df = pd.get_dummies(df, drop_first=True)

# STEP-4 ---> MODEL BUILDING WITH LOGISTIC REGRESSION

We'll split the dataset into training and testing sets, build the logistic regression model, and evaluate its performance.

In [21]:
# Split the Data: We'll divide the data into training and testing sets, typically using an 80-20 or 70-30 split.

from sklearn.model_selection import train_test_split

# Define the feature set (X) and target variable (y)
X = df.drop('Converted', axis=1)
y = df['Converted']

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model

log_reg = LogisticRegression(max_iter=1000)

# Fit the model on the training data

log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [23]:
# Predict the target variable on the test data

y_pred = log_reg.predict(X_test)

In [24]:
# Evaluate the Model: Assess the model's performance using metrics like accuracy, precision, recall, F1 score, and ROC-AUC.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Evaluate performance

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Print the evaluation metrics

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"ROC-AUC Score: {roc_auc}")

Accuracy: 0.9366883116883117
Precision: 0.9382022471910112
Recall: 0.9014844804318488
F1 Score: 0.9194769442532691
ROC-AUC Score: 0.9308687081472705


### Model Performance Analysis:

1. Accuracy (93.67%):
   --> The model correctly classified 93.67% of the leads in the test set, indicating strong overall performance.

2. Precision (93.82%):
   --> Of the leads that the model predicted as "converted," 93.82% were actually converted. This indicates that the model is making few false positive predictions, meaning it's good at identifying actual conversions.

3. Recall (90.15%):
   --> The recall score of 90.15% means that the model identified 90.15% of all the actual conversions. This shows that the model is able to detect most of the leads that will convert, though there is still a small portion of actual conversions that it misses.

4. F1 Score (91.95%):
   --> The F1 score, which is the harmonic mean of precision and recall, is quite high at 91.95%. This shows a strong balance between precision and recall.

5. ROC-AUC Score (93.09%):
   --> A score of 93.09% on the ROC-AUC indicates that the model has excellent discrimination ability between converted and non-converted leads.

### Interpretation:

--> The model is quite effective at predicting lead conversions. With high precision and recall, it indicates that most of the leads flagged as "hot" are indeed likely to convert, while also identifying most of the actual conversions.

--> The company can use this model to target leads with high scores and allocate sales resources more efficiently.

#### Let's proceed with assigning lead scores to each lead based on the logistic regression model's predictions.


# STEP-4 ---> Assign Lead Scores (0 to 100)

We’ll use the predicted probabilities from the logistic regression model to assign a lead score to each lead. The higher the probability that a lead will convert, the higher the lead score. These scores will range from 0 to 100.

In [26]:
# Get the Predicted Probabilities: 
# Instead of just predicting 0 or 1 (converted or not), we will get the probability that each lead will convert.

y_pred_proba = log_reg.predict_proba(X_test)[:, 1]  # Probability for the positive class (Converted)

In [27]:
# Convert Probabilities to Lead Scores: 
# Multiply the predicted probabilities by 100 to get a score between 0 and 100.

lead_scores = y_pred_proba * 100

In [28]:
# Assign Lead Scores to the Test Data

X_test['Lead Score'] = lead_scores

In [29]:
# View the Lead Scores

X_test[['Lead Score']].head()

Unnamed: 0,Lead Score
4608,93.792583
7935,1.790261
4043,0.319903
7821,0.340177
856,4.834153


### Interpretation:

High Scores (e.g., 93.79):

  --> Leads with high scores are predicted to have a high probability of conversion.

  --> These are your "Hot Leads", and your sales team should prioritize these leads for follow-up.

Low Scores (e.g., 0.32):

  --> Leads with low scores are predicted to have a low probability of conversion.  
  
  --> These leads are less likely to convert and may not warrant immediate attention.

# STEP-5 ---> SETTING A THRESOLD FOR LEAD SCORE

Setting a threshold helps categorize leads into different levels based on their likelihood to convert. This will allow you to prioritize which leads to focus on.

Let's assume the following thresholds:

Hot Leads: Score ≥ 70

Warm Leads: 30 ≤ Score < 70

Cold Leads: Score < 30

In [31]:
# Define thresholds
def categorize_lead(score):
    if score >= 70:
        return 'Hot Lead'
    elif score >= 30:
        return 'Warm Lead'
    else:
        return 'Cold Lead'

# Apply thresholds to create a new column for lead categories
X_test['Lead Category'] = X_test['Lead Score'].apply(categorize_lead)

In [32]:
# Check the distribution of leads across different categories.

lead_category_distribution = X_test['Lead Category'].value_counts()
print(lead_category_distribution)

Lead Category
Cold Lead    1077
Hot Lead      665
Warm Lead     106
Name: count, dtype: int64


### Interpretation:

1. Cold Leads (1077):

   --> These leads have a low probability of conversion. 
 
   --> They should be deprioritized or monitored with minimal effort, as they are less likely to convert into customers.

2. Hot Leads (665):

   --> These leads are predicted to have a high probability of conversion. 
 
   -->They should be prioritized for immediate follow-up by the sales team, as they are the most promising leads.

3. Warm Leads (106):

   --> These leads fall in between. 
 
   --> They may require further nurturing or follow-up to determine if they can be converted into customers.