<a href="https://colab.research.google.com/github/gomezphd/CAP4767-Data-Mining/blob/main/projects/01_lead_scoring/notebooks/Lead_Scoring_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎯 Lead Scoring: Predicting Customer Conversion with Machine Learning

## The Challenge: Lead Scoring in Marketing and Sales

In the competitive world of marketing and sales, identifying which potential customers (leads) are most likely to convert into paying customers is a critical challenge. This process, known as **lead scoring**, empowers sales teams to focus their efforts on the most promising leads, improving efficiency and boosting revenue.

---

### 🔍 The Role of Machine Learning

Machine learning offers a robust solution to the lead scoring problem by analyzing large amounts of data and uncovering patterns that predict customer behavior. By leveraging classification models, businesses can:
- Automatically rank leads based on their likelihood of conversion.
- Gain actionable insights into the factors driving conversions.
- Continuously improve predictions as more data becomes available.

---

### 🤔 Lead Scoring vs. Customer Churn Prediction

While lead scoring shares similarities with customer churn prediction, the goals are fundamentally different:
- **In lead scoring**: We predict whether a potential customer will convert (e.g., sign up or make a purchase).
- **In churn prediction**: We predict whether an existing customer will leave (e.g., cancel their subscription).

Despite their opposite objectives—**acquiring new customers** versus **retaining existing ones**—both problems use similar machine learning techniques to address critical business questions.

---

### 🛠️ Let's Get Started!

In this project, we’ll work with a real-world dataset of leads to develop a machine learning model that predicts customer conversion. We’ll:
1. Explore the dataset to understand its structure and key features.
2. Preprocess the data to prepare it for modeling.
3. Train and evaluate classification models to predict lead conversion.
4. Build a scoring system that sales teams can use to prioritize leads effectively.

Ready to dive in? Let’s take a closer look at our dataset and set the stage for the project!


In [18]:
# Essential Libraries
import pandas as pd
import numpy as np

# Visualization Libraries
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

# Machine Learning Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import xgboost as xgb



# 📊 Our Dataset and Goal

We'll be working with a **real-world leads dataset** to predict customer conversion. This dataset contains **rich, detailed information** about potential leads, including their demographics, interactions with marketing campaigns, and other behavioral data. Our objective is to identify patterns and features that help predict whether a lead will convert into a customer.

---

### 🔑 Key Features of the Dataset:

| **Feature**                     | **Description**                                                                 |
|----------------------------------|---------------------------------------------------------------------------------|
| **Lead ID**                     | A unique identifier for each lead.                                             |
| **Lead Origin**                  | Identifies the source where the lead originated (e.g., landing page, email).    |
| **Lead Source**                  | The specific channel (e.g., Google, Organic Search) from which the lead was acquired. |
| **Lead Score**                   | A numerical value representing the likelihood of conversion.                   |
| **Country**                      | Geographic location of the lead.                                               |
| **Total Visits**                 | Total number of visits to the website, indicating engagement level.            |
| **Total Time Spent on Website**  | The total time a lead spent browsing the website, reflecting interest.         |
| **Page Views Per Visit**         | Average number of pages viewed during a visit, indicating interaction depth.   |
| **Last Activity**                | The most recent action taken by the lead, such as opening an email or clicking an ad. |
| **Specialization**               | The industry or domain of the lead, potentially influencing their likelihood of conversion. |
| **What is your current occupation** | Indicates whether the lead is a student, professional, or other occupation.    |
| **Activity Level**               | Behavioral metrics, e.g., website visits, email engagement, etc.               |
| **Profile Completeness**         | Percentage of the profile completed by the lead.                               |
| **Converted** (Target)           | The target variable: `1` for converted leads, `0` otherwise.                   |

---

### 🎯 Dataset Goals:

Our goal is to leverage machine learning to:
1. **Prioritize leads** by identifying those most likely to convert.
2. **Discover actionable insights** into what drives conversions.

By achieving these, businesses can focus their marketing and sales efforts where they matter most, improving efficiency and ROI.

---

### 🛠️ First Steps:

Let's get started by downloading the dataset, loading it, and taking a quick look at its structure:


In [20]:
# Read the CSV directly from GitHub raw URL
url = "https://raw.githubusercontent.com/gomezphd/CAP4767-Data-Mining/main/datasets/01_lead_scoring/data/raw/Leads.csv"
df = pd.read_csv(url)

# Display the first few rows and basic information
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
df.head()

Dataset Shape: (9240, 37)

First few rows:


Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


## 🛠️ Next Steps: Building a Lead Scoring Workflow

To successfully predict customer conversion, our workflow will mirror approaches proven in churn prediction:

1. **Clean and prepare the leads data**: Handle missing values, normalize data, and encode categorical variables.
2. **Analyze characteristics associated with conversion**: Identify trends, correlations, and key predictive features.
3. **Build a classification model**: Train a machine learning model to predict which leads are likely to convert.
4. **Create a scoring system**: Develop an actionable lead scoring system that sales teams can use to prioritize efforts.

---

### 🤔 Why Classification Models?

Classification models are particularly well-suited for lead scoring because they:

- **Handle diverse data types**: Work seamlessly with both numerical and categorical data about leads.
- **Capture complex patterns**: Learn intricate relationships in customer behavior and interactions.
- **Provide actionable insights**: Output probability scores that sales teams can use to rank and prioritize leads.
- **Adapt to new data**: Improve predictions over time as more data becomes available.

---

### 🚦 Ready to Start?

Let's begin by **exploring our data** and preparing it for modeling. This will include:
- Understanding the dataset structure.
- Checking for missing values and outliers.
- Encoding categorical variables for analysis.

Stay tuned as we dive into the data and transform it into insights!
