<h1>Step 1: Data Collection</h1>

* Load the Dataset

In [41]:
import pandas as pd

# Read the leads.csv file
df = pd.read_csv('Files/leads.csv')


* Explore the Dataset

In [42]:
# Display the first few rows of the dataset
print(df.head())

# Check the number of rows and columns in the dataset
print("Shape of the dataset:", df.shape)

# Review the column names
print("Column names:", df.columns)

# Get summary information about the dataset
print(df.info())

# Calculate descriptive statistics for numeric variables
print(df.describe())


                            Prospect ID  Lead Number              Lead Origin  \
0  7927b2df-8bba-4d29-b9a2-b6e0beafe620       660737                      API   
1  2a272436-5132-4136-86fa-dcc88c88f482       660728                      API   
2  8cc8c611-a219-4f35-ad23-fdfd2656bd8a       660727  Landing Page Submission   
3  0cc2df48-7cf4-4e39-9de9-19797f9b38cc       660719  Landing Page Submission   
4  3256f628-e534-4826-9d63-4a8b88782852       660681  Landing Page Submission   

      Lead Source Do Not Email Do Not Call  Converted  TotalVisits  \
0      Olark Chat           No          No          0          0.0   
1  Organic Search           No          No          0          5.0   
2  Direct Traffic           No          No          1          2.0   
3  Direct Traffic           No          No          0          1.0   
4          Google           No          No          1          2.0   

   Total Time Spent on Website  Page Views Per Visit  ...  \
0                            0 

* Identify the Target Variable

In [43]:
# "Converted" is the target variable
target_variable = 'Converted'

* Review Data Types

In [44]:
# Check the data types of each variable
print(df.dtypes)

Prospect ID                                       object
Lead Number                                        int64
Lead Origin                                       object
Lead Source                                       object
Do Not Email                                      object
Do Not Call                                       object
Converted                                          int64
TotalVisits                                      float64
Total Time Spent on Website                        int64
Page Views Per Visit                             float64
Last Activity                                     object
Country                                           object
Specialization                                    object
How did you hear about X Education                object
What is your current occupation                   object
What matters most to you in choosing a course     object
Search                                            object
Magazine                       

* Assess Data Quality

In [45]:
# Check for missing values
print(df.isnull().sum())

# Check for duplicates
print("Number of duplicates:", df.duplicated().sum())


Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                        36
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                       137
Total Time Spent on Website                         0
Page Views Per Visit                              137
Last Activity                                     103
Country                                          2461
Specialization                                   1438
How did you hear about X Education               2207
What is your current occupation                  2690
What matters most to you in choosing a course    2709
Search                                              0
Magazine                                            0
Newspaper Article           

<h1>Step 2 : Data Preprocessing</h1>


* Data Cleaning

identify columns with missing values

In [46]:
print(df.isnull().sum())

Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                        36
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                       137
Total Time Spent on Website                         0
Page Views Per Visit                              137
Last Activity                                     103
Country                                          2461
Specialization                                   1438
How did you hear about X Education               2207
What is your current occupation                  2690
What matters most to you in choosing a course    2709
Search                                              0
Magazine                                            0
Newspaper Article           

Based on the missing value counts, we have several columns with missing values. To decide how to handle these missing values, it's important to understand the nature of the data and the context of each column.
Some common strategies for handling missing values include:
* Dropping columns with a high percentage of missing values.
* Dropping rows with missing values, especially if the number of missing values is relatively small compared to the total dataset size.
* Imputing missing values using methods such as mean, median, mode, or regression.

To drop columns with a high percentage of missing values, we can set a threshold value and drop the columns that have missing values exceeding that threshold. below code  that drops columns with missing values exceeding a threshold of 30%

In [47]:
threshold = 0.3  # Set the threshold to 30% missing values

# Calculate the percentage of missing values in each column
missing_percentage = df.isnull().mean()

# Get the columns to drop based on the threshold
columns_to_drop = missing_percentage[missing_percentage > threshold].index

# Drop the columns from the DataFrame
df_dropped_columns = df.drop(columns=columns_to_drop)

# Print the updated DataFrame
print(df_dropped_columns.head())


                            Prospect ID  Lead Number              Lead Origin  \
0  7927b2df-8bba-4d29-b9a2-b6e0beafe620       660737                      API   
1  2a272436-5132-4136-86fa-dcc88c88f482       660728                      API   
2  8cc8c611-a219-4f35-ad23-fdfd2656bd8a       660727  Landing Page Submission   
3  0cc2df48-7cf4-4e39-9de9-19797f9b38cc       660719  Landing Page Submission   
4  3256f628-e534-4826-9d63-4a8b88782852       660681  Landing Page Submission   

      Lead Source Do Not Email Do Not Call  Converted  TotalVisits  \
0      Olark Chat           No          No          0          0.0   
1  Organic Search           No          No          0          5.0   
2  Direct Traffic           No          No          1          2.0   
3  Direct Traffic           No          No          0          1.0   
4          Google           No          No          1          2.0   

   Total Time Spent on Website  Page Views Per Visit  ...  \
0                            0 

In [48]:
# check for percentage of null values in each column after dropping columns having more than 30 % null values

round(100*(df_dropped_columns.isnull().sum()/len(df_dropped_columns.index)), 2)

Prospect ID                                       0.00
Lead Number                                       0.00
Lead Origin                                       0.00
Lead Source                                       0.39
Do Not Email                                      0.00
Do Not Call                                       0.00
Converted                                         0.00
TotalVisits                                       1.48
Total Time Spent on Website                       0.00
Page Views Per Visit                              1.48
Last Activity                                     1.11
Country                                          26.63
Specialization                                   15.56
How did you hear about X Education               23.89
What is your current occupation                  29.11
What matters most to you in choosing a course    29.32
Search                                            0.00
Magazine                                          0.00
Newspaper 

The below columns still has high null values let's individually check and handle them
* Country   
* Specialization        
* How did you hear about X Education           
* What is your current occupation       
* What matters most to you in choosing a course    
* Lead Profile   
* City  

* Country: To impute missing values in the 'Country' column, we can replace them with the mode (most frequent value) since it's a categorical variable.

In [49]:
df_dropped_columns['Country'] = df_dropped_columns['Country'].fillna(df_dropped_columns['Country'].mode()[0])


* Specialization:
For the 'Specialization' column, we can replace missing values with the string "Not Specified" to indicate that the information was not provided.


In [50]:
df_dropped_columns['Specialization'] = df_dropped_columns['Specialization'].fillna('Not Specified')

* How did you hear about X Education:
Similarly, for the 'How did you hear about X Education' column, we can replace missing values with the string "Not Specified".

In [51]:
df_dropped_columns['How did you hear about X Education'] = df_dropped_columns['How did you hear about X Education'].fillna('Not Specified')

* What is your current occupation:
For the 'What is your current occupation' column, we can replace missing values with the mode (most frequent value) since it's a categorical variable.

In [52]:
df_dropped_columns['What is your current occupation'] = df_dropped_columns['What is your current occupation'].fillna('Unemployed')

* What matters most to you in choosing a course:
Since the 'What matters most to you in choosing a course' column has a high percentage of missing values, it might be better to drop this column.

In [53]:
df_dropped_columns.drop('What matters most to you in choosing a course', axis=1, inplace=True)

* Lead Profile:
For the 'Lead Profile' column, we can replace missing values with the string "Not Specified".

In [54]:
df_dropped_columns['Lead Profile'] = df_dropped_columns['Lead Profile'].fillna('Not Specified')

* City: For the 'City' column, we can replace missing values with the mode (most frequent value) since it's a categorical variable.

In [55]:
df_dropped_columns['City'] = df_dropped_columns['City'].fillna(df_dropped_columns['City'].mode()[0])


In [59]:
#checking Null percentages
round(100*(df_dropped_columns.isnull().sum()/len(df_dropped_columns.index)), 2)

Prospect ID                                 0.0
Lead Number                                 0.0
Lead Origin                                 0.0
Lead Source                                 0.0
Do Not Email                                0.0
Do Not Call                                 0.0
Converted                                   0.0
TotalVisits                                 0.0
Total Time Spent on Website                 0.0
Page Views Per Visit                        0.0
Last Activity                               0.0
Country                                     0.0
Specialization                              0.0
How did you hear about X Education          0.0
What is your current occupation             0.0
Search                                      0.0
Magazine                                    0.0
Newspaper Article                           0.0
X Education Forums                          0.0
Newspaper                                   0.0
Digital Advertisement                   

In [60]:
#Now missing values are close to zero so we can drop them
df_dropped_columns.dropna(inplace = True)

In [61]:
#checking Null percentages
round(100*(df_dropped_columns.isnull().sum()/len(df_dropped_columns.index)), 2)

Prospect ID                                 0.0
Lead Number                                 0.0
Lead Origin                                 0.0
Lead Source                                 0.0
Do Not Email                                0.0
Do Not Call                                 0.0
Converted                                   0.0
TotalVisits                                 0.0
Total Time Spent on Website                 0.0
Page Views Per Visit                        0.0
Last Activity                               0.0
Country                                     0.0
Specialization                              0.0
How did you hear about X Education          0.0
What is your current occupation             0.0
Search                                      0.0
Magazine                                    0.0
Newspaper Article                           0.0
X Education Forums                          0.0
Newspaper                                   0.0
Digital Advertisement                   

<h1>Step 3 : Exploratory Data Analysis</h1>


* Univariate Analysis: