# Introduction
Joni is a CRM manager of Madugital. Madugital sells honey via online. Joni has a leads collection in the form of questionnaire which previously filled by customers who accessed Madugital website. Joni wants to know which types of customer who potentially buy Madugital products and how to approach them in the best way. Let's help Joni by modelling the dataset!

## 1. Data Preparation
First, let's import some important packages.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

### Data Overview

In [2]:
df = pd.read_csv('lead_scoring.csv')
df.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Jakarta,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Jakarta,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Jakarta,02.Medium,01.High,15.0,18.0,No,No,Modified


In [3]:
df.shape

(9240, 37)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Prospect ID                                     9240 non-null   object 
 1   Lead Number                                     9240 non-null   int64  
 2   Lead Origin                                     9240 non-null   object 
 3   Lead Source                                     9204 non-null   object 
 4   Do Not Email                                    9240 non-null   object 
 5   Do Not Call                                     9240 non-null   object 
 6   Converted                                       9240 non-null   int64  
 7   TotalVisits                                     9103 non-null   float64
 8   Total Time Spent on Website                     9240 non-null   int64  
 9   Page Views Per Visit                     

### Handling Missing Values

In [5]:
# check if there any missing values
df.isnull().sum()

Prospect ID                                          0
Lead Number                                          0
Lead Origin                                          0
Lead Source                                         36
Do Not Email                                         0
Do Not Call                                          0
Converted                                            0
TotalVisits                                        137
Total Time Spent on Website                          0
Page Views Per Visit                               137
Last Activity                                      103
Country                                           2461
Specialization                                    1438
How did you hear about Madugital                  2207
What is your current occupation                   2690
What matters most to you in choosing a product    2709
Search                                               0
Magazine                                             0
Newspaper 

In [6]:
# percentage of missing values in each column
df.isna().mean() * 100

Prospect ID                                        0.000000
Lead Number                                        0.000000
Lead Origin                                        0.000000
Lead Source                                        0.389610
Do Not Email                                       0.000000
Do Not Call                                        0.000000
Converted                                          0.000000
TotalVisits                                        1.482684
Total Time Spent on Website                        0.000000
Page Views Per Visit                               1.482684
Last Activity                                      1.114719
Country                                           26.634199
Specialization                                    15.562771
How did you hear about Madugital                  23.885281
What is your current occupation                   29.112554
What matters most to you in choosing a product    29.318182
Search                                  

In [7]:
df_dropped = df.copy()

In [8]:
# check values in 'How did you hear about Madugital' column
df['How did you hear about Madugital'].unique()

array(['Select', 'Word Of Mouth', 'Other', nan, 'Online Search',
       'Multiple Sources', 'Advertisements', 'Student of SomeSchool',
       'Email', 'Social Media', 'SMS'], dtype=object)

In [9]:
# fill missing records in 'How did you hear about Madugital' column
df_dropped['How did you hear about Madugital'] = df_dropped['How did you hear about Madugital'].fillna('Other')

In [10]:
# check values in 'What is your current occupation' column
df['What is your current occupation'].unique()

array(['Unemployed', 'Student', nan, 'Working Professional',
       'Businessman', 'Other', 'Housewife'], dtype=object)

In [11]:
# fill missing records in 'What is your current occupation' column
df_dropped['What is your current occupation'] = df_dropped['What is your current occupation'].fillna('Other')

In [12]:
# percentage of missing values in each column
df_dropped.isna().mean() * 100

Prospect ID                                        0.000000
Lead Number                                        0.000000
Lead Origin                                        0.000000
Lead Source                                        0.389610
Do Not Email                                       0.000000
Do Not Call                                        0.000000
Converted                                          0.000000
TotalVisits                                        1.482684
Total Time Spent on Website                        0.000000
Page Views Per Visit                               1.482684
Last Activity                                      1.114719
Country                                           26.634199
Specialization                                    15.562771
How did you hear about Madugital                   0.000000
What is your current occupation                    0.000000
What matters most to you in choosing a product    29.318182
Search                                  

We can see that the three related variables or columns have been filled in the new dataframe.

In [13]:
df.shape, df_dropped.shape

((9240, 37), (9240, 37))

### Check Duplicate Values

In [14]:
df.duplicated().sum()

0

### Finding Data Insight

In [15]:
# calculate the correlation between two variables
pd.crosstab(df_dropped['How did you hear about Madugital'], df_dropped['Converted'], normalize = True)

Converted,0,1
How did you hear about Madugital,Unnamed: 1_level_1,Unnamed: 2_level_1
Advertisements,0.004113,0.003463
Email,0.001407,0.001407
Multiple Sources,0.01039,0.006061
Online Search,0.050325,0.037121
Other,0.220238,0.038745
SMS,0.001948,0.000541
Select,0.282684,0.263095
Social Media,0.004221,0.00303
Student of SomeSchool,0.018074,0.015476
Word Of Mouth,0.021212,0.01645


In [16]:
pd.crosstab(df_dropped['How did you hear about Madugital'], df_dropped['What is your current occupation'], normalize = True)

What is your current occupation,Businessman,Housewife,Other,Student,Unemployed,Working Professional
How did you hear about Madugital,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Advertisements,0.0,0.0,0.002165,0.000216,0.004762,0.000433
Email,0.0,0.0,0.000433,0.0,0.002165,0.000216
Multiple Sources,0.0,0.0,0.004221,0.000216,0.010606,0.001407
Online Search,0.000108,0.000108,0.022186,0.000758,0.058658,0.005628
Other,0.000108,0.000108,0.241234,0.001732,0.014069,0.001732
SMS,0.0,0.0,0.00119,0.0,0.00119,0.000108
Select,0.000649,0.000866,0.001515,0.01829,0.46461,0.059848
Social Media,0.0,0.0,0.001515,0.000108,0.005303,0.000325
Student of SomeSchool,0.0,0.0,0.007792,0.000866,0.021212,0.00368
Word Of Mouth,0.0,0.0,0.010606,0.000541,0.023485,0.00303


In [17]:
#get dummies values
df_dummies = pd.get_dummies(df_dropped[['How did you hear about Madugital', 'What is your current occupation']], drop_first = True)
df_modelling = df_dropped.join(df_dummies)
df_dummies.head()

Unnamed: 0,How did you hear about Madugital_Email,How did you hear about Madugital_Multiple Sources,How did you hear about Madugital_Online Search,How did you hear about Madugital_Other,How did you hear about Madugital_SMS,How did you hear about Madugital_Select,How did you hear about Madugital_Social Media,How did you hear about Madugital_Student of SomeSchool,How did you hear about Madugital_Word Of Mouth,What is your current occupation_Housewife,What is your current occupation_Other,What is your current occupation_Student,What is your current occupation_Unemployed,What is your current occupation_Working Professional
0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,0,0,0,0,0,1,0
2,0,0,0,0,0,1,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,1,0,0,0,1,0
4,0,0,0,1,0,0,0,0,0,0,0,0,1,0


In [18]:
#check the columns of dataframe for modelling
df_modelling.columns

Index(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Do Not Call', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Country', 'Specialization', 'How did you hear about Madugital',
       'What is your current occupation',
       'What matters most to you in choosing a product', 'Search', 'Magazine',
       'Newspaper Article', 'Madugital Telegram', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Products', 'Tags', 'Lead Quality',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'Asymmetrique Activity Index',
       'Asymmetrique Profile Index', 'Asymmetrique Activity Score',
       'Asymmetrique Profile Score',
       'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity',
       'How did you hear about Mad

In [19]:
# drop unnecessary columns
df_modelling.drop(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Do Not Call', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Country', 'Specialization', 'How did you hear about Madugital',
       'What is your current occupation',
       'What matters most to you in choosing a product', 'Search', 'Magazine',
       'Newspaper Article', 'Madugital Telegram', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Products', 'Tags', 'Lead Quality',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'Asymmetrique Activity Index',
       'Asymmetrique Profile Index', 'Asymmetrique Activity Score',
       'Asymmetrique Profile Score',
       'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity'], axis = 1, inplace = True)

### Modelling Preparation

In [20]:
train, test = train_test_split(df_modelling, test_size = 0.3, random_state = 10)

In [21]:
train.shape, test.shape

((6468, 15), (2772, 15))

In [22]:
train.columns

Index(['Converted', 'How did you hear about Madugital_Email',
       'How did you hear about Madugital_Multiple Sources',
       'How did you hear about Madugital_Online Search',
       'How did you hear about Madugital_Other',
       'How did you hear about Madugital_SMS',
       'How did you hear about Madugital_Select',
       'How did you hear about Madugital_Social Media',
       'How did you hear about Madugital_Student of SomeSchool',
       'How did you hear about Madugital_Word Of Mouth',
       'What is your current occupation_Housewife',
       'What is your current occupation_Other',
       'What is your current occupation_Student',
       'What is your current occupation_Unemployed',
       'What is your current occupation_Working Professional'],
      dtype='object')

### Modelling

In [23]:
# Decision Tree Classifier
X_train = train.drop(['Converted'], 1)
y_train = train['Converted']

X_test = test.drop(['Converted'], 1)
y_test = test['Converted']

dt = DecisionTreeClassifier(random_state = 10)
dt.fit(X_train,y_train)

DecisionTreeClassifier(random_state=10)

In [24]:
dt.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [25]:
y_test

567     0
2303    0
1523    0
6923    0
7841    0
       ..
9124    0
1049    0
7778    0
6432    0
924     0
Name: Converted, Length: 2772, dtype: int64

### Modelling Evaluation

In [26]:
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print(accuracy, precision, recall)

0.6778499278499278 0.8598130841121495 0.17574021012416427
