# Importing Libraries

In [1]:
import numpy as np
import pandas as pd

# For Visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# To Scale our data
from sklearn.preprocessing import scale

# Jupyter Settings

In [2]:
## Set the max display columns to None so that pandas doesn't sandwich the output 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 40)

# Data Read

In [3]:
df = pd.read_csv('Leads.csv')

# Data Understanding

In [4]:
df.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,Select,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Word Of Mouth,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,Select,Other,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [5]:
df.shape

(9240, 37)

There are 9240 records and 37 columns.

In [6]:
df.dtypes

Prospect ID                                       object
Lead Number                                        int64
Lead Origin                                       object
Lead Source                                       object
Do Not Email                                      object
Do Not Call                                       object
Converted                                          int64
TotalVisits                                      float64
Total Time Spent on Website                        int64
Page Views Per Visit                             float64
Last Activity                                     object
Country                                           object
Specialization                                    object
How did you hear about X Education                object
What is your current occupation                   object
What matters most to you in choosing a course     object
Search                                            object
Magazine                       

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

## Variable Identification

In [8]:
# Checking the Converted Rate

converted = (sum(df['Converted'])/len(df['Converted'].index))*100

converted

38.53896103896104

There is almost 40% of conversion in the given data set.

## Describe Numeric

In [9]:
df.describe()

Unnamed: 0,Lead Number,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Asymmetrique Activity Score,Asymmetrique Profile Score
count,9240.0,9240.0,9103.0,9240.0,9103.0,5022.0,5022.0
mean,617188.435606,0.38539,3.445238,487.698268,2.36282,14.306252,16.344883
std,23405.995698,0.486714,4.854853,548.021466,2.161418,1.386694,1.811395
min,579533.0,0.0,0.0,0.0,0.0,7.0,11.0
25%,596484.5,0.0,1.0,12.0,1.0,14.0,15.0
50%,615479.0,0.0,3.0,248.0,2.0,14.0,16.0
75%,637387.25,1.0,5.0,936.0,3.0,15.0,18.0
max,660737.0,1.0,251.0,2272.0,55.0,18.0,20.0


## Describe Categorical

In [10]:
cat_var = [cname for cname in df.columns if df[cname].dtype == "object"]

df[cat_var].describe()

Unnamed: 0,Prospect ID,Lead Origin,Lead Source,Do Not Email,Do Not Call,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
count,9240,9240,9204,9240,9240,9137,6779,7802,7033,6550,6531,9240,9240,9240,9240,9240,9240,9240,9240,5887,4473,9240,9240,6531,7820,5022,5022,9240,9240,9240
unique,9240,5,21,2,2,17,38,19,10,6,3,2,1,2,2,2,2,2,1,26,5,1,1,6,7,3,3,1,2,16
top,6c2a3e3f-de1a-49bc-9de5-b6a4c8a44cda,Landing Page Submission,Google,No,No,Email Opened,India,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Select,Mumbai,02.Medium,02.Medium,No,No,Modified
freq,1,4886,2868,8506,9238,3437,6492,1942,5043,5600,6528,9226,9240,9238,9239,9239,9236,9233,9240,2072,1560,9240,9240,4146,3222,3839,2788,9240,6352,3407


## Unique value counts

In [11]:
unique_values = pd.DataFrame(df.apply(lambda x: len(x.value_counts(dropna=False)), axis=0), columns=['Unique Value Count']).sort_values(by='Unique Value Count', ascending=True)

unique_values['dtype'] = pd.DataFrame(df.dtypes)

unique_values

Unnamed: 0,Unique Value Count,dtype
Get updates on DM Content,1,object
I agree to pay the amount through cheque,1,object
Receive More Updates About Our Courses,1,object
Magazine,1,object
Update me on Supply Chain Content,1,object
Through Recommendations,2,object
Digital Advertisement,2,object
Newspaper,2,object
X Education Forums,2,object
A free copy of Mastering The Interview,2,object


This is a very important attribute to understand the variance in each column so that we can identify which columns have all values same and all values different to be used to decide whether or not to drop the variable.

## Frequency Distribution of Unique Values of all Columns

In [12]:
cols = list(df.columns)
cols.remove('Prospect ID')
cols.remove('Lead Number')
cols.remove('Total Time Spent on Website')
cols.remove('Page Views Per Visit')

for col in cols:
    print('\n############################################')
    print('Unique value distribution of ' + str(col))
    print('############################################')
    print(df[col].value_counts(dropna=False).sort_values(ascending = False))


############################################
Unique value distribution of Lead Origin
############################################
Landing Page Submission    4886
API                        3580
Lead Add Form               718
Lead Import                  55
Quick Add Form                1
Name: Lead Origin, dtype: int64

############################################
Unique value distribution of Lead Source
############################################
Google               2868
Direct Traffic       2543
Olark Chat           1755
Organic Search       1154
Reference             534
Welingak Website      142
Referral Sites        125
Facebook               55
NaN                    36
bing                    6
google                  5
Click2call              4
Press_Release           2
Live Chat               2
Social Media            2
WeLearn                 1
youtubechannel          1
NC_EDM                  1
blog                    1
welearnblog_Home        1
Pay per Click Ads       

## Min Max Values of Numeric Columns

In [13]:
# Check if all the values of the variables are in the same scale

numeric_cols = [cname for cname in df.columns if 
                                df[cname].dtype in ['int64', 'float64']]

Max = pd.DataFrame(df[numeric_cols].max().rename('Max'))
Min = pd.DataFrame(df[numeric_cols].min().rename('Min'))

pd.concat([Max, Min], axis=1)

Unnamed: 0,Max,Min
Lead Number,660737.0,579533.0
Converted,1.0,0.0
TotalVisits,251.0,0.0
Total Time Spent on Website,2272.0,0.0
Page Views Per Visit,55.0,0.0
Asymmetrique Activity Score,18.0,7.0
Asymmetrique Profile Score,20.0,11.0


# Data Cleaning

## Dropping Columns with Single/All Different Values

In [14]:
drop_col = ['Prospect ID', 'Lead Number', 'Get updates on DM Content', 'I agree to pay the amount through cheque', 
            'Receive More Updates About Our Courses', 'Magazine', 'Update me on Supply Chain Content']

df.drop(drop_col, axis=1, inplace=True)

## Dropping Columns with Very High Imabalance

Below columns have a very high number of a single value compared to the other hence not useful for the analysis.

In [15]:
for col in ['Through Recommendations', 'Digital Advertisement', 'Newspaper', 'X Education Forums', 'Search', 
            'Newspaper Article', 'What matters most to you in choosing a course', 'Do Not Call']:
    print('\n############################################')
    print('Unique value distribution of ' + str(col))
    print('############################################')
    print(df[col].value_counts(dropna=False).sort_values(ascending = False))


############################################
Unique value distribution of Through Recommendations
############################################
No     9233
Yes       7
Name: Through Recommendations, dtype: int64

############################################
Unique value distribution of Digital Advertisement
############################################
No     9236
Yes       4
Name: Digital Advertisement, dtype: int64

############################################
Unique value distribution of Newspaper
############################################
No     9239
Yes       1
Name: Newspaper, dtype: int64

############################################
Unique value distribution of X Education Forums
############################################
No     9239
Yes       1
Name: X Education Forums, dtype: int64

############################################
Unique value distribution of Search
############################################
No     9226
Yes      14
Name: Search, dtype: int64

###############

In [16]:
drop_col = ['Through Recommendations', 'Digital Advertisement', 'Newspaper', 'X Education Forums', 'Search', 
            'Newspaper Article', 'What matters most to you in choosing a course', 'Do Not Call']

df.drop(drop_col, axis=1, inplace=True)

## Droping Insignificant Columns

If we observe the values of city column, all of them are from India which makes the column Country not useful for the analysis.

In [17]:
df.drop(['Country'], axis=1, inplace=True)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 21 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Lead Origin                             9240 non-null   object 
 1   Lead Source                             9204 non-null   object 
 2   Do Not Email                            9240 non-null   object 
 3   Converted                               9240 non-null   int64  
 4   TotalVisits                             9103 non-null   float64
 5   Total Time Spent on Website             9240 non-null   int64  
 6   Page Views Per Visit                    9103 non-null   float64
 7   Last Activity                           9137 non-null   object 
 8   Specialization                          7802 non-null   object 
 9   How did you hear about X Education      7033 non-null   object 
 10  What is your current occupation         6550 non-null   obje

## Rename Columns

In [19]:
df.rename(columns=
            {
             "Lead Origin":"Lead_Origin",
             "Lead Source":"Lead_Source",
             "Do Not Email":"No_Email",
             "TotalVisits":"Total_Visits",
             "Total Time Spent on Website":"Time_On_Website",
             "Page Views Per Visit":"Page_Views",
             "Last Activity":"Last_Activity",
             "How did you hear about X Education":"Hear",
             "What is your current occupation":"Occupation",
             "Lead Quality":"Lead_Quality",
             "Lead Profile":"Lead_Profile",
             "Asymmetrique Activity Index":"Activity_Index",
             "Asymmetrique Profile Index":"Profile_Index",
             "Asymmetrique Activity Score":"Activity_Score",
             "Asymmetrique Profile Score":"Profile_Score",
             "A free copy of Mastering The Interview":"Free_Copy",
             "Last Notable Activity":"Last_Notable_Activity"
            }, 
            inplace=True)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Lead_Origin            9240 non-null   object 
 1   Lead_Source            9204 non-null   object 
 2   No_Email               9240 non-null   object 
 3   Converted              9240 non-null   int64  
 4   Total_Visits           9103 non-null   float64
 5   Time_On_Website        9240 non-null   int64  
 6   Page_Views             9103 non-null   float64
 7   Last_Activity          9137 non-null   object 
 8   Specialization         7802 non-null   object 
 9   Hear                   7033 non-null   object 
 10  Occupation             6550 non-null   object 
 11  Tags                   5887 non-null   object 
 12  Lead_Quality           4473 non-null   object 
 13  Lead_Profile           6531 non-null   object 
 14  City                   7820 non-null   object 
 15  Acti

In [21]:
cols = list(df.columns)
cols.remove('Time_On_Website')
cols.remove('Page_Views')
cols.remove('Total_Visits')
cols.remove('Profile_Index')
cols.remove('Activity_Score')
cols.remove('Profile_Score')
cols.remove('Converted')

for col in cols:
    print('\n############################################')
    print('Unique value distribution of ' + str(col))
    print('############################################')
    print(df[col].value_counts(dropna=False).sort_values(ascending = False))


############################################
Unique value distribution of Lead_Origin
############################################
Landing Page Submission    4886
API                        3580
Lead Add Form               718
Lead Import                  55
Quick Add Form                1
Name: Lead_Origin, dtype: int64

############################################
Unique value distribution of Lead_Source
############################################
Google               2868
Direct Traffic       2543
Olark Chat           1755
Organic Search       1154
Reference             534
Welingak Website      142
Referral Sites        125
Facebook               55
NaN                    36
bing                    6
google                  5
Click2call              4
Press_Release           2
Live Chat               2
Social Media            2
WeLearn                 1
youtubechannel          1
NC_EDM                  1
blog                    1
welearnblog_Home        1
Pay per Click Ads       

## Standardizing Values

In [22]:
cat_cols = [cname for cname in df.columns if df[cname].dtype == "object"]

for col in cat_cols:
    df[col] = df[col].str.lower()

Converting all text to lower helps in identifying duplicate values due to case.

In [23]:
cols = list(df.columns)
cols.remove('Time_On_Website')
cols.remove('Page_Views')
cols.remove('Total_Visits')
cols.remove('Profile_Index')
cols.remove('Activity_Score')
cols.remove('Profile_Score')
cols.remove('Converted')

for col in cols:
    print('\n################################')
    print('Unique values of ' + str(col))
    print('################################')
    print(pd.Series(df[col].unique()).sort_values(ascending=True))


################################
Unique values of Lead_Origin
################################
0                        api
1    landing page submission
2              lead add form
3                lead import
4             quick add form
dtype: object

################################
Unique values of Lead_Source
################################
11                 bing
9                  blog
14           click2call
2        direct traffic
7              facebook
3                google
15            live chat
20               nc_edm
0            olark chat
1        organic search
10    pay per click ads
19        press_release
6             reference
4        referral sites
12         social media
18              testone
13              welearn
16     welearnblog_home
5      welingak website
17       youtubechannel
8                   NaN
dtype: object

################################
Unique values of No_Email
################################
0     no
1    yes
dtype: object

#####

## Duplicates

In [24]:
df.duplicated().value_counts()

False    7958
True     1282
dtype: int64

There are 1282 duplicate rows in the data set. We can drop them.

In [25]:
df.loc[df.duplicated()]

Unnamed: 0,Lead_Origin,Lead_Source,No_Email,Converted,Total_Visits,Time_On_Website,Page_Views,Last_Activity,Specialization,Hear,Occupation,Tags,Lead_Quality,Lead_Profile,City,Activity_Index,Profile_Index,Activity_Score,Profile_Score,Free_Copy,Last_Notable_Activity
16,api,olark chat,no,0,0.0,0,0.0,olark chat conversation,,,,,,,,01.high,02.medium,17.0,15.0,no,modified
47,api,olark chat,no,0,0.0,0,0.0,olark chat conversation,,,,,,,,01.high,02.medium,17.0,15.0,no,modified
49,api,olark chat,no,0,0.0,0,0.0,olark chat conversation,,,,,,,,01.high,02.medium,17.0,15.0,no,modified
83,api,olark chat,no,0,0.0,0,0.0,olark chat conversation,,,,,,,,01.high,02.medium,17.0,15.0,no,modified
190,api,olark chat,no,0,0.0,0,0.0,olark chat conversation,,,,,,,,01.high,02.medium,17.0,15.0,no,modified
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9136,api,olark chat,no,0,0.0,0,0.0,sms sent,,,,,,,,02.medium,02.medium,14.0,15.0,no,modified
9137,api,olark chat,no,0,0.0,0,0.0,olark chat conversation,,,,,,,,01.high,02.medium,17.0,15.0,no,modified
9165,api,olark chat,no,1,0.0,0,0.0,sms sent,,,,,,,,01.high,02.medium,16.0,15.0,no,modified
9170,api,olark chat,no,0,0.0,0,0.0,email opened,select,select,unemployed,already a student,,potential lead,select,01.high,02.medium,17.0,16.0,no,email opened


In [26]:
df.drop_duplicates(keep='first', inplace=True)

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7958 entries, 0 to 9239
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Lead_Origin            7958 non-null   object 
 1   Lead_Source            7925 non-null   object 
 2   No_Email               7958 non-null   object 
 3   Converted              7958 non-null   int64  
 4   Total_Visits           7821 non-null   float64
 5   Time_On_Website        7958 non-null   int64  
 6   Page_Views             7821 non-null   float64
 7   Last_Activity          7855 non-null   object 
 8   Specialization         7259 non-null   object 
 9   Hear                   6490 non-null   object 
 10  Occupation             6007 non-null   object 
 11  Tags                   5556 non-null   object 
 12  Lead_Quality           4253 non-null   object 
 13  Lead_Profile           5988 non-null   object 
 14  City                   7276 non-null   object 
 15  Acti

## Missing Values

In [28]:
null_series = pd.Series(round(100*(df.isnull().sum(axis=0)/len(df.index)), 2).sort_values(ascending = False))
null_series.loc[null_series.values > 0]

Lead_Quality      46.56
Profile_Score     44.16
Activity_Score    44.16
Profile_Index     44.16
Activity_Index    44.16
Tags              30.18
Lead_Profile      24.75
Occupation        24.52
Hear              18.45
Specialization     8.78
City               8.57
Page_Views         1.72
Total_Visits       1.72
Last_Activity      1.29
Lead_Source        0.41
dtype: float64

Dropping columns whose missing value % is >= **30%** as they won't help in the analysis and imputing them would only add more bias.

In [29]:
drop_col = ['Lead_Quality','Profile_Score','Activity_Score','Profile_Index','Activity_Index','Tags']

df.drop(drop_col, axis=1, inplace=True)

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7958 entries, 0 to 9239
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Lead_Origin            7958 non-null   object 
 1   Lead_Source            7925 non-null   object 
 2   No_Email               7958 non-null   object 
 3   Converted              7958 non-null   int64  
 4   Total_Visits           7821 non-null   float64
 5   Time_On_Website        7958 non-null   int64  
 6   Page_Views             7821 non-null   float64
 7   Last_Activity          7855 non-null   object 
 8   Specialization         7259 non-null   object 
 9   Hear                   6490 non-null   object 
 10  Occupation             6007 non-null   object 
 11  Lead_Profile           5988 non-null   object 
 12  City                   7276 non-null   object 
 13  Free_Copy              7958 non-null   object 
 14  Last_Notable_Activity  7958 non-null   object 
dtypes: f

In [31]:
null_series = pd.Series(round(100*(df.isnull().sum(axis=0)/len(df.index)), 2).sort_values(ascending = False))
null_series.loc[null_series.values > 0]

Lead_Profile      24.75
Occupation        24.52
Hear              18.45
Specialization     8.78
City               8.57
Page_Views         1.72
Total_Visits       1.72
Last_Activity      1.29
Lead_Source        0.41
dtype: float64

Finding the different unique values of the missing value columns to see how we can impute them.

In [32]:
for col in ['Lead_Profile','Occupation','Hear','Specialization','City','Page_Views','Total_Visits',
            'Last_Activity','Lead_Source']:
    print('\n############################################')
    print('Unique value distribution of ' + str(col))
    print('############################################')
    print(df[col].value_counts(dropna=False).sort_values(ascending = False))


############################################
Unique value distribution of Lead_Profile
############################################
select                         3751
NaN                            1970
potential lead                 1508
other leads                     484
student of someschool           201
lateral student                  24
dual specialization student      20
Name: Lead_Profile, dtype: int64

############################################
Unique value distribution of Occupation
############################################
unemployed              5150
NaN                     1951
working professional     638
student                  185
other                     16
housewife                 10
businessman                8
Name: Occupation, dtype: int64

############################################
Unique value distribution of Hear
############################################
select                   4500
NaN                      1468
online search             808
wo

A general strategy is employed here to impute nulls, which is to impute them with the 'select' value which is equivalen to null. At a later statge, dummy variables are created and the select dummy variable will be dropped.

This will take care of nulls and as well as preserves the information.

Same is the case with other columns where the nulls are imputed with 'unspecified' and that dummy variable is dropped later.

In [33]:
# Check the above for the strategy

df['Specialization'].fillna(value='select', inplace=True)
df['Hear'].fillna(value='select', inplace=True)
df['Lead_Profile'].fillna(value='select', inplace=True)
df['City'].fillna(value='select', inplace=True)

# Imputing

df['Occupation'].fillna(value='unspecified', inplace=True)
df['Last_Activity'].fillna(value='unspecified', inplace=True)
df['Lead_Source'].fillna(value='unspecified', inplace=True)

# Imputing with the mode value

df['Total_Visits'].fillna(value=2.0, inplace=True)
df['Page_Views'].fillna(value=2.0, inplace=True)

In [34]:
df.isnull().values.any()

False

In [35]:
for col in ['Lead_Profile','Occupation','Hear','Specialization','City','Page_Views','Total_Visits',
            'Last_Activity','Lead_Source']:
    print('\n############################################')
    print('Unique value distribution of ' + str(col))
    print('############################################')
    print(df[col].value_counts(dropna=False).sort_values(ascending = False))


############################################
Unique value distribution of Lead_Profile
############################################
select                         5721
potential lead                 1508
other leads                     484
student of someschool           201
lateral student                  24
dual specialization student      20
Name: Lead_Profile, dtype: int64

############################################
Unique value distribution of Occupation
############################################
unemployed              5150
unspecified             1951
working professional     638
student                  185
other                     16
housewife                 10
businessman                8
Name: Occupation, dtype: int64

############################################
Unique value distribution of Hear
############################################
select                   5968
online search             808
word of mouth             348
student of someschool     310
other   

# EDA

## Univariate Analysis

In [36]:
from tabulate import tabulate

def univariate_categorical(df):
        
   cat_cols = [cname for cname in df.columns if df[cname].dtype == "object"]        

   for col in cat_cols:
        len_cat = len(df[col].unique())
        print('\n############' + col + '############')
        print('\nNumber of unique values => ' + str(len_cat) + '\n\n')

        if len_cat > 10:
            plt.figure(figsize=(15, 12))
        else:
            plt.figure(figsize=(10, 8))

        y = "count"
        plt.subplot(1, 2, 1)
        sns.countplot(df[col])
        count_df = df[col].value_counts().rename(y).reset_index().rename(columns={"index":col})

        y = "percent(%)"
        percent_df = df[col].value_counts(normalize=True).apply(lambda x: round(x*100, 2)).rename(y).reset_index().rename(columns={"index":col})
        plt.subplot(1, 2, 2)
        sns.barplot(percent_df[col], percent_df[y], data=percent_df)

        plt.show()
        print(tabulate(pd.merge(percent_df, count_df, how='inner'), headers='keys', tablefmt='fancy_grid'))

ModuleNotFoundError: No module named 'tabulate'

In [None]:
univariate_categorical(df)

# Feature Encoding

In [None]:
for col in df:
    print('\n############################################')
    print('Unique value distribution of ' + str(col))
    print('############################################')
    print(df[col].value_counts(dropna=False).sort_values(ascending = False))

Since 'No_Email' and 'Free_Copy' columns have only two values, we employ binary encoding using map function.

In [None]:
# List of variables to map
varlist =  ['No_Email', 'Free_Copy']

# Defining the map function
def binary_map(x):
    return x.map({'yes': 1, "no": 0})

# Applying the function to the housing list
df[varlist] = df[varlist].apply(binary_map)

In [None]:
for col in ['No_Email', 'Free_Copy']:
    print('\n############################################')
    print('Unique value distribution of ' + str(col))
    print('############################################')
    print(df[col].value_counts(dropna=False).sort_values(ascending = False))

In [None]:
df.info()

Going with plain old dummy variable creation for 'Lead_Origin', 'Last_Notable_Activity' columns.

In [None]:
# Creating a dummy variable for some of the categorical variables and dropping the first one.
dummy = pd.get_dummies(df[['Lead_Origin', 'Last_Notable_Activity']], drop_first=True)

# Adding the results to the master dataframe
df = pd.concat([df, dummy], axis=1)

In [None]:
# We have created dummies for the below variables, so we can drop them

df = df.drop(['Lead_Origin', 'Last_Notable_Activity'], 1)

In [None]:
# Creating dummy variables for the remaining categorical variables and dropping the level which was created to impute nulls

############################################################################################################

# Creating dummy variables for the variable 'Specialization'
specialization = pd.get_dummies(df['Specialization'], prefix='Specialization')

# Dropping Specialization_select column
specialization = specialization.drop(['Specialization_select'], 1)

# Adding the results to the master dataframe
df = pd.concat([df, specialization], axis=1)

# Dropping the redundant column
df = df.drop(['Specialization'], 1)

############################################################################################################

# Creating dummy variables for the variable 'Hear'
hear = pd.get_dummies(df['Hear'], prefix='Hear')

# Dropping Hear_select column
hear = hear.drop(['Hear_select'], 1)

# Adding the results to the master dataframe
df = pd.concat([df, hear], axis=1)

# Dropping the redundant column
df = df.drop(['Hear'], 1)

############################################################################################################

# Creating dummy variables for the variable 'Lead_Profile'
lead_profile = pd.get_dummies(df['Lead_Profile'], prefix='Lead_Profile')

# Dropping Lead_Profile_select column
lead_profile = lead_profile.drop(['Lead_Profile_select'], 1)

# Adding the results to the master dataframe
df = pd.concat([df, lead_profile], axis=1)

# Dropping the redundant column
df = df.drop(['Lead_Profile'], 1)

############################################################################################################

# Creating dummy variables for the variable 'City'
city = pd.get_dummies(df['City'], prefix='City')

# Dropping City_select column
city = city.drop(['City_select'], 1)

# Adding the results to the master dataframe
df = pd.concat([df, city], axis=1)

# Dropping the redundant column
df = df.drop(['City'], 1)

############################################################################################################

# Creating dummy variables for the variable 'Occupation'
occupation = pd.get_dummies(df['Occupation'], prefix='Occupation')

# Dropping Occupation_unspecified column
occupation = occupation.drop(['Occupation_unspecified'], 1)

# Adding the results to the master dataframe
df = pd.concat([df, occupation], axis=1)

# Dropping the redundant column
df = df.drop(['Occupation'], 1)

############################################################################################################

# Creating dummy variables for the variable 'Last_Activity'
last_activity = pd.get_dummies(df['Last_Activity'], prefix='Last_Activity')

# Dropping Last_Activity_unspecified column
last_activity = last_activity.drop(['Last_Activity_unspecified'], 1)

# Adding the results to the master dataframe
df = pd.concat([df, last_activity], axis=1)

# Dropping the redundant column
df = df.drop(['Last_Activity'], 1)

############################################################################################################

# Creating dummy variables for the variable 'Lead_Source'
lead_source = pd.get_dummies(df['Lead_Source'], prefix='Lead_Source')

# Dropping Lead_Source_unspecified column
lead_source = lead_source.drop(['Lead_Source_unspecified'], 1)

# Adding the results to the master dataframe
df = pd.concat([df, lead_source], axis=1)

# Dropping the redundant column
df = df.drop(['Lead_Source'], 1)

In [None]:
len(df.columns)

We finally are left with 106 columns to deal with for our analysis.

In [None]:
df.info()

In [None]:
list(df.columns)

# Outliers

In [None]:
def univariate_continuos(df):
    
    numeric_cols = ['Total_Visits','Time_On_Website','Page_Views']
    
    for col in numeric_cols:
    
        sns.boxplot(df[col])
        plt.title(col)
        plt.show()

In [None]:
univariate_continuos(df)

Columns 'Total_Visits' and 'Page_Views' have outliers in their data and will be dealt with the common capping method where we assign the 'Q3+1.5*IQR' value to the values greater than that.

In [None]:
col = 'Total_Visits'

print('\nFrequency distribution of unique values => \n\n'+ str(df[col].value_counts(dropna=False).sort_index(ascending = False)))

In [None]:
col = 'Page_Views'

print('\nFrequency distribution of unique values => \n\n'+ str(df[col].value_counts(dropna=False).sort_index(ascending = False)))

In [None]:
def outlier_treatment(df, col):

    print('#########################')
    print(col)
    print('#########################')
    
    Q1 = df[col].quantile(0.25)
    print('Q1 is => ' + str(Q1))

    Q3 = df[col].quantile(0.75)
    print('Q3 is => ' + str(Q3))

    IQR = Q3 - Q1

    fence_high = Q3+1.5*IQR
    print('Fence High is => ' + str(fence_high))
    
    print('Imputing values greater than ' + str(fence_high) + ' with the same value')

    df.loc[(df[col] > fence_high), col] = fence_high
    
    print('\n')
    
    return df

In [None]:
df = outlier_treatment(df, 'Total_Visits')
df = outlier_treatment(df, 'Page_Views')

In [None]:
df.loc[df['Total_Visits'] == 9.5, 'Total_Visits'] = 10.0

In [None]:
univariate_continuos(df)

Now, the outliers are treated.

In [None]:
col = 'Total_Visits'

print('\nFrequency distribution of unique values => \n\n'+ str(df[col].value_counts(dropna=False).sort_index(ascending = False)))

In [None]:
col = 'Page_Views'

print('\nFrequency distribution of unique values => \n\n'+ str(df[col].value_counts(dropna=False).sort_index(ascending = False)))

**We will now build 3 models with taking RFE o/p as 15, 10 and 20 each and will evaluate them based on various metrics.**

# Model 1 (RFE 15)

## Test-Train Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Putting feature variable to X

X = df.drop(['Converted'], axis=1)

X.head()

In [None]:
# Putting response variable to y

y = df['Converted']

y.head()

In [None]:
# Splitting the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

## Feature Scaling

In [None]:
df.info()

In [None]:
from sklearn.preprocessing import StandardScaler

Using standard scaler for the non dummy variables.

In [None]:
scaler = StandardScaler()

X_train[['Total_Visits','Time_On_Website','Page_Views']] = scaler.fit_transform(X_train[['Total_Visits','Time_On_Website','Page_Views']])

X_train.head()

In [None]:
X_train[['Total_Visits','Time_On_Website','Page_Views']].describe()

## Multi Collinearity

In [None]:
# Importing matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Let's see the correlation matrix 

plt.figure(figsize = (20,10))
sns.heatmap(df.corr(),annot = True)
plt.show()

Since it is not clear with the heat map, we will let RFE deal with dropping the variables with high collinearity and subsequently using manual elimination based on VIF and p-values.

## Feature Selection Using RFE

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 15) # Running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)

In [None]:
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]

col

In [None]:
X_train.columns[~rfe.support_]

## Model Building

In [None]:
import statsmodels.api as sm
from sklearn import metrics

# Check for the VIF values of the feature variables
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
def model_accuracy(res, X_train_sm, y_train):
    # Getting the predicted values on the train set

    y_train_pred = res.predict(X_train_sm)
    y_train_pred = y_train_pred.values.reshape(-1)

    y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
    y_train_pred_final['CustID'] = y_train.index

    y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)

    from sklearn import metrics

    # Confusion matrix 
    confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
    print(confusion)

    # Let's check the overall accuracy.
    print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted))

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

VIF values of all the variables are significant. However we have some variables with p values > 0.05.

We will drop 'Occupation_housewife' variable since it has high p value of 0.999 and higher VIF compared to other variables with similar p value.

In [None]:
col = col.drop('Occupation_housewife', 1)
col

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We will drop 'Lead_Profile_dual specialization student' variable since it has high p value of 0.999 and higher VIF compared to other variables with similar p value.

In [None]:
col = col.drop('Lead_Profile_dual specialization student', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We will drop 'Last_Activity_approached upfront' variable since it has high p value of 0.999 and higher VIF compared to other variables with similar p value.

In [None]:
col = col.drop('Last_Activity_approached upfront', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We will drop 'Last_Notable_Activity_had a phone conversation' variable since it has relatively higher p value compared to the other variables and also to see if the accuracy of the model drops with dropping this.

In [None]:
col = col.drop('Last_Notable_Activity_had a phone conversation', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We will drop 'Occupation_other' variable since it has relatively higher p value compared to the other variables and also to see if the accuracy of the model drops with dropping this.

In [None]:
col = col.drop('Occupation_other', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We will drop 'Lead_Source_welingak website' variable since it has relatively higher p value compared to the other variables and also to see if the accuracy of the model drops with dropping this.

In [None]:
col = col.drop('Lead_Source_welingak website', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We will drop 'Lead_Profile_lateral student' variable since it has relatively higher p value compared to the other variables and also to see if the accuracy of the model drops with dropping this.

In [None]:
col = col.drop('Lead_Profile_lateral student', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We will drop 'Last_Notable_Activity_unreachable' variable since it has relatively higher p value compared to the other variables and also to see if the accuracy of the model drops with dropping this.

In [None]:
col = col.drop('Last_Notable_Activity_unreachable', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
    # Getting the predicted values on the train set

    y_train_pred = res.predict(X_train_sm)
    y_train_pred = y_train_pred.values.reshape(-1)

    y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
    y_train_pred_final['CustID'] = y_train.index

    y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)

    from sklearn import metrics

    # Confusion matrix 
    confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
    print(confusion)

    # Let's check the overall accuracy.
    print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted))

The final accuracy of the model on the training set with 0.5 threshold is 80%.

## Plotting the ROC Curve

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

The AUC is 0.87 which is decent and the curve is not close to the diagnol.

## Finding Optimal Cutoff Point

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.

cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

** From the plot above, 0.35 is the optimum point to take it as a cutoff probability **

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.35 else 0)
y_train_pred_final['lead_score'] = y_train_pred_final.Converted_Prob * 100
y_train_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
# Calcualte the confusion matrix

confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )

confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

## Precision and Recall Tradeoff

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
y_train_pred_final.Converted, y_train_pred_final.predicted

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

**From the curve above, 0.42 is the optimum point to take it as a cutoff probability**

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.42 else 0)
y_train_pred_final['lead_score'] = y_train_pred_final.Converted_Prob * 100
y_train_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion3 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )

confusion3

In [None]:
TP = confusion3[1,1] # true positive 
TN = confusion3[0,0] # true negatives
FP = confusion3[0,1] # false positives
FN = confusion3[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

## Making predictions on the test set

In [None]:
X_test[['Total_Visits','Time_On_Website','Page_Views']] = scaler.transform(X_test[['Total_Visits','Time_On_Website','Page_Views']])

In [None]:
X_test = X_test[col]
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array

y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head

y_pred_1.head()

In [None]:
# Converting y_test to dataframe

y_test_df = pd.DataFrame(y_test)

In [None]:
# Putting CustID to index

y_test_df['CustID'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 

y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1

y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 

y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_Prob'})

In [None]:
# Rearranging the columns

y_pred_final = y_pred_final.reindex_axis(['CustID','Converted','Converted_Prob'], axis=1)

In [None]:
# Let's see the head of y_pred_final

y_pred_final.head()

We will go with the threshold of 0.42 which yielded better metrics on train set.

In [None]:
y_pred_final['final_predicted'] = y_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.42 else 0)
y_pred_final['lead_score'] = y_pred_final.Converted_Prob * 100

In [None]:
y_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_predicted)

In [None]:
confusion4 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_predicted )

confusion4

In [None]:
TP = confusion4[1,1] # true positive 
TN = confusion4[0,0] # true negatives
FP = confusion4[0,1] # false positives
FN = confusion4[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
# Precision

confusion4[1,1]/(confusion4[0,1]+confusion4[1,1])

In [None]:
# Recall

confusion4[1,1]/(confusion4[1,0]+confusion4[1,1])

**Detail comments and observations on the metrics is provided at the bottom of this notebook and is used for model comparison**

**The approach for splitting data, scaling, RFE, model building logic and variable elimination criteria for the coming models is same as the previous one. Hence, not repeating the same**

# Model 2 (RFE 10)

## Test-Train Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Putting feature variable to X

X = df.drop(['Converted'], axis=1)

X.head()

In [None]:
# Putting response variable to y

y = df['Converted']

y.head()

In [None]:
# Splitting the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

## Feature Scaling

In [None]:
list(df.columns)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

X_train[['Total_Visits','Time_On_Website','Page_Views']] = scaler.fit_transform(X_train[['Total_Visits','Time_On_Website','Page_Views']])

X_train.head()

## Feature Selection Using RFE

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 10) # running RFE with 10 variables as output
rfe = rfe.fit(X_train, y_train)

In [37]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

NameError: name 'X_train' is not defined

In [None]:
col = X_train.columns[rfe.support_]

col

In [None]:
X_train.columns[~rfe.support_]

## Model Building

In [None]:
import statsmodels.api as sm
from sklearn import metrics

# Check for the VIF values of the feature variables
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Occupation_housewife', 1)
col

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Lead_Profile_dual specialization student', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Last_Activity_approached upfront', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
    # Getting the predicted values on the train set

    y_train_pred = res.predict(X_train_sm)
    y_train_pred = y_train_pred.values.reshape(-1)

    y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
    y_train_pred_final['CustID'] = y_train.index

    y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)

    from sklearn import metrics

    # Confusion matrix 
    confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
    print(confusion)

    # Let's check the overall accuracy.
    print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted))

The final accuracy of the model on the training set with 0.5 threshold is 75.7%.

## Plotting the ROC Curve

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, y_train_pred_final.Converted_Prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

The AUC is 0.80 which is decent and the curve is not very close to the diagnol.

## Finding Optimal Cutoff Point

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.

cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

**From the plot above, 0.42 is the optimum point to take it as a cutoff probability**

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.42 else 0)
y_train_pred_final['lead_score'] = y_train_pred_final.Converted_Prob * 100
y_train_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )

confusion2

In [None]:
TP = confusion2[1,1] # true positives
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [38]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

NameError: name 'TP' is not defined

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

## Precision and Recall

## Precision and recall tradeoff

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
y_train_pred_final.Converted, y_train_pred_final.predicted

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

**From the plot above, 0.52 is the optimum point to take it as a cutoff probability**

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.52 else 0)
y_train_pred_final['lead_score'] = y_train_pred_final.Converted_Prob * 100
y_train_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion3 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )

confusion3

In [None]:
TP = confusion3[1,1] # true positives
TN = confusion3[0,0] # true negatives
FP = confusion3[0,1] # false positives
FN = confusion3[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
from sklearn.metrics import precision_score, recall_score

In [39]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

NameError: name 'precision_score' is not defined

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

## Making predictions on the test set

In [None]:
X_test[['Total_Visits','Time_On_Website','Page_Views']] = scaler.transform(X_test[['Total_Visits','Time_On_Website','Page_Views']])

In [None]:
X_test = X_test[col]
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array

y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head

y_pred_1.head()

In [None]:
# Converting y_test to dataframe

y_test_df = pd.DataFrame(y_test)

In [None]:
# Putting CustID to index

y_test_df['CustID'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 

y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1

y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 

y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_Prob'})

In [None]:
# Rearranging the columns

y_pred_final = y_pred_final.reindex_axis(['CustID','Converted','Converted_Prob'], axis=1)

In [None]:
# Let's see the head of y_pred_final

y_pred_final.head()

We will go with the threshold of 0.52 which yielded better metrics on train set.

In [40]:
y_pred_final['final_predicted'] = y_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.52 else 0)
y_pred_final['lead_score'] = y_pred_final.Converted_Prob * 100

NameError: name 'y_pred_final' is not defined

In [None]:
y_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_predicted)

In [None]:
confusion4 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_predicted )

confusion4

In [None]:
TP = confusion4[1,1] # true positive 
TN = confusion4[0,0] # true negatives
FP = confusion4[0,1] # false positives
FN = confusion4[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
# Precision

confusion4[1,1]/(confusion4[0,1]+confusion4[1,1])

In [None]:
# Recall

confusion4[1,1]/(confusion4[1,0]+confusion4[1,1])

**Detail comments and observations on the metrics is provided at the bottom of this notebook and is used for model comparison**

# Model 3 (RFE 20)

## Test-Train Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Putting feature variable to X

X = df.drop(['Converted'], axis=1)

X.head()

In [None]:
# Putting response variable to y

y = df['Converted']

y.head()

In [None]:
# Splitting the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [41]:
scaler = StandardScaler()

X_train[['Total_Visits','Time_On_Website','Page_Views']] = scaler.fit_transform(X_train[['Total_Visits','Time_On_Website','Page_Views']])

X_train.head()

NameError: name 'StandardScaler' is not defined

In [None]:
X_train[['Total_Visits','Time_On_Website','Page_Views']].describe()

## Feature Selection Using RFE

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [42]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 20) # running RFE with 20 variables as output
rfe = rfe.fit(X_train, y_train)

NameError: name 'logreg' is not defined

In [None]:
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]

col

In [None]:
X_train.columns[~rfe.support_]

## Model Building

In [None]:
import statsmodels.api as sm
from sklearn import metrics

# Check for the VIF values of the feature variables
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Lead_Profile_dual specialization student', 1)
col

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Occupation_housewife', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Last_Activity_approached upfront', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Lead_Source_facebook', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Last_Notable_Activity_had a phone conversation', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Occupation_other', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Last_Notable_Activity_unreachable', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Lead_Source_welingak website', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col = col.drop('Lead_Profile_lateral student', 1)

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logmodel = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logmodel.fit()
res.summary()

In [None]:
model_accuracy(res, X_train_sm, y_train)

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
    # Getting the predicted values on the train set

    y_train_pred = res.predict(X_train_sm)
    y_train_pred = y_train_pred.values.reshape(-1)

    y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
    y_train_pred_final['CustID'] = y_train.index

    y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)

    from sklearn import metrics

    # Confusion matrix 
    confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
    print(confusion)

    # Let's check the overall accuracy.
    print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted))

The final accuracy of the model on the training set with 0.5 threshold is 81%.

## Plotting the ROC Curve

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

The AUC is 0.89 which is decent and the curve is not close to the diagnol.

## Finding Optimal Cutoff Point

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.

cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

**From the plot above, 0.35 is the optimum point to take it as a cutoff probability**

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.35 else 0)
y_train_pred_final['lead_score'] = y_train_pred_final.Converted_Prob * 100
y_train_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )

confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

## Precision and Recall Tradeoff

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
y_train_pred_final.Converted, y_train_pred_final.predicted

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

**From the plot above, 0.42 is the optimum point to take it as a cutoff probability**

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.42 else 0)
y_train_pred_final['lead_score'] = y_train_pred_final.Converted_Prob * 100
y_train_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion3 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )

confusion3

In [None]:
TP = confusion3[1,1] # true positive 
TN = confusion3[0,0] # true negatives
FP = confusion3[0,1] # false positives
FN = confusion3[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

## Making predictions on the test set

In [None]:
X_test[['Total_Visits','Time_On_Website','Page_Views']] = scaler.transform(X_test[['Total_Visits','Time_On_Website','Page_Views']])

In [None]:
X_test = X_test[col]
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array

y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head

y_pred_1.head()

In [None]:
# Converting y_test to dataframe

y_test_df = pd.DataFrame(y_test)

In [None]:
# Putting CustID to index

y_test_df['CustID'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 

y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1

y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 

y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_Prob'})

In [None]:
# Rearranging the columns

y_pred_final = y_pred_final.reindex_axis(['CustID','Converted','Converted_Prob'], axis=1)

In [None]:
# Let's see the head of y_pred_final

y_pred_final.head()

In [None]:
y_pred_final['final_predicted'] = y_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.42 else 0)
y_pred_final['lead_score'] = y_pred_final.Converted_Prob * 100

In [None]:
y_pred_final.head()

### Metrics

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_predicted)

In [None]:
confusion4 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_predicted )

confusion4

In [None]:
TP = confusion4[1,1] # true positive 
TN = confusion4[0,0] # true negatives
FP = confusion4[0,1] # false positives
FN = confusion4[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned

print(FP/ float(TN+FP))

In [None]:
# Precision

confusion4[1,1]/(confusion4[0,1]+confusion4[1,1])

In [None]:
# Recall

confusion4[1,1]/(confusion4[1,0]+confusion4[1,1])

Detail comments and observations on the metrics is provided at the bottom of this notebook and is used for model comparison.

# Model Comparison

<img src="Metrics1.png",width=1000,height=1000>

**Model 1**

AUC is decent, variables are not many and are easily interpretable. Accuracy, Sensitivity, Precision values are comparable with Model 3.

Precision ~ 80% which is the target set by CEO.

**Model 2**

AUC is less compared to Model 1 and 3, variables are not many and are easily interpretable. But, the accuracy and sensitivity values are significantly less than Model 1 and 3.

Precision ~ 80% which is the target set by CEO.

**Model 3**

AUC is good and comparable to Model 1, variables count is high compared to Model 1 and 2. Accuracy, Sensitivity, Precision values are comparable with Model 1.

Precision ~ 80% which is the target set by CEO.

# Verdict

Model 1 is the winner since it is simple with lesser number of variables, decent AUC and evaluation metrics with precision ~ 80%.