## Lead Scoring with Logistic Regression
##### Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.metrics import classification_report
%matplotlib inline
pd.set_option('display.max_columns',500)

## [Step 1. Reading and Understanding the Data](#step1)

## [Step 2.Data Cleaning and EDA](#step2)
[2.1 Missing value check](#step2.1)<br>
[2.2 Cleaning and Visualizing categorical variables](#step2.2)<br>
[2.3 Cleaning and Visualizing numerical variables](#step2.3)<br>
[2.4 Outlier Treatment](#step2.4)<br>
[2.5 Check for data type conversion](#step2.5)<br>

## [Step 3. Preprocessing and Data Preparation](#step3)
[3.1 Categorizing variables](#step3.1)<br>
[3.2 Creating dummy variables](#step3.2)<br>
[3.3 Train test split](#step3.3)<br>
[3.4 Scaling data](#step3.4)<br>

## [Step 4. Model Building](#Step4)
[Step 4.1 Build Logistic Model](#step4.1)<br>
[Step 4.2 Prediction and evaluation on Training Set](#step4.2)<br>
[Step 4.3 Prediction and evaluation on Testing Set](#step4.3)<br>

## [Step 5. Final Analysis](#Step5)

The above steps will be covered to build the model. 

## Step 1. Reading and Understanding the Data<a id='step1'></a>

In [None]:
lead_data = pd.read_csv('Leads.csv')
lead_data.head()

In [None]:
lead_data.info()

In [None]:
lead_data.shape

In [None]:
lead_data.describe().T

Out of 37 columns,the bottom 5 from the above list are numeric columns

In [None]:
#Checking data imbalance
lead_data['Converted'].mean()

#### Only 38% of the leads were converted

In [None]:
#lead_data.columns

In [None]:
#renaming lengthy names of columns
lead_data = lead_data.rename({'Total Time Spent on Website':'Website Time','What is your current occupation': 'Occupation','What matters most to you in choosing a course':'Reason','A free copy of Mastering The Interview':'Free copy required'}, axis=1)
lead_data.columns

## Step 2.Data Cleaning and EDA<a id='step2'></a>

#### 2.1 Missing value check<a id=step2.1></a>


In [None]:
# sum it up to check how many rows have all missing values
lead_data.isnull().all(axis=1).sum()

In [None]:
len(lead_data[lead_data.isnull().sum(axis=1)>5].index)


In [None]:
# % of the missing values (column-wise)
round(100*(lead_data.isnull().sum()/len(lead_data.index)), 2)


#### We do see lots of missing values and most of columns have greater than 20% missing values. Lets divide the columns into categorical and numerical and handle them one by one.

#### 2.2 Cleaning and Visualizing categorical variables<a id='step2.2'></a>

In [None]:
## Dropping columns having > 45% missing values
lead_data.drop(['Asymmetrique Activity Index','Asymmetrique Profile Index','Asymmetrique Activity Score','Asymmetrique Profile Score','Lead Quality'],axis=1,inplace=True)

In [None]:
lead_data.info()

In [None]:
cat_cols = ['Lead Origin', 'Lead Source','Do Not Email', 'Do Not Call', 'Last Activity',
       'Country', 'Specialization', 'How did you hear about X Education',
       'Occupation',
       'Reason', 'Search', 'Magazine',
       'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses', 'Tags',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City','I agree to pay the amount through cheque',
       'Free copy required', 'Last Notable Activity']

In [None]:
len(cat_cols)

In [None]:
#Function to check total counts, skewness and importance of a categorical column based on conversion rate. 
def check_count_conversion_rate(X):
    #checking counts of col
    col_counts = pd.DataFrame(lead_data[X].value_counts()).reset_index()
    col_counts.columns = [X,'Counts']
    col_counts['Total%'] = col_counts['Counts']/len(lead_data.index)
    #checking conversion rate by col
    groupby_col = pd.DataFrame(lead_data.groupby(X)['Converted'].mean()).reset_index()

    col_counts_percentage = col_counts.merge(groupby_col,how='inner',on=X)
    return col_counts_percentage

In [None]:
def check_2_col_count_conversion_rate(X):
    col_counts = pd.DataFrame(lead_data.groupby(X)[X[1]].count())
    col_counts.columns = ['Counts']
    col_counts.reset_index()
     #checking conversion rate by col
    groupby_col = pd.DataFrame(lead_data.groupby(X)['Converted'].mean()).reset_index()
    groupby_col
    col_counts_percentage = pd.merge(col_counts, groupby_col,  how='left', left_on=X, right_on = X)
    return col_counts_percentage

In [None]:
round(100*(lead_data[cat_cols].isnull().sum()/len(lead_data.index)), 2)


The value of `Select` in categorical columns are treated as null.Therefore, will convert `Select` value to `NaN` before checking the actual missing value percentage of a particular column.

#### Checking skewed categorical Variable

In [None]:
def check_skewness(X):
    count_of_values = lead_data[X].value_counts().values
    count_of_values = count_of_values/len(lead_data)
    col_of_values = list(lead_data[X].value_counts().index)
    dict_data = {}
    for i in range(0,len(col_of_values)):
        dict_data[col_of_values[i]] = count_of_values[i]
    df = pd.DataFrame(data=dict_data,index=[X])
    return df

In [None]:
plt.figure(figsize=[20,20])
skewness = []
for i in range(0,len(cat_cols)):
    ax = plt.subplot(6,5,i+1)
    skewness.append(check_skewness(cat_cols[i]))
    skewness[i].plot(kind='bar',stacked=True,ax = ax)
    ax.legend().set_visible(False)
    plt.xticks(rotation=0)
    ax.set_ylim([0.0,1.0])
plt.show()

#check_skewness(cat_cols[4]).plot.bar(stacked=True)

#### Inference:
- We see lot of columns that are skewed and has only one value more than 90% of the time
- The reason some bars are not 100% complete is because of the presence of null values.
- Its better to delete highly skewed columns as they will not be helpfull in predictions.

In [None]:
#Deleting cols where 90% of the values are same
skewed_cols = ['Do Not Email', 'Do Not Call',
       'Reason', 'Search', 'Magazine',
       'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'I agree to pay the amount through cheque']
lead_data.drop(skewed_cols,axis=1,inplace=True)
lead_data.shape

In [None]:
for i in skewed_cols:
    cat_cols.remove(i)
cat_cols

In [None]:
round(100*(lead_data[cat_cols].isnull().sum()/len(lead_data.index)), 2)


In [None]:
#deleting cols where missing % is greater than 45%
# missing_45 = ['Lead Quality','Asymmetrique Activity Index','Asymmetrique Profile Index']
# lead_data.drop(missing_45,axis=1,inplace=True)
# print("final shape = {}".format(lead_data.shape))
# for i in missing_45:
#     cat_cols.remove(i)
# cat_cols

In [None]:
round(100*(lead_data[cat_cols].isnull().sum()/len(lead_data.index)), 2)


In [None]:
100*(len(lead_data[lead_data[cat_cols].isnull().sum(axis=1)>6].index)/len(lead_data.index))
len(lead_data[lead_data[cat_cols].isnull().sum(axis=1)>4].index)


#### Handling Lead Source

In [None]:
#checking number of missing values
lead_data['Lead Source'].isnull().sum()

In [None]:
check_count_conversion_rate('Lead Source').sort_values(by='Total%',ascending = False)

#### Inferences:
- Google is written as google in 5 values.
- There are total 21 different values for the column
- There are few of them which were used only once. Will to combine them into others later.

In [None]:
#google to Google
lead_data.loc[(lead_data['Lead Source'] == 'google'),['Lead Source']] = 'Google'
lead_data['Lead Source'].value_counts()


In [None]:
check_count_conversion_rate('Lead Source').sort_values(by='Converted',ascending = False)

#### Inferences:
- Welingak Website has the maximum conversion rate followed by Reference
- Google and Direct Traffic has a top counts and a good conversion rate of ~35% 

Lets check the relation `Lead Origin` and `Lead Source`

In [None]:
check_count_conversion_rate('Lead Origin')

Checking the counts of `Lead Source` based on `Lead Origin`

In [None]:
check_2_col_count_conversion_rate(['Lead Origin','Lead Source'])

#### We notice that Lead Source is highly affected by Lead Origin. Therefore, replacing the missing values of Lead Source by the mode of Lead Source depending on the Lead Origin

    Lead Origin                 Mode of Lead Source
    - API                        Olark Chat
    - Landing Page Submission	Direct Traffic
    - Lead Add Form              Reference
    - Lead Import                Facebook 

In [None]:
lead_data.loc[(pd.isnull(lead_data['Lead Source'])),['Lead Origin','Lead Source']]

In [None]:
origin_source_mode_dict = dict({'API': 'Olark Chat', 'Landing Page Submission': 'Direct Traffic', 'Lead Add Form':'Reference','Lead Import':'Lead Import','Quick Add Form':'NaN'}) 
lead_data.loc[pd.isnull(lead_data['Lead Source']), ['Lead Source']] = lead_data.loc[pd.isnull(lead_data['Lead Source'])].apply(lambda x: origin_source_mode_dict[x['Lead Origin']],axis=1)
check_2_col_count_conversion_rate(['Lead Origin','Lead Source'])

In [None]:
#Deleting one row where Lead Origin is Quick Add form and Lead Source is Nan
lead_data = lead_data.loc[lead_data['Lead Origin'] != 'Quick Add Form']
lead_data.shape

#### Handling last Activity

In [None]:
#checking number of missing values
lead_data['Last Activity'].isnull().sum()

In [None]:
lastActivity = check_skewness('Last Activity')
#print(df)
ax = plt.subplot(1,1,1)
lastActivity.plot(kind='bar',stacked=True,ax = ax)
plt.xticks(rotation=0)
ax.set_ylim([0.0,1.0])
ax.legend(loc='upper center', bbox_to_anchor=(1.45, 1.1), shadow=True, ncol=1)

In [None]:
lastActivity.T

- 37% of rows have `Email Opened` as the value of `Last Activity`
- Replacing 103 missing values of `Last Activity` with `Email Opened`

In [None]:
lead_data.loc[lead_data['Last Activity'].isnull(),['Last Activity']] = 'Email Opened'
check_skewness('Last Activity').T

In [None]:
check_count_conversion_rate('Last Activity')

In [None]:
round(100*(lead_data[cat_cols].isnull().sum()/len(lead_data.index)), 2)

#### Handling Country

In [None]:
check_skewness('Country').T

In [None]:
print('Missing country values: {}'.format(lead_data['Country'].isnull().sum()))
print('Rows where country and city both are missing: {}'.format(len(lead_data.loc[(lead_data['Country'].isnull() & lead_data['City'].isnull())])))

In [None]:
check_count_conversion_rate('Country')

#### Country seems to be an important columns as Countries like India, US, Singapore have greater than 30% conversion rate.

In [None]:
check_count_conversion_rate('City')

Replacing missing value of `Country` with 'Unknown' where `City` value is also missing 

In [None]:
lead_data.loc[(lead_data['Country'].isnull() & lead_data['City'].isnull()),['Country']] = 'unknown'

In [None]:
len(lead_data.loc[(lead_data['Country'].isnull() & lead_data['City'].isnull())])

In [None]:
def handle_country(x):
    if x['City'] in ['Mumbai' ,'Thane & Outskirts','Other Cities of Maharashtra']:
        return 'India'
    else:
        return 'unknown'
    
lead_data.loc[lead_data['Country'].isnull(),['Country']] = lead_data.loc[lead_data['Country'].isnull(),['Country','City']].apply(lambda x: handle_country(x),axis=1)

In [None]:
check_count_conversion_rate('Country')

Unkown also comes in missing value

#### out of 2461 missing value of Country 2131 countries are still unknown and rest are substituted with 'India' as the city column for these have Indian references

In [None]:
round(100*(lead_data[cat_cols].isnull().sum()/len(lead_data.index)), 2)

#### Handling Specialization

In [None]:
check_count_conversion_rate('Specialization')

In [None]:
#Value `Select` acts as missing value. First let is replace missing values with Select
lead_data.loc[pd.isnull(lead_data['Specialization']),['Specialization']] = 'Select'
check_count_conversion_rate('Specialization') .sort_values(by='Total%',ascending=False)           

#### From above table we can see that Actual missing % of Specialization = 36.58

In [None]:
lastActivity = check_skewness('Specialization')
#print(df)
ax = plt.subplot(1,1,1)
lastActivity.plot(kind='bar',stacked=True,ax = ax)
plt.xticks(rotation=0)
ax.set_ylim([0.0,1.0])
ax.legend(loc='upper center', bbox_to_anchor=(1.45, 1.1), shadow=True, ncol=1)

#### The maximum % is still missing with value as (Select). As the percentage is large will replace these missing value with `others`

In [None]:
#### replacing select with Others

lead_data.loc[lead_data['Specialization']=='Select','Specialization']='Others'
check_skewness('Specialization').T

#### Handling `How did you hear about X Education`

In [None]:
check_count_conversion_rate('How did you hear about X Education')

In [None]:
lead_data.loc[pd.isnull(lead_data['How did you hear about X Education']),['How did you hear about X Education']] = 'Select'
check_count_conversion_rate('How did you hear about X Education') .sort_values(by='Total%',ascending=False)    

#### 78% of values for `How did you hear about X Education` is missing. Therefore deleting this column.

In [None]:
lead_data.drop(['How did you hear about X Education'],axis=1,inplace=True)
print(lead_data.shape)

cat_cols.remove('How did you hear about X Education')
cat_cols

In [None]:
round(100*(lead_data[cat_cols].isnull().sum()/len(lead_data.index)), 2)


#### Handling Occupation               

In [None]:
lead_data['Occupation'].isnull().sum()

In [None]:
check_count_conversion_rate('Occupation') .sort_values(by='Total%',ascending=False)

#### Inferences:
- 60% of leads are Unemployed and they have good conversion rate of 43%
- Though number of working professional is less, but their conversion rate is higher at 91%
- The missing percentage is 29%. Replacing with mode might skew the data.
- Therefore, Replacing null values in Occupation with value `Other`

In [None]:
lead_data.loc[pd.isnull(lead_data['Occupation']),['Occupation']] = 'Other'
check_count_conversion_rate('Occupation') .sort_values(by='Total%',ascending=False)

#### Handling Tags                     

In [None]:
lead_data['Tags'].isnull().sum()

In [None]:
check_count_conversion_rate('Tags').sort_values(by='Total%',ascending=False)

In [None]:
lead_data.loc[pd.isnull(lead_data['Tags']),['Tags']] = 'Others'
check_count_conversion_rate('Tags') .sort_values(by='Total%',ascending=False)

In [None]:
#exp=check_count_conversion_rate('Tags') .sort_values(by='Total%',ascending=False)
#exp.to_csv('analysis.csv')

#### Handling Lead Profile 

In [None]:
lead_data['Lead Profile'].isnull().sum()

In [None]:
check_count_conversion_rate('Lead Profile') .sort_values(by='Total%',ascending=False)

In [None]:
#replacing nans with select to get total missing percentage
lead_data.loc[pd.isnull(lead_data['Lead Profile']),['Lead Profile']] = 'Select'
check_count_conversion_rate('Lead Profile') .sort_values(by='Total%',ascending=False)

#### 74% of the data is missing for column `Lead Profile`, therefore, deleting the column

In [None]:
lead_data.drop(['Lead Profile'],axis=1,inplace=True)
print(lead_data.shape)

cat_cols.remove('Lead Profile')
cat_cols

#### Handling City                      

In [None]:
lead_data['City'].isnull().sum()

In [None]:
check_count_conversion_rate('City') .sort_values(by='Total%',ascending=False)

In [None]:
#replacing nans with select to get total count
lead_data.loc[pd.isnull(lead_data['City']),['City']] = 'Select'
check_count_conversion_rate('City') .sort_values(by='Total%',ascending=False)

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(x='City',hue='Converted',data=lead_data)

#### Inference
- Close to 40% of the values are missing
- Each value have equal probaility of conversion(~40%), which might not help in prediction
- As city column is not adding value, deleting the column

In [None]:
lead_data.drop(['City'],axis=1,inplace=True)
print(lead_data.shape)

cat_cols.remove('City')
cat_cols

#### Last Notable Activity is an intermediate column which is an update while the sales team representative is in touch with the lead. Last Activity is a column that talks about the last activity when the lead was closed from the sales team side. Therefore,using only Last Activity column as it shows the final decision when the lead was closed

In [None]:
lead_data.drop('Last Notable Activity',axis=1,inplace=True)
cat_cols.remove('Last Notable Activity')

In [None]:
round(100*(lead_data[cat_cols].isnull().sum()/len(lead_data.index)), 2)


In [None]:
lead_data.shape

#### Till now 
- only 1 row deleted
- 20 columns deleted(14 skewed, 5 missing >45, 1 based on data)

#### 2.3 Cleaning and Visualizing numerical variables<a id='step2.3'></a>

In [None]:
lead_data['TotalVisits'].isnull().sum()

In [None]:
num_cols = ['TotalVisits','Website Time','Page Views Per Visit']
num_cols

#### Total Visits 

In [None]:
### Visualising Total visits data 
lead_data['TotalVisits'].value_counts()
#lead_data['TotalVisits'].describe()

In [None]:
### Analysing Total visits data 
lead_data['TotalVisits'].median() ##3.0
lead_data['TotalVisits'].mean() ##3.45
## Since the Total visits cannot be fractional,replacing with 3 considering both mean and median values
lead_data.loc[pd.isnull(lead_data['TotalVisits']),'TotalVisits']=lead_data['TotalVisits'].median()

In [None]:
lead_data['TotalVisits'].isnull().sum()

#### Page Views Per Visit

In [None]:
lead_data['Page Views Per Visit'].isnull().sum()

In [None]:
lead_data['Page Views Per Visit'].value_counts()

In [None]:
## Analysing Page views per visit data
lead_data['Page Views Per Visit'].median() ## 2.0
lead_data['Page Views Per Visit'].mean() ## 2.36
lead_data.loc[pd.isnull(lead_data['TotalVisits']),'TotalVisits']=lead_data['TotalVisits'].median()
lead_data.loc[pd.isnull(lead_data['Page Views Per Visit']),'Page Views Per Visit']=lead_data['Page Views Per Visit'].median()

In [None]:
lead_data['Page Views Per Visit'].isnull().sum()

In [None]:
lead_data.info()

In [None]:
lead_data['Lead Source'].value_counts()

#### 2.4 Outlier Treatment<a id='step2.4'></a>

In [None]:
plt.figure(figsize=(15,15))
for col in num_cols:
    plt.subplot(2,3,num_cols.index(col)+1)
    sns.boxplot(y=col,data=lead_data,palette='rainbow')

#### Observation:
Total Visits and Page views per visit have Outliers

In [None]:
lead_data[['TotalVisits','Page Views Per Visit']].describe(percentiles=(0.95,0.99))

In [None]:
lead_data['TotalVisits'].value_counts(ascending=False)

## capping at 99% value - 17

lead_data.loc[lead_data['TotalVisits']>17,['TotalVisits']]=17

In [None]:
lead_data['TotalVisits'].value_counts(ascending=False)

In [None]:
## Capping at 99% value - 9

lead_data.loc[lead_data['Page Views Per Visit']>9,'Page Views Per Visit']=9
lead_data['Page Views Per Visit'].value_counts()

In [None]:
## Replotting the box plots

plt.figure(figsize=(15,15))
for col in num_cols:
    plt.subplot(2,3,num_cols.index(col)+1)
    sns.boxplot(y=col,data=lead_data)


#### 2.5 Check for data type conversion<a id='step2.5'></a>

In [None]:
lead_data.info()

#### Observation : 
No datatype conversion required

## Step 3. Preprocessing and Data Preparation<a id='step3'></a>

#### 3.1 Categorizing variables<a id='step3.1'></a>

In [None]:
#Lead source grouping into others category
values=lead_data['Lead Source'].value_counts()
#type(values)
val=values[values.lt(7)].index
#val
lead_data.loc[lead_data['Lead Source'].isin(val),'Lead Source']='Others'
lead_data['Lead Source'].value_counts()

In [None]:
#country categorization
check_count_conversion_rate('Country')

There are 37 different countries and many have very few data. Clubbing them together based on the continents.

In [None]:
asia=['Singapore',
'Saudi Arabia',
'Qatar',
'Bahrain', 
'Hong Kong', 
'Oman', 
'Kuwait',
'Philippines', 
'Bangladesh', 
'Asia/Pacific Region',
'China', 
'Malaysia',  
'Russia',
'Vietnam',
'Indonesia',
'Sri Lanka',
'United Arab Emirates']

africa= ['Kenya',
'South Africa',
'Nigeria',
'Uganda',
'Ghana',
'Tanzania',
'Liberia']

europe = ['United Kingdom',
'France',
'Germany',
'Sweden',
'Netherlands',
'Belgium',
'Italy',
'Switzerland',
'Denmark']

north_america = ['United States','Canada']

def categorize_country(x):
    if x in asia:
        return 'Asia'
    elif x in africa:
        return 'Africa'
    elif x in europe:
        return 'Europe'
    elif x in north_america:
        return 'North_america'
    else:
        return x

lead_data['country_categorized'] = lead_data['Country'].apply(lambda x: categorize_country(x))

In [None]:
check_count_conversion_rate('country_categorized')

In [None]:
lead_data[['Country','country_categorized']].head(10)

In [None]:
lead_data['Country'] = lead_data['country_categorized']
lead_data['Country'].value_counts()

In [None]:
#Tags categorization
check_count_conversion_rate('Tags')

There are 27 different values of Tags. We will be bucketing them into following categories based upon the business knowledge

- Interested
- Busy
- Probable
- Lost


In [None]:
interested = ['Will revert after reading the email','Interested in other courses',
              'Closed by Horizzon',
              'Want to take admission but has financial problems',
              'Still Thinking',  
              'In confusion whether part time or DLP',
              'Interested in Next batch',
              'Shall take in the next coming month',
              'University not recognized']
lost = ['Lost to EINS','invalid number',
        'Diploma holder (Not Eligible)',
        'number not provided',
        'wrong number given',
        'Lost to Others',
        'Already a student']
busy = ['Busy',
        'opp hangup']
probable = ['Not doing further education',
            'Interested  in full time MBA',
            'Graduation in progress',
            'in touch with EINS','Lateral student',
           'Recognition issue (DEC approval)']
def categorize_tags(x):
    if x in interested:
        return 'interested'
    elif x in lost:
        return 'lost'
    elif x in busy:
        return 'busy'
    elif x in probable:
        return 'probable'
    else:
        return x

lead_data['tags_categorized'] = lead_data['Tags'].apply(lambda x: categorize_tags(x))

In [None]:
lead_data['tags_categorized'].value_counts()
check_count_conversion_rate('tags_categorized')

In [None]:
lead_data[['Tags','tags_categorized']].head(10)

In [None]:
lead_data['Tags'] = lead_data['tags_categorized']
lead_data['Tags'].value_counts()

In [None]:
lead_data.drop(['tags_categorized','country_categorized'],axis=1,inplace=True)
lead_data.columns

#### 3.3 Creating dummy variables <a id='step3.3' ></a>

#### Converting Binary Columns

In [None]:
### Creating dummies for binary columns with Yes/No values
binary_cols=['Free copy required']
binary_cols

In [None]:
for col in binary_cols:
    lead_data[col]=lead_data[col].map({'Yes':1,'No':0})

In [None]:
lead_data['Free copy required'].value_counts()

In [None]:
cat_cols

In [None]:
cat_cols.remove('Free copy required')
cols_with_others = ['Lead Source','Tags','Specialization','Country','Occupation']
for col in cat_cols:
    if col not in cols_with_others:
        dummies=pd.get_dummies(lead_data[col],prefix=col,drop_first=True)
        lead_data=pd.concat([lead_data,dummies],axis=1)

In [None]:
## Deleting others value from dummy variables
for col in cols_with_others:
    dummies=pd.get_dummies(lead_data[col],prefix=col)
    lead_data=pd.concat([lead_data,dummies],axis=1)    

lead_data.drop(['Lead Source_Others','Tags_Others','Specialization_Others','Country_unknown','Occupation_Other'],axis=1,inplace=True)

In [None]:
lead_data.info(verbose=True)

In [None]:
cat_cols

#### Dropping redundant columns - final step of cleaning

In [None]:
lead_data.drop(cat_cols,inplace=True,axis=1)
lead_data.drop('Prospect ID',inplace=True,axis=1) ## some more to be dropped

In [None]:
y=lead_data.pop('Converted')
lead_data.columns

#### 3.3 Train test split<a id='step3.3'></a>

In [None]:
X_train,X_test,y_train,y_test=train_test_split(lead_data,y,train_size=0.7,test_size=0.3,random_state=100)

In [None]:
X_train.info()

#### 3.4 Scaling data<a id='step3.4'></a>

In [None]:
num_cols

In [None]:
X_train.shape

In [None]:
scaling=StandardScaler()
X_train[num_cols]=scaling.fit_transform(X_train[num_cols])
X_train.head()

In [None]:
X_train.shape

## Step 4. Model Building<a id='step4'></a>

#### Step 4.1  Build a Logistic Model<a id='step4.1'></a>

#### Feature selection using RFE

In [None]:
## Running the first time
logm=sm.GLM(y_train,sm.add_constant(X_train),family=sm.families.Binomial())
logm.fit().summary()

In [None]:
def deduce_best_model(y_train,X_train_rfe,drop_column=''):
    
    #X_train_rfe_trials=X_train_rfe
    if drop_column!='':
        X_train_rfe=X_train_rfe.drop(drop_column,axis=1)
    lm=sm.GLM(y_train,X_train_rfe,family=sm.families.Binomial()).fit()
    print(lm.summary())
    
    X_train_new=X_train_rfe.drop(['const'],axis=1)
#X_train_new
    print()
    print()
    
    print("-------------------------VIF Results-------------------------")
    vif=pd.DataFrame()
    temp=X_train_new
    vif['Columns']=temp.columns
    vif['VIF']=[variance_inflation_factor(temp.values,i) for i in range(temp.shape[1])]
    vif['VIF']=round(vif['VIF'],2)
    vif=vif.sort_values(by='VIF',ascending=False)
    print(vif)
    return X_train_rfe,lm
    

In [None]:
#plt.figure(figsize=(50,50))
#sns.heatmap(lead_data.corr(),annot = True, cmap="YlGnBu")

In [None]:
log_reg=LogisticRegression()
rfe=RFE(log_reg,20)
rfe=rfe.fit(X_train,y_train)
rfe.ranking_

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
cols=X_train.columns[rfe.support_]

In [None]:
cols

In [None]:
log_reg.fit(X_train,y_train)


In [None]:
X_train_rfe=X_train[cols]
X_train_rfe=sm.add_constant(X_train_rfe)

In [None]:
cols

#### Checking corelations on columns selected by rfe

In [None]:
plt.figure(figsize=(50,50))
sns.set(font_scale=2.5)
sns.heatmap(X_train[cols].corr(),annot = True, cmap="YlGnBu",annot_kws={"size": 20})

In [None]:
X_train_rfe,lm1=deduce_best_model(y_train,X_train_rfe)

#### Dropping 'Lead Number' as it has high VIF.

In [None]:
X_train_rfe,lm2=deduce_best_model(y_train,X_train_rfe,'Lead Number') ## Dropping 'Lead Number' as it has high VIF.

#### Dropping 'Last Activity_Page Visited on Website' as it has high p-value.

In [None]:
X_train_rfe,lm2=deduce_best_model(y_train,X_train_rfe,'Last Activity_Page Visited on Website')

#### Dropping 'Lead Origin_Lead Add Form' as it has high VIF.

In [None]:
X_train_rfe,lm2=deduce_best_model(y_train,X_train_rfe,'Lead Origin_Lead Add Form') ## Dropping 'Lead Number' as it has high VIF.

#### Dropping 'Country_India' as it has high VIF.

In [None]:
X_train_rfe,lm2=deduce_best_model(y_train,X_train_rfe,'Country_India') ## Dropping 'Lead Number' as it has high VIF.

#### Dropping 'Tags_lost' as it has high p-value.

In [None]:
X_train_rfe,lm2=deduce_best_model(y_train,X_train_rfe,'Tags_lost') ## Dropping 'Tags_lost' as it has high VIF.

#### Dropping 'Lead Source_Direct Traffic' as it has high p-value.

In [None]:
X_train_rfe,lm2=deduce_best_model(y_train,X_train_rfe,'Lead Source_Direct Traffic') 

#### Dropping 'Lead Origin_Landing Page Submission' as it has high VIF.

In [None]:
X_train_rfe,lm2=deduce_best_model(y_train,X_train_rfe,'Lead Origin_Landing Page Submission') ## Dropping 'Lead Number' as it has high VIF.

#### Dropping 'TotalVisits' as it has high p-value.

In [None]:
X_train_rfe,lm2=deduce_best_model(y_train,X_train_rfe,'TotalVisits')

#### The above model is the final model as all variables are significant and the VIF values are all less than 2.02

#### Step 4.2 Prediction and evaluation on Training Set<a id='Step4.2'></a>

The logistic model predict the probabilities for the target variable 'Converted' being `1`
Will convert the predicted values to an array and perform some analysis to check the accuracy of the model.

In [None]:
#get the predicted values on train set
y_train_pred=lm2.predict(X_train_rfe).values.reshape(-1)
y_train_pred[:10]

In [None]:
print(type(y_train_pred))

In [None]:
print(type(y_train))

Next we will predict the value of `Converted` as 0 or 1 based on the probabilities predicted.

- First we will create a dataframe with actual Converted flag and the predicted probabilities which will act as `Lead Score`
- Then we will create a new column `predicted Converted` with 0 if predicted probability <= 0.5 else 1

In [None]:
#X_train['predicted']=y_train_pred
y_train_pred_final = pd.DataFrame({'Actual_Converted':y_train.values,'Lead Score':y_train_pred,'Lead Number':X_train['Lead Number']})
y_train_pred_final['ID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['Lead Score'].sort_values(ascending=False).head(10)

In [None]:
y_train_pred_final['Lead Number'].isnull().sum()

In [None]:
### Observing the top rows that gives cumulative probability of 80%

In [None]:
X_train_rfe.shape

In [None]:
X_train.shape

In [None]:
#X_train.to_csv('pred_analysis.csv')

#### Finding optimal cut-off

In [None]:

numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final['Lead Score'].map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final['Actual_Converted'], y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

The value 3.5 seems to be a ood cutoff as all the metrics values are intersecting.

In [None]:
y_train_pred_final['Predicted_Converted'] = y_train_pred_final['Lead Score'].map( lambda x: 1 if x > 0.35 else 0)
y_train_pred_final['Predicted_Converted'].value_counts()
y_train_pred_final['Lead Number']=X_train['Lead Number']


In [None]:
X_train['Lead Number'].shape

In [None]:
y_train_pred_final.shape

In [None]:
y_train_pred_final.head()

#### Checking Metrics on training data

In [None]:
# Let's check the overall accuracy.
accu_train = metrics.accuracy_score(y_train_pred_final.Actual_Converted, y_train_pred_final.Predicted_Converted)

confusion_mat = metrics.confusion_matrix(y_train_pred_final.Actual_Converted, y_train_pred_final.Predicted_Converted)
confusion_mat

In [None]:
print("Training Accuracy = {}".format(accu_train))
TP = confusion_mat[1,1] # true positive 
TN = confusion_mat[0,0] # true negatives
FP = confusion_mat[0,1] # false positives
FN = confusion_mat[1,0] # false negatives\

# Let's see the sensitivity of our logistic regression model
print("Training Sensitivity: {}".format(TP / float(TP+FN)))
# Let us calculate specificity
print("Training Specificity: {}".format(TN / float(TN+FP)))
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))
# Positive predictive value 
print (TP / float(TP+FP))
# Negative predictive value
print (TN / float(TN+ FN))

In [None]:
print(classification_report(y_train_pred_final.Actual_Converted, y_train_pred_final.Predicted_Converted))

### The f1 score of training is 84%

In [None]:
def draw_roc(actual_value,pred_prob):
    fpr,tpr,thresholds = metrics.roc_curve(actual_value,pred_prob,drop_intermediate = False)
    auc_score = metrics.roc_auc_score(actual_value,pred_prob)
    plt.figure(figsize = (5,5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr,tpr,thresholds = metrics.roc_curve(y_train_pred_final.Actual_Converted,y_train_pred_final['Lead Score'],drop_intermediate = False)

In [None]:
draw_roc(y_train_pred_final.Actual_Converted,y_train_pred_final['Lead Score'])

### The ROC curve area is 93% which is good enough

####  Step 4.3 Prediction and evaluation on Testing Set<a id='step4.3'></a>

In [None]:
cols=X_train_rfe.columns
cols=cols.drop('const')
X_test[['TotalVisits','Website Time','Page Views Per Visit']]=scaling.transform(X_test[['TotalVisits','Website Time','Page Views Per Visit']])
X_test_tp=X_test
X_test=X_test[cols]
X_test.head()

In [None]:
X_test_sm=sm.add_constant(X_test)

In [None]:
y_test_pred=lm2.predict(X_test_sm)
y_test_pred[:10]

In [None]:
y_test_pred_final=pd.DataFrame({'Actual_Converted':y_test,'Lead Score':y_test_pred,'Lead Number':X_test_tp['Lead Number']})

numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_test_pred_final[i]= y_test_pred_final['Lead Score'].map(lambda x: 1 if x > i else 0)
y_test_pred_final.head()

In [None]:
y_test_pred_final['Predicted_Converted'] = y_test_pred_final['Lead Score'].map(lambda x: 1 if x > 0.35 else 0)
#y_test_pred_final['Lead Number']=X_test['Lead Number']
y_test_pred_final.head()

In [None]:
y_test_pred_final['Lead Number'].isnull().sum()

In [None]:
# Let's check the overall accuracy.
accu_train = metrics.accuracy_score(y_test_pred_final.Actual_Converted, y_test_pred_final.Predicted_Converted)

confusion_mat = metrics.confusion_matrix(y_test_pred_final.Actual_Converted, y_test_pred_final.Predicted_Converted)
confusion_mat

In [None]:
print("Testing Accuracy = {}".format(accu_train))
TP = confusion_mat[1,1] # true positive 
TN = confusion_mat[0,0] # true negatives
FP = confusion_mat[0,1] # false positives
FN = confusion_mat[1,0] # false negatives\

# Let's see the sensitivity of our logistic regression model
print("Testing Sensitivity: {}".format(TP / float(TP+FN)))
# Let us calculate specificity
print("Testing Specificity: {}".format(TN / float(TN+FP)))
# Calculate false postive rate -
print(FP/ float(TN+FP))
# Positive predictive value 
print (TP / float(TP+FP))
# Negative predictive value
print (TN / float(TN+ FN))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test_pred_final.Actual_Converted, y_test_pred_final.Predicted_Converted))

### F1 score of test data is 84%

## Step 5. Final Analysis:<a id='Step5'></a>
         Features                           	Coeff
            - Tags_interested                        2.2906 			 
            - Lead Source_Olark Chat                 0.7507			 
            - Last Activity_Email Opened             0.7293
            - Last Activity_SMS Sent                 2.2233
            - Free copy required                    -0.3246
            - Last Activity_Olark Chat Conversation -1.1362
            - Lead Source_Reference                  2.3124
            - Website Time                           0.9219
            - Occupation_Working Professional        1.7101
            - Tags_Ringing                          -3.1427
            - Last Activity_Converted to Lead       -0.9456
            - Tags_probable                         -1.6866

Based on log odd equation:
    
$ ln(P / 1− P) = β0 + β1 \times x1 + β2 \times x2 + β3 \times x3 .. β12 \times x12 $

where 
    β1.. β12 are the coefficients of above 12 features
    
The odds of a lead getting converted (P/1-P), indicate how much likelier a lead is to get converted than to not convert. For example, if for a lead  whose odds of getting converted are equal to 3, 
i.e he is 3 times more likely to get converted than not to get converted. 

P(Conversion) = 3 * P(No Conversion).

top features that increases the probabilty are

- Lead Source_Reference (currently 91% conversion rate)
- Tags_interested (currently  80% conversion rate)
- Last Activity_SMS Sent (currently  63% conversion rate)
- Website Time
    
top features that decreases the probabilty are
    
- Tags_Ringing (currently 2% conversion rate)
- Tags_probable (currently 4% coversion rate)
- Last Activity_Olark Chat Conversation (currently 8% conversion rate)

## Adding lead score value to data

In [None]:
y_train_pred_final.head()

In [None]:
y_test_pred_final.head()

In [None]:
Final_output = pd.concat([y_train_pred_final[['Lead Number','Lead Score']],y_test_pred_final[['Lead Number','Lead Score']]],axis=0)
Final_output.head()

In [None]:
import math
Final_output['Lead Score']=Final_output['Lead Score']*100
Final_output['Lead Score']=Final_output['Lead Score'].map(lambda x:math.ceil(x))


In [None]:
Final_output['Lead Score'].isnull().sum()

In [None]:
Final_output=Final_output.sort_values(by='Lead Score',ascending=False)
Final_output.head()

In [None]:
Final_output.to_csv('Final_output.csv')