# Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

 

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. 

Data
You have been provided with a leads dataset from the past with around 9000 data points. This dataset consists of various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc. which may or may not be useful in ultimately deciding whether a lead will be converted or not. The target variable, in this case, is the column ‘Converted’ which tells whether a past lead was converted or not wherein 1 means it was converted and 0 means it wasn’t converted. You can learn more about the dataset from the data dictionary provided in the zip folder at the end of the page. Another thing that you also need to check out for are the levels present in the categorical variables. Many of the categorical variables have a level called 'Select' which needs to be handled because it is as good as a null value (think why?).

# Objective
Built a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.

There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. Please fill it based on the logistic regression model you got in the first step. Also, make sure you include this in your final PPT where you'll make recommendations.

# Problem Solving Methodology

### Step 1. Load the dataset
### Step 2. Data Cleaning & Preparation
### Step 3. Treatment of Null Values
### Step 4. Outliers Detection & Treatment
### Step 5. Univariate Exploratory Data Analysis
### Step 6. Model Building & Evaluation

In [None]:
# import the required libraries
import numpy  as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import xticks
import seaborn as sns
### Supress unnecessary warnings

import warnings
warnings.filterwarnings('ignore')

# 1. Load the DataSet

In [None]:
# load the dataset
leads = pd.read_csv("../input/leadds/Leads.csv")
leads.head()

In [None]:
# check the dimensions of the datset
leads.shape

In [None]:
# lets check the dataset
leads.info()

# Observation: there are null values present in the dataset, we will treat these nulls later

In [None]:
# lets check for duplicates
leads[leads["Prospect ID"].duplicated()==True]

# no duplicates found

In [None]:
leads.describe()

# 2. Data Cleaning & Preparation

In [None]:
# lets check % of null values for each column
round(100*(leads.isnull().sum()/len(leads.index)), 2).sort_values(ascending = False)

In [None]:
# lets drop columns with null values greater than 30%
drop_cols = []
for col in leads.columns:
    if round(100*(leads[col].isnull().sum()/len(leads.index)), 2)>30:
        drop_cols.append(col)

print("dropping columns:",drop_cols)
leads.drop(drop_cols,1,inplace = True)

In [None]:
# lets check % of null values again
round(100*(leads.isnull().sum()/len(leads.index)), 2).sort_values(ascending = False)

# we will treat these null values in a while

In [None]:
# lets check unique valeus for all the variables except "Prospect ID" & " Lead Number"
cols = leads.drop(["Prospect ID","Lead Number"],axis = 1)

print('\nUnique values in the dataframe - column wise:')
for i in cols:
    print(i,leads[i].unique(),'\n')
    
# we can see columns having "Select" values, which means user did not select any value for these columns

In [None]:
# lets get the list of columns having "Select" values
print("Column name containing value Select:\n")
for col in cols:
    for value in leads[col].values:
        if value == "Select":
            print(col)
            break
# lets check columns with "Select" values one by one

In [None]:
# lets check "Specialization" for "Select"
print("Count of different values for Specialization:",leads["Specialization"].value_counts())

# There are 1942 records present in the dataset with Specialization value as "Select", we will drop these records
print("\nDropping the rows with value Select" )
leads = leads[~(leads['Specialization'] == "Select")]
      
print("\nCount of final values for Specialization:",leads["Specialization"].value_counts())

In [None]:
# lets check "How did you hear about X Education" for "Select"
print("How did you hear about x education values:\n",leads["How did you hear about X Education"].value_counts())
# majority of the records have "Select" for this fields, lets drop the column

print("\nDropping the field as it has too many rows with Select Value")
leads = leads.drop("How did you hear about X Education",1)

In [None]:
# lets check "Lead Profile" for "Select"

print("Lead Profile Values:",leads["Lead Profile"].value_counts())
print("\nDropping the field as it has too many rows with Select Value")
leads = leads.drop("Lead Profile",1)

In [None]:
# lets check "City" for "Select"
print("City values:",leads["City"].value_counts())

# majority of the records have only 2 values for this field "Mumbai" & "Select", we can drop this column
print("\nDropping the field as it has too many rows with Select Value")
leads = leads.drop("City",1)

We are done treating "Select" Values, now lets correct other values

In [None]:
# field Lead Source has values like google & Google, lets make it "google"
leads["Lead Source"] = leads["Lead Source"].replace("Google","google")

# lets check whether the value has been corrected or not
leads["Lead Source"].unique()

We have made all the corrections now, lets check for variable with single values or almost no variation and drop such variables

In [None]:
# lets check the variations for each column

for col in leads.drop(["Prospect ID","Converted","Lead Number"],1).columns:
    print("column name:",col,"\n",leads[col].value_counts(),"\n\n")

* Based on above stats we can we can drop following features as these features do not have much variation
1. Do Not Call
2. Search
3. Magazine
4. Newspaper Article
5. X Education Forums
6. Newspaper
7. Digital Advertisement
8. Through Recommendations
9. Receive More Updates About Our Courses 
10. Update me on Supply Chain Content
11. Get updates on DM Content
12. I agree to pay the amount through cheque
13. What matters most to you in choosing a course 
14. Country

In [None]:
# lets drop all the columns mentioned above

leads.drop(['Do Not Call', 'Country','Search', 'Magazine', 'Newspaper Article', 'X Education Forums', 'Newspaper', 
            'Digital Advertisement', 'Through Recommendations', 'Receive More Updates About Our Courses', 
            'Update me on Supply Chain Content', 'Get updates on DM Content', 
            'I agree to pay the amount through cheque','What matters most to you in choosing a course'], axis = 1, inplace = True)

# 3. Treat Null Values

##### Lets treat null values now

In [None]:
# check for null values
leads.isnull().sum().sort_values(ascending = False)

In [None]:
# lets check % of null values for each column
round(100*(leads.isnull().sum()/len(leads.index)), 2).sort_values(ascending = False)

In [None]:
# lets check What is your current occupation field

print("\nNo. of Null Values:",leads["What is your current occupation"].isnull().sum())
print("Check different values:\n",leads["What is your current occupation"].value_counts())

# lets replace the null values with the most frequent value
leads["What is your current occupation"] = leads["What is your current occupation"].replace(np.nan,"Unemployed")

print("\nNo. of Null Values after treating null values:",leads["What is your current occupation"].isnull().sum())
print("Check different values:\n",leads["What is your current occupation"].value_counts())

In [None]:
# lets check What is your current occupation field

print("\nNo. of Null Values:",leads["Specialization"].isnull().sum())
print("Check different values:\n",leads["Specialization"].value_counts())

# we will remove the rows with null values
leads = leads[~pd.isnull(leads['Specialization'])]

print("\nNo. of Null Values after treating null values:",leads["Specialization"].isnull().sum())
print("Check different values:\n",leads["Specialization"].value_counts())

In [None]:
# lets check Page Views Per Visit field
print("\nNo. of Null Values for Page Views Per Visit:",leads["Page Views Per Visit"].isnull().sum())
print("Describe Page Views Per Visit:",leads["Page Views Per Visit"].value_counts())

# we will remove the records with null values
leads = leads[~pd.isnull(leads['Page Views Per Visit'])]

print("\nNo. of Null Values for Page Views Per Visit after treatment:",leads["Page Views Per Visit"].isnull().sum())

In [None]:
# lets check TotalVisits field
print("\nNo. of Null Values:",leads["TotalVisits"].isnull().sum())
# no null values left

In [None]:
# lets check Values for Page Views Per Visit field
print("\nNo. of Null Values for Last Activity:",leads["Last Activity"].isnull().sum())
# no null values left

In [None]:
# lets check the column 'Lead Source'
print("\nNo. of Null Values for Lead Source:",leads["Lead Source"].isnull().sum())
print("Describe Lead Source:",leads["Lead Source"].value_counts())

# we will remove the rows with null values
leads = leads[~pd.isnull(leads['Lead Source'])]

In [None]:
# lets check % of null values again
round(100*(leads.isnull().sum()/len(leads.index)), 2).sort_values(ascending = False)

We have treated all the null values

# 4. Outliers Detection & Treatment

In [None]:
# lets drop "Prospect ID" and "Lead Number" because are of no use in our analysis
leads = leads.drop(["Prospect ID", "Lead Number"],1)

In [None]:
leads.info()

# lets check numeric fields TotalVisits, Total Time Spent on Website,Page Views Per Visit for outliers

In [None]:
def outlier_treatment(data ,field):
    plt.figure(figsize=(10,8))
    plt.subplot(1,2,1)
    plt.boxplot(data[field])
    Q1 = data[field].quantile(0.05)
    Q3 = data[field].quantile(0.95)
    IQR = Q3 - Q1
    data = data[(data[field]>= Q1) & (data[field] <= Q3)]
    plt.title("Before Outlier Treatment")
    
    plt.subplot(1,2,2)
    plt.boxplot(data[field])
    plt.title("After Outlier Treatment")
    return(data)

In [None]:
leads = outlier_treatment(leads,"TotalVisits")

In [None]:
leads = outlier_treatment(leads,"Total Time Spent on Website")

In [None]:
leads = outlier_treatment(leads,"Page Views Per Visit")

We have succesfully treated the outliers, lets perform EDA now

# 5. Univariate Exploratory Data Analysis

In [None]:
# lets take a copy of the dataset to perform eda, we are taking a copy because we would create "binning" variables 
# for the fields having more than 30 unique values
leads_eda = pd.DataFrame(leads).copy()

In [None]:
leads_eda.head()

In [None]:
# function for univariate analysis
# if an integer variable has more than 30 unique values, we will create bins for the variable
def univariate_plot(data,col):            
    if data[col].nunique() > 30:
        col_bins = col+"_bins"
        data[col_bins] = pd.cut(data[col], 8, duplicates = 'drop') # creating bins                                  
        sns.countplot(data[col_bins]) # plot for binned variables
        plt.xlabel(col_bins,fontsize = 15)
    else:        
        sns.countplot(data[col]) # plot for non binned variables
        plt.xlabel(col,fontsize = 15)
    
    plt.ylabel('Frequency',fontsize = 15)
    xticks(rotation = 30)    

In [None]:
# lets perform univariate analysis for numeric type variables
plt.figure(figsize=(25,30))
fin = []
cols = leads_eda.columns
for col in cols:
    if leads_eda[col].dtypes != 'O': # getting the list of numeric variables
        fin.append(col)
        
for idx,col in enumerate(fin): # plotting for numeric variables
    plt.subplot(3, 2, idx+1)
    univariate_plot(leads_eda,col)          

#### Observation from above plots:
1. Converted: column has good reprsentation for both the possible values
2. Total Time Spent onf website: most of the visitors spent less than 402 seconds on the website

In [None]:
# lets perform univariate analysis for non numeric variables
fin = []
cols = leads_eda.columns
for col in cols:
    if leads_eda[col].dtypes == 'O': # getting the list of non numeric variables
        fin.append(col)

plt.figure(figsize=(25,60)) # plotting graphs for non numeric variables
for idx,col in enumerate(fin):            
    plt.subplot(4, 2, idx+1)
    sns.countplot(leads_eda[col])
    plt.title("Count Plot for "+ col)
    xticks(rotation = 30)

plt.show()

### Observations from above graphs:
1. Lead Origin : majority of the leads originate from  "Lending Page Submission"
2. Direct Traffic & google are few of the top Lead Sources
3. Do not Email: Majority of the leads seems to have selected NO
4. Last Activity : for majority of the leads, last activity is either "Email Opened" or "SMS Sent"
5. Specialization: data is evenly distributed for this column
6. What is your current occupation: majority of the leads are unemployed.



In [None]:
print(len(leads.index))
print(len(leads.index)/9240)

We are left with around 54% of the original dataset now

In [None]:
# lets check whether the dataset is balanced or not
len(leads[leads["Converted"]==1])/len(leads.index)

We have balanced dataset.

# Dummy Variable Creation

Lets create dummy variables for the categorical features present in the data

In [None]:
# lets find out categorical variables
obj = [col        for col in leads.columns     if leads[col].dtype == "O"]
obj

In [None]:
# Create dummy variables (except Prospect ID and ) using the 'get_dummies' command
dummy = pd.get_dummies(leads[['Lead Origin', 'Lead Source', 'Do Not Email', 'Last Activity',
                              'What is your current occupation','A free copy of Mastering The Interview', 
                              'Last Notable Activity','Specialization']], drop_first=True)

In [None]:
# lets drop the variables for which we have created the dummy variables above
leads = leads.drop(['Lead Origin', 'Lead Source', 'Do Not Email', 'Last Activity',
                              'What is your current occupation','A free copy of Mastering The Interview', 
                              'Last Notable Activity','Specialization'],1)

In [None]:
leads = pd.concat([leads,dummy],1)
leads.shape

In [None]:
leads.columns

In [None]:
leads.head()

# Train Test Split

In [None]:
# Import the required library

from sklearn.model_selection import train_test_split

In [None]:
# Put all the feature variables in X

X = leads.drop(['Converted'], 1)
X.head()

In [None]:
# Put the target variable in y

y = leads['Converted']

y.head()

In [None]:
# Split the dataset into 70% train and 30% test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
# Import MinMax scaler

from sklearn.preprocessing import MinMaxScaler

In [None]:
# Scale the three numeric features present in the dataset

scaler = MinMaxScaler()

X_train[['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']] = scaler.fit_transform(X_train[['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']])

X_train.head()

# 6. Model Building

In [None]:
# Import 'LogisticRegression' and create a LogisticRegression object

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
# Import RFE and select 15 variables

from sklearn.feature_selection import RFE
rfe = RFE(logreg, 15)             # running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)

In [None]:
# Let's take a look at which features have been selected by RFE

list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# Put all the columns selected by RFE in the variable 'col'

col = X_train.columns[rfe.support_]
col

In [None]:
# Select only the columns selected by RFE

X_train = X_train[col]

In [None]:
# Import statsmodels

import statsmodels.api as sm

In [None]:
# Fit a logistic Regression model on X_train after adding a constant and output the summary

X_train_sm = sm.add_constant(X_train)
logm2 = sm.GLM(y_train, X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
    # Import 'variance_inflation_factor'

    from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Make a VIF dataframe for all the variables present

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# p-value & VIF for Lead Origin_Lead Add Form is high, lets drop this column
X_train.drop('Lead Origin_Lead Add Form', axis = 1, inplace = True)

In [None]:
# Refit the model with the new set of features

logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

In [None]:
# Make a VIF dataframe for all the variables present

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# p-value for this field is too high, lets drop
X_train.drop('Lead Source_Welingak Website', axis = 1, inplace = True)

In [None]:
# Refit the model with the new set of features

logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

In [None]:
# p-value for this field is too high, lets drop
X_train.drop('What is your current occupation_Housewife', axis = 1, inplace = True)

In [None]:
# Refit the model with the new set of features

logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
res = logm1.fit()
res.summary()

In [None]:
# Make a VIF dataframe for all the variables present

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif


In [None]:
# VIF for this field is too high, lets drop
X_train.drop('What is your current occupation_Unemployed', axis = 1, inplace = True)

In [None]:
# Refit the model with the new set of features

logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
res = logm1.fit()
res.summary()

In [None]:
# Make a VIF dataframe for all the variables present

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif


In [None]:
# p-value for this field is too high, lets drop
X_train.drop('What is your current occupation_Other', axis = 1, inplace = True)

In [None]:
# Refit the model with the new set of features

logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
res = logm1.fit()
res.summary()

In [None]:
# Make a VIF dataframe for all the variables present

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif


We have finally got a set of features with p-values & VIF in permissible limit , now lets evaluate the model

# 7. Model Evaluation

In [None]:
# Use 'predict' to predict the probabilities on the train set

y_train_pred = res.predict(sm.add_constant(X_train))
y_train_pred[:10]

In [None]:
# Create a new dataframe containing the actual conversion flag and the probabilities predicted by the model

y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Conversion_Prob':y_train_pred})
y_train_pred_final.head()

In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
# Import metrics from sklearn for evaluation

from sklearn import metrics

In [None]:
# Create confusion matrix 

confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted )
print(confusion)

In [None]:
# Let's check the overall accuracy

print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted))


In [None]:
# Let's evaluate the other metrics as wella

TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Calculate the sensitivity

TP/(TP+FN)

In [None]:
# Calculate the specificity

TN/(TN+FP)

In [None]:
# ROC function

def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, y_train_pred_final.Conversion_Prob, drop_intermediate = False )

In [None]:
# Import matplotlib to plot the ROC curve

import matplotlib.pyplot as plt

In [None]:
# Call the ROC function

draw_roc(y_train_pred_final.Converted, y_train_pred_final.Conversion_Prob)

In [None]:
# Let's create columns with different probability cutoffs 

numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Let's create a dataframe to see the values of accuracy, sensitivity, and specificity at different values of probabiity cutoffs

cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot it as well

cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

We have intersection at around cut off value 0f 0.4

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Conversion_Prob.map( lambda x: 1 if x > 0.4 else 0)

y_train_pred_final.head()

In [None]:
# Let's check the accuracy now

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
# Let's create the confusion matrix once again

confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2

In [None]:
# Let's evaluate the other metrics as well

TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Calculate Sensitivity

TP/(TP+FN)


### Let's now make predicitons on the test set.

In [None]:
# Scale the three numeric features present in the dataset

scaler = MinMaxScaler()

X_test[['Total Time Spent on Website','Page Views Per Visit','TotalVisits']] = scaler.fit_transform(X_test[['Total Time Spent on Website','Page Views Per Visit','TotalVisits']])

X_test.head()

In [None]:
# Select the columns in X_train for X_test as well

X_test = X_test[col]
X_test.head()

In [None]:
# Add a constant to X_test

X_test_sm = sm.add_constant(X_test[col])

In [None]:
# Drop the required columns from X_test as well

X_test.drop(['Lead Origin_Lead Add Form','Lead Source_Welingak Website','What is your current occupation_Housewife',
             'What is your current occupation_Unemployed','What is your current occupation_Other'], 1, inplace = True)

In [None]:
# Make predictions on the test set and store it in the variable 'y_test_pred'

y_test_pred = res.predict(sm.add_constant(X_test))

In [None]:
# Converting y_pred to a dataframe

y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head

y_pred_1.head()

In [None]:
# Converting y_test to dataframe

y_test_df = pd.DataFrame(y_test)

In [None]:
# Remove index for both dataframes to append them side by side 

y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Append y_test_df and y_pred_1

y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
# Check 'y_pred_final'

y_pred_final.head()

In [None]:
# Rename the column 

y_pred_final= y_pred_final.rename(columns = {0 : 'Conversion_Prob'})

In [None]:
# Let's see the head of y_pred_final

y_pred_final.head()

In [None]:
# Make predictions on the test set using 0.4 as the cutoff

y_pred_final['final_predicted'] = y_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.4 else 0)

In [None]:
# Check y_pred_final

y_pred_final.head()

In [None]:
# Let's check the overall accuracy

metrics.accuracy_score(y_pred_final['Converted'], y_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final['Converted'], y_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Calculate sensitivity
TP / float(TP+FN)

In [None]:
# Calculate specificity
TN / float(TN+FP)

# Precision Recall View

In [None]:
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted )
confusion

In [None]:
# precision
confusion[1,1]/(confusion[0,1]+confusion[1,1])

In [None]:
# Recall

confusion[1,1]/(confusion[1,0]+confusion[1,1])

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
y_train_pred_final.Converted, y_train_pred_final.Predicted

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Conversion_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.4 else 0)

y_train_pred_final.head()

In [None]:
    # Let's check the accuracy now

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
# Let's create the confusion matrix once again

confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2

In [None]:
# Let's evaluate the other metrics as well

TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Calculate Precision

TP/(TP+FP)

In [None]:
# Calculate Recall

TP/(TP+FN)