## Problem Statement

The data set is the response of people to the h1n1 flu vaccine related questionnaire. The respondents are people of age 6 months and older. This survey was designed to monitor the influenza immunization coverage in 2009-10 season. Machine learning techniques may aid a more efficient analysis in the prediction of how likely the people are to opt for the flu vaccine. In this case study, we predict, how likely it is that the people will take a H1N1 flu vaccine.        

## Data Definition

**unique_id**: Unique identifier for each respondent - (Numerical)    

**h1n1_worry**: Worry about the h1n1 flu(0,1,2,3) 0=Not worried at all, 1=Not very worried, 2=Somewhat worried, 3=Very worried - (Categorical)

**h1n1_awareness**: Signifies the amount of knowledge or understanding the respondent has about h1n1 flu - (0,1,2) - 0=No knowledge, 1=little knowledge, 2=good knowledge- (Categorical) 
 
**antiviral_medication**: Has the respondent taken antiviral vaccination - (0,1) (Categorical)
    
**contact_avoidance**: Has avoided any close contact with people who have flu-like symptoms  - (0,1) - (Categorical)
    
**bought_face_mask**: Has the respondent bought mask or not - (0,1) - (Categorical)
    
**wash_hands_frequently**: Washes hands frequently or uses hand sanitizer - (0,1) - (Categorical)
    
**avoid_large_gatherings**: Has the respondent reduced time spent at large gatherings - (0,1) - (Categorical)
    
**reduced_outside_home_cont**: Has the respondent reduced contact with people outside own house - (0,1) - (Categorical)
    
**avoid_touch_face**: Avoids touching nose, eyes, mouth - (0,1) - (Categorical)

**dr_recc_h1n1_vacc**: Doctor has recommended h1n1 vaccine - (0,1) - (Categorical)
    
**dr_recc_seasonal_vacc**: Doctor has recommended seasonalflu vaccine - (0,1) - (Categorical)
    
**chronic_medic_condition**: Has any chronic medical condition - (0,1) - (Categorical)
    
**cont_child_undr_6_mnth** - Has a regular contact with child the age of 6 months - (0,1) - (Categorical)

**is_health_worker**: Is respondent a health worker - (0,1) - (Categorical)
    
**has_health_insur**: Does respondent have health insurance - (0,1) - (Categorical)
    
**is_h1n1_vacc_effective**:  Does respondent think that the h1n1 vaccine is effective - (1,2,3,4,5)- (1=Thinks not effective at all, 2=Thinks it is not very effective, 3=Doesn't know if it is effective or not, 4=Thinks it is somewhat effective, 5=Thinks it is highly effective) - (Categorical)

**is_h1n1_risky**: What respondenst think about the risk of getting ill with h1n1 in the absence of the vaccine- (1,2,3,4,5)- (1=Thinks it is not very low risk, 2=Thinks it is somewhat low risk, 3=Doesn't know if it is risky or not, 4=Thinks it is somewhat high risk, 5=Thinks it is very highly risky) - (Categorical)
 
**sick_from_h1n1_vacc**: Does respondent worry about getting sick by taking the h1n1 vaccine - (1,2,3,4,5)- (1=Respondent not worried at all, 2=Respondent is not very worried, 3=Doesn't know, 4=Respondent is somewhat worried, 5Respondent is very worried) - (Categorical)

**is_seas_vacc_effective**: Does respondent think that the seasonal vaccine is effective- (1,2,3,4,5)- (1=Thinks not effective at all, 2=Thinks it is not very effective, 3=Doesn't know if it is effective or not, 4=Thinks it is somewhat effective, 5=Thinks it is highly effective) - (Categorical)

**is_seas_flu_risky**: What respondenst think about the risk of getting ill with seasonal flu in the absence of the vaccine- (1,2,3,4,5)- (1=Thinks it is not very low risk, 2=Thinks it is somewhat low risk, 3=Doesn't know if it is risky or not, 4=Thinks it is somewhat high risk, 5=Thinks it is very highly risky) - (Categorical)
 
**sick_from_seas_vacc**: Does respondent worry about getting sick by taking the seasonal flu vaccine - (1,2,3,4,5)- (1=Respondent not worried at all, 2=Respondent is not very worried, 3=Doesn't know, 4=Respondent is somewhat worried, 5Respondent is very worried) - (Categorical)

**age_bracket** - Age bracket of the respondent - (18 - 34 Years, 35 - 44 Years, 45 - 54 Years, 55 - 64 Years, 64+ Years) - (Categorical)
    
**qualification** - Qualification/education level of the respondent as per their response -(<12 Years, 12 Years, College Graduate, Some College) - (Categorical)
    
**race**: Respondent's race - (White, Black, Other or Multiple ,Hispanic) - (Categorical) 
    
**sex**: Respondent's sex - (Female, Male) - (Categorical)
    
**income_level**:Annual income of the respondent as per the 2008 poverty Census - (<=$75000-Above Poverty, >$75000, Below Poverty) - (Categorical)
    
**marital_status**: Respondent's marital status - (Not Married, Married) - (Categorical)
    
**housing_status**: Respondent's housing status - (Own, Rent) - (Categorical)
    
**employment**: Respondent's employment status - (Not in Labor Force, Employed, Unemployed) - (Categorical)
    
**census_msa**: Residence of the respondent with the MSA(metropolitan statistical area)(Non-MSA, MSA-Not Principle, CityMSA-Principle city) - (Yes, no) - (Categorical)
    
**no_of_adults**:  Number of adults in the respondent's house (0,1,2,3) - (Yes, no) - (Categorical)

**no_of_children**: Number of children in the respondent's house(0,1,2,3) - (Yes, No) - (Categorical)

**h1n1_vaccine**: (Dependent variable)Did the respondent received the h1n1 vaccine or not(1,0) - (Yes, No) - (Categorical)

<a id='import_lib'></a>
# 1. Import Libraries

In [1]:
# suppress display of warnings
import warnings
warnings.filterwarnings("ignore")

# 'Pandas' is used for data manipulation and analysis
import pandas as pd 

# 'Numpy' is used for mathematical operations on large, multi-dimensional arrays and matrices
import numpy as np

# 'Matplotlib' is a data visualization library for 2D and 3D plots, built on numpy
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# 'Seaborn' is based on matplotlib; used for plotting statistical graphics
import seaborn as sns

# import 'is_string_dtype' to check if the type of input is string  
from pandas.api.types import is_string_dtype

# import various functions to perform classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.linear_model import SGDClassifier


# import functions to perform logistic regression
import statsmodels
import statsmodels.api as sm

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

<a id='set_options'></a>
# 2. Set Options

In [3]:
# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None

# use below code to convert the 'exponential' values to float
np.set_printoptions(suppress=True)

<a id='RD'></a>
# 3. Read Data

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#root_path='/content/gdrive/MyDrive/Canspirit/h1n1_vaccine_prediction.csv'
root_path='/content/gdrive/MyDrive/Colab Notebooks/ImarticusArun/4LogisticsRegression/h1n1_vaccine_prediction.csv'

In [None]:
# read the excel data file 
df_vaccine = pd.read_csv(root_path)

# display the top 5 rows of the dataframe
df_vaccine.head()

# Note: To display more rows, example 10, use head(10)

#### Dimensions of the data

In [None]:
# 'shape' function gives the total number of rows and columns in the data
df_vaccine.shape

<a id='data_preparation'></a>
# 4. Data Analysis and Preparation

<a id='Data_Understanding'></a>
## 4.1 Understand the Dataset

**1. Check for the data type**

In [None]:
# 'dtypes' gives the data type for each column
df_vaccine.dtypes

**2. Change the incorrect data type.**

In [None]:
# use 'for' loop to change the data type of variables 
for col in ['h1n1_worry','h1n1_awareness', 'antiviral_medication', 'contact_avoidance', 'bought_face_mask',
            'wash_hands_frequently', 'avoid_large_gatherings', 'reduced_outside_home_cont', 'avoid_touch_face', 
            'dr_recc_h1n1_vacc', 'dr_recc_seasonal_vacc', 'chronic_medic_condition','cont_child_undr_6_mnths',
           'is_health_worker', 'has_health_insur', 'is_h1n1_vacc_effective', 'is_h1n1_risky', 'sick_from_h1n1_vacc', 
            'is_seas_vacc_effective', 'is_seas_risky', 'sick_from_seas_vacc', 'no_of_adults', 'no_of_children']:

    # use .astype() to change the data type
    df_vaccine[col] = df_vaccine[col].astype('object')

**3. Recheck the data type after the conversion.**

In [None]:
# recheck the data types of all variables
df_vaccine.dtypes

In [None]:
#drop the field 'unique_id'
# axis=1: it stands for column
# inplace=True: it perform operations on original data
df_vaccine.drop('unique_id', axis=1, inplace=True)

In [None]:
#verify the shape
df_vaccine.shape

In [None]:
# splitting features and the target variable
# consider all the columns except 'h1n1_vaccine' using 'iloc'
df_features = df_vaccine.iloc[:, df_vaccine.columns != 'h1n1_vaccine']

# consider the target variable
df_target = df_vaccine.iloc[:, df_vaccine.columns == 'h1n1_vaccine']

Use the dataframe containing features (df_features) for further analysis.

<a id='Summary_Statistics'></a>
### 4.1.2 Summary Statistics

**1. For numerical variables, use the describe()**

In [None]:
# the describe() returns the statistical summary of the variables
# by default, it returns the summary of all categorical variables as tere are no numerical variables in the dataset
# use .transpose() for better readability, however its optional
df_features.describe().transpose()

<a id='distribution_variables'></a>
### 4.1.3 Distribution of Variables

In [None]:
# create a list of all categorical variables
# initiate an empty list to store the categorical variables
categorical=[]

# use for loop to check the data type of each variable
for column in df_features:
    
    # use 'if' statement with condition to check the categorical type 
    if is_string_dtype(df_features[column]):
        
        # append the variables with 'categoric' data type in the list 'categorical'
        categorical.append(column)

# plot the count plot for each categorical variable 
# set the number of rows in the subplot using the parameter, 'nrows'
# set the number of columns in the subplot using the parameter, 'ncols'
# 'figsize' sets the figure size
fig, ax = plt.subplots(nrows = 8, ncols = 4, figsize=(25, 30))


# use for loop to plot the count plot for each variable
for variable, subplot in zip(categorical, ax.flatten()):
    
    # use countplot() to plot the graph
    # pass the axes for the plot to the parameter, 'ax'
    sns.countplot(df_vaccine[variable], ax = subplot)

# display the plot
plt.show()

In [None]:
df_vaccine[variable]

#### 3. Distribution of dependent variable.

In [None]:
# get counts of 0's and 1's in the 'h1n1_vaccine' variable using 'value_counts()'
# store the values in 'class_frequency'
class_frequency = df_target.h1n1_vaccine.value_counts()
class_frequency

In [None]:
# plot the countplot of the variable 'h1n1_vaccine'
sns.countplot(x = df_target.h1n1_vaccine)

# use below code to print the values in the graph
# 'x' and 'y' gives position of the text
# 's' is the text on the plot
plt.text(x = -0.05, y = df_target.h1n1_vaccine.value_counts()[0] + 30, s = str((class_frequency[0])*100/len(df_target.h1n1_vaccine)) + '%')
plt.text(x = 0.95, y = df_target.h1n1_vaccine.value_counts()[1] +20, s = str((class_frequency[1])*100/len(df_target.h1n1_vaccine)) + '%')

# add plot and axes labels
# set text size using 'fontsize'
plt.title('Count Plot for Target Variable (h1n1_vaccine)', fontsize = 15)
plt.xlabel('Target Variable', fontsize = 15)
plt.ylabel('Count', fontsize = 15)

# to show the plot
plt.show()

<a id='correlation'></a>
### 4.1.4 Correlation

<a id='Missing_Values'></a>
### 4.1.5 Missing Values

In [None]:
# sort the variables on the basis of total null values in the variable
# 'isnull().sum()' returns the number of missing values in each variable
# 'ascending = False' sorts values in the descending order
# the variable with highest number of missing values will appear first
Total = df_vaccine.isnull().sum().sort_values(ascending = False)          

# calculate the percentage of missing values
# 'ascending = False' sorts values in the descending order
# the variable with highest percentage of missing values will appear first
Percent = (df_vaccine.isnull().sum()*100/df_vaccine.isnull().count()).sort_values(ascending = False)   

# concat the 'Total' and 'Percent' columns using 'concat' function
# 'keys' is the list of column names
# 'axis = 1' concats along the columns
missing_data = pd.concat([Total, Percent], axis = 1, keys = ['Total', 'Percentage of Missing Values'])    
missing_data

In [None]:
# plot heatmap to check null values
# 'cbar = False' does not show the color axis 
sns.heatmap(df_vaccine.isnull(), cbar=False)

# display the plot
plt.show()

The horizontal lines in the heatmap correspond to the missing values.

In [None]:
df_vaccine.drop(['has_health_insur','income_level','dr_recc_h1n1_vacc','dr_recc_seasonal_vacc'], axis=1, inplace=True)

In [None]:
df_vaccine.shape

In [None]:
sns.heatmap(df_vaccine.isnull(), cbar=False)
plt.show()

In [None]:
df_vaccine.dropna(axis=0, inplace=True)

In [None]:
df_vaccine.shape

After replacing the null values for both the variables, recheck the null values. 

In [None]:
sns.heatmap(df_vaccine.isnull(), cbar=False)
plt.show()

In [None]:
# recheck the null values
# 'isnull().sum()' returns the number of missing values in each variable
df_vaccine.isnull().sum()

<a id='Data_Preparation'></a>
## 4.2 Prepare the Data

To build the classification models, we need to encode the categorical variables using dummy encoding.

**1. Filter numerical and categorical variables **

There are no numerical variables except the dependent variable(h1n1_vaccine)

In [None]:
df_vaccine.dtypes

In [None]:
# create a list of all categorical variables
# initiate an empty list to store the categorical variables
categorical=[]

# use for loop to check the data type of each variable
for column in df_vaccine:
    
    # use 'if' statement with condition to check the categorical type 
    if is_string_dtype(df_vaccine[column]):
        
        # append the variables with 'categoric' data type in the list 'categorical'
        categorical.append(column)

In [None]:
# dataframe with categorical features
# 'categorical' contains a list of categorical variables
df_cat = df_vaccine[categorical]

# dataframe with numerical features
# use 'drop()' to drop the categorical variables
# 'axis = 1' drops the corresponding column(s)
df_num = df_vaccine.drop(categorical, axis = 1)

**2. Dummy encode the categorical variables**

In [None]:
# print the first five observations of the 'df_cat'
df_cat.head()

In [None]:
# use 'get_dummies()' from pandas to create dummy variables
# use 'drop_first = True' to create (n-1) dummy variables
df_cat_dummies = pd.get_dummies(df_cat, drop_first = True)

In [None]:
# check the first five observations of the data with dummy encoded variables
df_cat_dummies.head()

After removal of missing values and dummy encoding the data, the dataframe `df_cat_dummies` contains all the independent variables and the dataframe df_num contains the target variable. We will rename these dataframes as X and y respectively.

In [None]:
# df_num contains only the target variable 'h1n1_vaccine'.
# We store it in dataframe 'y'
y = pd.DataFrame(df_num)

Now, use this 'y' as a target variable to build the classification models.

In [None]:
# df_cat_dummies contain all the dummy encoded independent variables
# We store it in dataframe 'X'
X = pd.DataFrame(df_cat_dummies)

In [None]:
# check the first five observations of X
X.head()

Use this 'X' as a set of predictors to build the classification models.

#### Create a generalized function to calculate the metrics for the test set.

In [None]:
# create a generalized function to calculate the metrics values for test set
def get_test_report(model):
    
    # return the performace measures on test set
    return(classification_report(y_test, y_pred))

#### Create a generalized function to calculate the kappa score for the test set.

In [None]:
# create a generalized function to calculate the metrics values for test set
def kappa_score(model):
    
    # return the kappa score on test set
    return(cohen_kappa_score(y_test, y_pred))

#### Define a function to plot the confusion matrix.

In [None]:
# define a to plot a confusion matrix for the model
def plot_confusion_matrix(model):
    
    # create a confusion matrix
    # pass the actual and predicted target values to the confusion_matrix()
    cm = confusion_matrix(y_test, y_pred)

    # label the confusion matrix  
    # pass the matrix as 'data'
    # pass the required column names to the parameter, 'columns'
    # pass the required row names to the parameter, 'index'
    conf_matrix = pd.DataFrame(data = cm,columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])

    # plot a heatmap to visualize the confusion matrix
    # 'annot' prints the value of each grid 
    # 'fmt = d' returns the integer value in each grid
    # 'cmap' assigns color to each grid
    # as we do not require different colors for each grid in the heatmap,
    # use 'ListedColormap' to assign the specified color to the grid
    # 'cbar = False' will not return the color bar to the right side of the heatmap
    # 'linewidths' assigns the width to the line that divides each grid
    # 'annot_kws = {'size':25})' assigns the font size of the annotated text 
    sns.heatmap(conf_matrix, annot = True, fmt = 'd', cmap = ListedColormap(['lightskyblue']), cbar = False, 
                linewidths = 0.1, annot_kws = {'size':25})

    # set the font size of x-axis ticks using 'fontsize'
    plt.xticks(fontsize = 20)

    # set the font size of y-axis ticks using 'fontsize'
    plt.yticks(fontsize = 20)

    # display the plot
    plt.show()

#### Define a function to plot the ROC curve.

In [None]:
# define a function to plot the ROC curve and print the ROC-AUC score
def plot_roc(model):
    
    # the roc_curve() returns the values for false positive rate, true positive rate and threshold
    # pass the actual target values and predicted probabilities to the function
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

    # plot the ROC curve
    plt.plot(fpr, tpr)

    # set limits for x and y axes
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])

    # plot the straight line showing worst prediction for the model
    plt.plot([0, 1], [0, 1],'r--')

    # add plot and axes labels
    # set text size using 'fontsize'
    plt.title('ROC Curve for h1n1_vaccine Classifier', fontsize = 15)
    plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
    plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)

    # add the AUC score to the plot
    # 'x' and 'y' gives position of the text
    # 's' is the text 
    # use round() to round-off the AUC score upto 4 digits
    plt.text(x = 0.02, y = 0.9, s = ('AUC Score:',round(roc_auc_score(y_test, y_pred_prob),4)))

    # plot the grid
    plt.grid(True)

#### Create a generalized function to create a dataframe containing the scores for the models.

In [None]:
# create an empty dataframe to store the scores for various classification algorithms
score_card = pd.DataFrame(columns=['Model', 'AUC Score', 'Precision Score', 'Recall Score', 'Accuracy Score',
                                   'Kappa Score', 'f1-score'])

# append the result table for all performance scores
# performance measures considered for comparision are 'AUC', 'Precision', 'Recall','Accuracy','Kappa Score', and 'f1-score'
# compile the required information in a user defined function 
def update_score_card(model_name):
    
    # assign 'score_card' as global variable
    global score_card

    # append the results to the dataframe 'score_card'
    # 'ignore_index = True' do not consider the index labels
    score_card = score_card.append({'Model': model_name,
                                    'AUC Score' : roc_auc_score(y_test, y_pred_prob),
                                    'Precision Score': metrics.precision_score(y_test, y_pred),
                                    'Recall Score': metrics.recall_score(y_test, y_pred),
                                    'Accuracy Score': metrics.accuracy_score(y_test, y_pred),
                                    'Kappa Score': cohen_kappa_score(y_test, y_pred),
                                    'f1-score': metrics.f1_score(y_test, y_pred)}, 
                                    ignore_index = True)
    return(score_card)

<a id='LogisticReg'></a>
# 5. Logistic Regression 

Logistic regression is one of the techniques used for classification. The estimates of the parameters are obtained by maximizing the likelihood function.

<a id='withStatsModels'></a>
## 5.1 Logistic Regression (using MLE)

**1. Introduce the intercept term**

In [None]:
# add the intercept column using 'add_constant()'
X = sm.add_constant(X)

# print the first five bservations after adding intercept
X.tail()

**2. Split the dataset into train and test sets**

In [None]:
# split data into train subset and test subset
# set 'random_state' to generate the same dataset each time you run the code 
# 'test_size' returns the proportion of data to be included in the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 10)

# check the dimensions of the train & test subset using 'shape'
# print dimension of train set
print("X_train",X_train.shape)
print("y_train",y_train.shape)

# print dimension of test set
print("X_test",X_test.shape)
print("y_test",y_test.shape)

#### 3. Build a logistic regression model using statsmodels `Logit()`.

In [None]:
# build the model on train data (X_train and y_train)
# use fit() to fit the logistic regression model
log_reg_model = sm.Logit(y_train, X_train).fit()

# print the summary of the model
print(log_reg_model.summary())

**Interpretation:** The `Pseudo R-squ.` obtained from the above model summary is the value of `McFadden's R-squared`.

**4. Do predictions on the test set**

In [None]:
# let 'y_pred_prob' be the predicted values of y
y_pred_prob = log_reg_model.predict(X_test)

# print the y_pred_prob
y_pred_prob.head()

In [None]:
# convert probabilities to 0 and 1 using 'if_else'
y_pred = ['0' if x < 0.5 else '1' for x in y_pred_prob]

In [None]:
# convert the predicted values to type 'float32'
y_pred = np.array(y_pred, dtype=np.float32)

# print the first five predictions
y_pred[0:5]

#### 5. Calculate the performance measures.

#### Build a confusion matrix.

In [None]:
# call the function to plot the confusion matrix
# pass the logistic regression model to the function
plot_confusion_matrix(log_reg_model)

**Calculate performance measures on the test set.**

In [None]:
# compute the performance measures on test data
# call the function 'get_test_report'
# pass the logstic regression model to the function
test_report = get_test_report(log_reg_model)

# print the performace measures
print(test_report)

**Interpretation:** The accuracy is 81% for this model. Also, there is significant difference between specificity and sensitivity.

In [None]:
# compute kappa score on test set
# call the function 'kappa_score'
# pass the logstic regression model to the function
kappa_value = kappa_score(log_reg_model)

# print the kappa value
print(kappa_value)

**Interpretation:** As the kappa score for the logistic regression is 0.3426, we can say that there is low to moderate agreement between the actual and predicted values.

**Plot the ROC curve.**

In [None]:
# call the function 'plot_roc' to plot the ROC curve
# pass the logstic regression model to the function
plot_roc(log_reg_model)

**6. Tabulate the results.**

Now, we tabulate the results, so that is easy for us to compare the models built.

In [None]:
# use the function 'update_score_card' to store the performance measures
# pass the 'Logistic Regression' as model name to the function
update_score_card(model_name = 'Logistic RegressionMLE')

<a id='usingSGD'></a>
## 5.2 Logistic Regression (using SGD)

**1. Scale the data features**

We do not need to scale the data, as all the varaibles are categorical variables.

**2. Split the data into training and test sets**

The data has been split in section 5.1

**3. Build the model**

The `SGDClassifier()` from sklearn contains an intercept term. Thus, there is no need to add the column of intercept.

In [None]:
# instantiate the 'SGDClassifier' to build model using SGD
# to perform logistic regression, consider the log-loss function 
# set 'random_state' to generate the same dataset each time you run the code 
SGD = SGDClassifier(loss = 'log', random_state = 10)

# fit the model on scaled training data
logreg_with_SGD = SGD.fit(X_train, y_train)

**4. Do predictions on the test set**

In [None]:
# predict probabilities on the test set
# consider the probability of positive class by subsetting with '[:,1]'
y_pred_prob = logreg_with_SGD.predict_proba(X_test)[:,1]
#y_pred_prob = logreg_with_SGD.predict_proba(X_test)

In [None]:
y_pred_prob

In [None]:
y_pred_prob

In [None]:
#X_test.head(2)

In [None]:
# use predict() to predict the class labels of target variable
y_pred = logreg_with_SGD.predict(X_test)

**5. Compute accuracy measures**

#### Build a confusion matrix.

In [None]:
# call the function to plot the confusion matrix
# pass the logistic regression (SGD) model to the function
plot_confusion_matrix(logreg_with_SGD)

**Calculate performance measures on the test set.**

In [None]:
# compute the performance measures on test data
# call the function 'get_test_report'
# pass the logstic regression (SGD) model to the function
test_report = get_test_report(logreg_with_SGD)

# print the performace measures
print(test_report)

**Interpretation:** The accuracy is 81% for this model.

In [None]:
# compute kappa score on test set
# call the function 'kappa_score'
# pass the logstic regression (SGD) model to the function
kappa_value = kappa_score(logreg_with_SGD)

# print the kappa value
print(kappa_value)

**Interpretation:** As the kappa score for the logistic regression (SGD) is 0.3047, we can say that there is low to moderate agreement between the actual and predicted values.

**Plot the ROC curve.**

In [None]:
# call the function 'plot_roc' to plot the ROC curve
# pass the logstic regression (SGD) model to the function
plot_roc(logreg_with_SGD)

**6. Tabulate the results**

In [None]:
# use the function 'update_score_card' to store the performance measures
# pass the 'Logistic Regression (SGD)' as model name to the function
update_score_card(model_name = 'Logistic Regression (SGD)')

<a id="conclusion"> </a>
# 6. Conclusion and Interpretation

To take the final conclusion, let us print the result table.

In [None]:
# print the 'score_card' to compare all the models
score_card

Let us plot the performance measures of the two models in the single graph.

In [None]:
# plot the graph
# by default, plot() returns the line plot
score_card.plot()

# set the text size of the title
plt.title(label = 'Comparison of the Models', fontsize = 15)

# set the model names as x-ticks
# 'score_card.Model' retuns the model names
# rotate the x-axis labels vertically
plt.xticks([0,1], list(score_card.Model), rotation = 'vertical')

# display the plot
plt.show()

<a id='usingSGD'></a>
## 5.3 Logistic Regression (using sklearn), Applying ROC threholds to improve the Model

Applying ROC threholds to improve the Model is not required when you are using the Sklearn Method. You use it only if you have a Business Reason to do it. 

In [None]:
X1 = pd.DataFrame(df_cat_dummies)
X1.head(5)

In [None]:
Y1 = pd.DataFrame(df_num)
Y1.head(5)

In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
#from sklearn.model_selection import cross_val_score

#from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X1,Y1,test_size=0.2,random_state=10)

In [None]:
# build the model on train data (X_train and y_train)
model = LogisticRegression()

In [None]:
# fit the model with data
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
plot_confusion_matrix(model)

In [None]:
# compute the performance measures on test data
# call the function 'get_test_report'
# pass the logstic regression model to the function
test_report = get_test_report(model)

# print the performace measures
print(test_report)

In [None]:
model.predict_proba(X_test)[:10,1:]

In [None]:
y_pred_prob = model.predict_proba(X_test)[:,1]


In [None]:
y_pred_prob[0:10]    #probability values after changing the threshold

In [None]:
y_pred = model.predict(X_test)

#### 5. Calculate the performance measures.

#### Build a confusion matrix.

In [None]:
# call the function to plot the confusion matrix
# pass the logistic regression model to the function
#plot_confusion_matrix(model)

**Calculate performance measures on the test set.**

In [None]:
# compute the performance measures on test data
# call the function 'get_test_report'
# pass the logstic regression model to the function
#test_report = get_test_report(model)

# print the performace measures
#print(test_report)

In [None]:
# compute kappa score on test set
# call the function 'kappa_score'
# pass the logstic regression model to the function
kappa_value = kappa_score(model)

# print the kappa value
print(kappa_value)

**Interpretation:** As the kappa score for the logistic regression is 0.3424, we can say that there is low to moderate agreement between the actual and predicted values.

**Plot the ROC curve.**

In [None]:
# call the function 'plot_roc' to plot the ROC curve
# pass the logstic regression model to the function
plot_roc(model)

<font color="red">
The Y-axis of the ROC graph denotes the True Positive Rate, also called as Sensitivity. The X-axis of the ROC graph denotes the False Positive Rate. <br>  
</font>  

In [None]:
roc_auc_score(y_test,y_pred_prob)

**6. Tabulate the results.**

Now, we tabulate the results, so that is easy for us to compare the models built.

In [None]:
# use the function 'update_score_card' to store the performance measures
# pass the 'Logistic Regression' as model name to the function
update_score_card(model_name = 'Logistic Regression default (sklearn)')

In [None]:
model.predict_proba(X_test)

In [None]:
score_card

<a id="conclusion"> </a>
# 5.3.1 To show ROC Curve with Different Threshold Values

THRESHOLD VALUE = 0.5

In [None]:
# build the model on train data (X_train and y_train)
modelp5 = LogisticRegression()

In [None]:
# fit the model with data
modelp5.fit(X_train,y_train)

In [None]:
modelp5.predict_proba(X_test)[:10,1:]

In [None]:
y_pred_prob = (modelp5.predict_proba(X_test)[:,1]>0.5).astype(int)

In [None]:
y_pred_prob[0:10]    #probability values after changing the threshold

In [None]:
y_pred = modelp5.predict(X_test)

In [None]:
# convert probabilities to 0 and 1 using 'if_else'
y_pred = ['0' if x < 0.5 else '1' for x in y_pred_prob]

**Plot the ROC curve.**

In [None]:
# call the function 'plot_roc' to plot the ROC curve
# pass the logstic regression model to the function
plot_roc(modelp5)

In [None]:
roc_auc_score(y_test,y_pred_prob)

THRESHOLD VALUE = 0.75

In [None]:
# build the model on train data (X_train and y_train)
modelp75 = LogisticRegression()

In [None]:
# fit the model with data
modelp75.fit(X_train,y_train)

In [None]:
modelp75.predict_proba(X_test)[:10,1:]

In [None]:
y_pred_prob = (modelp75.predict_proba(X_test)[:,1]>0.75).astype(int)

In [None]:
y_pred_prob[0:10]    #probability values after changing the threshold

In [None]:
y_pred = modelp75.predict(X_test)

In [None]:
# convert probabilities to 0 and 1 using 'if_else'
y_pred = ['0' if x < 0.75 else '1' for x in y_pred_prob]

**Plot the ROC curve.**

In [None]:
# call the function 'plot_roc' to plot the ROC curve
# pass the logstic regression model to the function
plot_roc(modelp75)

In [None]:
roc_auc_score(y_test,y_pred_prob)

<a id="conclusion"> </a>
# 6. Conclusion and Interpretation

<font color="red">
The dotted line indicates equal number of true positives and false positives. The more the blue line lies away from the dotted line, the more better is our model classification. The farther blue line indicates higher True Positive Rate, i.e. maximum number of observations are correctly classified.
</font>