# Logistic Regression

<b>Logistic Reression</b> is a <b>Classification</b> algorithm for categorical variables i.e. for predicting discrete values ({0, 1}, {True, False}, etc) <br><br>

The goal of a Logistic Regression model is to build a model to<br>
- predict the class which the input sample case belongs to
- find the probability of input sample case belonging to a class

Logistic Regression is a variation of Linear Regression, used when the observed dependent variable, <b>y</b>, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.

Logistic regression fits a special s-shaped curve by taking the linear regression function ${\theta^TX}$ and transforming the numeric estimate into a probability with the following function, which is called the sigmoid function 𝜎:


$$
ℎ\_\theta(𝑥) = \sigma({\theta^TX}) =  \frac1{1 + e^{-(\theta\_0 + \theta\_1  x\_1 + \theta\_2  x\_2 +\cdots)}}
$$

Or:
$$
ProbabilityOfaClass\_1 =  P(Y=1|X) = \sigma({\theta^TX}) = \frac{1}{1+e^{-\theta^TX}}
$$

In this equation, ${\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\sigma(\theta^TX)$ is the sigmoid or [logistic function]
<br><br>

if ${\theta^TX}$ is a large value, then $\sigma({\theta^TX})$ ≈ 1 and, <br>
if ${\theta^TX}$ is a small value, then  $\sigma({\theta^TX})$ 
≈ 0.<br>
Thus, the predicted value always lies in the range (0,1)

So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:

<img
src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/images/mod_ID_24_final.png" width="400" align="center">

The objective of the **Logistic Regression** algorithm, is to find the best parameters θ, for $ℎ\_\theta(𝑥)$ = $\sigma({\theta^TX})$, in such a way that the model best predicts the class of each case.


<b>NOTE</b> in this notebook, we are building a binary classification model.<br><br>
However, Logistic Regression can be used just the same for multi-class classification using the <b>one-vs-all technique</b> i.e 
- Train a Logistic Regression classifier $ℎ\_\theta^i(𝑥)$ for each discrete class i to predict the probability that y = i
- On a new input 'x', to make a prediction, run $ℎ\_\theta^i(𝑥)$ for all i = 1, 2, 3, .. i.e. all the classes and pick the class that maximizes $ℎ\_\theta^i(𝑥)$
- Pick the class i that maximizes $ℎ\_\theta^i(𝑥)$
<br><br>
In Layman terms, 
- classify the sample input using binary classification by clumping all classes but 1 together, and the left out class as the 2 required classes for Binary Classification 
- finding the probability of sample belonging to the left out class,
- repeating these steps while leaving out each class and finding the probabilty of sample data to belong to that class 
- finally classifying the output to the class that has the maximum probability of sample data belonging to that class.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot
import seaborn as sns
sns.set(style='white')
sns.set(style='whitegrid', color_codes=True)
from scipy import optimize
import utils

In [None]:
#read data using pandas method

df = pd.read_csv('dataset.csv')

## 1.1 Logistic Regression from scratch

<b> first, we will build the Logistic Regression algorithm from scratch without using the scikit-learn library for understanding the algorithm</b>

### 1.1.1 sigmoid function

recall that the logistic regression hypothesis is defined as:

$$ h_\theta(x) = g(\theta^T x)$$

where function $g$ is the sigmoid function. The sigmoid function is defined as: 

$$g(z) = \frac{1}{1+e^{-z}}$$.

In [None]:
import math

In [None]:
X = df[['relevant', 'columns', 'and', 'not', 'target_var']]
y = df['target_var']

In [None]:
def sigmoid(z):
    '''
    Objective : calculate sigmoid function given imput z
    
    Parameters :
    z : array_like
        The input to the sigmoid function. 
        This can be a 1-D vector or a 2-D matrix
    
    Returns:
    g : array_like
        The computed sigmoid function. g has the same shape as z,
        as the sigmoid is computed element wise on z.
    '''
    
    z = np.array(z)
    
    g = np.zeros(z.shape)
    
    g = 1/(1+np.exp(-z))
    
    return g

### 1.1.2 Cost Function

cost function in logistic regression is

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left[ -y^{(i)} \log\left(h_\theta\left( x^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - h_\theta\left( x^{(i)} \right) \right) \right]$$

and the gradient of the cost is a vector of the same length as $\theta$ where the $j^{th}$
element (for $j = 0, 1, \cdots , n$) is defined as follows:

$$ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} $$

In [None]:
def costFunction(theta, X, y):
    '''
    Objective 
    ---------
    Compute cost and gradient for Logistic Regression
    Compute the cost of a particular choice of theta. Set J to cost.
    Compute the partial dereivatives and set the grad to the partial derivatives 
    of the cost w.r.t each parameter in theta.
    
    
    Parameters
    ----------
    theta : array_like
            The parameters for Logistic Regression. 
            This is a vector of shape (n+1, )
    
    
    X : array_like
        The input dataset of shape (m x n+1) where m is the total number 
        of data points and n is the total number of features.

    y : array_like
        Labels for the input. 
        This is a vector of shape (m, )
    
    Returns
    -------
    
    J : float
        The computed version for the Cost Function
        
    grad : array_like
           A vector of shape (n+1, ) which is the gradient of the cost function
           w.r.t. theta, at the current values of theta.
                
    '''
    
    m = y.size
    
    J = 0
    grad = np.zeros(theta.shape)
    
    z = np.dot(X, theta)
    h = sigmoid(z)
    
    J = np.mean((- np.transpose(y)*np.log(h)) - (1-y)*np.log(1-h))
    
    grad = 1/m * np.dot(X.T, (h-y))
    
    return J, grad

### 1.1.3 find OPTIMAL learning parameters (𝜃)

we can use different techniques like <b>Gradient Descent</b> to find the optimal values of parameter <b>theta</b>


In this notebook, we will use the `scipy.optimize` module.<br>

For Logistic Regression, we optimize the Cost Function J(𝜃) with parameters 𝜃. Concretely, we use `optimize.minimize` to find the optimal parameters for the Logistic Regression cost function, given a fixed dataset (of X and y values).

`scipy.optimize.minimize` takes in the following parameters:

- `costFunctin` : A cost functin that given the training set and a particular 𝜃, computes the logistic regression cost and gradient with respect to 𝜃 for the dataset (X,y). we only pass the name of this function and not the paranthesis, indicating that we are only providing reference to this function, and not evaluating the result from this function.

- `initial_theta` : The inital values of parameters we are trying to optimize.

- `(x, y)` : additional arguments to the cost function.

- `jac` : Indication if the Cost Function returns the Jacobian (gradient) along with the cost value. (True)

- `method` : Optimization method/arguments to use

- `options` : Additional options which might be specific to the specific optimizaqtion method.

If we write the `costFunction` correctly, `optimize.minimize` will converge on the right optimization parameters and return the final values of the cost and 𝜃 in a class object.
<br><br>
NOTE: using `scipy.optimize.minimize` we only had to pass in a correct Cost Function and not worry about the writing any for loops or setting a learning rate as we had to do in Gradient Descent.

In [None]:
#set options; set max number of iterations to 400
options = {'maxiter': 400}

res = optimize.minimize(costFunction, initial_theta, (X,y),
                        jac=True, method='TNC', options=options)

#fun property of 'OptimizeResult' obect resturns 
#the value of costFunctionat optimized theta
cost = res.fun

#the optimized theta is in the x property
theta = res.x

print('Cost at theta found by optimize.minimize: {:.3f}'.format(cost))

print('theta:')
print('\t[{:.3f}, {:.3f}, {:.3f}]'.format(*theta))

### 1.1.4 Visualize Decision Boundary

use `seaborn` or `matplotlib` methods to Visualize data and the calclated optimal values for 𝜃 to plot the data and the Decision Boundary

### 1.1.5 Evaluate Logistic Regression


After learning the parameters, we can can use the model to predict 0/1, True/False, etc given sample input(s) against a user-set threshold 9between 0 and 1) 
<br><br>
for new Input 'X'

$$
       P(y=1 | x) = sigmoid(𝜃^T X) 
       $$

i.e. probability that (ouptput) y=1 given 'X' is given by $sigmoid(𝜃^T X)$

In [None]:
def predict(theta, X):
    '''
    Objective
    ---------
    Predict whether the label is 0 or 1 using learned Logistic Regression
    Computes the predictions for X using a threshold at (here) 0.5 
    (i.e., if sigmoid(theta.T*x) >= 0.5, predict 1)
    
    Parameters
    ----------
    theta : array_like
            Parameters for logistic regression.
            A vector of shape (n+1, )
    
    X : array_like
        The data to use for computing predictions.
        The row is the number of points to compute predictions,
        and columns is the number of features.
        
    Returns
    -------
    
    p : array_like
        Predicstions as 0 or 1 for each row in X.     
    '''
    
    m = X.shape[0] #number of training examples
    
    p = np.zeros(m)
    
    p = (1 / (1+np.exp(np.dot(X, theta.T)))<0.5) * 1
                         # <0.5 makes 'p' an ndarray of True and False
                         # *1 transforms True/False values to 1/0
    return p

In [None]:
#pred = predict(theta, X_test)
#print(pred)

### 1.1.6 Evaluate model performance

described later in this notebook in details

## 1.2 Regularized Logistic Regression

<b>Regularization</b> a technique to solve the problem of overfitting by penalizing the cost function. <br>
It does so by using an additional penalty term in the cost function called the <b>regularization parameter : λ </b>
<br>
The larger lambda is, the more the coefficients are shrunk toward zero (and each other).

### 1.2.2 Feature Mapping 


When that dataset cannot be linearly separated, a straight-forward application of Logistic Regression will not perform well on the dataset since Logistic Regression will only be able to find a linear decision boundary.
<br>

One way to fit such dataset (which requires a non-linear decision boundary say, elliptical) is to create more features from each data point i.e. <b>add Polynomial features</b> to our matrix (similar to Polynomial Regression).
<br>

The function `mapFeature` defined in the `utils.py` allows us to map features into all polynomial terms up to 6th power.

$$ \text{mapFeature}(x) = \begin{bmatrix} 1 & x_1 & x_2 & x_1^2 & x_1 x_2 & x_2^2 & x_1^3 & \dots & x_1 x_2^5 & x_2^6 \end{bmatrix}^T $$

As a result, our logistic regression classifier will be trained on higher-dimension feature vector and have a more complex decision boundary and will appear non-linear when drawn in our 2-dimensional plot.<br><br>

While feature mapping allows us to build a more expressive classifier, it also makes the model more susceptible to Overfitting

In [None]:
X = df[['relevant', 'columns', 'and', 'not', 'target_var']]
y = df['target_var']

In [None]:
X = mapFeature(X, degree=6)

### 1.2.3 Regularized Cost Function and Gradient

Regularized cost function in logistic regression is

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^m \left[ -y^{(i)}\log \left( h_\theta \left(x^{(i)} \right) \right) - \left( 1 - y^{(i)} \right) \log \left( 1 - h_\theta \left( x^{(i)} \right) \right) \right] + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2 $$

Note that you should not regularize the parameters $\theta_0$. The gradient of the cost function is a vector where the $j^{th}$ element is defined as follows:

$$ \frac{\partial J(\theta)}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left(x^{(i)}\right) - y^{(i)} \right) x_j^{(i)} \qquad \text{for } j =0 $$

$$ \frac{\partial J(\theta)}{\partial \theta_j} = \left( \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left(x^{(i)}\right) - y^{(i)} \right) x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \qquad \text{for } j \ge 1 $$
<a id="costFunctionReg"></a>

In [None]:
def costFunction_Reg(theta, X, y, lambda_):
    '''
    Objective
    ---------
    Compute cost and gradient for logistic regression with regularization.
    Compute the cost `J` of a particular choice of theta.
    Compute the partial derivatives and set `grad` to the partial
    derivatives of the cost w.r.t. each parameter in theta.
   
    Parameters
    ----------
    theta : array_like
            Logistic regression parameters. A vector with shape (n, ). 
            n is the number of features including any intercept. 
            If we have mapped our initial features into polynomial features, 
            then n is the total number of polynomial features. 
                
    X : array_like
        The dataset with the shape (m x n)
        m is the number of data points (examples)
        n is the number of features (After feature mapping)
        
    y : array_like
        The data labels. A vector with shape(m, )
        
    lambda_ : float
              The regularization parameter                
    
    Returns
    -------
    J : float
        The computed value for the regularized cost function
        
    grad : array_like
           A vector of shape (n, ) which is the gradient of the cost function
           w.r.t. each parameter in theta 
    '''
    
    m = y.size
    
    J = 0
    grad = np.zeros(theta.shape)
    
    z = np.dot(X, theta)
    h = sigmoid(z)
    
    J = np.mean((- np.transpose(y)*np.log(h)) - (1-y)*np.log(1-h)) + (lambda_/(2*m))*np.mean(grad[1:])
    reg = (lambda_/(2*m)) * (theta[1:].T@theta[1:])
    J += reg
    
    grad = (1/m) * X.T @ (h-y)
    grad[1:] = grad[1:] + (lambda_ / m) * theta[1:]
    
    return J, grad

Once we are done with the costFunctionReg, we call it below using the initial value of  𝜃  (initialized to all zeros), and also another test case where  𝜃  is all ones.

In [None]:
# Initialize fitting parameters
initial_theta = np.zeros(X.shape[1])
lambda_ = 1

# Compute and display initial cost and gradient for regularized logistic
# regression
cost, grad = costFunctionReg(initial_theta, X, y, lambda_)

print('Cost at initial theta (zeros): {:.3f}'.format(cost))

### 1.2.4 Learning parameters using `scipy.optimize.minimize`

Similar to the previous parts, we will use `optimize.minimize` to learn the optimal parameters <b>𝜃</b>.


### 1.2.5 Visualize Decision Boundary

use `seaborn` or `matplotlib` methods to Visualize data and the calclated optimal values for 𝜃 to plot the data and the Decision Boundary

### 1.2.5 Predict

In [None]:
#'theta' is the optimized learning parameters optimized using 'optimize.minimize as done above'

#pred2 = predict(theta, X_test)

### 1.2.6 Evaluate model performance

described later in this notebook in details

## 2. Logistic Regression with python

using `pandas`, `scikit-learn` and other libraries

### 2.1 load data

In [None]:
df = pd.read_csv('dataset.csv')
#OR
df = pd.read_csv(http://url)

### 2.2 Exploratory Data Analysis

It is the approach of analyzing data sets to summarize their main characteristics.
<br><br>
In Logistic Regression, we find the underlying story the data tells by <b>Visualizing</b> the data.
<br>
In other words, we find the <b>main features</b> of the dataset by plotting all the features against the target variable and interpreting the plots.

In [None]:
df.sample(7)

In [None]:
df.head(5)

In [None]:
df.tail(5)

In [None]:
df.shape()

In [None]:
df.info()

In [None]:
for col in df.columns:
    print(col, 'unique values :', df[col].nunique())

In [None]:
df.describe()

### 2.2.2 Grouping Data

Pandas `dataframe.groupby()` function is used to split the data into groups based on some criteria. <br>
Pandas objects can be split on any of their axes. (row/col) <br>
The abstract definition of grouping is to provide a mapping of labels to group names.

In [None]:
top_players = df.group_by(['Team', 'Position'])

In [None]:
country_perfomance = df.group_by(['Team'])

for grouping 3 variables, `pandas.pivot(index, columns, values)` function produces <b>pivot table</b> based on 3 columns of the DataFrame. Uses unique values from index / columns and fills with values.

In [None]:
##ex:

import pandas as pd
  
pivot_df = pd.DataFrame({'A': ['John', 'Cena', 'Mina'],
      'B': ['Masters', 'Masters', 'Graduate'],
      'C': [27, 23, 21]})
  
print(pivot_df, '\n\n')
print(pivot_df.pivot('A', 'B', 'C'))

### 2.2.3 Plot data

Plot each feature (variable) against the target variable<br><br>

Use **countplot** and **barplot** to understand if features are relevent to the target variable, and <br>
**boxplots** to understand the relationship of features among themselves

In [None]:
#barplot is for comparison against continuous values
sns.barplot('feature1', 'target_var', data=df, color='darkturqoise')
plt.show()

In [None]:
plt.figure(figsize=(20,12))
temp_data = df[['continuous_feature', 'target_var']].groupby(['continuous_feature'], as_index=False).mean()
g = sns.barplot(x='conitnuous_feature', y='target_var', data=temp_data, color='LightSeaGreen')
plt.show()

In [None]:
#countplot is similar to barplot, it is for comparison against categorical values
sns.countplot(x='feature', hue='target_var', data=df, palette='rainbow')

## 2.3 clean data

In [None]:
#deal with missing values in dataset

df.isnull().sum()

#if data is missing in columns:
#1. check the pattern of data by plotting a histogram
#2. check %age of missing data i that column
#3. check mean and median of that column
#4. replace missing values with appropriate value (mean/median/etc) 
#5. drop missing values

In [None]:
#let, col1, col2, col3 have missing values

In [None]:
print('Percentage of missing col1 is %.f' %(df['col1'].isnull().sum()/df.shape[0] * 100))

ax = df['col1'].hist(bins=15, density=True, stacked=True, color='teal')
df['col1'].plot(kind='density', color='teal')
ax.set(xlabel='col1')
plt.show()

In [None]:
# mean col1
print('The mean of "col1" is %.2f' %(df["col1"].mean(skipna=True)))
# median col1
print('The median of "col1" is %.2f' %(df["col1"].median(skipna=True)))

#suppose this value is less (<20%) and this feature has high correlation with the target variable, 
#then we replace all the missing values with the mean or median of this column
#(depending on the distribution of data)

In [None]:
#Visualize distribution of continuous features to understand how to deal with missing values

df['col1'].hist(color='green', bins=20, figsize=(12,8))

In [None]:
print('Percentage of missing col2 is %.f' %(df['col2'].isnull().sum()/df.shape[0] * 100))
#suppose this %age is high, so we drop this column

In [None]:
print('Percentage of missing col3 is %.f' %(df['col3'].isnull().sum()/df.shape[0] * 100))
#suppose this %age is less, then

print(df['col3'].value_counts())
sns.countplot(x='col3', data=df, palette='Set2')
plt.show()
#this will print and show the most occuring values in 'col3'

print('The most common boarding port of col3 is %s.' %df['col3'].value_counts().idxmax())

In [None]:
#use bosplots to understand the relationship of features with each other

plt.figure(figsize=(12,8))
sns.boxplot(x='feature1', y='feature12', data=df, palette='winter')

### 2.3.1 adjustments to dataset for missing values

In [None]:
df['col1'].fillna(df['col1'].median(skipna=True), inplace=True)

df.drop('col2', axis=1, inplace=True)

df['col3'].fillna(df['col3'].value_counts().idxmax(), inplace=True)

df['col4'].replace(np.nan, 0)

In [None]:
#deal with all columns with missing values in the aforementioned way

In [None]:
df.isnull().sum()

### 2.3.2 change dtypes of columns

In [None]:
df[['col1', 'col_n']] = df[['col1', 'col_n']].astype(dtype)

### 2.3.3 cleaning column names

In [None]:
df.columns = df.columns.str.strip()
df.columns = df.columns.str.lower()

### 2.3.4 Dealing with copy variables with case sensitivity

In [None]:
for i in range(len(df.columns)-1):
    if df.dtypes[i] == 'object':
        df[datdfaset.columns[i]] = dataset[dataset.columns[i]].str.lower

### 2.3.5 Bin data

Binning is the process of grouping values together into 'bins'
ex: bin 'age' into 'kid', 'teen,', 'adult', 'middle-aged', etc

In [None]:
bins = np.linspace(min(df['col_to_be_binned']), max(df['col_to_be_binned']), 4)
#this divides the 'col_to_be_binned' into 4 bins w.r.t their min and max val

bin_names = ['bin1_small', 'bin2_medium', 'bin3_large']
#i.e. a list of bin names

df['binned_col'] = pd.cut(df['col_to_be_binned'], bins, label = bin_names, include_lowest = True)

#this creates a new col in the DataFrame with binned data
#you can delete the  original column which was binned to reduce dataframe size

df.drop(['col_to_be_binned'], inplace=True)

## 2.4 Categorical Variables into Numerical Variables

Categorical Variables (ex: df['sex'].unique() = ['M', 'F'] needs to be converted into Numeircal Variables to be included in the ML algorithm<br><br>

This can be achieved by:

1. <b>Dummy Variables</b> :  A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value.
In our example, it will create 2 columns and provide each row w with [0, 1] if data is Male and [1, 0] if data is Female.<br>
Dummy Variable works well while you are exploring and analyzing but for final inclusion in the dataset, OneHotEncoding is much more suitable.<br><br>

2. <b>One Hot Encoding</b> : It works essentially the same as Dummy Vairables the only difference is that it excludes 1 row. This works with the concept that the variables have a linear relationship and hence for n unique categories in categorical variable, there are n-1 new vairables. 
In our example, it will create 1 column and provide each row w with [0] if data is Male and [1] if data is Female.<br><br>

3. <b>Label Encoding</b> : It follows Ordinal Scaling i.e. for each category it establishes a relationship among the categories as per their ranking.<br>
ex: for categories : ['low', 'medium', 'high'], it will be twice and thrice as affective for 'medium' and 'high' respectively than for 'low'

In [None]:
##Dummy Vairables

dvar = pd.get_dummies(df['col_to_be_binned'], columns = ['cat_1', 'cat_n'])
df = pd.concat([df, dvar], axis = 1)
df.drop(df['col_to_be_binned'], axis=1, inplace=True)

# you may drop 1 dummy variable

In [None]:
## LabelEncoding

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()                               #labelencoderobject

copy_df = df
copy_df.cat_var = le.fit_transform(copy_df.cat_var)
#this LeabelEncodes the 'cat_var' category of 'copy_df' DataFrame and puts it back in the 'cat_var' category of 'copy_df' DataFrame
#i.e. all the categories in our categorical variables are converted into int numbers (0,1,2,..)

In [None]:
## OneHotEncoding

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()                             #onehotencoder object

encoded_cat_col = ohe.fit_transform(copy_df.cat_var).toarray()
#onehotencoded_col = ohe.fit_Transform(label_encoded_col)
df = df.join(encoded_cat_col)

## 2.5 Feature selection (basic)
**We now understand which features are relevant to the target variable and so will keep only those variables

In [None]:
y_data = df['target_var']
x_data = df.drop(columns= ['target_var', 
                           'other_irrelevant_columns_as_shown_by_data/statistical_analysis'],
                          inplace = True)

##y_data is target data
##x_data is parameters that affect the target data

## 2.6 Data Standardzation

Data standardization is the process of rescaling the attributes so that they have mean as 0 and variance as 1. <br>
This refers to bringing down all the features to a common scale without distorting the differences in the range of the values.
ex: np.arange(0.01,1,0.01), np.arange(1, 10000, 10)] to [np.arange(0.01,1,0.01), np.arange(0.01, 1, 0.01)]
<br><br>
For this, each data point in  the col = (data point - mean of col)/std of col<br><br>
Though this is done easily using `sklearn.preprocessing.StandardScaler` , some additional context to the data is also required<br>
ex: if col8: mileage(kmph) in city and col9: mileage(mpg) in highway, then first convert either mpg to kmph or vice versa and then StandardScale

In [None]:
scaler  = preprocessing.StandardScaler()
scaler2 = preprocessing.StandardScaler()
 
standard_x_df = scaler.fit_transform(x_data)
standard_y_df = scaler.fit_transform(y_data)

X = pd.DataFrame(x_data)
y = pd.DataFrame(y_data)

<b>Now that the data is cleaned and preprocessed and we have some understanding od the data, we we should proceed to model building</b>

## 2.7 Model Building  (Logistic Regression with Scikit-learn)

### 2.7.1 Feature Selection

**Feature selection** refers to techniques that select a subset of the **most relevant features** (columns) for a dataset. Fewer features can allow machine learning algorithms to run more efficiently (less space or time complexity) and be more effective.

#### (i) Recursive Feature Elimination (RFE)

RFE is a wrapper-type feature selection algorithm. <br>
This means that a different machine learning algorithm is given and used in the core of the method, is wrapped by RFE, and used to help select features. This is in contrast to filter-based feature selections that score each feature and select those features with the largest (or smallest) score.
<br><br>

There are two important configuration options when using RFE: 
- the choice in the number of features to select, and 
- the choice of the algorithm used to help choose features

<br>

**Recursive Feature Elimination (RFE)** is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. 
<br>
The goal of RFE is to select features by recursively considering smaller and smaller sets of features.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

#build Logistic Regression model
model = LogisticRegression()

#create RFE model and select 8 attributes (no of attributes is arbitrary, refer later part of this notebook)
rfe = RFE(estimator=model, n_features_to_select=8)
rfe.fit(X, y)

#summarize the selction of attributes
selected_features = list(X.columns[rfe.support_])
print('Selected Features : ', selected_features)

#### (ii) Feature ranking with recursive feature elimination and cross-validation (RFECV)

RFECV performs RFE in a cross-validation loop to find the optimal number or the best number of features.
<br>
Hereafter a recursive feature elimination applied on logistic regression with automatic tuning of the number of features selected with cross-validation.

In [None]:
from sklearn.feature_selection import RFCEV

#create RFECV object and compute a cross-validation score
#The 'accuracy' is proportional to the number of correct classifications

rfecv = RFECV(estimator=LogisticRegression(), step=1, cv=10, scoring='accuracy')
rfcev.fit(X, y)

#print optimal number of features and features selected
print('Optimal number of features : %d' % rfecv.n_features_)
selected_features = list(X.columns[rfecv.support_])
print('Selected Features : %s' % selected_features)

In [None]:
#Plot number of features VS. cross-validation scores

plt.figure(figsize=(12,8))
plt.xlabel('Number of features selected')
plt.ylabel('Cross Validation score (nb of correct classifications)')
plt.plot(range(1, len(rfecv.grid_scores_) +1), rfecv.grid_scores_)
plt.show()

In [None]:
final_x = X[selected_features]

#features' correlation heatmap
plt.subplots(figsize=(8, 5))
sns.heatmap(final_x.corr(), annot=True, cmap='RdY1Gn')
plt.show()

### 2.7.2 train_test_split

`sklearn.model_selection.train_test_split()` splits arrays or matrices into random train and test subsets<br><br>
Use the training set to train the Machine Learning model and the test set to evaluate it's perfomance


In [None]:
from sklearn.modelSelection import train_test_split

##y is target data
##final_x is final dataset consisting of all sample data and only selected features

x_train, x_test, y_train, y_test = train_test_split(final_x, y, test_size=0.2)

## 2.8 Predict result

Use `LogisticRegression` model from `sklearn.linear_model` <br> <br>

Use `predict()` method to classify sample data into discrete classes (0, 1, ..) <br> <br>

`predict_proba()` method returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 0 P(Y=0|X), the second column is the probability of class 1 P(Y=1|X), and so on

In [None]:
#Predicting values froms ample data

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

#predict() method 
y_pred = logreg.predict(x_test)

#predict_proba() method 
y_pred_probability = logreg.predict_proba(x_test)[:, 1]

In [None]:
print(y_pred)

In [None]:
print(y_pred_probability)

## 2.9 Model Evaluation

### 2.9.1 jaccard index

We can define jaccard as the size of the intersection divided by the size of the union of the two label sets.<br><br>
If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.


In [None]:
from sklearn.metrics import jaccard_score
jaccard_score(y_test, y_pred, pos_label=0)

### 2.9.2 Log Loss

**Log-loss** is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). <br>
The more the predicted probability diverges from the actual value, the higher is the log-loss value.
<br><br>

Mathematical Meaning : <br>
Log Loss is the negative average of the log of corrected predicted probabilities for each instance.

In [None]:
from sklearn.metrics import log_loss

print('Log Loss :', log_loss(y_test, y_pred_probability))

### 2.9.3 Confusion Matrix

Another way of looking at the accuracy of the classifier is to look at **confusion matrix**.<br><br>

Confusion Matrix is a (2 x 2) table that has the values <br>
$$
\left(\begin{array}{cc} 
True Positive  & False Positive\\
False  Negative & True  Negative
\end{array}\right)
$$

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
print(confusion_matrix(y_test, y_pred, labels=[1,0]))

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['class=1','class=0'],normalize= False,  title='Confusion matrix')

### 2.9.4 Accuracy, Precision, Recall, F-measure and support

<br>

**Accuracy** is the proportion of correct predictions over total predictions.<br><br>

**Precision** is the ratio $ tp / (tp + fp) $ where tp is the number of true positives and fp the number of false positives. <br>
Precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.
<br><br>

**Recall** is the ratio $ tp / (tp + fn) $ where tp is the number of true positives and fn the number of false negatives.<br>
Recall is intuitively the ability of the classifier to find all the positive samples.
<br><br>

**F-beta score** can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.<br>
The F-beta score weights the recall more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important.
<br><br>
    
**Support** is the number of occurrences of each class in y_test.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print('Accuracy :', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

### 2.9.5 ROC Curve and AUC

An **ROC curve** (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. 
<br>This curve plots two parameters:
- True Positive Rate (TPR) 
    $$ TPR = TP / (TP + FN) $$
- False Positive Rate (FPR)
    $$ FPR = FP / (FP + TN) $$
    
An ROC curve plots TPR vs. FPR at different classification thresholds. <br>
Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.
<br><br>

**AUC** or **Area Under the ROC Curve**  measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).<br><br>
AUC provides an aggregate measure of performance across all possible classification thresholds.<br>
AUC represents the probability that a random positive is positioned to the right of a random negative .<br>
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.<br>
<br>
<br>

In [None]:
##ROC Curve

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test)[:,1])
print('Logistic Regression AUC is ', auc(fpr, tpr))
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

## 2.10 Model Evaluation suing Cross-Validation

**Cross-validation** is a resampling procedure used to evaluate machine learning models on a limited data sample using a single parameter called **k** that refers to the number of groups that a given data sample is to be split into, which is why the procedure is often called **k-fold Cross Validation**<br><br>

The general procedure is as follows:

- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
- (  i) Take the group as a hold out or test data set
- ( ii) Take the remaining groups as a training data set
- (iii) Fit a model on the training set and evaluate it on the test set
- (iv) Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores

Model evaluation using k-fold Cross Validation can be done using 2 methods:

1. `cross_val_score()` function
2. `cross_validate()`  function

### 2.10.1  Model evaluation based on K-fold cross-validation using `cross_val_score()` function

In [None]:
#10-fold Cross Validation using Logistic Regression

logreg = LogisticRegression()

scores_accuracy = cross_val_score(logreg, final_x, y, cv=10, scoring='accuracy')

scores_log_loss = cross_val_score(logreg, final_x, y, cv=10, scoring='neg_log_loss')

scores_auc      = cross_val_score(logreg, final_x, y, cv=10, scoring='roc_auc')

print('K-fold Cross-Validation results : ')

print('Average accuracy is :', scores_accuracy.mean())
print('Average log-loss is :', scores_log_loss.mean())
print('Average auc is      :', scores_auc.mean())

### 2.10.2  Model evaluation based on K-fold cross-validation using `cross_validate()` function

In [None]:
from sklearn.model_selection import cross_validate

scoring = {'accuracy': 'accuracy', 'log_loss': 'neg_log_loss', 'auc': 'roc_auc'}

modelCV = LogisticRegression()

results = cross_validate(modelCV, final_x, y, cv=10, scoring=list(scoring.values()), 
                         return_train_score=False)

print('K-fold cross-validation results :')
for sc in range(len(scoring)):
    print(modelCV.__class__.__name__+" average %s: %.3f (+/-%.3f)" % (list(scoring.keys())[sc], -results['test_%s' % list(scoring.values())[sc]].mean()
                               if list(scoring.values())[sc]=='neg_log_loss' 
                               else results['test_%s' % list(scoring.values())[sc]].mean(), 
                               results['test_%s' % list(scoring.values())[sc]].std()))

## 2.11 GridSearchCV evaluating using multiple scorers simultaneously

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': np.arange(1e-05, 3, 0.1)}
scoring = {'Accuracy': 'accuracy', 'AUC': 'roc_auc', 'Log_loss': 'neg_log_loss'}

gs = GridSearchCV(LogisticRegression(), return_train_score=True,
                  param_grid=param_grid, scoring=scoring, cv=10, refit='Accuracy')

gs.fit(final_x, y)
results = gs.cv_results_

print('='*20)
print("best params: " + str(gs.best_estimator_))
print("best params: " + str(gs.best_params_))
print('best score:', gs.best_score_)
print('='*20)

plt.figure(figsize=(10, 10))
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",fontsize=16)

plt.xlabel("Inverse of regularization strength: C")
plt.ylabel("Score")
plt.grid()

ax = plt.axes()
ax.set_xlim(0, param_grid['C'].max()) 
ax.set_ylim(0.35, 0.95)

# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_C'].data, dtype=float)

for scorer, color in zip(list(scoring.keys()), ['g', 'k', 'b']): 
    for sample, style in (('train', '--'), ('test', '-')):
        sample_score_mean = -results['mean_%s_%s' % (sample, scorer)] if scoring[scorer]=='neg_log_loss' else results['mean_%s_%s' % (sample, scorer)]
        sample_score_std = results['std_%s_%s' % (sample, scorer)]
        ax.fill_between(X_axis, sample_score_mean - sample_score_std,
                        sample_score_mean + sample_score_std,
                        alpha=0.1 if sample == 'test' else 0, color=color)
        ax.plot(X_axis, sample_score_mean, style, color=color,
                alpha=1 if sample == 'test' else 0.7,
                label="%s (%s)" % (scorer, sample))

    best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
    best_score = -results['mean_test_%s' % scorer][best_index] if scoring[scorer]=='neg_log_loss' else results['mean_test_%s' % scorer][best_index]
        
    # Plot a dotted vertical line at the best score for that scorer marked by x
    ax.plot([X_axis[best_index], ] * 2, [0, best_score],
            linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)

    # Annotate the best score for that scorer
    ax.annotate("%0.2f" % best_score,
                (X_axis[best_index], best_score + 0.005))

plt.legend(loc="best")
plt.grid('off')
plt.show()

## 2.12 GridSearchCV evaluating using multiple scorers, RepeatedStratifiedKFold and pipeline for preprocessing simultaneously 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline

#Define simple model
C = np.arange(1e-05, 5.5, 0.1)
scoring = {'Accuracy': 'accuracy', 'AUC': 'roc_auc', 'Log_loss': 'neg_log_loss'}
log_reg = LogisticRegression()

#Simple pre-processing estimators
std_scale = StandardScaler(with_mean=False, with_std=False)

#Defining the CV method: Using the Repeated Stratified K Fold
n_folds=5
n_repeats=5

rskfold = RepeatedStratifiedKFold(n_splits=n_folds, n_repeats=n_repeats, random_state=2)

#Creating simple pipeline and defining the gridsearch
log_clf_pipe = Pipeline(steps=[('scale',std_scale), ('clf',log_reg)])

log_clf = GridSearchCV(estimator=log_clf_pipe, cv=rskfold,
              scoring=scoring, return_train_score=True,
              param_grid=dict(clf__C=C), refit='Accuracy')

log_clf.fit(final_x, y)
results = log_clf.cv_results_

print("best params: " + str(log_clf.best_estimator_))
print("best params: " + str(log_clf.best_params_))
print('best score:', log_clf.best_score_)

plt.figure(figsize=(10, 10))
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",fontsize=16)

plt.xlabel("Inverse of regularization strength: C")
plt.ylabel("Score")
plt.grid()

ax = plt.axes()
ax.set_xlim(0, C.max()) 
ax.set_ylim(0.35, 0.95)

# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_clf__C'].data, dtype=float)

for scorer, color in zip(list(scoring.keys()), ['g', 'k', 'b']): 
    for sample, style in (('train', '--'), ('test', '-')):
        sample_score_mean = -results['mean_%s_%s' % (sample, scorer)] if scoring[scorer]=='neg_log_loss' else results['mean_%s_%s' % (sample, scorer)]
        sample_score_std = results['std_%s_%s' % (sample, scorer)]
        ax.fill_between(X_axis, sample_score_mean - sample_score_std,
                        sample_score_mean + sample_score_std,
                        alpha=0.1 if sample == 'test' else 0, color=color)
        ax.plot(X_axis, sample_score_mean, style, color=color,
                alpha=1 if sample == 'test' else 0.7,
                label="%s (%s)" % (scorer, sample))

    best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
    best_score = -results['mean_test_%s' % scorer][best_index] if scoring[scorer]=='neg_log_loss' else results['mean_test_%s' % scorer][best_index]
        
    # Plot a dotted vertical line at the best score for that scorer marked by x
    ax.plot([X_axis[best_index], ] * 2, [0, best_score],
            linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)

    # Annotate the best score for that scorer
    ax.annotate("%0.2f" % best_score,
                (X_axis[best_index], best_score + 0.005))

plt.legend(loc="best")
plt.grid('off')
plt.show()

## 2.13 predicting result of new data

In [None]:
#let the fresh data be in a dataframe 'new_data'

predicted = pd.concat(new_data['id_feature'], pd.Series(data=log_clf.predict(new_data[selected_features])))
#so the 'predicted' dataframe has 2 columns: id_feature' to identify each data sample uniquely and 
#outcome' which is the class in which each data sample is classified respectively

predicted.to_csv('new_predicted_outcomes.csv', index=False)

## 2.14 Serialize model using pickle module

Serialize the trained (and ready to use) Machine Learning Model using `pickle` or `joblib` (here, pickle)
<br><br>

The idea is that this python object converted to character stream contains all the information necessary to reconstruct the object in another python script i.e. we need not preprocess, analyze or train the data and model again.<br>
We can directly read the serialized object and use it for predictions

In [None]:
# Save the Modle to file in the current working directory

log_clf.fit(final_x, y)

Pkl_Filename = "whatever_filename_you_find_cute.pkl"  
with open(Pkl_Filename, 'wb') as file:  
    pickle.dump(model, file)

In [None]:
# Load the Model back from file
with open('whatever_filename_you_find_cute.pkl', 'rb') as file:  
    model = pickle.load(file)