<a href="https://colab.research.google.com/github/mohammad0alfares/MachineLearningNotebooks/blob/master/Logistic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone the Source GitHub Reporsitory 
We need to clone some source files to be used throughtout this tutorial from the GitHub reprository

In [0]:
!rm -rf ./MachineLearning
!git clone https://github.com/mkjubran/MachineLearning.git

# Logistic Regression
**Introduction**

In this section, we will use **logistic regression** to solve few classification problems. We will solve binary and multiclass problems using **logistic regression**.


**Readings and Resources**

[1] https://medium.com/greyatom/logistic-regression-89e496433063

# Case #1: Studying Hours and Passing Exams

In this section we will use logistic regression to infer whether a student will pass or fail an exam based on the number of hours the student spends preparing for the exam. A dataset for few students that includes the number of study hours and whether they pass (1) or fail (0) the exam. This is a binary classification problem that can be solved using logistic regression as will be shown next.

**Implementation**

Read the input data (number of study hours and exam pass or fail) from the csv file (HoursPassExam.csv) file. Use the pandas library (https://pandas.pydata.org/) to read the data from the file.

In [0]:
import pandas as pd
df = pd.read_csv("./MachineLearning/2_logistic/HoursPassExam.csv")
df.head()

To view the dataset, we will use the scatter plot from matplotlib library as

In [0]:
import matplotlib.pyplot as plt
plt.scatter(df['hours'],df['pass'],color = 'red', marker = '+')

As can be seen, the output (y) is binary; 0 for failing the exam and 1 for passing the exam. Also, the chances of passing the exam increases when the number of studying for the exam increases. Let us divide the dataset into training and testing datasets.

In [0]:
from sklearn.model_selection import train_test_split
x = df[['hours']]
y = df['pass']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print('size of test dataset = {}, size of traing data = {}, percentage = {}%'.format(len(x_test),len(x_train),len(x_test)*100/(len(x_test) + len(x_train))))

Let us now try to use linear regresison to fit this dataset.

In [0]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(x_train,y_train)
print('coef=',reg.coef_)
print('intercept=',reg.intercept_)
print('score=',reg.score(x_train,y_train))
plt.scatter(x_train,y_train,color = 'red', marker = '+', label = 'Data')
plt.plot(x_train,reg.predict(x_train) , label = 'Linear')
plt.legend()
plt.xlabel('Hourse')
plt.ylabel('Pass Exam (0:pass,1:fail)')


As can be observed from the figure above, the linear regresison fails to fit the data (also training accuracy ~= 56%). Let us try the logitic regression to fit the dataset.

In [0]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(x_train,y_train)
print(logreg.coef_)
print(logreg.intercept_)
print(logreg.score(x_train,y_train))
print(logreg.score(x_test,y_test))

The training and testing accuracy of the logistic regression are higher than linear regression. This means that the logitic regresison fit the data better. Let us plot the logitic regression curve.

In [0]:
plt.scatter(x,y,color = 'red', marker = '+', label = 'Data')
plt.plot(x,reg.predict(x) , label = 'Linear')
plt.xlabel('Hourse')
plt.ylabel('Pass Exam (0:pass,1:fail)')

import numpy as np
x_sigmoid = np.sort(np.array(x),axis=0)
##plot the logistic regression
plt.plot(x_sigmoid,logreg.predict(pd.DataFrame(x_sigmoid)), label = 'Logitic', color ='orange')
plt.legend()


We observe from the above figure that the logistic regression curve (orange line) fits better the dataset as compared to the linear regression curve (blue line). Let us also plot the sigmoid curve on the same figure using the logistic coefficient and inception. 

In [0]:
plt.scatter(x,y,color = 'red', marker = '+', label = 'Data')
plt.plot(x,reg.predict(x) , label = 'Linear')
plt.xlabel('Hourse')
plt.ylabel('Pass Exam (0:pass,1:fail)')

import numpy as np
x_sigmoid = np.sort(np.array(x),axis=0)
##plot the logistic regression
plt.plot(x_sigmoid,logreg.predict(pd.DataFrame(x_sigmoid)), label = 'Logitic', color ='orange')

import math
## plot the sigmoid function
a=float(logreg.coef_)
b=float(logreg.intercept_)
y_sigmoid = [1 / ( 1 + math.exp(-1 * ( a * val + b ))) for val in x_sigmoid]
plt.plot(x_sigmoid,y_sigmoid, label = 'Sigmoid', color ='green')
plt.grid()
plt.legend()

We could also add the probability of classification for each point in the datset 

In [0]:
plt.scatter(x,y,color = 'red', marker = '+', label = 'Data')
plt.plot(x,reg.predict(x) , label = 'Linear')
plt.xlabel('Hourse')
plt.ylabel('Pass Exam (0:pass,1:fail)')

import numpy as np
x_sigmoid = np.sort(np.array(x),axis=0)
##plot the logistic regression
plt.plot(x_sigmoid,logreg.predict(pd.DataFrame(x_sigmoid)), label = 'Logitic', color ='orange')

import math
## plot the sigmoid function
a=float(logreg.coef_)
b=float(logreg.intercept_)
y_sigmoid = [1 / ( 1 + math.exp(-1 * ( a * val + b ))) for val in x_sigmoid]
plt.plot(x_sigmoid,y_sigmoid, label = 'Sigmoid', color ='green')
plt.grid()
plt.legend()

## geting probaility of classification for each point in the datset
y_prob = logreg.predict_proba(x)
plt.scatter(x,y_prob[:,1],color = 'black', marker = '+', label = 'prob')

We observe the following from the above figure: \\
1- Sigmoid fits the dataset better than linear regression. It's values are between 0 and 1. \\
2- Linear regression extends to values less than zero and more than one which doesn't match the dataset (pass=1, fail=0). \\
3- The logistic curve is zero when the sigmoid is less than 0.5, and the logistic curve is one when the sigmoid is more than 0.5 (this can be adjusted depending on the classification problem). \\
4- The sigmoid curve is the probability of classification. \\ 

Also for logistic regression, the intercept (b) moves the curve left and right and the slope (a) defines the steepness of the curve.

# Case #2: HR Analysis

In this section, we will analyze the data of employees of a company. This data includes some information about the employees who are working at the company and those who left the company. Our objective is to predict whether an existing employee would leave the company based on his/her current status. This will help us decide to offer the employee some incentives to keep him/her in the company. This could also be used to plan early to hire new employees.$^{[1]}$

[1] https://codebasicshub.com/tutorial/machine-learning/logistic-regression-binary-classification

**Implementation**

Read the input data from the csv file (HR_comma_sep.csv) file. Dataset is downloaded from Kaggle. Link: https://www.kaggle.com/giripujar/hr-analytics. Use the pandas library (https://pandas.pydata.org/) to read the data from the file.

In [0]:
import pandas as pd
HR = pd.read_csv('./MachineLearning/2_logistic/HR_comma_sep.csv')
HR.head()

Before applying regression to the data, we will explore and analyze the data to determine the features that influence the decision of the employees to remain or leave the company.

In [0]:
left = HR[HR.left==1] ## employees who left the company 
No_left= left.shape[0]
remain = HR[HR.left==0] ## employees who remain at the company 
No_remain = remain.shape[0]
Per_left = No_left / (No_left + No_remain)

print('No_left = {}, No_remain = {} , Percentage of left = {} %'.format(No_left,No_remain,Per_left*100))


About $23\%$ employees left the company. Now, let us check which features are mostly affecting the decision of employees to leave or remain in the company. To do this, we will measure the average of each numeric feature for employees to remain or leave the company.  

In [0]:
HR.groupby('left').mean() #

We may conclude the following from the table above: \\
1- Employees who remain in the company has higher satisfaction_level and thus it is a good indicator for our regression/classifier (good feature) \\
2- The last_evaluation, number of projects, and time_spend_company scores are almost independent of the employees remain or leave the company \\
3- The average_montly_hours for employees who left the company are higher than those who remained which could be an indicator (good feature) \\
4- The promotion_last_5years feature of employees remaining in the company is much higher than those left the company (good feature) \\
5- Work_accident is also an indicator so it is a good feature.




Let us also check the quality of the categories' features.

In [0]:
pd.crosstab(HR.salary,HR.left)

The salary table shows that emloyees with high salaries are more likely to stay in the company. So it is a good feature. To visualize this we make a bar plot as follows:

In [0]:
pd.crosstab(HR.salary,HR.left).plot(kind='bar')

We need also to investigate the department feature as follows

In [0]:
pd.crosstab(HR.Department,HR.left).plot(kind='bar')

The department type has a minor effect on the decision of employees to stay or leave the company. It doesn't look a major factor and thus we will ignore this feature. 

Based on the above analysis, we will create the following table which includes only the good (important, major) features affecting employees' decisions to stay or leave the company

In [0]:
HR_GF = HR[['left','satisfaction_level','average_montly_hours','Work_accident','promotion_last_5years', 'salary']]
HR_GF.head()

To prepare the data for classification (using logistic regression), we will apply one hot encoding for the categorical features (salary).

In [0]:
dm = pd.get_dummies(HR_GF.salary)
HR_GF_merged = pd.concat([HR_GF,dm],axis=1)
HR_GF_merged = HR_GF_merged.drop(['salary','medium'],axis=1)
HR_GF_merged.head()

Let us define input (x) and output (y) of the model

In [0]:
x = HR_GF_merged.drop('left',axis=1)
y = HR_GF_merged.left

Before classification, we need to split the datset into test and training parts

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print('size of test dataset = {}, size of traing data = {}, percentage = {}%'.format(len(x_test),len(x_train),len(x_test)*100/(len(x_test) + len(x_train))))

Now, we are ready to apply logistic regression

In [0]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(x_train,y_train)
print(logreg.coef_)
print(logreg.intercept_)
print(logreg.score(x_train,y_train))
print(logreg.score(x_test,y_test))

**Comment on the traning and testing accuracies?**

# Case #3: Recognition of Handwitten Digits

In this section, we will try to recognize handwritten digits using logistic regression. We will be using a standard dataset available through the sklearn library called "load_digits".$^{[1][2]}$

[1] https://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction

**Readings and Resources**

1- https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html#sphx-glr-auto-examples-classification-plot-digits-classification-py

2- https://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction

3- https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html

In the beginning, we will load the dataset as follows

In [0]:
from sklearn.datasets import load_digits
digits = load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data. Let us explore the content of the digits dataset

In [0]:
dir(digits)

The digits.data contains the features that will be used to classify the digits samples

In [0]:
print(digits.data)

The digits.images contains the images of the digits samples. They can be viewed using the following code

In [0]:
import matplotlib.pyplot as plt
plt.gray()
plt.matshow(digits.images[0])

The ground truth of the datset is stored in the digits.taget

In [0]:
print(digits.target)

After exploring the content of the digits dataset, we will design a classified using logistic regression. First, we decide the input feature vector (x) and the ground truth (y) 

In [0]:
x = digits.data
y = digits.target

Then we split the datset into testing and training parts

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print('size of test dataset = {}, size of traing data = {}, percentage = {}%'.format(len(x_test),len(x_train),len(x_test)*100/(len(x_test) + len(x_train))))


Here we will train the logistic regression model

In [0]:
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(x_train,y_train)
print(logreg.score(x_train,y_train))
print(logreg.score(x_test,y_test))

**Comment on the training and testing accuracies.**

To predict the types of test samples and store it is y_pred run

In [0]:
y_pred = logreg.predict(x_test)

Sometimes, we wish to know where did the model fail. This can be achieved using what is called the confusion matrix.

In [0]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm

The confusion matrix is a square-matrix that shows the relationship between the ground truth and the predicted values. For example, the number in the left-top cell indicates the number of zero digits in the dataset that were predicted as zeros. In the last row which represents the prediction of the 9 digit, the first cell in the row includes the number of 0 digits that are classified as 9, the second cell containing the number of 1 digits that are predicted as 9 and so on. 

We can also use the  seaborn library to view the confusion matrix as follows

In [0]:
import seaborn as sn
sn.heatmap(cm,annot=True)

We could also increase the size of the figure using matplot.pyplot properties.

In [0]:
plt.figure(figsize = (10,7))
sn.heatmap(cm,annot=True)
plt.xlabel('ground truth')
plt.ylabel('predicted')

Let us try to find out how did the logistic classified a specific digit. In the next code, we will visualize the images, predicted and target values.

In [0]:
import numpy as np
fig = plt.figure(figsize=(12, 24))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

digit_visualize = 5 # the digit that we want to visualize

cnt = 0
i = 0
while (i < 128) and (i < len(y_test)) :
    if y_test[i] == digit_visualize:
       Idx = np.where(np.prod(digits.data == x_test[i,:],axis = -1))
       ax = fig.add_subplot(16, 8, cnt + 1, xticks=[], yticks=[])
       ax.imshow(digits.images[int(Idx[0])], cmap=plt.cm.binary, interpolation='nearest')
       # label the image with the target value
       ax.text(0, 7, str(y_test[i]))
       ax.text(6.5, 7, str(y_pred[i]))
       cnt+=1
    i+=1

Let us try to find out in more details where did the logistic failed. In the next code, we will visualize the images, predicted and target values.

In [0]:
fig = plt.figure(figsize=(12, 24))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

cnt = 0
i = 0
while (i < 128) and (i < len(y_test)) :
    if y_test[i] != y_pred[i]:
       Idx = np.where(np.prod(digits.data == x_test[i,:],axis = -1))
       ax = fig.add_subplot(16, 8, cnt + 1, xticks=[], yticks=[])
       ax.imshow(digits.images[int(Idx[0])], cmap=plt.cm.binary, interpolation='nearest')
       # label the image with the target value
       ax.text(0, 7, str(y_test[i]))
       ax.text(6.5, 7, str(y_pred[i]))
       cnt+=1
    i+=1

# Exercises

**1) Iris identification**: use logistic regression to build a classification model for the iris dataset stored in the sklearn library. You may import the dataset using "from sklearn.datasets import load_iris".

In [0]:
import pandas as pd
df = pd.read_csv("./MachineLearning/2_logistic/HoursPassExam.csv")
df.head()