[This notebook is taken from this repo and changed for our class](https://github.com/mGalarnyk/Python_Tutorials)

# Logistic Regression


## Objectives

- To be able to use sklearn for logistic regression problems. 

- After a model fit to be able to print confusion matrices with seaborn and matplotblib.

- Understanding parameters of LogisticRegression in sklearn.

## Loading the Data (Digits Dataset) 

The digits dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below will load the digits dataset.

In [None]:
%matplotlib inline
from sklearn.datasets import load_digits
digits = load_digits()

Now that you have the dataset loaded you can use the commands below

In [None]:
# Print to show there are 1797 images (8 by 8 images for a dimensionality of 64)
print("Image Data Shape" , digits.data.shape)

# Print to show there are 1797 labels (integers from 0-9)
print("Label Data Shape", digits.target.shape)

## Showing the Images and Labels (Digits Dataset)

In [None]:
import numpy as np 
import matplotlib.pyplot as plt

plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(digits.data[0:5], digits.target[0:5])):
    plt.subplot(1, 5, index + 1)
    plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
    plt.title('Training: %i\n' % label, fontsize = 20)

## Splitting Data into Training and Test Sets (Digits Dataset)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)

## Scikit-learn 4-Step Modeling Pattern (Digits Dataset)

**Step 1.** Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

In [None]:
from sklearn.linear_model import LogisticRegression

**Step 2.** Make an instance of the Model

In [None]:
logisticRegr = LogisticRegression()

**Step 3.** Training the model on the data, storing the information learned from the data

Model is learning the relationship between x (digits) and y (labels)

In [None]:
logisticRegr.fit(x_train, y_train)

**Step 4.** Predict the labels of new data (new images)

Uses the information the model learned during the model training process

In [None]:
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(x_train[0].reshape(1,-1))

In [None]:
# Predict for Multiple Observations (images) at Once
logisticRegr.predict(x_train[0:10])

In [None]:
# Make predictions on entire test data
predictions = logisticRegr.predict(x_train)

## Measuring Model Performance (Digits Dataset)

While there are other ways of measuring model performance, we are going to keep this simple and use accuracy as our metric. 
To do this are going to see how the model performs on the new data (test set)

accuracy is defined as: 

(fraction of correct predictions): correct predictions / total number of data points

In [None]:
# Use score method to get accuracy of model
score = logisticRegr.score(x_train, y_train)
print(score)

### Confusion Matrix (Digits Dataset)

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. In this section, I am just showing two python packages (Seaborn and Matplotlib) for making confusion matrixes. 

In [None]:
import numpy as np 

import seaborn as sns
from sklearn import metrics

**Method 1 (Seaborn)**

In [None]:
cm = metrics.confusion_matrix(y_train, predictions)

In [None]:
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);
plt.savefig('toy_Digits_ConfusionSeabornCodementor.png')
#plt.show();


**Method 2 (Matplotlib)**

This method is clearly a lot more code. I just wanted to show people how to do it in matplotlib as well. 

In [None]:
cm = metrics.confusion_matrix(y_train, predictions)

plt.figure(figsize=(9,9))
plt.imshow(cm, interpolation='nearest', cmap='Pastel1')
plt.title('Confusion matrix', size = 15)
plt.colorbar()
tick_marks = np.arange(10)
plt.xticks(tick_marks, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], rotation=45, size = 10)
plt.yticks(tick_marks, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], size = 10)
plt.tight_layout()
plt.ylabel('Actual label', size = 15)
plt.xlabel('Predicted label', size = 15)
width, height = cm.shape

for x in range(width):
    for y in range(height):
        plt.annotate(str(cm[x][y]), xy=(y, x), 
                    horizontalalignment='center',
                    verticalalignment='center')
plt.savefig('toy_Digits_ConfusionMatplotlibCodementor.png')
#plt.show()

<b>if this tutorial doesn't cover what you are looking for, please leave a comment on the youtube video or blog post and I will try to cover what you are interested in. </b>

[youtube video](https://www.youtube.com/watch?v=71iXeuKFcQM)

## Coefficients of Logistic Regression

In [None]:
logisticRegr.coef_.shape

## This looks like too good to be true?

In [None]:
from sklearn.model_selection import cross_validate

Let's take a look at how cross_validation works one more time. 

[Sklearn Cross_Validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)

[Sklearn Cross Validation User Guide](https://scikit-learn.org/stable/modules/cross_validation.html#multimetric-cross-validation)

In [None]:
log_reg = LogisticRegression(penalty = 'l1', max_iter= 10000, multi_class= 'auto', C = 0.01, solver = 'liblinear')

cv = cross_validate(log_reg, x_train, y = y_train, cv = 5, return_train_score= True)

In [None]:
training_scores_mean =cv['train_score'].mean()

training_scores_std = cv['train_score'].std()

validation_scores_mean = cv['test_score'].mean()

validation_scores_std = cv['test_score'].std()

Let's report our cross-validation results.

In [None]:
print('5 fold training score {} +/- {} '.format(training_scores_mean, training_scores_std))

print('5 fold validation score {} +/- {} '.format(validation_scores_mean, validation_scores_std))

Note that we have some differences between our training scores and validation score. 

So what should we expect in our test case?

## Further work:

Find the best parameters for:

penalty: ['l1' , 'l2']

C (regularization): [0.01, 0.1, 1, 10, 100]

solver = ['sag', 'newton-cg', 'liblinear']

## Exit Ticket

[Make sure that you finished this exit ticket before our second lecture](https://forms.gle/jPTUygnr5ohATsDf9)