# Logistic Regression: cat or dog?
In this lab you will teach computer to distinguish between images of cats and dogs using Logistic Regression. 
The input dataset consists of 10,000 images manually labeled as ''cats'' and ''dogs''. The original dataset was downloaded from kaggle. 

Download the entire [folder](https://drive.google.com/file/d/1V4pAtGy7VOJQlxM3g8gyDee8h5k7VTSF/view?usp=sharing)  with images and unzip it into your local directory containing input files for this course. Then set the path below to point to this directory.

In [None]:
data_dir = "../../../data_ml_2020/cat_dog_data"

## 1. Building the model

### 1.1. Import all the required libraries. 
If you get an import error on `keras`, run one of the next 2 cells to install `keras` in the current Jupyter kernel, and then rerun the import cell. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

from PIL import Image
from keras import preprocessing

ModuleNotFoundError: No module named 'keras'

In [None]:
# Install a conda package (with all its dependencies) in the current Jupyter kernel
# this will work if you have a clean installation of anaconda
import sys
!conda install --yes --prefix {sys.prefix} keras

Collecting package metadata (current_repodata.json): done
Solving environment: / 

In [None]:
# Alternatively - install keras package and its dependencies using pip
import sys
!pip install --upgrade tensorflow
!pip install --upgrade keras

### 1.2. Load images
First check if the path to the directory is correct:

In [None]:
import os
cwd = os.getcwd()
os.chdir(cwd)
print(os.listdir(data_dir))

Next create two lists and fill them with the paths to the corresponding images. 

In [None]:
train_cats_files = []
train_path_cats = data_dir +"/training_set/cats/"
for path in os.listdir(train_path_cats):
    if '.jpg' in path:
        train_cats_files.append(os.path.join(train_path_cats, path))
        
train_dogs_files = []
train_path_dogs = data_dir +"/training_set/dogs/"
for path in os.listdir(train_path_dogs):
    if '.jpg' in path:
        train_dogs_files.append(os.path.join(train_path_dogs, path))
        
len(train_cats_files), len(train_dogs_files)

Now we have the paths to each image in the training set.
We need to convert each image into a numpy array. For this we use the preprocessing module in the `keras` library. 

In [None]:
k = 200
sample_dog_file = train_dogs_files[k]
img = preprocessing.image.load_img(sample_dog_file, target_size=(64, 64))
img_array = preprocessing.image.img_to_array(img)

In [None]:
plt.imshow(np.uint8(img_array))

In [None]:
img_array.shape
# print(img_array)

Each image is represented as a $64*64$ matrix of pixels, and for each pixel we have values of Red, Green, and Blue (RGB). 

### 1.3. Images to numpy arrays
Now we create training sets for cats and for dogs and then concatenate 2 sets into a single `X_train` dataset of features.

In [None]:
# image dimensions: using 32x32 pixels just for speed
d = 32
X_train_orig = np.zeros((8000, d, d, 3), dtype='float32')
for i in range(4000):    
    path = train_cats_files[i]
    img = preprocessing.image.load_img(path, target_size=(d, d))
    X_train_orig[i] = preprocessing.image.img_to_array(img)

for i in range(4000,8000):    
    path = train_dogs_files[i-4000]
    img = preprocessing.image.load_img(path, target_size=(d, d))
    X_train_orig[i] = preprocessing.image.img_to_array(img)    

X_train_orig.shape

### 1.4. Flatten 3D image arrays
Our model requires each object to be a 1D vector of features -
we need to flatten our 3D image arrays.

After reshaping we will have,
$d*d*3$ features as a single array for each picture in the training set (8000 pics),

In [None]:
X_train = X_train_orig.reshape(8000,-1)
print(X_train[0])
X_train.shape

### 1.5. Create class labels
Now we need to create the corresponding class label vectors. We will mark the cats as class 1, and the dogs as class 0 (not cats).

In [None]:
Y_train_orig = np.ones((4000,)) # 1 - 4000 are cat pictures so our label is 1
Y_train_orig = np.concatenate((Y_train_orig, np.zeros((4000,)))) # 4000 - 8000 are dog pictures so our label is 0
Y_train = Y_train_orig.reshape(-1)
print("At position 3 should be a cat:", Y_train[3])
print("At position 4002 should be a dog:", Y_train[4002])
Y_train.shape

### 1.6. Build the model
We are using the `LogisticRegression` class from `sklearn` package.
<ul>
<li>The <code>random_state</code> parameter tells to shuffle the samples, so the classifier does not see all the cats first, and then the dogs. Specifying  the `random_state` value ensures that the algorithm starts from the same random seed and produces reproducible results.</li> 
<li>The <code>max_iter</code> parameter tells algorithm to stop even if it did not reach the thrreshold for convergence yet.</li>
    <li>In the <code>solver</code> parameter you can specify the algorithm which you want to use.</li>
</ul>

You can read more about the parameters of  `LogisticRegression` model [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
from sklearn import linear_model

algorithms = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'] # default='lbfgs'
logreg = linear_model.LogisticRegression(solver=algorithms[1], random_state = 42, max_iter= 1000)
logreg.fit (X_train, Y_train)

The score of the logistic regression classifier is simply a percentage of correctly predicted data points. This measure is called the **accuracy** of the model.

In [None]:
acc_train = logreg.score(X_train, Y_train)
print("train accuracy: {} ".format(acc_train))

## 2. Lab Task 1: Model evaluation  \[60%\]
Obviously, we are much more interested to see how our model performs on the test data. To create a test set, repeat steps 1.2-1.5 for the test_set folder.

### 2.1. Load images

In [None]:
test_cats_files = []
test_path_cats = data_dir +"/test_set/cats/"
# <Your code here>

test_dogs_files = []
test_path_dogs = data_dir +"/test_set/dogs/"
# <Your code here>

len(test_cats_files), len(test_dogs_files)

### 2.2. Images to numpy arrays

In [None]:
X_test_orig = np.zeros((2000, d, d, 3), dtype='float32')  
# <Your code here>
X_test_orig.shape

### 2.3. Flatten 3D image arrays

In [None]:
X_test # <Your code here>
print(X_test[0])
X_test.shape

### 2.4. Create class labels

In [None]:
Y_test #<Your code here>
Y_test.shape

### 2.5. Accuracy for the test set

In [None]:
acc_test = logreg.score(X_test, Y_test)
print("test accuracy: {} ".format(acc_test))

### 2.6. Improve the model
If the predictive power of the classifier is too low, try to improve the model. Below are some suggestions for improving it. Rerun the model after each modification and see if the accuracy of prediction is improved. 

Carefully record the results of your experiments in a separate markdown cell.

<ol>
    <li>Increase value of $d$ (image dimensions) to 64.</li>
    <li>Normalize values in pixel arrays by dividing each value by 255 (max RGB value).</li>
    <li>Use a different model-fitting algorithm.</li>
    <li>Modify default parameters of <code>LogisticRegression</code> class.</li>
    <li>$\ldots$</li>
</ol>

You can stop once you have a good accuracy for the test set (no less than 0.60).

### 2.7. Predict random cats
Find a random image of a cat and another of a dog, and test your model to predict it. Follow all the steps to convert two images into an array of features and then call:

In [None]:
# X_new = [[...], [...]]
Y_new = model.predict(X_new)

Submit your images with your lab, and specify which prediction did you obtain for each image.

### 2.8. Save model to file
When you are happy with the performance of your model and want to use it to identify cats in the future, save it to file using pickle. An example how to save the model and then reload it can be found [here](
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/).

Test that you can save the model and then load it in the cell below. Put your saved model to your google drive folder and provide the link to it in your notebook submission.

# 3. Lab Task 2: Support Vector Machines \[40%\]
First, watch the [video](https://www.youtube.com/watch?v=efR1C6CvhmE&vl=en) about another classifier: Support Vector Machine (SVM).

Next, perform the cat/dog image classification learning using SVM.
Learn about the parameters of the sklearn SVC class [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

In [None]:
from sklearn.svm import SVC # "Support vector classifier"
svm = SVC(kernel='rbf', C=1E3)
#<Your code here>

SVM is a more powerful classifier than logistic regression. Try to achieve a better accuracy by playing with the algorithm parameters. Report the final values in a new markdown cell below.

Finally, in a newly added markdown cell briefly explain how do you understand the difference between the logistic regression and SVM learning algorithms. Pay a special attention to how these algorithms treat a decision boundary. 

Copyright &copy; 2020 Marina Barsky. All rights reserved.