# Mini Project: Binary classification with Logistic Regression

Your goal is to use your previous experience in order to build a simple binary classifier for Text vs. Non-text classification. Then apply it to classify image patches.

The roadmap is as follows:

* Load and display images
* Import the train/test datasets
* Train and Run the Logistic Regression classifier using two different image features:
 * Raw pixels
 * Histograms of grey values
* Evaluate and compare the classifier results using different evaluation measures
* Use your patch classifier in a sliding window scheme

## Preliminary

Load images, convert to numpy arrays, reshape.

In [None]:
import imageio

#Load an image example
img = imageio.imread('./img_A.pgm')

type(img),img.ndim, img.shape, img.dtype

In [None]:
import matplotlib.pyplot as plt #matplotlib plotting library

# Display an image
plt.imshow(img, cmap=plt.cm.gray)

In [None]:
#reshape image array to a vector (1D array)
img = np.reshape(img,-1)

type(img),img.ndim, img.shape, img.dtype

## Data acquisition (Optional)

This part of the notebook deals with loading images from your hard disk and convert them to NumPy arrays. This is going to be our dataset. Images are stored as pgm files but may be in any other image format. 

Since we are going to read near 12000 files from disk, it may take around 20 minutes. If you prefer you can jump to the next section of the notebook and load directly the **raw_pixels_dataset_5980.pklz** file, which already contains the NumPy arrays for the entire dataset. The code is here just in case you are curious about how to convert image files into NumPy arrays, and how the **raw_pixels_dataset_5980.pklz** has been created.

To execute this part of the notebook you must download the **scene_text_dataset.zip** (<font color='red'>~140 Mb</font>) file from the Campus Virtual site and decompress it in the same directory as this python notebook.


In [None]:
from os import listdir #now we will load all images in a given directory

datapath = 'data/characters/icdar/img_ICDAR_train/'

char_raw_pixels = np.array([ np.reshape(imageio.imread(datapath+f),-1) for f in listdir(datapath) ])
char_raw_pixels = np.reshape(char_raw_pixels,[-1,1024])

char_raw_pixels.shape

In [None]:
datapath = 'data/background/train/'

bg_raw_pixels = np.array([ np.reshape(imageio.imread(datapath+f),-1) for f in listdir(datapath) ])
bg_raw_pixels = np.reshape(bg_raw_pixels,[-1,1024])

bg_raw_pixels.shape

In [None]:
# We want a balanced dataset so we take only the first 5980 background samples
bg_raw_pixels = bg_raw_pixels[0:5980,:]
bg_raw_pixels.shape

In [None]:
#Visualize. Just to be sure the data is correct

im = char_raw_pixels[1,:]
im = np.reshape(im,[32,32])

plt.subplot(1, 2, 1)
plt.imshow(im, cmap=plt.cm.gray)

im = bg_raw_pixels[1,:]
im = np.reshape(im,[32,32])

plt.subplot(1, 2, 2)
plt.imshow(im, cmap=plt.cm.gray)

In [None]:
train_features = np.append(char_raw_pixels,bg_raw_pixels, axis=0)

char_labels = np.ones([char_raw_pixels.shape[0],1])
bg_labels   = np.zeros([bg_raw_pixels.shape[0],1])

train_labels = np.append(char_labels, bg_labels)

train_features.shape, train_labels.shape

In [None]:
#we now do the same for test data

datapath = 'data/characters/icdar/img_ICDAR_test/'

char_raw_pixels = np.array([ np.reshape(imageio.imread(datapath+f),-1) for f in listdir(datapath) ])
char_raw_pixels = np.reshape(char_raw_pixels,[-1,1024])

char_raw_pixels.shape

In [None]:
datapath = 'data/background/test/'

bg_raw_pixels = np.array([ np.reshape(imageio.imread(datapath+f),-1) for f in listdir(datapath) ])
bg_raw_pixels = np.reshape(bg_raw_pixels,[-1,1024])

bg_raw_pixels.shape

In [None]:
test_features = np.append(char_raw_pixels,bg_raw_pixels, axis=0)

char_labels = np.ones([char_raw_pixels.shape[0],1])
bg_labels   = np.zeros([bg_raw_pixels.shape[0],1])

test_labels = np.append(char_labels, bg_labels)

test_features.shape, test_labels.shape

In [None]:
#Now we can save all our data as python serialized data, so we do not need to read again image
# files the next time we want execute our classification code

import pickle #module for serialization of python object structure
import gzip   #we can compress our data directly when writting data to a file

with gzip.open('./raw_pixels_dataset_5980.pklz','wb') as f:
 pickle.dump((train_labels,train_features,test_labels,test_features),f,pickle.HIGHEST_PROTOCOL)


## Classification using raw pixels as features

In the following we will try how good are the raw pixel features to automatically classify the different classes.

We are going to evaluate classification using Logistic regression

In [None]:
# load the data
import pickle
import gzip

with gzip.open('./raw_pixels_dataset_5980.pklz','rb') as f:
 (train_labels,train_features,test_labels,test_features) = pickle.load(f)

print (train_features.shape)
print (test_features.shape)

Let's recover the logistic regression code we have defined:

In [None]:
def sigmoid(X):
    '''
    Computes the Sigmoid function of the input argument X.
    '''
    return 1.0/(1+np.exp(-X))

def GradientDescent_logistic_reg(x,y,max_iterations=2500, alpha=0.1, reg_lambda = 1):
    
    m,n = x.shape # number of samples, number of features

    # y must be a column vector
    y = y.reshape(m,1)
    
    #initialize the parameters
    theta = np.ones(shape=(n,1)) 
    
    # Repeat until convergence (or max_iterations)
    for iteration in range(max_iterations):
        h = sigmoid(np.dot(x,theta))
        error = (h-y)
        gradient = np.dot(x.T , error) / m  + reg_lambda * theta / m   # ADDED THE REGULARISATION TERM
        theta = theta - alpha*gradient
    return theta

def classifyVector(X, theta):
    '''
    Evaluate the Logistic Regression model h(x) with theta parameters,
    and returns the predicted label of x.
    '''
    prob = sigmoid(sum(np.dot(X,theta)))
    if prob > 0.5: return 1.0
    else: return 0.0

<font color=blue>Use logistic regression to learn a classifier over your training set, using the original pixel data as input</font>

In [None]:
# YOUR CODE HERE


## Handcrafted feature extraction

We have seen how using the raw pixels is not a good idea. Intuitively there are two main reasons for the bad performance of our classifier: first, we do not have enough data to train in such a high dimensional space (1024-D), second, the raw pixels do not have enough discriminative power to effectively discriminate over the Text and Non-text examples in our dataset. Notice that a simple 1-pixel shift in one of the examples may produce a very different feature vector.

In Computer Vision (and in Pattern Recognition in general), feature extraction is a procedure to extract pieces of information which are relevant for solving the computational task at hand. There is a large tradition in designing handcrafted features, that incorporate class prior knowledge, to solve specific problems.

In this part of the notebook we are going to extract simple features: histograms of the intensity values of our images. The intuition is that in text image patches we expect to find bi-level histograms (two opposite dominant colors), because text is by design written with high contrast to its background.

Then we will evaluate how good those features are for  automatically classifying between the two classes (Text/Non-text) using Logistic Regression.


In [None]:
#For each example we compute the histogram of grey intensity values

new_train_features = np.zeros([train_features.shape[0],8])
for i in range(train_features.shape[0]):
    new_train_features[i,:] = np.histogram(train_features[i,:],8)[0]
    new_train_features[i,:] /= np.sum(new_train_features[i,:]) #Histogram normalization
    
new_test_features = np.zeros([test_features.shape[0],8])
for i in range(test_features.shape[0]):
    new_test_features[i,:] = np.histogram(test_features[i,:],8)[0]
    new_test_features[i,:] /= np.sum(new_test_features[i,:]) #Histogram normalization
    
new_train_features.shape, new_test_features.shape

In [None]:
#Visualize the histograms of positive/negative samples

plt.subplot(2, 2, 1)
plt.imshow(np.reshape(train_features[1,:],[32,32]), cmap=plt.cm.gray)

plt.subplot(2, 2, 2)
plt.imshow(np.reshape(train_features[5981,:],[32,32]), cmap=plt.cm.gray)

bins = [0,1,2,3,4,5,6,7]

plt.subplot(2, 2, 3)
plt.bar(bins, new_train_features[1,:], align='center')

plt.subplot(2, 2, 4)
plt.bar(bins, new_train_features[5981,:], align='center')

#print train_labels[1],train_labels[5981]

<font color=blue>Learn a new classifier using these features</font>

In [None]:
# YOUR CODE HERE


With the use of Histograms of intensity values we have improved the performance of our classifier by more than 20%!

Notice that this is a very simple feature extraction process, that is not really used in this way in state-of-the-art algorithms. In fact the proposed features are quite weak (as can be seen with the obtained results). However, the idea here is to take conscience that the design of handcrafted features is a possible way of improving the discrimination power of our classifiers.

This experiment also serves to introduce the topic of next Practical (PR2), where we are going to see how it is possible to automatically learn powerful features in an unsupervised way from our training data.

<font color=blue>Making use the Histogram of intensity values features, evaluate the Precision and Recall measures at different operation points of the classifier. This can be done by changing the 0.5 decision threshold in the ClassifyVector function to a range of values between 0 and 1. Plot the obtained Precision/Recall curve and analyse it.</font>

In [None]:
# YOUR CODE HERE


## Using Sliding Window with our Text vs. Non-text classifier

Sliding Window is a common Computer Vision technique used to apply patch-based classifiers into full-size images. The basic idea is to exhaustively evaluate the classifier response in "all" possible sub-windows of the input image.

In [None]:
from PIL import Image

img = Image.open("img_scene.jpg")
#img.show()

img = np.array(img)

#pylab.rcParams['figure.figsize'] = 14, 10.5  # changes the default image size for the notebook
plt.figure(num=None, figsize=(14, 10.5), dpi=80, facecolor='w', edgecolor='k')

# Load an image and plot it
#img = imageio.imread('img_scene.jpg')
plt.subplot(1, 2, 1)
plt.imshow(img, cmap=plt.cm.gray)

detection_map = np.zeros(shape=img.shape)

win_sizes = (32,64,96)
win_step  = 0.2

for size in win_sizes:
  for x in range(0,img.shape[1]-size,int(size*win_step)):
    for y in range(0,img.shape[0]-size,int(size*win_step)):
        window = img[y:y+size,x:x+size]
        window = np.array(Image.fromarray(window).resize(size = (32,32), resample = Image.BILINEAR))
        raw_pixels = np.reshape(window,[-1,1024])
        hist_feature = np.histogram(raw_pixels,8)[0]
        hist_feature = hist_feature.astype(np.float32)
        hist_feature /= np.sum(hist_feature.T)
        prob = sigmoid(sum(np.dot(hist_feature,w1)))
        detection_map[y:y+size,x:x+size] += 255*prob
        
plt.subplot(1, 2, 2)
plt.imshow(detection_map,cmap=plt.cm.jet)


## Open Exercise

<font color=blue>Propose and implement an improvement, or an extra evaluation analysis, for our Text vs. Non-text classifier in its current status.</font>

Some ideas (not a closed list):

* Implement the Stochastic Gradient Descent algorithm. Show how it improves the training time performance.

* Use cross-validation to tune the meta-parameters (max_iterations and learning rate $\alpha$) of Gradient Descent. 

* Other optimization algorithms?

* Evaluate the Test Accuracy as a function of the number of training examples ($m$).

* Use Histogram of Oriented Gradients (HOG) image features. http://scikit-image.org/docs/dev/auto_examples/plot_hog.html

* Improve the Histogram features. E.g. Evaluate the effect of histogram size. Try different number of bins and compare the obatined results (show precision, recall, accuracy). Can you improve the current Test Accuracy? What is the best accuracy you can reach?

* Zoning: compute NxN grey level histograms in a N by N grid of cells over the image patch, and concatenate them to create a new feature.


In [None]:
# YOUR CODE HERE
