## Putting it all together! 

In this exercise, we will use the tools we have encountered this far, get acquainted with newer ones and put them to use to develop an elaborate program. 


### Task
Given a dataset of color images (RGB) of different fruit types, we want to build an automated classifier that can distinguish between the different classes. To keep it simple we provide pictures of different varieties of apples and oranges. The task then is, proverbially, *to tell Apples from Oranges*.

### Dataset

The complete dataset is provided as the archive data.zip. When uncompressed it conists of several directories. The names of these directories are the labels of the fruit type.Each directory has numerous pictures of the fruit type in JPG format.

In [None]:
# Some of the essential Python Packages to be loaded
# See how you can load a package using a different name; 
import numpy as np 
import glob
import matplotlib.pyplot as plt
# add any other packages you might need

In [None]:
# The data is provided in two folders with names
# Apples and Oranges.
# Read the file names using glob.glob()

# Count the number of apples and oranges

orFiles = glob.glob('../data/Oranges/*.jpg')
nOranges = len(orFiles)

apFiles = glob.glob('../data/Apples/*.jpg')
nApples = len(apFiles)
nFeatures = 3
print("Dataset contains %d apples and %d oranges" %(nApples,nOranges))

In [None]:
# Show some sample oranges and apples using pyplot


In [None]:
# We would like to extract some features 
# to be able to compare apples vs oranges

# Can  you think of what features might be 
# most useful?


# Initialise empty arrays to hold some features
orFeatures = np.zeros((nOranges,nFeatures))
apFeatures = np.zeros((nApples,nFeatures))

In [None]:
# We need to assign "labels" to distinguish 
# apples from oranges

# Usually the two classes are mapped to two numbers
# We use 0 -> oranges and 1 -> apples
orLabels = np.zeros(nOranges)
apLabels = np.ones(nApples)

In [None]:
# Feature extraction 
# One of the basic features that can be extracted is the average intensity
# of the R, G, B channels. 
# Steps:
# 1. Load image data
# 2. Convert it into numpy array
# 3. Extract 3 features 


In [None]:
# Visualise the pairwise features
plt.scatter(orFeatures[:,0],orFeatures[:,1])
plt.scatter(apFeatures[:,0],apFeatures[:,1])

In [None]:
# make the dataset

# Combine features of both oranges and apples into
# a single array. Same for labels. 
X = np.concatenate((orFeatures,apFeatures),axis=0)
Y = np.concatenate((orLabels,apLabels))

# Let us shuffle the data! 
shuffleIdx = np.random.permutation(len(Y))
xShuffle = X[shuffleIdx]
yShuffle = Y[shuffleIdx]



In [None]:
# Split the data into training and test data
N = len(yShuffle)
nTrain = int(0.6*N)
nTest = N - nTrain

# Let us use only the first two features. 
xTrain, yTrain = xShuffle[:nTrain,:2], yShuffle[:nTrain]
xTest, yTest = xShuffle[nTrain:,:2], yShuffle[nTrain:]

In [None]:
# Now that our data is ready. Let us use a classifier.


# We will use another linear classifier
# Logistic regression learns a linear boundary 
# to classify binary class of inputs

from sklearn.linear_model import LogisticRegression

In [None]:
clf = LogisticRegression(random_state=0,solver='lbfgs').fit(xTrain, yTrain)

# Print the model coefficients.
print("Logistic regression parameters: ",clf.coef_[0],clf.intercept_[0])


In [None]:
# Check the performance on the training dataset
yTrPred = clf.predict(xTrain)
accTrain = sum(yTrPred == yTrain)/nTrain
print("Training accuracy is: %.2f"%accTrain)

In [None]:
# Check the performance on the training dataset
yTsPred = clf.predict(xTest)
accTest = sum(yTsPred == yTest)/nTest
print("Test accuracy is: %.2f"%accTest)

In [None]:
w = clf.coef_[0]
a = -w[0] / w[1]
xMax = X[:,0].max()+10
xMin = X[:,0].min()-10
xx = np.linspace(xMax, xMin) 
yy = a * xx - (clf.intercept_[0]) / w[1]

plt.plot(xx, yy, 'k-')
plt.scatter(xTrain[:,0],xTrain[:,1],c=yTrain,cmap='Accent')

plt.scatter(xTest[:,0],xTest[:,1],c=yTest,cmap='Accent')