# Predicting bird species with random forests
In this notebook I will be attempting to predict a bird's species with a random forest classifier. I will be using the [Caltech-UC San Diego dataset]('http://www.vision.caltech.edu/visipedia/CUB-200-2011.html'). This project can be ofund and discussed in detail in the book 'Python AI projects for beginners' by Eckroth Joshua ([link to book](https://www.amazon.com/Python-Artificial-Intelligence-Projects-Beginners/dp/1789539463))

## Getting the data

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [3]:
# importing the attribute labels
img_att = pd.read_csv("/home/huginn/Documents/AI/data/CUB_200_2011/attributes/image_attribute_labels.txt", 
                     sep='\s+', header=None, error_bad_lines=False, warn_bad_lines=False, 
                     usecols=[0, 1, 2], names=['imgid', 'attid', 'present'])

In [4]:
img_att.head()

Unnamed: 0,imgid,attid,present
0,1,1,0
1,1,2,0
2,1,3,0
3,1,4,0
4,1,5,1


In [5]:
img_att.shape

(3677856, 3)

## Data processing

We want the attributes to be columns and the image IDs rows, so we will use `pivot()` to create a one-hot encoded version of the attributes.

In [6]:
img_att_mtx = img_att.pivot(index='imgid', columns='attid', values='present')

In [7]:
img_att_mtx.head()

attid,1,2,3,4,5,6,7,8,9,10,...,303,304,305,306,307,308,309,310,311,312
imgid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
5,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [8]:
img_att_mtx.shape

(11788, 312)

Now we can see that there are around 12,000 images and we have 312 attributes.

Here we load the answers, that is which image is which species: 

In [9]:
img_labels = pd.read_csv("/home/huginn/Documents/AI/data/CUB_200_2011/image_class_labels.txt", 
                         sep=' ', header=None, names=['imgid', 'label'])

In [10]:
img_labels.head()

Unnamed: 0,imgid,label
0,1,1
1,2,1
2,3,1
3,4,1
4,5,1


In [11]:
img_labels.shape

(11788, 2)

And now we'll use `set_index()` to set the index as the image ID:

In [12]:
img_labels = img_labels.set_index('imgid')

In [13]:
img_labels.head()

Unnamed: 0_level_0,label
imgid,Unnamed: 1_level_1
1,1
2,1
3,1
4,1
5,1


Now we have a 1 column table where `'label'` corresponds to the IDs of the images in `images.txt`, and each line corresponds to an image.

Now I'll attach the image labels to the attributes dataframe, and then shuffle them to then create a train and test set:

In [14]:
birds = img_att_mtx.join(img_labels)
birds = birds.sample(frac=1)

In [15]:
birds_att = birds.iloc[:, :312]
birds_labels = birds.iloc[:, 312:]

In [16]:
birds_att.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,303,304,305,306,307,308,309,310,311,312
imgid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2672,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
162,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,1,0,0,0
9493,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,1,0,0,0
8493,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3434,0,0,1,0,0,0,0,0,0,0,...,0,0,1,1,0,1,0,0,0,0


In [17]:
len(birds_att)

11788

In [18]:
birds_labels.head()

Unnamed: 0_level_0,label
imgid,Unnamed: 1_level_1
2672,47
162,3
9493,162
8493,145
3434,60


In [19]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(birds_att, birds_labels,
                                    test_size=0.2, random_state=42)

In [20]:
# reshaping the answers to a 1D numpy array
y_train = np.array(y_train)
y_train_vec = y_train.reshape(-1)
y_test = np.array(y_test)
y_test_vec = y_test.reshape(-1)

## Training models

### Random Forest Classifier

In [21]:
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train_vec)

RandomForestClassifier()

In [22]:
rfc.score(x_test, y_test_vec)

0.46055979643765904

We get an accuracy of 46% which is not too bad compared to randomly guessing out of 200 species, but still not ideal.

### SVM

In [30]:
from sklearn.svm import LinearSVC

svc = LinearSVC(max_iter=2000)
svc.fit(x_train, y_train_vec)
svc.score(x_test, y_test_vec)

0.43214588634435963

Slightly worse than the random forest so I'll go with that.
But we'll perform a cross validation scoring first:

In [31]:
from sklearn.model_selection import cross_val_score

svc_score = cross_val_score(svc, x_test, y_test_vec, cv=4)

In [32]:
svc_score

array([0.33050847, 0.37457627, 0.31578947, 0.33616299])

### Fine tuning
We'll use `GridSearchCV` to run trough different hyperparameter combinations

In [25]:
from sklearn.model_selection import GridSearchCV

param_grid = [{"n_estimators": [200, 250], "max_features": [9, 10, 11]}]

rfc = RandomForestClassifier()
grid_search = GridSearchCV(rfc, param_grid, cv=5, return_train_score=True)
grid_search.fit(x_train, y_train_vec)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid=[{'max_features': [9, 10, 11],
                          'n_estimators': [200, 250]}],
             return_train_score=True)

In [26]:
grid_search.best_estimator_

RandomForestClassifier(max_features=11, n_estimators=250)

In [27]:
grid_search.best_score_

0.476033934252386

The `n_estimators` best value was the maximum so we'll try higher values:

In [23]:
param_grid = [{"n_estimators": [255, 270], "max_features": [10, 12]}]
grid_search = GridSearchCV(rfc, param_grid, cv=4, return_train_score=True)
grid_search.fit(x_train, y_train_vec)

GridSearchCV(cv=4, estimator=RandomForestClassifier(),
             param_grid=[{'max_features': [10, 12],
                          'n_estimators': [255, 270]}],
             return_train_score=True)

In [24]:
grid_search.best_estimator_

RandomForestClassifier(max_features=12, n_estimators=255)

In [25]:
grid_search.best_score_

0.46776272867386876

We see that the score dropped a little so I'll stop here and take the 250 and 11 values for `n_estimators` and `max_features`.

In [35]:
rfc = RandomForestClassifier(max_features=11, n_estimators=250)

So the best accuracy that we're going to get seems to be around 47% 

## Summary

In this project we managed to process the data with Pandas and Numpy, then split it into train and test sets, and trained 2 classification models, an SVM and a Random Forest Classifier, with the latter proving to be slighly more successful, and we used GridSearchCV to perform hyperparameter tuning, thought which we found that a number of 250 trees and 11 features works best, getting an accuracy of 47%.
In a future project I will use the actual images from the dataset to train a CNN to recognize the species.