# Project 59 Shape Classification

## Authors: Julen Etxaniz and Ibon Urbina

## Objectives: The goal of the project is to compare different classification algorithms on the solution of one or more shape datasets. 

## What is done in the Notebook: 
### Importing the libraries
### Reading the datasets
### Processing the dataset
### Preparing data for classification
### Dividing dataset in train and test sets for validation
### Defining the classifiers
### Learning the classifiers
### Using the classifier for predictions

# Importing the libraries
 We start by importing all relevant libraries to be used in the notebook.
    

In [17]:
import os
from scipy.io import loadmat

# Reading the datasets
We read the plane and car datasets

## Reading the plane dataset
We read the 210 files that contain the instances of the plane classification problem.

We concatenate all the instances in a unique dataframe called "plane_mats"

In [24]:
plane_dir = "../shape_dataset/plane_data/"
plane_mats = []
for file in os.listdir(plane_dir) :
    plane_mats.append(loadmat(plane_dir + file))

We check the dataset is correct, looking at the number of samples

In [26]:
print('The number of samples in the plane dataset is', len(plane_mats))

The number of samples in the plane dataset is 210


## Reading the car dataset
We read the 120 files that contain the instances of the car classification problem.

We concatenate all the instances in a unique dataframe called "car_mats"

In [21]:
car_dir = "../shape_dataset/car_data/"
car_mats = []
for file in os.listdir(car_dir) :
    car_mats.append(loadmat(car_dir + file))

We check the dataset is correct, looking at the number of samples

In [27]:
print('The number of samples in the car dataset is', len(car_mats))

The number of samples in the car dataset is 120


# Preprocessing the dataset

In this problem there are four classes that correspond to the types of vehicles: 'van' 'saab' 'bus' 'opel'. 
For using the classifiers, we need to convert each of these strings in the dataset to a number between 1 and 4. 
That is what we do in the next cell.

In [None]:
# The unique values in the column 'VEHICLE' of the dataset.
Vehicle_Types = all_tables['VEHICLE'].unique()
print(Vehicle_Types)

# We make a copyt of the dataset
my_dataset = all_tables.copy()

# A Map between the four names and the four numbers is created
map_to_int = {name: n for n, name in enumerate(Vehicle_Types)}

# A column is created in the new dataset where words are replaced by numbers
my_dataset['CLASS'] = my_dataset['VEHICLE'].replace(map_to_int)

# Finally we delete the column with the names from the new dataset
my_dataset = my_dataset.drop('VEHICLE',1)

['van' 'saab' 'bus' 'opel']


In [None]:
#import numpy as np
#print(np.histogram(my_dataset['CLASS']))

# Preparing data for classification

To apply the classifiers, we need to separate in two different sets the features and the classes. 


In [None]:
# The names of the features are the first 18 attributes
features_names = names_attributes[1:18]

# all_features contains the features for the 846 samples.
all_features = my_dataset[features_names]

# all_theclass contains the classes (values between 1 and 4) for the 846 samples 
all_theclass = my_dataset["CLASS"]

# Dividing dataset in train and test  sets for validation

Also, to evaluate the accuracy of the classifiers in the dataset we will split the data in two sets. Train and Test data. 
Each set will have the same number of samples  (846/2).


In [None]:
# We divide the data into two sets (train and test)

# Number of samples in the train and test sets (half of the number of samples)
n_samples = int(len(all_features)/2)

# The train data are the first half of all_features
train_data = all_features[:n_samples]
train_class = all_theclass[:n_samples]

# The test data are the second half of all_features
test_data = all_features[n_samples:]
test_class = all_theclass[n_samples:]


# Defining the classifiers
We define the three classifiers used.

In [None]:
dt  = DecisionTreeClassifier()
lda = LinearDiscriminantAnalysis()
lg  = LogisticRegression()

# Learning the classifiers
We used the train data to learn the three classifiers

In [None]:
dt.fit(train_data,train_class)
lda.fit(train_data,train_class)
lg.fit(train_data,train_class)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

# Using the classifier for predictions
We predict the class of the samples in the test data with the three classifiers.

In [None]:
dt_test_predictions = dt.predict(test_data)
lda_test_predictions = lda.predict(test_data)
lg_test_predictions = lg.predict(test_data)

# Computing the accuracy

Finally, we compute the accuracy using the three classifiers and print it. 

In [None]:
dt_acc =  accuracy_score(test_class,dt_test_predictions)
lda_acc =  accuracy_score(test_class,lda_test_predictions)
lg_acc =  accuracy_score(test_class,lg_test_predictions)
print("Accuracy for the decision tree :",dt_acc)
print("Accuracy for LDA :",lda_acc)
print("Accuracy for logistic regression:",lg_acc)

Accuracy for the decision tree : 0.633569739953
Accuracy for LDA : 0.706855791962
Accuracy for logistic regression: 0.737588652482


# Computing the confusion matrices
Finally we compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report. 


In [None]:
print("Confusion matrix decision tree")
cm_dt = pd.crosstab(test_class,dt_test_predictions)
print(cm_dt)
cm_dt.to_latex()

Confusion matrix decision tree
col_0   0   1   2   3
CLASS                
0      74   8   5   7
1       6  50   8  54
2       0   6  87   4
3       8  40   9  57


'\\begin{tabular}{lrrrr}\n\\toprule\ncol\\_0 &   0 &   1 &   2 &   3 \\\\\nCLASS &     &     &     &     \\\\\n\\midrule\n0     &  74 &   8 &   5 &   7 \\\\\n1     &   6 &  50 &   8 &  54 \\\\\n2     &   0 &   6 &  87 &   4 \\\\\n3     &   8 &  40 &   9 &  57 \\\\\n\\bottomrule\n\\end{tabular}\n'

In [None]:
print("Confusion matrix LDA")
cm_lda = pd.crosstab(test_class,lda_test_predictions)
print(cm_lda)
cm_lda.to_latex()

Confusion matrix LDA
col_0   0   1   2   3
CLASS                
0      80   6   3   5
1       4  65   3  46
2       1   2  94   0
3       7  45   2  60


'\\begin{tabular}{lrrrr}\n\\toprule\ncol\\_0 &   0 &   1 &   2 &   3 \\\\\nCLASS &     &     &     &     \\\\\n\\midrule\n0     &  80 &   6 &   3 &   5 \\\\\n1     &   4 &  65 &   3 &  46 \\\\\n2     &   1 &   2 &  94 &   0 \\\\\n3     &   7 &  45 &   2 &  60 \\\\\n\\bottomrule\n\\end{tabular}\n'

In [None]:
print("Confusion matrix Logistic regression")
cm_lg = pd.crosstab(test_class,lg_test_predictions)
print(cm_lg)
cm_lg.to_latex()

Confusion matrix Logistic regression
col_0   0   1   2   3
CLASS                
0      83   3   2   6
1       3  74   3  38
2       1   3  93   0
3       5  46   1  62


'\\begin{tabular}{lrrrr}\n\\toprule\ncol\\_0 &   0 &   1 &   2 &   3 \\\\\nCLASS &     &     &     &     \\\\\n\\midrule\n0     &  83 &   3 &   2 &   6 \\\\\n1     &   3 &  74 &   3 &  38 \\\\\n2     &   1 &   3 &  93 &   0 \\\\\n3     &   5 &  46 &   1 &  62 \\\\\n\\bottomrule\n\\end{tabular}\n'