# Project 59 Shape Classification

## Authors: Julen Etxaniz and Ibon Urbina

## Objectives: The goal of the project is to compare different classification algorithms on the solution of plane and car shape datasets. 

## What is done in the Notebook: 
### [1.Importing the libraries](#1.-Importing-the-libraries)
### [2.Reading the datasets](#2.-Reading-the-datasets)
### [3.Processing the datasets](#3.-Preprocessing-the-datasets)
### [4.Preparing data for classification](#4.-Preparing-data-for-classification)
### 5.Dividing dataset in train and test sets for validation
### 6.Defining the classifiers
### 7.Learning the classifiers
### 8.Using the classifier for predictions

# 1. Importing the libraries
 We start by importing all relevant libraries to be used in the notebook.
    

In [218]:
# Enables interaction with the plots
%matplotlib notebook
import os
import matplotlib.pyplot as plt
from scipy.io import loadmat
import re

import pandas as pd
import numpy as np

from scipy.ndimage import gaussian_filter
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from tpot import TPOTClassifier
from sklearn.feature_selection import SelectKBest, chi2

from mpl_toolkits.mplot3d import Axes3D
from sklearn.manifold import TSNE

# 2. Reading the datasets
We read the plane and car datasets

In [61]:
def read_mats(dir):
    mats = []
    mats_file_name = []

    files = os.listdir(dir)
    # Files ordered before appending to maintain same order
    sorted_files = sorted(files)
    for file in sorted_files:
        mats.append(loadmat(dir + file))
        # To know in which order are we reading the files
        mats_file_name.append(file)
    
    return mats, mats_file_name

## 2.1. Reading the plane dataset
We read the 210 files that contain the instances of the plane classification problem.

We concatenate all the instances in a unique dataframe called "plane_mats".

In [62]:
plane_dir = "shape_data/plane_data/"
plane_mats, plane_mats_file_name = read_mats(plane_dir)

We plot one plane of each shape.

In [237]:
for i in range(7):
    plane = plane_mats[i*30]
    x = plane['x'][:,0]
    y = plane['x'][:,1]
    plt.plot(x, y)
    plt.show()

We check the dataset is correct, looking at the number of samples

In [64]:
print('The number of samples in the plane dataset is', len(plane_mats))

The number of samples in the plane dataset is 210


## 2.2. Reading the car dataset
We read the 120 files that contain the instances of the car classification problem.

We concatenate all the instances in a unique dataframe called "car_mats"

In [65]:
# Ibon
car_dir = "shape_data/car_data/"
car_mats, car_mats_file_name = read_mats(car_dir)

We check the dataset is correct, looking at the number of samples

In [66]:
print('The number of samples in the car dataset is', len(car_mats))

The number of samples in the car dataset is 120


# 3. Preprocessing the datasets

### Create dataframe
One of the best ways to represent data are pandas DataFrames. Either for their flexibility and eassy management of information. That's what we are going to do in the next cell: convert the list where we read all the data to a DataFrame.

In [67]:
def get_dataframe(mats):
    df = pd.DataFrame(mats)
    df
    df = df.drop(['__header__', '__version__', '__globals__'], axis=1)
    df
    return df

### Get class and sample numbers

In [242]:
# Remember we have the names of the files read (in order) in our list called 
# Lets, divide that array in two arrays. One containing the class number and the other the sample number.

def get_samples_classes(mats_file_name):
    class_n = []
    sample_n = []

    for i in mats_file_name:
        class_n.append(int(re.findall(r'\d+', str(i))[0]))
        sample_n.append(int(re.findall(r'\d+', str(i))[1]))
    
    return class_n, sample_n

In [265]:
def add_samples_classes(df, class_n, sample_n):
    df['Class'] = class_n
    df['Sample'] = sample_n

### Check if classes are balanced

In [69]:
def print_class_count(df):
    print("Quantity of samples in each class:")
    print(df['Class'].value_counts())

### Add another feature

In [70]:
def add_perimeter(df):
    length_list = []
    for i in range(len(df)):
        length_list.append(len(df['x'][i]))

    df['Perimeter_length'] = length_list
    
    return df

### Changing how x feature is represented

In [281]:
def min_length(df):
    return min(len(df['x'][i]) for i in range(len(df['x'])))

In [280]:
def separate_coordinates(df, min_length):
    x_coordinates = []
    y_coordinates = []

    for i in range(len(df['x'])):
        x_coordinates.append(np.resize((df['x'][i])[:,0], (min_length, 1)))
        y_coordinates.append(np.resize((df['x'][i])[:,1], (min_length, 1)))
    
    return x_coordinates, y_coordinates

In [72]:
def get_stacks(x_coordinates, y_coordinates):
    x_stack = x_coordinates[0]
    y_stack = y_coordinates[0]
    
    for i in range(len(x_coordinates)-1):
        x_stack = np.column_stack((x_stack, x_coordinates[i+1]))
        y_stack = np.column_stack((y_stack, y_coordinates[i+1]))
    
    return x_stack, y_stack

In [73]:
# Insert those columns in the dataFrame
def insert_columns(df, x_stack, y_stack):
    for i in range(len(x_stack)):
        stringX = "x" + str(i)
        stringY = "y" + str(i)
        df[stringX] = x_stack[i]
        df[stringY] = y_stack[i]
        
    return df

## 3.1. Preprocessing the plane dataset

In this problem there are four classes that correspond to the 7 types of planes: (a) Mirage, (b) Eurofighter, (c) F-14 wings closed, (d) F-14 wings opened, (e) Harrier, (f) F-22, (g) F-15. However, in the database files are written like this: "ClassX_SampleY.mat", where X is the corresponding class number and Y the corresponding sample number. 

Here is the correspondance of class number and class name (plane model name):
* 1 = Mirage
* 2 = Eurofighter
* 3 = F-14 wings closed
* 4 = F-14 wings opened
* 5 = Harrier
* 6 = F-22
* 7 = F-15

### Create dataframe

In [251]:
plane_df = get_dataframe(plane_mats)
plane_df

Unnamed: 0,x
0,"[[64, 235], [65, 234], [66, 234], [67, 234], [..."
1,"[[60, 139], [61, 138], [62, 137], [63, 137], [..."
2,"[[60, 219], [61, 218], [62, 217], [63, 217], [..."
3,"[[54, 201], [55, 200], [55, 199], [56, 198], [..."
4,"[[64, 275], [65, 274], [66, 274], [67, 274], [..."
...,...
205,"[[33, 234], [34, 233], [35, 232], [36, 231], [..."
206,"[[21, 155], [22, 154], [23, 153], [24, 152], [..."
207,"[[45, 324], [46, 323], [47, 322], [48, 321], [..."
208,"[[70, 255], [71, 254], [72, 254], [73, 253], [..."


### Get class and sample numbers

In [252]:
plane_class_n, plane_sample_n = get_samples_classes(plane_mats_file_name)

In [253]:
print("This is how our class_n looks like: \n")
np.array(plane_class_n)

This is how our class_n looks like: 



array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])

In [254]:
print("This is how our sample_n looks like: \n")
np.array(plane_sample_n)

This is how our sample_n looks like: 



array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12,
       13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,
        5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,
        9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23,
       24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11,
       12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27,
       28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,
        4,  5,  6,  7,  8,  9])

Lets add those lists to the car DataFrame.

In [264]:
add_samples_classes(plane_df, plane_class_n, plane_sample_n)
print("This is, finally, how our plane dataFrame looks like: \n")
plane_df

This is, finally, how our plane dataFrame looks like: 



Unnamed: 0,x,Class,Sample
0,"[[64, 235], [65, 234], [66, 234], [67, 234], [...",1,1
1,"[[60, 139], [61, 138], [62, 137], [63, 137], [...",1,10
2,"[[60, 219], [61, 218], [62, 217], [63, 217], [...",1,11
3,"[[54, 201], [55, 200], [55, 199], [56, 198], [...",1,12
4,"[[64, 275], [65, 274], [66, 274], [67, 274], [...",1,13
...,...,...,...
205,"[[33, 234], [34, 233], [35, 232], [36, 231], [...",7,5
206,"[[21, 155], [22, 154], [23, 153], [24, 152], [...",7,6
207,"[[45, 324], [46, 323], [47, 322], [48, 321], [...",7,7
208,"[[70, 255], [71, 254], [72, 254], [73, 253], [...",7,8


### Classes are balanced? Yes

Although in the description of the database it is said that each class has 30 samples, to make sure about it we are going to count them.

In [79]:
print_class_count(plane_df)

Quantity of samples in each class:
7    30
6    30
5    30
4    30
3    30
2    30
1    30
Name: Class, dtype: int64


### Add another feature

As we mention before, the only feature descriptor of the shapes is x, which  which refers to cartesian coordinates of each point on the perimeter of the shape. However, how many points are in each contour perimeter is not taken as a unique feature. It is implicitly measure in the length of each x sample, but, we prefer make it explicit.

In [80]:
plane_df = add_perimeter(plane_df)

In [81]:
print("This is how our plane dataFrame looks like: \n")
plane_df

This is how our plane dataFrame looks like: 



Unnamed: 0,x,Class,Sample,Perimeter_length
0,"[[64, 235], [65, 234], [66, 234], [67, 234], [...",1,1,1433
1,"[[60, 139], [61, 138], [62, 137], [63, 137], [...",1,10,1540
2,"[[60, 219], [61, 218], [62, 217], [63, 217], [...",1,11,1587
3,"[[54, 201], [55, 200], [55, 199], [56, 198], [...",1,12,1511
4,"[[64, 275], [65, 274], [66, 274], [67, 274], [...",1,13,1489
...,...,...,...,...
205,"[[33, 234], [34, 233], [35, 232], [36, 231], [...",7,5,1801
206,"[[21, 155], [22, 154], [23, 153], [24, 152], [...",7,6,1943
207,"[[45, 324], [46, 323], [47, 322], [48, 321], [...",7,7,1876
208,"[[70, 255], [71, 254], [72, 254], [73, 253], [...",7,8,1661


### Changing how x feature is represented

When learning a classifier is useful to have features as arrays of numbers, and not as arrays of sequences. In our case, x is an array of (x, y) coordinates; so we are going to separate x and y, an then create two extra features from there.

In [282]:
min_len = min_length(plane_df)
print(min_len)
x_coordinates, y_coordinates = separate_coordinates(plane_df, min_len)
x_stack, y_stack = get_stacks(x_coordinates, y_coordinates)
plane_df = insert_columns(plane_df, x_stack, y_stack)
plane_df

890


Unnamed: 0,x,Class,Sample,x0,y0,x1,y1,x2,y2,x3,...,x885,y885,x886,y886,x887,y887,x888,y888,x889,y889
0,"[[64, 235], [65, 234], [66, 234], [67, 234], [...",1,1,64,235,65,234,66,234,67,...,471,264,471,265,471,266,471,267,471,268
1,"[[60, 139], [61, 138], [62, 137], [63, 137], [...",1,10,60,139,61,138,62,137,63,...,560,304,559,303,558,303,557,302,556,301
2,"[[60, 219], [61, 218], [62, 217], [63, 217], [...",1,11,60,219,61,218,62,217,63,...,564,246,563,246,562,246,561,246,560,246
3,"[[54, 201], [55, 200], [55, 199], [56, 198], [...",1,12,54,201,55,200,55,199,56,...,502,227,501,228,500,228,499,228,498,228
4,"[[64, 275], [65, 274], [66, 274], [67, 274], [...",1,13,64,275,65,274,66,274,67,...,490,234,490,235,490,236,490,237,491,238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,"[[33, 234], [34, 233], [35, 232], [36, 231], [...",7,5,33,234,34,233,35,232,36,...,533,202,533,203,534,204,534,205,534,206
206,"[[21, 155], [22, 154], [23, 153], [24, 152], [...",7,6,21,155,22,154,23,153,24,...,586,260,585,259,584,258,583,259,582,259
207,"[[45, 324], [46, 323], [47, 322], [48, 321], [...",7,7,45,324,46,323,47,322,48,...,597,157,597,158,597,159,597,160,596,161
208,"[[70, 255], [71, 254], [72, 254], [73, 253], [...",7,8,70,255,71,254,72,254,73,...,531,296,530,297,529,298,528,299,528,300


## 3.2. Preprocessing the car dataset

In this problem there are four classes that correspond to the 4 types of cars:  sedan, pickup, minivan, or SUV. However, in the database files are written like this: "ClassX_SampleY.mat", where X is the corresponding class number and Y the corresponding sample number. 

Here is the correspondance of class number and class name (car model name):
* 1 = sedan
* 2 = pickup
* 3 = minivan
* 4 = SUV

### Create dataframe

In [299]:
# Ibon
car_df = pd.DataFrame(car_mats)

This is the way car_df DataFrame looks like:


In [300]:
# Ibon
car_df

Unnamed: 0,__header__,__version__,__globals__,x
0,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[113, 181], [114, 180], [114, 179], [114, 178..."
1,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[98, 180], [99, 179], [99, 178], [100, 177], ..."
2,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[70, 180], [71, 180], [72, 179], [73, 178], [..."
3,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[54, 184], [55, 183], [56, 183], [57, 183], [..."
4,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[44, 180], [45, 179], [46, 179], [47, 178], [..."
...,...,...,...,...
115,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[101, 182], [102, 182], [103, 182], [104, 182..."
116,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[46, 180], [47, 180], [48, 179], [48, 178], [..."
117,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[31, 173], [32, 173], [33, 174], [34, 174], [..."
118,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[20, 170], [21, 171], [22, 170], [23, 170], [..."


As we see in the image above, __header__, __version__ and __globals__ columns are values created when using loadmat function to be able to read .mat files.

Those columns are not useful. We are going to delete them.

In [301]:
# Ibon
car_df = car_df.drop(['__header__', '__version__', '__globals__'], axis=1)
car_df

Unnamed: 0,x
0,"[[113, 181], [114, 180], [114, 179], [114, 178..."
1,"[[98, 180], [99, 179], [99, 178], [100, 177], ..."
2,"[[70, 180], [71, 180], [72, 179], [73, 178], [..."
3,"[[54, 184], [55, 183], [56, 183], [57, 183], [..."
4,"[[44, 180], [45, 179], [46, 179], [47, 178], [..."
...,...
115,"[[101, 182], [102, 182], [103, 182], [104, 182..."
116,"[[46, 180], [47, 180], [48, 179], [48, 178], [..."
117,"[[31, 173], [32, 173], [33, 174], [34, 174], [..."
118,"[[20, 170], [21, 171], [22, 170], [23, 170], [..."


Now, the only attribute available in our car DataFrame is x, which refers to cartesian coordinates of each point on the perimeter of the shape. We need more information to include there, such as class value and sample number.

In [302]:
# Ibon
# Remember we have the names of the files read (in order) in our list called car_mats_file_name.
# Lets, divide that array in two arrays. One containing the class number and the other the sample number.
car_class_n, car_sample_n = get_samples_classes(car_mats_file_name)

In [303]:
# Ibon
print("This is how our class_n looks like: \n")
np.array(car_class_n)

This is how our class_n looks like: 



array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4])

In [304]:
# Ibon
print("This is how our sample_n looks like: \n")
np.array(car_sample_n)

This is how our sample_n looks like: 



array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12,
       13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,
        5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,
        9])

Lets add those lists to the car DataFrame.

In [305]:
# Ibon
add_samples_classes(car_df, car_class_n, car_sample_n)
print("This is, finally, how our car dataFrame looks like: \n")
car_df

This is, finally, how our car dataFrame looks like: 



Unnamed: 0,x,Class,Sample
0,"[[113, 181], [114, 180], [114, 179], [114, 178...",1,1
1,"[[98, 180], [99, 179], [99, 178], [100, 177], ...",1,10
2,"[[70, 180], [71, 180], [72, 179], [73, 178], [...",1,11
3,"[[54, 184], [55, 183], [56, 183], [57, 183], [...",1,12
4,"[[44, 180], [45, 179], [46, 179], [47, 178], [...",1,13
...,...,...,...
115,"[[101, 182], [102, 182], [103, 182], [104, 182...",4,5
116,"[[46, 180], [47, 180], [48, 179], [48, 178], [...",4,6
117,"[[31, 173], [32, 173], [33, 174], [34, 174], [...",4,7
118,"[[20, 170], [21, 171], [22, 170], [23, 170], [...",4,8


### Classes are balanced? Yes

Although in the description of the database it is said that each class has 30 samples, to make sure about it we are going to count them.

In [306]:
print_class_count(car_df)

Quantity of samples in each class:
4    30
3    30
2    30
1    30
Name: Class, dtype: int64


### Let's add another feature to our database

As we mention before, the only feature descriptor of the shapes is x, which  which refers to cartesian coordinates of each point on the perimeter of the shape. However, how many points are in each contour perimeter is not taken as a unique feature. It is implicitly measure in the length of each x sample, but, we prefer make it explicit.

In [307]:
car_df = add_perimeter(car_df)

In [308]:
print("This is how our car dataFrame looks like: \n")
car_df

This is how our car dataFrame looks like: 



Unnamed: 0,x,Class,Sample,Perimeter_length
0,"[[113, 181], [114, 180], [114, 179], [114, 178...",1,1,310
1,"[[98, 180], [99, 179], [99, 178], [100, 177], ...",1,10,331
2,"[[70, 180], [71, 180], [72, 179], [73, 178], [...",1,11,344
3,"[[54, 184], [55, 183], [56, 183], [57, 183], [...",1,12,334
4,"[[44, 180], [45, 179], [46, 179], [47, 178], [...",1,13,322
...,...,...,...,...
115,"[[101, 182], [102, 182], [103, 182], [104, 182...",4,5,373
116,"[[46, 180], [47, 180], [48, 179], [48, 178], [...",4,6,358
117,"[[31, 173], [32, 173], [33, 174], [34, 174], [...",4,7,374
118,"[[20, 170], [21, 171], [22, 170], [23, 170], [...",4,8,356


### Changing how x feature is represented

When learning a classifier is useful to have features as arrays of numbers, and not as arrays of sequences. In our case, x is an array of (x, y) coordinates; so we are going to separate x and y, an then create two extra features from there.

In [309]:
min_len = min_length(car_df)
print(min_len)
x_coordinates, y_coordinates = separate_coordinates(car_df, min_len)
x_stack, y_stack = get_stacks(x_coordinates, y_coordinates)
car_df = insert_columns(car_df, x_stack, y_stack)
car_df

272


Unnamed: 0,x,Class,Sample,Perimeter_length,x0,y0,x1,y1,x2,y2,...,x267,y267,x268,y268,x269,y269,x270,y270,x271,y271
0,"[[113, 181], [114, 180], [114, 179], [114, 178...",1,1,310,113,181,114,180,114,179,...,150,189,149,189,148,190,147,191,146,191
1,"[[98, 180], [99, 179], [99, 178], [100, 177], ...",1,10,331,98,180,99,179,99,178,...,140,188,139,188,138,189,139,190,138,190
2,"[[70, 180], [71, 180], [72, 179], [73, 178], [...",1,11,344,70,180,71,180,72,179,...,131,186,130,187,129,187,128,187,127,187
3,"[[54, 184], [55, 183], [56, 183], [57, 183], [...",1,12,334,54,184,55,183,56,183,...,108,186,107,187,106,187,105,187,104,188
4,"[[44, 180], [45, 179], [46, 179], [47, 178], [...",1,13,322,44,180,45,179,46,179,...,84,189,83,189,82,190,81,191,82,192
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,"[[101, 182], [102, 182], [103, 182], [104, 182...",4,5,373,101,182,102,182,103,182,...,186,188,185,188,184,188,183,188,182,188
116,"[[46, 180], [47, 180], [48, 179], [48, 178], [...",4,6,358,46,180,47,180,48,179,...,131,186,130,186,129,186,128,186,127,186
117,"[[31, 173], [32, 173], [33, 174], [34, 174], [...",4,7,374,31,173,32,173,33,174,...,111,187,110,188,109,188,108,189,107,189
118,"[[20, 170], [21, 171], [22, 170], [23, 170], [...",4,8,356,20,170,21,171,22,170,...,76,189,75,189,74,189,73,189,72,189


# 4. Preparing data for classification

To learn the classifiers, we need to separate in two different sets the features and the classes. 


## 4.2. Preparing the plane data

In [321]:
# The selected features are: 'Perimeter_length', 'xJ' and 'yJ'  (J -> [0, 299])
plane_features = plane_df.drop(columns=['x', 'Class', 'Sample'])
plane_features

Unnamed: 0,x0,y0,x1,y1,x2,y2,x3,y3,x4,y4,...,x885,y885,x886,y886,x887,y887,x888,y888,x889,y889
0,64,235,65,234,66,234,67,234,68,233,...,471,264,471,265,471,266,471,267,471,268
1,60,139,61,138,62,137,63,137,64,136,...,560,304,559,303,558,303,557,302,556,301
2,60,219,61,218,62,217,63,217,64,217,...,564,246,563,246,562,246,561,246,560,246
3,54,201,55,200,55,199,56,198,57,197,...,502,227,501,228,500,228,499,228,498,228
4,64,275,65,274,66,274,67,274,68,273,...,490,234,490,235,490,236,490,237,491,238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,33,234,34,233,35,232,36,231,37,231,...,533,202,533,203,534,204,534,205,534,206
206,21,155,22,154,23,153,24,152,25,152,...,586,260,585,259,584,258,583,259,582,259
207,45,324,46,323,47,322,48,321,49,320,...,597,157,597,158,597,159,597,160,596,161
208,70,255,71,254,72,254,73,253,74,253,...,531,296,530,297,529,298,528,299,528,300


Then we are going to put all Classes in a unique structure.

In [322]:
# Ibon
plane_target = plane_df['Class']
plane_target

0      1
1      1
2      1
3      1
4      1
      ..
205    7
206    7
207    7
208    7
209    7
Name: Class, Length: 210, dtype: int64

## 4.2. Preparing the car data

In [323]:
# The selected features are: 'Perimeter_length', 'xJ' and 'yJ'  (J -> [0, 299])
car_features = car_df.drop(columns=['x', 'Class', 'Sample'])
car_features

Unnamed: 0,Perimeter_length,x0,y0,x1,y1,x2,y2,x3,y3,x4,...,x267,y267,x268,y268,x269,y269,x270,y270,x271,y271
0,310,113,181,114,180,114,179,114,178,114,...,150,189,149,189,148,190,147,191,146,191
1,331,98,180,99,179,99,178,100,177,101,...,140,188,139,188,138,189,139,190,138,190
2,344,70,180,71,180,72,179,73,178,72,...,131,186,130,187,129,187,128,187,127,187
3,334,54,184,55,183,56,183,57,183,58,...,108,186,107,187,106,187,105,187,104,188
4,322,44,180,45,179,46,179,47,178,48,...,84,189,83,189,82,190,81,191,82,192
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,373,101,182,102,182,103,182,104,182,105,...,186,188,185,188,184,188,183,188,182,188
116,358,46,180,47,180,48,179,48,178,48,...,131,186,130,186,129,186,128,186,127,186
117,374,31,173,32,173,33,174,34,174,35,...,111,187,110,188,109,188,108,189,107,189
118,356,20,170,21,171,22,170,23,170,24,...,76,189,75,189,74,189,73,189,72,189


Then we are going to put all Classes in a unique structure.

In [324]:
# Ibon
car_target = car_df['Class']
car_target

0      1
1      1
2      1
3      1
4      1
      ..
115    4
116    4
117    4
118    4
119    4
Name: Class, Length: 120, dtype: int64

# 5. Scaling the data

## 5.1. Scaling the plane data

In [231]:
plane_scaler = StandardScaler()
plane_features_scaled = plane_scaler.fit_transform(plane_features)
plane_features_scaled

array([[-0.76525741, -0.22082254, -0.30487452, ...,  0.43795551,
        -0.49593786,  0.45457855],
       [-0.14307215, -0.33949876, -2.22919207, ..., -0.27605975,
         0.67401358, -0.25741195],
       [ 0.13022418, -0.33949876, -0.62559411, ..., -0.02890062,
         0.18476116, -0.0109537 ],
       ...,
       [ 1.81070585, -0.7845346 ,  1.47912821, ...,  1.26181928,
        -0.28321941,  1.24872181],
       [ 0.56052052, -0.04280821,  0.09602497, ...,  0.87734952,
         0.48256698,  0.89272656],
       [ 1.62463176, -0.69552743, -0.34496447, ...,  0.05348576,
         0.16348932,  0.0438148 ]])

## 5.2. Scaling the car data

In [232]:
car_scaler = StandardScaler()
car_features_scaled = car_scaler.fit_transform(car_features)
car_features_scaled

array([[-1.06728963,  0.58628083,  0.59700163, ...,  0.45107198,
        -0.7215801 ,  0.4520606 ],
       [-0.89957934,  0.31002285,  0.52449131, ...,  0.83341351,
        -0.83429763,  0.8375742 ],
       [-0.79575869, -0.20565872,  0.52449131, ...,  0.77879329,
        -0.93574342,  0.78250083],
       ...,
       [-0.55617256, -0.92392947,  0.01691907, ...,  0.66955285,
        -1.19499375,  0.67235409],
       [-0.69992424, -1.12651866, -0.20061188, ...,  0.83341351,
        -1.58950513,  0.8375742 ],
       [-0.88360693, -0.83184348,  0.16193971, ...,  0.88803373,
        -1.52187461,  0.89264757]])

# 5. Dividing datasets in train and test sets for validation

Also, to evaluate the accuracy of the classifiers in the dataset we will split the data in two sets. Train and Test data. 
Each set will have the same number of samples of each class (15).

## 5.1. Dividing the plane data

In [310]:
# Ibon

# We divide the dataset for the first subject in training and test data. Even rows are in the train set 
# and odd rows in the test set. 

# Train set 
plane_train_features = plane_features[0::2]
plane_train_target = plane_target[0::2]
plane_train_n_samples = len(plane_train_target)

# Test set
plane_test_features = plane_features[1::2]
plane_test_target = plane_target[1::2]
plane_test_n_samples = len(plane_test_target)

## 5.2. Dividing the car data

In [311]:
# Ibon

# We divide the dataset for the first subject in training and test data. Even rows are in the train set 
# and odd rows in the test set. 

# Train set 
car_train_features = car_features[0::2]
car_train_target = car_target[0::2]
car_train_n_samples = len(car_train_target)

# Test set
car_test_features = car_features[1::2]
car_test_target = car_target[1::2]
car_test_n_samples = len(car_test_target)

# 6. Feature extraction / Dimensionality reduction

# 6.1. Using PCA

## PCA for the plane data

In [335]:
plane_pca = PCA(n_components=3)
plane_pca.fit(plane_train_features)

PCA(n_components=3)

In [336]:
plane_train_features_trans = plane_pca.transform(plane_train_features)
plane_test_features_trans = plane_pca.transform(plane_test_features)

In [337]:
print(plane_test_features.shape, plane_test_features_trans.shape)

(105, 1780) (105, 3)


In [338]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_test_features_trans[plane_test_target==i, 0], plane_test_features_trans[plane_test_target==i, 1], \
               plane_test_features_trans[plane_test_target==i, 2], label=str(i))

lgnd = plt.legend()
ax.set_xlabel('Comp. 1')
ax.set_ylabel('Comp. 2')
ax.set_zlabel('Comp. 3')

for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [206]:
plt.close(fig)

In [346]:
plane_lda = LinearDiscriminantAnalysis()
plane_lda.fit(plane_train_features_trans, plane_train_target)
plane_lda_test_predictions = plane_lda.predict(plane_test_features_trans)
lda_acc =  accuracy_score(plane_test_target, plane_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.6190476190476191


## PCA for the car data

In [339]:
car_pca = PCA(n_components=3)
car_pca.fit(car_train_features)

PCA(n_components=3)

In [341]:
car_train_features_trans = car_pca.transform(car_train_features)
car_test_features_trans = car_pca.transform(car_test_features)

In [342]:
print(car_test_features.shape, car_test_features_trans.shape)

(60, 601) (60, 3)


In [343]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 5):
    ax.scatter(car_test_features_trans[car_test_target==i, 0], car_test_features_trans[car_test_target==i, 1], \
               car_test_features_trans[car_test_target==i, 2], label=str(i))

lgnd = plt.legend()
ax.set_xlabel('Comp. 1')
ax.set_ylabel('Comp. 2')
ax.set_zlabel('Comp. 3')

for i in range(4):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [211]:
plt.close(fig)

In [354]:
car_lda = LinearDiscriminantAnalysis()
car_lda.fit(car_train_features_trans, car_train_target)
car_lda_test_predictions = car_lda.predict(car_test_features_trans)
lda_acc =  accuracy_score(car_test_target, car_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.5


# 6.2. Using TSNE

## TSNE for the plane data

In [226]:
plane_tsne = TSNE(n_components=3)
plane_tsne_trans3 = plane_tsne.fit_transform(plane_train_features)

In [227]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_tsne_trans3[plane_train_target==i, 0], plane_tsne_trans3[plane_train_target==i, 1], plane_tsne_trans3[plane_train_target==i, 2], label=str(i))

ax.set_xlabel('TSNE dim. 1')
ax.set_ylabel('TSNE dim. 2')
ax.set_zlabel('TSNE dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [228]:
plt.close(fig)

## TSNE for the car data

In [None]:
car_tsne = TSNE(n_components=3)
car_tsne_trans3 = car_tsne.fit_transform(car_train_features)

In [None]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 5):
    ax.scatter(car_tsne_trans3[car_train_target==i, 0], car_tsne_trans3[car_train_target==i, 1], car_tsne_trans3[car_train_target==i, 2], label=str(i))

ax.set_xlabel('TSNE dim. 1')
ax.set_ylabel('TSNE dim. 2')
ax.set_zlabel('TSNE dim. 3')

lgnd = plt.legend()
for i in range(4):
    lgnd.legendHandles[i]._sizes = [30]

In [None]:
plt.close(fig)

# 6.3. Using LDA

## LDA for the plane data

In [350]:
plane_lda = LinearDiscriminantAnalysis(n_components=3)
plane_lda.fit(plane_train_features, plane_train_target)

LinearDiscriminantAnalysis(n_components=3)

In [351]:
plane_train_features_trans = plane_lda.transform(plane_train_features)
plane_test_features_trans = plane_lda.transform(plane_test_features)

In [366]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_test_features_trans[plane_test_target==i, 0], plane_test_features_trans[plane_test_target==i, 1], \
               plane_test_features_trans[plane_test_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [365]:
plt.close(fig)

In [353]:
plane_lda = LinearDiscriminantAnalysis()
plane_lda.fit(plane_train_features_trans, plane_train_target)
plane_lda_test_predictions = plane_lda.predict(plane_test_features_trans)
lda_acc =  accuracy_score(plane_test_target, plane_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.9238095238095239


## LDA for the car data

In [361]:
car_lda = LinearDiscriminantAnalysis(n_components=3)
car_lda_trans = car_lda.fit(car_train_features, car_train_target)

In [362]:
car_train_features_trans = car_lda.transform(car_train_features)
car_test_features_trans = car_lda.transform(car_test_features)

In [363]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 5):
    ax.scatter(car_test_features_trans[car_train_target==i, 0], car_test_features_trans[car_train_target==i, 1], \
               car_test_features_trans[car_train_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(4):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [None]:
plt.close(fig)

In [364]:
car_lda = LinearDiscriminantAnalysis()
car_lda.fit(car_train_features_trans, car_train_target)
car_lda_test_predictions = car_lda.predict(car_test_features_trans)
lda_acc =  accuracy_score(car_test_target, car_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.8833333333333333


# 6.Using TPOT to generate optimal pipeline

## 6.1. TPOT for the plane data

In [239]:
plane_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
plane_tpot.fit(features=plane_train_features, target=plane_train_target)
plane_tpot.fitted_pipeline_.steps



HBox(children=(HTML(value='Optimization Progress'), FloatProgress(value=0.0, max=60.0), HTML(value='')))


Generation 1 - Current best internal CV score: 0.9238095238095237
Generation 2 - Current best internal CV score: 0.9238095238095237
Generation 3 - Current best internal CV score: 0.9238095238095237
Generation 4 - Current best internal CV score: 0.9238095238095237
Generation 5 - Current best internal CV score: 0.9333333333333332
Best pipeline: MLPClassifier(RandomForestClassifier(LinearSVC(input_matrix, C=5.0, dual=True, loss=squared_hinge, penalty=l2, tol=0.001), bootstrap=False, criterion=entropy, max_features=0.25, min_samples_leaf=5, min_samples_split=11, n_estimators=100), alpha=0.0001, learning_rate_init=0.001)


[('stackingestimator-1',
  StackingEstimator(estimator=LinearSVC(C=5.0, random_state=16, tol=0.001))),
 ('stackingestimator-2',
  StackingEstimator(estimator=RandomForestClassifier(bootstrap=False,
                                                     criterion='entropy',
                                                     max_features=0.25,
                                                     min_samples_leaf=5,
                                                     min_samples_split=11,
                                                     random_state=16))),
 ('mlpclassifier', MLPClassifier(random_state=16))]

In [None]:
plane_tpot_test_accuracy = plane_tpot.score(plane_test_features, plane_test_target)
print('The test accuracy obtained by tpot classification problem is:', plane_tpot_test_accuracy)

## 6.2. TPOT for the car data

In [None]:
car_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
car_car_tpot.fit(features=car_train_features, target=car_train_target)
car_tpot.fitted_pipeline_.steps

In [233]:
car_tpot_test_accuracy = car_tpot.score(car_test_features, car_test_target)
print('The test accuracy obtained by tpot classification problem is:', car_tpot_test_accuracy)



HBox(children=(HTML(value='Optimization Progress'), FloatProgress(value=0.0, max=60.0), HTML(value='')))


Generation 1 - Current best internal CV score: 0.8
Generation 2 - Current best internal CV score: 0.8
Generation 3 - Current best internal CV score: 0.8
Generation 4 - Current best internal CV score: 0.8
Generation 5 - Current best internal CV score: 0.8166666666666668
Best pipeline: RandomForestClassifier(GradientBoostingClassifier(LinearSVC(input_matrix, C=0.1, dual=False, loss=squared_hinge, penalty=l2, tol=0.1), learning_rate=0.5, max_depth=4, max_features=0.9000000000000001, min_samples_leaf=16, min_samples_split=8, n_estimators=100, subsample=0.9000000000000001), bootstrap=False, criterion=gini, max_features=0.8, min_samples_leaf=12, min_samples_split=12, n_estimators=100)
0.8333333333333334


# 7. Classification

## Defining the classifiers
We define the three classifiers used.

In [325]:
plane_dt  = DecisionTreeClassifier()
plane_lda = LinearDiscriminantAnalysis()
plane_lg  = LogisticRegression()

In [326]:
car_dt  = DecisionTreeClassifier()
car_lda = LinearDiscriminantAnalysis()
car_lg  = LogisticRegression()

## Learning the classifiers
We used the train data to learn the three classifiers

In [327]:
# Ibon

plane_dt.fit(plane_train_features, plane_train_target)
plane_lda.fit(plane_train_features, plane_train_target)
plane_lg.fit(plane_train_features, plane_train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [328]:
# Ibon

car_dt.fit(car_train_features, car_train_target)
car_lda.fit(car_train_features, car_train_target)
car_lg.fit(car_train_features, car_train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

## Using the classifier for predictions
We predict the class of the samples in the test data with the three classifiers.

In [329]:
# Ibon

plane_dt_test_predictions = plane_dt.predict(plane_test_features)
plane_lda_test_predictions = plane_lda.predict(plane_test_features)
plane_lg_test_predictions = plane_lg.predict(plane_test_features)

In [330]:
# Ibon

car_dt_test_predictions = car_dt.predict(car_test_features)
car_lda_test_predictions = car_lda.predict(car_test_features)
car_lg_test_predictions = car_lg.predict(car_test_features)

# 8. Validation

## Computing the accuracy

Finally, we compute the accuracy using the three classifiers and print it. 

In [331]:
def print_accuracies(test_target, dt_test_predictions, lda_test_predictions, lg_test_predictions):
    dt_acc =  accuracy_score(test_target, dt_test_predictions)
    lda_acc =  accuracy_score(test_target, lda_test_predictions)
    lg_acc =  accuracy_score(test_target, lg_test_predictions)
    print("Accuracy for the decision tree :", dt_acc)
    print("Accuracy for LDA :", lda_acc)
    print("Accuracy for logistic regression:", lg_acc)

In [332]:
print_accuracies(plane_test_target, plane_dt_test_predictions, plane_lda_test_predictions, plane_lg_test_predictions)

Accuracy for the decision tree : 0.780952380952381
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.8666666666666667


In [333]:
print_accuracies(car_test_target, car_dt_test_predictions, car_lda_test_predictions, car_lg_test_predictions)

Accuracy for the decision tree : 0.7333333333333333
Accuracy for LDA : 0.8833333333333333
Accuracy for logistic regression: 0.8833333333333333


## Computing the confusion matrices
Finally we compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report. 


In [296]:
def print_confusion_matrices(test_target, dt_test_predictions, lda_test_predictions, lg_test_predictions):
    print("Confusion matrix decision tree")
    cm_dt = pd.crosstab(test_target, dt_test_predictions)
    print(cm_dt)
    print()
    #cm_dt.to_latex()
    
    print("Confusion matrix LDA")
    cm_lda = pd.crosstab(test_target, lda_test_predictions)
    print(cm_lda)
    print()
    #cm_lda.to_latex()
    
    print("Confusion matrix Logistic regression")
    cm_lg = pd.crosstab(test_target, lg_test_predictions)
    print(cm_lg)
    print()
    #cm_lg.to_latex()

## 6.1. Confusion matrices for the plane data

In [None]:
print_confusion_matrices(plane_test_target, plane_dt_test_predictions, plane_lda_test_predictions, plane_lg_test_predictions)

## 6.2. Confusion matrices for the car data

In [None]:
print_confusion_matrices(car_test_target, car_dt_test_predictions, car_lda_test_predictions, car_lg_test_predictions)

In [None]:
##Applying Gaussian filter to our shapes
# Filtered with Gaussian filter (standar deviation = 10).
# Gaussian filter: https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_filter.html
#for i in range(len(car_df['Contour_Coordinates'])):
#    car_df.at[i, 'Contour_Coordinates'] = gaussian_filter(np.array(car_df['Contour_Coordinates'][i]), sigma=10)
#print("This how our shapes filtered look like: \n")
#car_df

In [None]:
car_features

# Feature Selection

At this point we have so many features: two features for each point in the contour perimeter. We normalized the length of each image to 300 points, so 300*2 = 600 features for representing each image. Let's try reducing this amount of features...

We are going to use SelectKBest + chi function for our feature selection process:

In [None]:
reduced_data_100 = SelectKBest(chi2, k=100).fit_transform(car_features, car_target)

reduced_data_200 = SelectKBest(chi2, k=200).fit_transform(car_features, car_target)

reduced_data_300 = SelectKBest(chi2, k=300).fit_transform(car_features, car_target)

reduced_data_400 = SelectKBest(chi2, k=400).fit_transform(car_features, car_target)

reduced_data_500 = SelectKBest(chi2, k=500).fit_transform(car_features, car_target)

As always, we need to separate the data in three: features data for training, features data for testing and class target:

In [None]:
reduced_data_train_100 = reduced_data_100[0::2]
reduced_data_test_100 = reduced_data_100[1::2]

reduced_data_train_200 = reduced_data_200[0::2]
reduced_data_test_200 = reduced_data_200[1::2]

reduced_data_train_300 = reduced_data_300[0::2]
reduced_data_test_300 = reduced_data_300[1::2]

reduced_data_train_400 = reduced_data_400[0::2]
reduced_data_test_400 = reduced_data_400[1::2]

reduced_data_train_500 = reduced_data_500[0::2]
reduced_data_test_500 = reduced_data_500[1::2]


car_train_target = car_target[0::2]
car_test_target = car_target[1::2]

It is better to create one classifier of each type not to mix them:

In [None]:
car_dt_100  = DecisionTreeClassifier()
car_lda_100 = LinearDiscriminantAnalysis()
car_lg_100  = LogisticRegression()

car_dt_200  = DecisionTreeClassifier()
car_lda_200 = LinearDiscriminantAnalysis()
car_lg_200  = LogisticRegression()

car_dt_300  = DecisionTreeClassifier()
car_lda_300 = LinearDiscriminantAnalysis()
car_lg_300  = LogisticRegression()

car_dt_400  = DecisionTreeClassifier()
car_lda_400 = LinearDiscriminantAnalysis()
car_lg_400  = LogisticRegression()

car_dt_500  = DecisionTreeClassifier()
car_lda_500 = LinearDiscriminantAnalysis()
car_lg_500  = LogisticRegression()

Fitting and predicting...

In [None]:
car_dt_100.fit(reduced_data_train_100, car_train_target)
car_lda_100.fit(reduced_data_train_100, car_train_target)
car_lg_100.fit(reduced_data_train_100, car_train_target)

car_dt_test_predictions_100 = car_dt_100.predict(reduced_data_test_100)
car_lda_test_predictions_100 = car_lda_100.predict(reduced_data_test_100)
car_lg_test_predictions_100 = car_lg_100.predict(reduced_data_test_100)

car_dt_200.fit(reduced_data_train_200, car_train_target)
car_lda_200.fit(reduced_data_train_200, car_train_target)
car_lg_200.fit(reduced_data_train_200, car_train_target)

car_dt_test_predictions_200 = car_dt_200.predict(reduced_data_test_200)
car_lda_test_predictions_200 = car_lda_200.predict(reduced_data_test_200)
car_lg_test_predictions_200 = car_lg_200.predict(reduced_data_test_200)

car_dt_300.fit(reduced_data_train_300, car_train_target)
car_lda_300.fit(reduced_data_train_300, car_train_target)
car_lg_300.fit(reduced_data_train_300, car_train_target)

car_dt_test_predictions_300 = car_dt_300.predict(reduced_data_test_300)
car_lda_test_predictions_300 = car_lda_300.predict(reduced_data_test_300)
car_lg_test_predictions_300 = car_lg_300.predict(reduced_data_test_300)

car_dt_400.fit(reduced_data_train_400, car_train_target)
car_lda_400.fit(reduced_data_train_400, car_train_target)
car_lg_400.fit(reduced_data_train_400, car_train_target)

car_dt_test_predictions_400 = car_dt_400.predict(reduced_data_test_400)
car_lda_test_predictions_400 = car_lda_400.predict(reduced_data_test_400)
car_lg_test_predictions_400 = car_lg_400.predict(reduced_data_test_400)

car_dt_500.fit(reduced_data_train_500, car_train_target)
car_lda_500.fit(reduced_data_train_500, car_train_target)
car_lg_500.fit(reduced_data_train_500, car_train_target)

car_dt_test_predictions_500 = car_dt_500.predict(reduced_data_test_500)
car_lda_test_predictions_500 = car_lda_500.predict(reduced_data_test_500)
car_lg_test_predictions_500 = car_lg_500.predict(reduced_data_test_500)

Let's see if the evolution of the accuracy depending on the quantity of fetures selected:

In [None]:
print("100 features: \n")
print_accuracies(car_test_target, car_dt_test_predictions_100, car_lda_test_predictions_100, car_lg_test_predictions_100)

In [None]:
print("200 features: \n")
print_accuracies(car_test_target, car_dt_test_predictions_200, car_lda_test_predictions_200, car_lg_test_predictions_200)

In [None]:
print("300 features: \n")
print_accuracies(car_test_target, car_dt_test_predictions_300, car_lda_test_predictions_300, car_lg_test_predictions_300)

In [None]:
print("400 features: \n")
print_accuracies(car_test_target, car_dt_test_predictions_400, car_lda_test_predictions_400, car_lg_test_predictions_400)

In [None]:
print("500 features: \n")
print_accuracies(car_test_target, car_dt_test_predictions_500, car_lda_test_predictions_500, car_lg_test_predictions_500)