# Project 59 Shape Classification

## Authors: Julen Etxaniz and Ibon Urbina

## Objectives: The goal of the project is to compare different classification algorithms on the solution of plane and car shape datasets. 

## Contents: 
### [1.Importing the libraries](#1.-Importing-the-libraries)
### [2.Reading the datasets](#2.-Reading-the-datasets)
### [3.Preprocessing the datasets](#3.-Preprocessing-the-datasets)
### [4.Scaling the data](#4.-Scaling-the-data)
### [5.Dividing train and test data](#5.-Dividing-train-and-test-data)
### [6.Classification](#6.-Classification)
### [7.Validation](#7.-Validation)
### [8.Feature Selection](#8.-Feature-Selection)
### [9.Feature Extraction](#9.-Feature-Extraction)
### [10.Pipeline Optimization](#10.-Pipeline-Optimization)

# 1. Importing the libraries
 We start by importing all relevant libraries to be used in the notebook.
    

In [21]:
# Reading data
from os import listdir
from scipy.io import loadmat
from re import findall

# Preprocessing
import pandas as pd
import numpy as np

# Scaling
from sklearn.preprocessing import StandardScaler

# Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

# Validation
from sklearn.metrics import accuracy_score

# Feature Selection
from sklearn.feature_selection import SelectKBest, chi2, f_classif, SelectFromModel
from sklearn.ensemble import RandomForestClassifier

# Feature Extraction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Pipeline Optimization
from tpot import TPOTClassifier

# Plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Enables interaction with the plots
%matplotlib notebook

# 2. Reading the datasets
We read the plane and car datasets

In [22]:
def read_mats(dir):
    mats = []
    mats_file_name = []

    files = os.listdir(dir)
    # Files ordered before appending to maintain same order
    sorted_files = sorted(files)
    for file in sorted_files:
        mats.append(loadmat(dir + file))
        # To know in which order are we reading the files
        mats_file_name.append(file)
    
    return mats, mats_file_name

## 2.1. Reading the plane dataset
We read the 210 files that contain the instances of the plane classification problem.

We concatenate all the instances in a unique dataframe called "plane_mats".

In [23]:
plane_dir = "shape_data/plane_data/"
plane_mats, plane_mats_file_name = read_mats(plane_dir)

We plot one plane of each shape.

In [24]:
for i in range(7):
    plane = plane_mats[i*30]
    x = plane['x'][:,0]
    y = plane['x'][:,1]
    plt.plot(x, y)
    plt.show()

<IPython.core.display.Javascript object>

We check the dataset is correct, looking at the number of samples

In [25]:
print('The number of samples in the plane dataset is', len(plane_mats))

The number of samples in the plane dataset is 210


## 2.2. Reading the car dataset
We read the 120 files that contain the instances of the car classification problem.

We concatenate all the instances in a unique dataframe called "car_mats"

In [26]:
# Ibon
car_dir = "shape_data/car_data/"
car_mats, car_mats_file_name = read_mats(car_dir)

We check the dataset is correct, looking at the number of samples

In [27]:
print('The number of samples in the car dataset is', len(car_mats))

The number of samples in the car dataset is 120


# 3. Preprocessing the datasets

### Create dataframe
One of the best ways to represent data are pandas DataFrames. Either for their flexibility and eassy management of information. That's what we are going to do in the next cell: convert the list where we read all the data to a DataFrame.

In [28]:
def get_dataframe(mats):
    df = pd.DataFrame(mats)
    df
    df = df.drop(['__header__', '__version__', '__globals__'], axis=1)
    df
    return df

### Get class and sample numbers

In [29]:
# Remember we have the names of the files read (in order) in our list called 
# Lets, divide that array in two arrays. One containing the class number and the other the sample number.

def get_samples_classes(mats_file_name):
    class_n = []
    sample_n = []

    for i in mats_file_name:
        class_n.append(int(re.findall(r'\d+', str(i))[0]))
        sample_n.append(int(re.findall(r'\d+', str(i))[1]))
    
    return class_n, sample_n

In [30]:
def add_samples_classes(df, class_n, sample_n):
    df['Class'] = class_n
    df['Sample'] = sample_n

### Check if classes are balanced

In [31]:
def print_class_count(df):
    print("Quantity of samples in each class:")
    print(df['Class'].value_counts())

### Add another feature

In [32]:
def add_perimeter(df):
    length_list = []
    for i in range(len(df)):
        length_list.append(len(df['x'][i]))

    df['Perimeter_length'] = length_list
    
    return df

### Changing how x feature is represented

In [33]:
def min_length(df):
    # return min(len(df['x'][i]) for i in range(len(df['x'])))
    return min(df['Perimeter_length'][i] for i in range(len(df['Perimeter_length'])))

In [34]:
def separate_coordinates(df, min_length):
    x_coordinates = []
    y_coordinates = []

    for i in range(len(df['x'])):
        x_coordinates.append(np.resize((df['x'][i])[:,0], (min_length, 1)))
        y_coordinates.append(np.resize((df['x'][i])[:,1], (min_length, 1)))
    
    return x_coordinates, y_coordinates

In [35]:
def get_stacks(x_coordinates, y_coordinates):
    x_stack = x_coordinates[0]
    y_stack = y_coordinates[0]
    
    for i in range(len(x_coordinates)-1):
        x_stack = np.column_stack((x_stack, x_coordinates[i+1]))
        y_stack = np.column_stack((y_stack, y_coordinates[i+1]))
    
    return x_stack, y_stack

In [36]:
# Insert those columns in the dataFrame
def insert_columns(df, x_stack, y_stack):
    for i in range(len(x_stack)):
        stringX = "x" + str(i)
        stringY = "y" + str(i)
        df[stringX] = x_stack[i]
        df[stringY] = y_stack[i]
        
    return df

### Preparing data for classification
To learn the classifiers, we need to separate in two different sets the features and the classes. 

In [37]:
# The selected features are: 'Perimeter_length', 'xJ' and 'yJ'
# Then we are going to put all Classes in a unique structure.
def get_features_target(df):
    features = df.drop(columns=['x', 'Class', 'Sample'])
    target = df['Class']
    
    return features, target

## 3.1. Preprocessing the plane dataset

In this problem there are four classes that correspond to the 7 types of planes: (a) Mirage, (b) Eurofighter, (c) F-14 wings closed, (d) F-14 wings opened, (e) Harrier, (f) F-22, (g) F-15. However, in the database files are written like this: "ClassX_SampleY.mat", where X is the corresponding class number and Y the corresponding sample number. 

Here is the correspondance of class number and class name (plane model name):
* 1 = Mirage
* 2 = Eurofighter
* 3 = F-14 wings closed
* 4 = F-14 wings opened
* 5 = Harrier
* 6 = F-22
* 7 = F-15

### Create dataframe

In [38]:
plane_df = get_dataframe(plane_mats)
plane_df

Unnamed: 0,x
0,"[[64, 235], [65, 234], [66, 234], [67, 234], [..."
1,"[[60, 139], [61, 138], [62, 137], [63, 137], [..."
2,"[[60, 219], [61, 218], [62, 217], [63, 217], [..."
3,"[[54, 201], [55, 200], [55, 199], [56, 198], [..."
4,"[[64, 275], [65, 274], [66, 274], [67, 274], [..."
...,...
205,"[[33, 234], [34, 233], [35, 232], [36, 231], [..."
206,"[[21, 155], [22, 154], [23, 153], [24, 152], [..."
207,"[[45, 324], [46, 323], [47, 322], [48, 321], [..."
208,"[[70, 255], [71, 254], [72, 254], [73, 253], [..."


### Get class and sample numbers

In [39]:
plane_class_n, plane_sample_n = get_samples_classes(plane_mats_file_name)

In [40]:
print("This is how our class_n looks like: \n")
np.array(plane_class_n)

This is how our class_n looks like: 



array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])

In [41]:
print("This is how our sample_n looks like: \n")
np.array(plane_sample_n)

This is how our sample_n looks like: 



array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12,
       13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,
        5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,
        9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23,
       24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11,
       12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27,
       28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,
        4,  5,  6,  7,  8,  9])

Lets add those lists to the car DataFrame.

In [42]:
add_samples_classes(plane_df, plane_class_n, plane_sample_n)
print("This is, finally, how our plane dataFrame looks like: \n")
plane_df

This is, finally, how our plane dataFrame looks like: 



Unnamed: 0,x,Class,Sample
0,"[[64, 235], [65, 234], [66, 234], [67, 234], [...",1,1
1,"[[60, 139], [61, 138], [62, 137], [63, 137], [...",1,10
2,"[[60, 219], [61, 218], [62, 217], [63, 217], [...",1,11
3,"[[54, 201], [55, 200], [55, 199], [56, 198], [...",1,12
4,"[[64, 275], [65, 274], [66, 274], [67, 274], [...",1,13
...,...,...,...
205,"[[33, 234], [34, 233], [35, 232], [36, 231], [...",7,5
206,"[[21, 155], [22, 154], [23, 153], [24, 152], [...",7,6
207,"[[45, 324], [46, 323], [47, 322], [48, 321], [...",7,7
208,"[[70, 255], [71, 254], [72, 254], [73, 253], [...",7,8


### Classes are balanced? Yes

Although in the description of the database it is said that each class has 30 samples, to make sure about it we are going to count them.

In [43]:
print_class_count(plane_df)

Quantity of samples in each class:
7    30
6    30
5    30
4    30
3    30
2    30
1    30
Name: Class, dtype: int64


### Add another feature

As we mention before, the only feature descriptor of the shapes is x, which  which refers to cartesian coordinates of each point on the perimeter of the shape. However, how many points are in each contour perimeter is not taken as a unique feature. It is implicitly measure in the length of each x sample, but, we prefer make it explicit.

In [44]:
plane_df = add_perimeter(plane_df)

In [45]:
print("This is how our plane dataFrame looks like: \n")
plane_df

This is how our plane dataFrame looks like: 



Unnamed: 0,x,Class,Sample,Perimeter_length
0,"[[64, 235], [65, 234], [66, 234], [67, 234], [...",1,1,1433
1,"[[60, 139], [61, 138], [62, 137], [63, 137], [...",1,10,1540
2,"[[60, 219], [61, 218], [62, 217], [63, 217], [...",1,11,1587
3,"[[54, 201], [55, 200], [55, 199], [56, 198], [...",1,12,1511
4,"[[64, 275], [65, 274], [66, 274], [67, 274], [...",1,13,1489
...,...,...,...,...
205,"[[33, 234], [34, 233], [35, 232], [36, 231], [...",7,5,1801
206,"[[21, 155], [22, 154], [23, 153], [24, 152], [...",7,6,1943
207,"[[45, 324], [46, 323], [47, 322], [48, 321], [...",7,7,1876
208,"[[70, 255], [71, 254], [72, 254], [73, 253], [...",7,8,1661


### Changing how x feature is represented

When learning a classifier is useful to have features as arrays of numbers, and not as arrays of sequences. In our case, x is an array of (x, y) coordinates; so we are going to separate x and y, an then create two extra features from there.

In [46]:
min_len = min_length(plane_df)
print(min_len)
x_coordinates, y_coordinates = separate_coordinates(plane_df, min_len)
x_stack, y_stack = get_stacks(x_coordinates, y_coordinates)
plane_df = insert_columns(plane_df, x_stack, y_stack)
plane_df

890


Unnamed: 0,x,Class,Sample,Perimeter_length,x0,y0,x1,y1,x2,y2,...,x885,y885,x886,y886,x887,y887,x888,y888,x889,y889
0,"[[64, 235], [65, 234], [66, 234], [67, 234], [...",1,1,1433,64,235,65,234,66,234,...,471,264,471,265,471,266,471,267,471,268
1,"[[60, 139], [61, 138], [62, 137], [63, 137], [...",1,10,1540,60,139,61,138,62,137,...,560,304,559,303,558,303,557,302,556,301
2,"[[60, 219], [61, 218], [62, 217], [63, 217], [...",1,11,1587,60,219,61,218,62,217,...,564,246,563,246,562,246,561,246,560,246
3,"[[54, 201], [55, 200], [55, 199], [56, 198], [...",1,12,1511,54,201,55,200,55,199,...,502,227,501,228,500,228,499,228,498,228
4,"[[64, 275], [65, 274], [66, 274], [67, 274], [...",1,13,1489,64,275,65,274,66,274,...,490,234,490,235,490,236,490,237,491,238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,"[[33, 234], [34, 233], [35, 232], [36, 231], [...",7,5,1801,33,234,34,233,35,232,...,533,202,533,203,534,204,534,205,534,206
206,"[[21, 155], [22, 154], [23, 153], [24, 152], [...",7,6,1943,21,155,22,154,23,153,...,586,260,585,259,584,258,583,259,582,259
207,"[[45, 324], [46, 323], [47, 322], [48, 321], [...",7,7,1876,45,324,46,323,47,322,...,597,157,597,158,597,159,597,160,596,161
208,"[[70, 255], [71, 254], [72, 254], [73, 253], [...",7,8,1661,70,255,71,254,72,254,...,531,296,530,297,529,298,528,299,528,300


### Preparing data for classification

In [47]:
# Julen
plane_features, plane_target = get_features_target(plane_df)

In [48]:
# The selected features are: 'Perimeter_length', 'xJ' and 'yJ'  (J -> [0, 889])
plane_features

Unnamed: 0,Perimeter_length,x0,y0,x1,y1,x2,y2,x3,y3,x4,...,x885,y885,x886,y886,x887,y887,x888,y888,x889,y889
0,1433,64,235,65,234,66,234,67,234,68,...,471,264,471,265,471,266,471,267,471,268
1,1540,60,139,61,138,62,137,63,137,64,...,560,304,559,303,558,303,557,302,556,301
2,1587,60,219,61,218,62,217,63,217,64,...,564,246,563,246,562,246,561,246,560,246
3,1511,54,201,55,200,55,199,56,198,57,...,502,227,501,228,500,228,499,228,498,228
4,1489,64,275,65,274,66,274,67,274,68,...,490,234,490,235,490,236,490,237,491,238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205,1801,33,234,34,233,35,232,36,231,37,...,533,202,533,203,534,204,534,205,534,206
206,1943,21,155,22,154,23,153,24,152,25,...,586,260,585,259,584,258,583,259,582,259
207,1876,45,324,46,323,47,322,48,321,49,...,597,157,597,158,597,159,597,160,596,161
208,1661,70,255,71,254,72,254,73,253,74,...,531,296,530,297,529,298,528,299,528,300


We have put all Classes in a unique structure.

In [49]:
plane_target

0      1
1      1
2      1
3      1
4      1
      ..
205    7
206    7
207    7
208    7
209    7
Name: Class, Length: 210, dtype: int64

## 3.2. Preprocessing the car dataset

In this problem there are four classes that correspond to the 4 types of cars:  sedan, pickup, minivan, or SUV. However, in the database files are written like this: "ClassX_SampleY.mat", where X is the corresponding class number and Y the corresponding sample number. 

Here is the correspondance of class number and class name (car model name):
* 1 = sedan
* 2 = pickup
* 3 = minivan
* 4 = SUV

### Create dataframe

In [50]:
# Ibon
car_df = pd.DataFrame(car_mats)

This is the way car_df DataFrame looks like:


In [51]:
# Ibon
car_df

Unnamed: 0,__header__,__version__,__globals__,x
0,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[113, 181], [114, 180], [114, 179], [114, 178..."
1,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[98, 180], [99, 179], [99, 178], [100, 177], ..."
2,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[70, 180], [71, 180], [72, 179], [73, 178], [..."
3,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[54, 184], [55, 183], [56, 183], [57, 183], [..."
4,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[44, 180], [45, 179], [46, 179], [47, 178], [..."
...,...,...,...,...
115,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[101, 182], [102, 182], [103, 182], [104, 182..."
116,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[46, 180], [47, 180], [48, 179], [48, 178], [..."
117,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[31, 173], [32, 173], [33, 174], [34, 174], [..."
118,"b'MATLAB 5.0 MAT-file, Platform: PCWIN, Create...",1.0,[],"[[20, 170], [21, 171], [22, 170], [23, 170], [..."


As we see in the image above, __header__, __version__ and __globals__ columns are values created when using loadmat function to be able to read .mat files.

Those columns are not useful. We are going to delete them.

In [52]:
# Ibon
car_df = car_df.drop(['__header__', '__version__', '__globals__'], axis=1)
car_df

Unnamed: 0,x
0,"[[113, 181], [114, 180], [114, 179], [114, 178..."
1,"[[98, 180], [99, 179], [99, 178], [100, 177], ..."
2,"[[70, 180], [71, 180], [72, 179], [73, 178], [..."
3,"[[54, 184], [55, 183], [56, 183], [57, 183], [..."
4,"[[44, 180], [45, 179], [46, 179], [47, 178], [..."
...,...
115,"[[101, 182], [102, 182], [103, 182], [104, 182..."
116,"[[46, 180], [47, 180], [48, 179], [48, 178], [..."
117,"[[31, 173], [32, 173], [33, 174], [34, 174], [..."
118,"[[20, 170], [21, 171], [22, 170], [23, 170], [..."


Now, the only attribute available in our car DataFrame is x, which refers to cartesian coordinates of each point on the perimeter of the shape. We need more information to include there, such as class value and sample number.

In [53]:
# Ibon
# Remember we have the names of the files read (in order) in our list called car_mats_file_name.
# Lets, divide that array in two arrays. One containing the class number and the other the sample number.
car_class_n, car_sample_n = get_samples_classes(car_mats_file_name)

In [54]:
# Ibon
print("This is how our class_n looks like: \n")
np.array(car_class_n)

This is how our class_n looks like: 



array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4])

In [55]:
# Ibon
print("This is how our sample_n looks like: \n")
np.array(car_sample_n)

This is how our sample_n looks like: 



array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12,
       13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,
        5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,
        9])

Lets add those lists to the car DataFrame.

In [56]:
# Ibon
add_samples_classes(car_df, car_class_n, car_sample_n)
print("This is, finally, how our car dataFrame looks like: \n")
car_df

This is, finally, how our car dataFrame looks like: 



Unnamed: 0,x,Class,Sample
0,"[[113, 181], [114, 180], [114, 179], [114, 178...",1,1
1,"[[98, 180], [99, 179], [99, 178], [100, 177], ...",1,10
2,"[[70, 180], [71, 180], [72, 179], [73, 178], [...",1,11
3,"[[54, 184], [55, 183], [56, 183], [57, 183], [...",1,12
4,"[[44, 180], [45, 179], [46, 179], [47, 178], [...",1,13
...,...,...,...
115,"[[101, 182], [102, 182], [103, 182], [104, 182...",4,5
116,"[[46, 180], [47, 180], [48, 179], [48, 178], [...",4,6
117,"[[31, 173], [32, 173], [33, 174], [34, 174], [...",4,7
118,"[[20, 170], [21, 171], [22, 170], [23, 170], [...",4,8


### Classes are balanced? Yes

Although in the description of the database it is said that each class has 30 samples, to make sure about it we are going to count them.

In [57]:
print_class_count(car_df)

Quantity of samples in each class:
4    30
3    30
2    30
1    30
Name: Class, dtype: int64


### Let's add another feature to our database

As we mention before, the only feature descriptor of the shapes is x, which  which refers to cartesian coordinates of each point on the perimeter of the shape. However, how many points are in each contour perimeter is not taken as a unique feature. It is implicitly measure in the length of each x sample, but, we prefer make it explicit.

In [58]:
car_df = add_perimeter(car_df)

In [59]:
print("This is how our car dataFrame looks like: \n")
car_df

This is how our car dataFrame looks like: 



Unnamed: 0,x,Class,Sample,Perimeter_length
0,"[[113, 181], [114, 180], [114, 179], [114, 178...",1,1,310
1,"[[98, 180], [99, 179], [99, 178], [100, 177], ...",1,10,331
2,"[[70, 180], [71, 180], [72, 179], [73, 178], [...",1,11,344
3,"[[54, 184], [55, 183], [56, 183], [57, 183], [...",1,12,334
4,"[[44, 180], [45, 179], [46, 179], [47, 178], [...",1,13,322
...,...,...,...,...
115,"[[101, 182], [102, 182], [103, 182], [104, 182...",4,5,373
116,"[[46, 180], [47, 180], [48, 179], [48, 178], [...",4,6,358
117,"[[31, 173], [32, 173], [33, 174], [34, 174], [...",4,7,374
118,"[[20, 170], [21, 171], [22, 170], [23, 170], [...",4,8,356


### Changing how x feature is represented

When learning a classifier is useful to have features as arrays of numbers, and not as arrays of sequences. In our case, x is an array of (x, y) coordinates; so we are going to separate x and y, an then create two extra features from there.

In [60]:
min_len = min_length(car_df)
print(min_len)
x_coordinates, y_coordinates = separate_coordinates(car_df, min_len)
x_stack, y_stack = get_stacks(x_coordinates, y_coordinates)
car_df = insert_columns(car_df, x_stack, y_stack)
car_df

272


Unnamed: 0,x,Class,Sample,Perimeter_length,x0,y0,x1,y1,x2,y2,...,x267,y267,x268,y268,x269,y269,x270,y270,x271,y271
0,"[[113, 181], [114, 180], [114, 179], [114, 178...",1,1,310,113,181,114,180,114,179,...,150,189,149,189,148,190,147,191,146,191
1,"[[98, 180], [99, 179], [99, 178], [100, 177], ...",1,10,331,98,180,99,179,99,178,...,140,188,139,188,138,189,139,190,138,190
2,"[[70, 180], [71, 180], [72, 179], [73, 178], [...",1,11,344,70,180,71,180,72,179,...,131,186,130,187,129,187,128,187,127,187
3,"[[54, 184], [55, 183], [56, 183], [57, 183], [...",1,12,334,54,184,55,183,56,183,...,108,186,107,187,106,187,105,187,104,188
4,"[[44, 180], [45, 179], [46, 179], [47, 178], [...",1,13,322,44,180,45,179,46,179,...,84,189,83,189,82,190,81,191,82,192
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,"[[101, 182], [102, 182], [103, 182], [104, 182...",4,5,373,101,182,102,182,103,182,...,186,188,185,188,184,188,183,188,182,188
116,"[[46, 180], [47, 180], [48, 179], [48, 178], [...",4,6,358,46,180,47,180,48,179,...,131,186,130,186,129,186,128,186,127,186
117,"[[31, 173], [32, 173], [33, 174], [34, 174], [...",4,7,374,31,173,32,173,33,174,...,111,187,110,188,109,188,108,189,107,189
118,"[[20, 170], [21, 171], [22, 170], [23, 170], [...",4,8,356,20,170,21,171,22,170,...,76,189,75,189,74,189,73,189,72,189


### Preparing data for classification

In [61]:
# Julen
car_features, car_target = get_features_target(car_df)

In [62]:
# The selected features are: 'Perimeter_length', 'xJ' and 'yJ'  (J -> [0, 271])
car_features

Unnamed: 0,Perimeter_length,x0,y0,x1,y1,x2,y2,x3,y3,x4,...,x267,y267,x268,y268,x269,y269,x270,y270,x271,y271
0,310,113,181,114,180,114,179,114,178,114,...,150,189,149,189,148,190,147,191,146,191
1,331,98,180,99,179,99,178,100,177,101,...,140,188,139,188,138,189,139,190,138,190
2,344,70,180,71,180,72,179,73,178,72,...,131,186,130,187,129,187,128,187,127,187
3,334,54,184,55,183,56,183,57,183,58,...,108,186,107,187,106,187,105,187,104,188
4,322,44,180,45,179,46,179,47,178,48,...,84,189,83,189,82,190,81,191,82,192
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,373,101,182,102,182,103,182,104,182,105,...,186,188,185,188,184,188,183,188,182,188
116,358,46,180,47,180,48,179,48,178,48,...,131,186,130,186,129,186,128,186,127,186
117,374,31,173,32,173,33,174,34,174,35,...,111,187,110,188,109,188,108,189,107,189
118,356,20,170,21,171,22,170,23,170,24,...,76,189,75,189,74,189,73,189,72,189


We have put all Classes in a unique structure.

In [63]:
car_target

0      1
1      1
2      1
3      1
4      1
      ..
115    4
116    4
117    4
118    4
119    4
Name: Class, Length: 120, dtype: int64

# 4. Scaling the data

## 4.1. Scaling the plane data

In [64]:
plane_scaler = StandardScaler()
plane_features_scaled = plane_scaler.fit_transform(plane_features)
plane_features_scaled

array([[-0.76525741, -0.22082254, -0.30487452, ...,  0.01310947,
        -0.9060359 ,  0.02397567],
       [-0.14307215, -0.33949876, -2.22919207, ...,  0.63878863,
         0.51461948,  0.61316508],
       [ 0.13022418, -0.33949876, -0.62559411, ..., -0.36229803,
         0.58147385, -0.36881726],
       ...,
       [ 1.81070585, -0.7845346 ,  1.47912821, ..., -1.89968112,
         1.18316319, -1.88642634],
       [ 0.56052052, -0.04280821,  0.09602497, ...,  0.58515899,
         0.04663888,  0.59531085],
       [ 1.62463176, -0.69552743, -0.34496447, ..., -1.27400195,
         0.83217774, -1.27938271]])

## 4.2. Scaling the car data

In [65]:
car_scaler = StandardScaler()
car_features_scaled = car_scaler.fit_transform(car_features)
car_features_scaled

array([[-1.06728963,  0.58628083,  0.59700163, ...,  0.8741316 ,
        -0.66810724,  0.86509951],
       [-0.89957934,  0.31002285,  0.52449131, ...,  0.82647335,
        -0.76429545,  0.81725991],
       [-0.79575869, -0.20565872,  0.52449131, ...,  0.68349863,
        -0.89655423,  0.67374109],
       ...,
       [-0.55617256, -0.92392947,  0.01691907, ...,  0.77881511,
        -1.13702474,  0.7694203 ],
       [-0.69992424, -1.12651866, -0.20061188, ...,  0.77881511,
        -1.55784814,  0.7694203 ],
       [-0.88360693, -0.83184348,  0.16193971, ...,  0.77881511,
        -1.54582461,  0.7215807 ]])

# 5. Dividing train and test data

Also, to evaluate the accuracy of the classifiers in the dataset we will split the data in two sets. Train and Test data. 
Each set will have the same number of samples of each class (15).

### Divide train and test features

In [66]:
def train_test_features(features):
    train_features = features[0::2]
    test_features = features[1::2]

    return train_features, test_features

### Divide train and test target

In [67]:
def train_test_target(target):
    train_target = target[0::2]
    test_target = target[1::2]

    return train_target, test_target

## 5.1. Dividing the plane data

### Not scaled data

In [68]:
plane_train_features, plane_test_features = train_test_features(plane_features)

plane_train_target, plane_test_target = train_test_features(plane_target)

### Scaled data

In [69]:
plane_train_features_scaled, plane_test_features_scaled = train_test_features(plane_features_scaled)

## 5.2. Dividing the car data

### Not scaled data

In [70]:
car_train_features, car_test_features = train_test_features(car_features)
car_train_target, car_test_target = train_test_features(car_target)

### Scaled data

In [71]:
car_train_features_scaled, car_test_features_scaled = train_test_features(car_features_scaled)

# 6. Classification

### Defining the classifiers
We define the three classifiers used.

In [72]:
def get_classifiers():
    dt  = DecisionTreeClassifier()
    lda = LinearDiscriminantAnalysis()
    lg  = LogisticRegression(max_iter=2000)
    return dt, lda, lg

### Learning the classifiers
We used the train data to learn the three classifiers

In [73]:
def fit_classifiers(dt, lda, lg, train_features, train_target):
    dt.fit(train_features, train_target)
    lda.fit(train_features, train_target)
    lg.fit(train_features, train_target)

### Using the classifier for predictions
We predict the class of the samples in the test data with the three classifiers.

In [74]:
def predict_classifiers(dt, lda, lg, test_features):
    dt_test_predictions = dt.predict(test_features)
    lda_test_predictions = lda.predict(test_features)
    lg_test_predictions = lg.predict(test_features)
    
    return dt_test_predictions, lda_test_predictions, lg_test_predictions

## 6.1. Classification for the plane data

## Not scaled data

### Defining the classifiers
We define the three classifiers used.

In [75]:
plane_dt, plane_lda, plane_lg = get_classifiers()

### Learning the classifiers
We used the train data to learn the three classifiers

In [76]:
fit_classifiers(plane_dt, plane_lda, plane_lg, plane_train_features, plane_train_target)

### Using the classifier for predictions
We predict the class of the samples in the test data with the three classifiers.

In [77]:
plane_dt_test_predictions, plane_lda_test_predictions, plane_lg_test_predictions = \
predict_classifiers(plane_dt, plane_lda, plane_lg, plane_test_features)

## Scaled data

### Learning the classifiers
We used the train data to learn the three classifiers

In [78]:
fit_classifiers(plane_dt, plane_lda, plane_lg, plane_train_features_scaled, plane_train_target)

### Using the classifier for predictions
We predict the class of the samples in the test data with the three classifiers.

In [79]:
plane_dt_test_predictions_scaled, plane_lda_test_predictions_scaled, plane_lg_test_predictions_scaled = \
predict_classifiers(plane_dt, plane_lda, plane_lg, plane_test_features_scaled)

## 6.2. Classification for the car data

## Not scaled data

### Defining the classifiers
We define the three classifiers used.

In [80]:
car_dt, car_lda, car_lg = get_classifiers()

### Learning the classifiers
We used the train data to learn the three classifiers

In [81]:
fit_classifiers(car_dt, car_lda, car_lg, car_train_features, car_train_target)

### Using the classifier for predictions
We predict the class of the samples in the test data with the three classifiers.

In [82]:
car_dt_test_predictions, car_lda_test_predictions, car_lg_test_predictions = \
predict_classifiers(car_dt, car_lda, car_lg, car_test_features)

## Scaled data

### Learning the classifiers
We used the train data to learn the three classifiers

In [83]:
fit_classifiers(car_dt, car_lda, car_lg, car_train_features_scaled, car_train_target)

### Using the classifier for predictions
We predict the class of the samples in the test data with the three classifiers.

In [84]:
car_dt_test_predictions_scaled, car_lda_test_predictions_scaled, car_lg_test_predictions_scaled = \
predict_classifiers(car_dt, car_lda, car_lg, car_test_features_scaled)

# 7. Validation

### Computing the accuracy

We compute the accuracy using the three classifiers and print it. 

In [85]:
def print_accuracies(test_target, dt_test_predictions, lda_test_predictions, lg_test_predictions):
    dt_acc =  accuracy_score(test_target, dt_test_predictions)
    lda_acc =  accuracy_score(test_target, lda_test_predictions)
    lg_acc =  accuracy_score(test_target, lg_test_predictions)
    print("Accuracy for the decision tree :", dt_acc)
    print("Accuracy for LDA :", lda_acc)
    print("Accuracy for logistic regression:", lg_acc)

### Computing the confusion matrices
We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report. 


In [98]:
def print_confusion_matrices(test_target, dt_test_predictions, lda_test_predictions, lg_test_predictions):
    print("Confusion matrix decision tree")
    cm_dt = pd.crosstab(test_target, dt_test_predictions)
    print(cm_dt)
    print()
    print(cm_dt.to_latex())
    
    print("Confusion matrix LDA")
    cm_lda = pd.crosstab(test_target, lda_test_predictions)
    print(cm_lda)
    print()
    print(cm_lda.to_latex())
    
    print("Confusion matrix Logistic regression")
    cm_lg = pd.crosstab(test_target, lg_test_predictions)
    print(cm_lg)
    print()
    print(cm_lg.to_latex())

## 7.1. Validation for the plane data

## Not scaled data

## Computing the accuracy

We compute the accuracy using the three classifiers and print it. 

In [87]:
print_accuracies(plane_test_target, plane_dt_test_predictions, plane_lda_test_predictions, plane_lg_test_predictions)

Accuracy for the decision tree : 0.7714285714285715
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9428571428571428


## Computing the confusion matrices
We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report. 


In [99]:
print_confusion_matrices(plane_test_target, plane_dt_test_predictions, plane_lda_test_predictions, plane_lg_test_predictions)

Confusion matrix decision tree
col_0  1  2   3   4   5   6   7
Class                          
1      7  2   2   1   0   1   2
2      2  8   0   0   1   0   4
3      1  1  10   0   0   2   1
4      0  1   0  14   0   0   0
5      0  0   0   0  14   1   0
6      0  0   0   0   0  15   0
7      2  0   0   0   0   0  13

\begin{tabular}{lrrrrrrr}
\toprule
col\_0 &  1 &  2 &   3 &   4 &   5 &   6 &   7 \\
Class &    &    &     &     &     &     &     \\
\midrule
1     &  7 &  2 &   2 &   1 &   0 &   1 &   2 \\
2     &  2 &  8 &   0 &   0 &   1 &   0 &   4 \\
3     &  1 &  1 &  10 &   0 &   0 &   2 &   1 \\
4     &  0 &  1 &   0 &  14 &   0 &   0 &   0 \\
5     &  0 &  0 &   0 &   0 &  14 &   1 &   0 \\
6     &  0 &  0 &   0 &   0 &   0 &  15 &   0 \\
7     &  2 &  0 &   0 &   0 &   0 &   0 &  13 \\
\bottomrule
\end{tabular}

Confusion matrix LDA
col_0   1   2   3   4   5   6   7
Class                            
1      13   0   0   0   0   2   0
2       0  12   0   0   0   0   3
3       0 

## Scaled data

## Computing the accuracy

We compute the accuracy using the three classifiers and print it. 

In [89]:
print_accuracies(plane_test_target, plane_dt_test_predictions_scaled, plane_lda_test_predictions_scaled, plane_lg_test_predictions_scaled)

Accuracy for the decision tree : 0.780952380952381
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9238095238095239


## Computing the confusion matrices
We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report. 


In [90]:
print_confusion_matrices(plane_test_target, plane_dt_test_predictions_scaled, plane_lda_test_predictions_scaled, plane_lg_test_predictions_scaled)

Confusion matrix decision tree
col_0  1   2   3   4   5   6   7
Class                           
1      7   3   1   1   0   1   2
2      0  10   0   0   1   0   4
3      1   0  11   0   0   2   1
4      0   1   0  14   0   0   0
5      0   1   0   0  13   1   0
6      0   0   0   0   0  15   0
7      2   0   0   0   1   0  12

Confusion matrix LDA
col_0   1   2   3   4   5   6   7
Class                            
1      13   0   0   0   0   2   0
2       0  12   0   0   0   0   3
3       0   0  15   0   0   0   0
4       0   0   0  15   0   0   0
5       0   0   0   0  15   0   0
6       0   0   0   0   0  15   0
7       0   0   0   0   0   0  15

Confusion matrix Logistic regression
col_0   1   2   3   4   5   6   7
Class                            
1      13   0   2   0   0   0   0
2       1  11   0   0   0   0   3
3       0   0  14   0   0   1   0
4       0   0   0  15   0   0   0
5       0   0   0   0  15   0   0
6       0   0   0   0   0  15   0
7       0   0   0   1   0   0  14


## 7.2. Validation for the car data

## Not scaled data

## Computing the accuracy

We compute the accuracy using the three classifiers and print it. 

In [91]:
print_accuracies(car_test_target, car_dt_test_predictions, car_lda_test_predictions, car_lg_test_predictions)

Accuracy for the decision tree : 0.7833333333333333
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.8666666666666667


## Computing the confusion matrices
We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report. 


In [92]:
print_confusion_matrices(car_test_target, car_dt_test_predictions, car_lda_test_predictions, car_lg_test_predictions)

Confusion matrix decision tree
col_0   1   2   3  4
Class               
1      14   1   0  0
2       1  10   0  4
3       0   1  14  0
4       0   4   2  9

Confusion matrix LDA
col_0   1   2   3   4
Class                
1      13   0   0   2
2       0  15   0   0
3       0   1  13   1
4       0   0   2  13

Confusion matrix Logistic regression
col_0   1   2   3   4
Class                
1      14   0   1   0
2       0  14   1   0
3       0   1  12   2
4       0   0   3  12



## Scaled data

## Computing the accuracy

We compute the accuracy using the three classifiers and print it. 

In [93]:
print_accuracies(car_test_target, car_dt_test_predictions_scaled, car_lda_test_predictions_scaled, car_lg_test_predictions_scaled)

Accuracy for the decision tree : 0.8
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.9166666666666666


## Computing the confusion matrices
We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report. 


In [94]:
print_confusion_matrices(car_test_target, car_dt_test_predictions_scaled, car_lda_test_predictions_scaled, car_lg_test_predictions_scaled)

Confusion matrix decision tree
col_0   1   2   3  4
Class               
1      14   0   1  0
2       0  11   1  3
3       0   1  14  0
4       0   3   3  9

Confusion matrix LDA
col_0   1   2   3   4
Class                
1      13   0   0   2
2       0  15   0   0
3       0   1  13   1
4       0   0   2  13

Confusion matrix Logistic regression
col_0   1   2   3   4
Class                
1      14   0   0   1
2       0  15   0   0
3       0   1  14   0
4       0   0   3  12



In [95]:
##Applying Gaussian filter to our shapes
# Filtered with Gaussian filter (standar deviation = 10).
# Gaussian filter: https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_filter.html
#for i in range(len(car_df['Contour_Coordinates'])):
#    car_df.at[i, 'Contour_Coordinates'] = gaussian_filter(np.array(car_df['Contour_Coordinates'][i]), sigma=10)
#print("This how our shapes filtered look like: \n")
#car_df

car_features

In [222]:
car_features

Unnamed: 0,Perimeter_length,x0,y0,x1,y1,x2,y2,x3,y3,x4,...,x267,y267,x268,y268,x269,y269,x270,y270,x271,y271
0,310,113,181,114,180,114,179,114,178,114,...,150,189,149,189,148,190,147,191,146,191
1,331,98,180,99,179,99,178,100,177,101,...,140,188,139,188,138,189,139,190,138,190
2,344,70,180,71,180,72,179,73,178,72,...,131,186,130,187,129,187,128,187,127,187
3,334,54,184,55,183,56,183,57,183,58,...,108,186,107,187,106,187,105,187,104,188
4,322,44,180,45,179,46,179,47,178,48,...,84,189,83,189,82,190,81,191,82,192
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,373,101,182,102,182,103,182,104,182,105,...,186,188,185,188,184,188,183,188,182,188
116,358,46,180,47,180,48,179,48,178,48,...,131,186,130,186,129,186,128,186,127,186
117,374,31,173,32,173,33,174,34,174,35,...,111,187,110,188,109,188,108,189,107,189
118,356,20,170,21,171,22,170,23,170,24,...,76,189,75,189,74,189,73,189,72,189


# 8. Feature Selection

# 8.1 Feature Selection with SelectKBest

At this point we have so many features: two features for each point in the contour perimeter. We normalized the length of each car image to 890 points, so 890 * 2 = 1780 features for representing each image. We normalized the length of each car image to 272 points, so 272 * 2 = 544 features for representing each image. Let's try reducing this amount of features...

We are going to use SelectKBest + f_classif function for our feature selection process.

### Feature selection

In [223]:
def feature_selection(features, target):
    reduced_data_100 = SelectKBest(f_classif, k=100).fit_transform(features, target)

    reduced_data_200 = SelectKBest(f_classif, k=200).fit_transform(features, target)

    reduced_data_300 = SelectKBest(f_classif, k=300).fit_transform(features, target)

    reduced_data_400 = SelectKBest(f_classif, k=400).fit_transform(features, target)

    reduced_data_500 = SelectKBest(f_classif, k=500).fit_transform(features, target)
    
    return reduced_data_100, reduced_data_200, reduced_data_300, reduced_data_400, reduced_data_500

## Plane Dataset

### Feature selection

In [224]:
# Julen
# Not Scaled data

plane_reduced_data_100, plane_reduced_data_200, plane_reduced_data_300, plane_reduced_data_400, plane_reduced_data_500 = \
feature_selection(plane_features, plane_target)

In [225]:
# Julen
# Scaled data

plane_reduced_data_100_scaled, plane_reduced_data_200_scaled, plane_reduced_data_300_scaled, plane_reduced_data_400_scaled, \
plane_reduced_data_500_scaled = feature_selection(plane_features, plane_target)

### Divide train and test features

As always, we need to separate the data in three: features data for training, features data for testing and class target:

In [226]:
# Julen
plane_reduced_data_train_100, plane_reduced_data_test_100 = train_test_features(plane_reduced_data_100)

plane_reduced_data_train_200, plane_reduced_data_test_200 = train_test_features(plane_reduced_data_200)

plane_reduced_data_train_300, plane_reduced_data_test_300 = train_test_features(plane_reduced_data_300)

plane_reduced_data_train_400, plane_reduced_data_test_400 = train_test_features(plane_reduced_data_400)

plane_reduced_data_train_500, plane_reduced_data_test_500 = train_test_features(plane_reduced_data_500)

In [227]:
# Julen
plane_reduced_data_train_100_scaled, plane_reduced_data_test_100_scaled = train_test_features(plane_reduced_data_100_scaled)

plane_reduced_data_train_200_scaled, plane_reduced_data_test_200_scaled = train_test_features(plane_reduced_data_200_scaled)

plane_reduced_data_train_300_scaled, plane_reduced_data_test_300_scaled = train_test_features(plane_reduced_data_300_scaled)

plane_reduced_data_train_400_scaled, plane_reduced_data_test_400_scaled = train_test_features(plane_reduced_data_400_scaled)

plane_reduced_data_train_500_scaled, plane_reduced_data_test_500_scaled = train_test_features(plane_reduced_data_500_scaled)

### Define classifiers
It is better to create one classifier of each type not to mix them:

In [228]:
# Julen
plane_dt_100, plane_lda_100, plane_lg_100 = get_classifiers()

plane_dt_200, plane_lda_200, plane_lg_200 = get_classifiers()

plane_dt_300, plane_lda_300, plane_lg_300 = get_classifiers()

plane_dt_400, plane_lda_400, plane_lg_400 = get_classifiers()

plane_dt_500, plane_lda_500, plane_lg_500 = get_classifiers()

### Fit Classifiers

In [229]:
# Julen
# Not Scaled data
fit_classifiers(plane_dt_100, plane_lda_100, plane_lg_100, plane_reduced_data_train_100, plane_train_target)

fit_classifiers(plane_dt_200, plane_lda_200, plane_lg_200, plane_reduced_data_train_200, plane_train_target)

fit_classifiers(plane_dt_300, plane_lda_300, plane_lg_300, plane_reduced_data_train_300, plane_train_target)

fit_classifiers(plane_dt_400, plane_lda_400, plane_lg_400, plane_reduced_data_train_400, plane_train_target)

fit_classifiers(plane_dt_500, plane_lda_500, plane_lg_500, plane_reduced_data_train_500, plane_train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

### Predict classifiers

In [230]:
# Julen
# Not Scaled data
plane_dt_test_predictions_100, plane_lda_test_predictions_100, plane_lg_test_predictions_100 = \
predict_classifiers(plane_dt_100, plane_lda_100, plane_lg_100, plane_reduced_data_test_100)

plane_dt_test_predictions_200, plane_lda_test_predictions_200, plane_lg_test_predictions_200 = \
predict_classifiers(plane_dt_200, plane_lda_200, plane_lg_200, plane_reduced_data_test_200)

plane_dt_test_predictions_300, plane_lda_test_predictions_300, plane_lg_test_predictions_300 = \
predict_classifiers(plane_dt_300, plane_lda_300, plane_lg_300, plane_reduced_data_test_300)

plane_dt_test_predictions_400, plane_lda_test_predictions_400, plane_lg_test_predictions_400 = \
predict_classifiers(plane_dt_400, plane_lda_400, plane_lg_400, plane_reduced_data_test_400)

plane_dt_test_predictions_500, plane_lda_test_predictions_500, plane_lg_test_predictions_500 = \
predict_classifiers(plane_dt_500, plane_lda_500, plane_lg_500, plane_reduced_data_test_500)

### Fitting classifiers scaled

In [231]:
# Julen
# Scaled data
fit_classifiers(plane_dt_100, plane_lda_100, plane_lg_100, plane_reduced_data_train_100_scaled, plane_train_target)

fit_classifiers(plane_dt_200, plane_lda_200, plane_lg_200, plane_reduced_data_train_200_scaled, plane_train_target)

fit_classifiers(plane_dt_300, plane_lda_300, plane_lg_300, plane_reduced_data_train_300_scaled, plane_train_target)

fit_classifiers(plane_dt_400, plane_lda_400, plane_lg_400, plane_reduced_data_train_400_scaled, plane_train_target)

fit_classifiers(plane_dt_500, plane_lda_500, plane_lg_500, plane_reduced_data_train_500_scaled, plane_train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

### Predicting classifiers scaled

In [232]:
# Julen
# Scaled data
plane_dt_test_predictions_100_scaled, plane_lda_test_predictions_100_scaled, plane_lg_test_predictions_100_scaled = \
predict_classifiers(plane_dt_100, plane_lda_100, plane_lg_100, plane_reduced_data_test_100_scaled)

plane_dt_test_predictions_200_scaled, plane_lda_test_predictions_200_scaled, plane_lg_test_predictions_200_scaled = \
predict_classifiers(plane_dt_200, plane_lda_200, plane_lg_200, plane_reduced_data_test_200_scaled)

plane_dt_test_predictions_300_scaled, plane_lda_test_predictions_300_scaled, plane_lg_test_predictions_300_scaled = \
predict_classifiers(plane_dt_300, plane_lda_300, plane_lg_300, plane_reduced_data_test_300_scaled)

plane_dt_test_predictions_400_scaled, plane_lda_test_predictions_400_scaled, plane_lg_test_predictions_400_scaled = \
predict_classifiers(plane_dt_400, plane_lda_400, plane_lg_400, plane_reduced_data_test_400_scaled)

plane_dt_test_predictions_500_scaled, plane_lda_test_predictions_500_scaled, plane_lg_test_predictions_500_scaled = \
predict_classifiers(plane_dt_500, plane_lda_500, plane_lg_500, plane_reduced_data_test_500_scaled)

### Calculating accuracy
Let's see if the evolution of the accuracy depending on the quantity of fetures selected:

In [233]:
print("100 features + Not scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_100, plane_lda_test_predictions_100, plane_lg_test_predictions_100)

print("\n")

print("100 features + Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_100_scaled, plane_lda_test_predictions_100_scaled, plane_lg_test_predictions_100_scaled)


100 features + Not scaled: 

Accuracy for the decision tree : 0.5904761904761905
Accuracy for LDA : 0.2571428571428571
Accuracy for logistic regression: 0.6190476190476191


100 features + Scaled: 

Accuracy for the decision tree : 0.6095238095238096
Accuracy for LDA : 0.2571428571428571
Accuracy for logistic regression: 0.6190476190476191


In [234]:
print("200 features + Not scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_200, plane_lda_test_predictions_200, plane_lg_test_predictions_200)

print("\n")

print("200 features + Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_200_scaled, plane_lda_test_predictions_200_scaled, plane_lg_test_predictions_200_scaled)

200 features + Not scaled: 

Accuracy for the decision tree : 0.6571428571428571
Accuracy for LDA : 0.49523809523809526
Accuracy for logistic regression: 0.6285714285714286


200 features + Scaled: 

Accuracy for the decision tree : 0.6571428571428571
Accuracy for LDA : 0.49523809523809526
Accuracy for logistic regression: 0.6285714285714286


In [235]:
print("300 features + Not Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_300, plane_lda_test_predictions_300, plane_lg_test_predictions_300)

print("\n")

print("300 features + Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_300_scaled, plane_lda_test_predictions_300_scaled, plane_lg_test_predictions_300_scaled)

300 features + Not Scaled: 

Accuracy for the decision tree : 0.7238095238095238
Accuracy for LDA : 0.7047619047619048
Accuracy for logistic regression: 0.7047619047619048


300 features + Scaled: 

Accuracy for the decision tree : 0.7047619047619048
Accuracy for LDA : 0.7047619047619048
Accuracy for logistic regression: 0.7047619047619048


In [236]:
print("400 features + Not scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_400, plane_lda_test_predictions_400, plane_lg_test_predictions_400)

print("\n")

print("400 features + Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_400_scaled, plane_lda_test_predictions_400_scaled, plane_lg_test_predictions_400_scaled)

400 features + Not scaled: 

Accuracy for the decision tree : 0.7523809523809524
Accuracy for LDA : 0.7714285714285715
Accuracy for logistic regression: 0.819047619047619


400 features + Scaled: 

Accuracy for the decision tree : 0.7523809523809524
Accuracy for LDA : 0.7714285714285715
Accuracy for logistic regression: 0.819047619047619


In [237]:
print("500 features + Not scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_500, plane_lda_test_predictions_500, plane_lg_test_predictions_500)

print("\n")

print("500 features+ Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_500_scaled, plane_lda_test_predictions_500_scaled, plane_lg_test_predictions_500_scaled)

500 features + Not scaled: 

Accuracy for the decision tree : 0.7333333333333333
Accuracy for LDA : 0.780952380952381
Accuracy for logistic regression: 0.8761904761904762


500 features+ Scaled: 

Accuracy for the decision tree : 0.7333333333333333
Accuracy for LDA : 0.780952380952381
Accuracy for logistic regression: 0.8761904761904762


## Car Dataset

### Feature selection

In [238]:
# Julen
# Not Scaled data

car_reduced_data_100, car_reduced_data_200, car_reduced_data_300, car_reduced_data_400, car_reduced_data_500 = \
feature_selection(car_features, car_target)

In [239]:
# Julen
# Scaled data

car_reduced_data_100_scaled, car_reduced_data_200_scaled, car_reduced_data_300_scaled, car_reduced_data_400_scaled, \
car_reduced_data_500_scaled = feature_selection(car_features, car_target)

### Divide train and test features

As always, we need to separate the data in three: features data for training, features data for testing and class target:

In [240]:
# Julen
car_reduced_data_train_100, car_reduced_data_test_100 = train_test_features(car_reduced_data_100)

car_reduced_data_train_200, car_reduced_data_test_200 = train_test_features(car_reduced_data_200)

car_reduced_data_train_300, car_reduced_data_test_300 = train_test_features(car_reduced_data_300)

car_reduced_data_train_400, car_reduced_data_test_400 = train_test_features(car_reduced_data_400)

car_reduced_data_train_500, car_reduced_data_test_500 = train_test_features(car_reduced_data_500)

In [241]:
# Julen
car_reduced_data_train_100_scaled, car_reduced_data_test_100_scaled = train_test_features(car_reduced_data_100_scaled)

car_reduced_data_train_200_scaled, car_reduced_data_test_200_scaled = train_test_features(car_reduced_data_200_scaled)

car_reduced_data_train_300_scaled, car_reduced_data_test_300_scaled = train_test_features(car_reduced_data_300_scaled)

car_reduced_data_train_400_scaled, car_reduced_data_test_400_scaled = train_test_features(car_reduced_data_400_scaled)

car_reduced_data_train_500_scaled, car_reduced_data_test_500_scaled = train_test_features(car_reduced_data_500_scaled)

### Define classifiers
It is better to create one classifier of each type not to mix them:

In [242]:
# Julen
car_dt_100, car_lda_100, car_lg_100 = get_classifiers()

car_dt_200, car_lda_200, car_lg_200 = get_classifiers()

car_dt_300, car_lda_300, car_lg_300 = get_classifiers()

car_dt_400, car_lda_400, car_lg_400 = get_classifiers()

car_dt_500, car_lda_500, car_lg_500 = get_classifiers()

### Fit Classifiers

In [243]:
# Julen
# Not Scaled data
fit_classifiers(car_dt_100, car_lda_100, car_lg_100, car_reduced_data_train_100, car_train_target)

fit_classifiers(car_dt_200, car_lda_200, car_lg_200, car_reduced_data_train_200, car_train_target)

fit_classifiers(car_dt_300, car_lda_300, car_lg_300, car_reduced_data_train_300, car_train_target)

fit_classifiers(car_dt_400, car_lda_400, car_lg_400, car_reduced_data_train_400, car_train_target)

fit_classifiers(car_dt_500, car_lda_500, car_lg_500, car_reduced_data_train_500, car_train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Predict classifiers

In [244]:
# Julen
# Not Scaled data
car_dt_test_predictions_100, car_lda_test_predictions_100, car_lg_test_predictions_100 = \
predict_classifiers(car_dt_100, car_lda_100, car_lg_100, car_reduced_data_test_100)

car_dt_test_predictions_200, car_lda_test_predictions_200, car_lg_test_predictions_200 = \
predict_classifiers(car_dt_200, car_lda_200, car_lg_200, car_reduced_data_test_200)

car_dt_test_predictions_300, car_lda_test_predictions_300, car_lg_test_predictions_300 = \
predict_classifiers(car_dt_300, car_lda_300, car_lg_300, car_reduced_data_test_300)

car_dt_test_predictions_400, car_lda_test_predictions_400, car_lg_test_predictions_400 = \
predict_classifiers(car_dt_400, car_lda_400, car_lg_400, car_reduced_data_test_400)

car_dt_test_predictions_500, car_lda_test_predictions_500, car_lg_test_predictions_500 = \
predict_classifiers(car_dt_500, car_lda_500, car_lg_500, car_reduced_data_test_500)

### Fitting classifiers scaled

In [245]:
# Julen
# Scaled data
fit_classifiers(car_dt_100, car_lda_100, car_lg_100, car_reduced_data_train_100_scaled, car_train_target)

fit_classifiers(car_dt_200, car_lda_200, car_lg_200, car_reduced_data_train_200_scaled, car_train_target)

fit_classifiers(car_dt_300, car_lda_300, car_lg_300, car_reduced_data_train_300_scaled, car_train_target)

fit_classifiers(car_dt_400, car_lda_400, car_lg_400, car_reduced_data_train_400_scaled, car_train_target)

fit_classifiers(car_dt_500, car_lda_500, car_lg_500, car_reduced_data_train_500_scaled, car_train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Predicting classifiers scaled

In [246]:
# Julen
# Scaled data
car_dt_test_predictions_100_scaled, car_lda_test_predictions_100_scaled, car_lg_test_predictions_100_scaled = \
predict_classifiers(car_dt_100, car_lda_100, car_lg_100, car_reduced_data_test_100_scaled)

car_dt_test_predictions_200_scaled, car_lda_test_predictions_200_scaled, car_lg_test_predictions_200_scaled = \
predict_classifiers(car_dt_200, car_lda_200, car_lg_200, car_reduced_data_test_200_scaled)

car_dt_test_predictions_300_scaled, car_lda_test_predictions_300_scaled, car_lg_test_predictions_300_scaled = \
predict_classifiers(car_dt_300, car_lda_300, car_lg_300, car_reduced_data_test_300_scaled)

car_dt_test_predictions_400_scaled, car_lda_test_predictions_400_scaled, car_lg_test_predictions_400_scaled = \
predict_classifiers(car_dt_400, car_lda_400, car_lg_400, car_reduced_data_test_400_scaled)

car_dt_test_predictions_500_scaled, car_lda_test_predictions_500_scaled, car_lg_test_predictions_500_scaled = \
predict_classifiers(car_dt_500, car_lda_500, car_lg_500, car_reduced_data_test_500_scaled)

### Calculating accuracy
Let's see if the evolution of the accuracy depending on the quantity of fetures selected:

In [247]:
print("100 features + Not scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_100, car_lda_test_predictions_100, car_lg_test_predictions_100)

print("\n")

print("100 features + Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_100_scaled, car_lda_test_predictions_100_scaled, car_lg_test_predictions_100_scaled)


100 features + Not scaled: 

Accuracy for the decision tree : 0.7833333333333333
Accuracy for LDA : 0.43333333333333335
Accuracy for logistic regression: 0.6


100 features + Scaled: 

Accuracy for the decision tree : 0.7166666666666667
Accuracy for LDA : 0.43333333333333335
Accuracy for logistic regression: 0.6


In [248]:
print("200 features + Not scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_200, car_lda_test_predictions_200, car_lg_test_predictions_200)

print("\n")

print("200 features + Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_200_scaled, car_lda_test_predictions_200_scaled, car_lg_test_predictions_200_scaled)

200 features + Not scaled: 

Accuracy for the decision tree : 0.8333333333333334
Accuracy for LDA : 0.6666666666666666
Accuracy for logistic regression: 0.7166666666666667


200 features + Scaled: 

Accuracy for the decision tree : 0.7666666666666667
Accuracy for LDA : 0.6666666666666666
Accuracy for logistic regression: 0.7166666666666667


In [249]:
print("300 features + Not Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_300, car_lda_test_predictions_300, car_lg_test_predictions_300)

print("\n")

print("300 features + Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_300_scaled, car_lda_test_predictions_300_scaled, car_lg_test_predictions_300_scaled)

300 features + Not Scaled: 

Accuracy for the decision tree : 0.75
Accuracy for LDA : 0.8333333333333334
Accuracy for logistic regression: 0.8833333333333333


300 features + Scaled: 

Accuracy for the decision tree : 0.8166666666666667
Accuracy for LDA : 0.8333333333333334
Accuracy for logistic regression: 0.8833333333333333


In [250]:
print("400 features + Not scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_400, car_lda_test_predictions_400, car_lg_test_predictions_400)

print("\n")

print("400 features + Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_400_scaled, car_lda_test_predictions_400_scaled, car_lg_test_predictions_400_scaled)

400 features + Not scaled: 

Accuracy for the decision tree : 0.7333333333333333
Accuracy for LDA : 0.8666666666666667
Accuracy for logistic regression: 0.85


400 features + Scaled: 

Accuracy for the decision tree : 0.8
Accuracy for LDA : 0.8666666666666667
Accuracy for logistic regression: 0.85


In [251]:
print("500 features + Not scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_500, car_lda_test_predictions_500, car_lg_test_predictions_500)

print("\n")

print("500 features+ Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_500_scaled, car_lda_test_predictions_500_scaled, car_lg_test_predictions_500_scaled)

500 features + Not scaled: 

Accuracy for the decision tree : 0.7833333333333333
Accuracy for LDA : 0.8833333333333333
Accuracy for logistic regression: 0.8666666666666667


500 features+ Scaled: 

Accuracy for the decision tree : 0.75
Accuracy for LDA : 0.8833333333333333
Accuracy for logistic regression: 0.8666666666666667


# 8.2. Feature Selection with RandomForest

### Plane dataset

In [252]:
# SelectFromModel will select those features which importance 
# is greater than the mean importance of all the features

sel = SelectFromModel(RandomForestClassifier(n_estimators = 300))
sel.fit(plane_train_features, plane_train_target)
selectedFeaturesBoolean= sel.get_support()

In [253]:
selectedFeaturesTrainNames = plane_train_features.columns[(selectedFeaturesBoolean)]
selectedFeaturesTestNames = plane_test_features.columns[(selectedFeaturesBoolean)]

In [254]:
selectedTrainData = plane_train_features[selectedFeaturesTrainNames]
selectedTestData = plane_test_features[selectedFeaturesTestNames]
np.array(selectedTestData).shape

(105, 590)

In [255]:
dt  = DecisionTreeClassifier()
lda = LinearDiscriminantAnalysis()
lg  = LogisticRegression(max_iter=2000)

In [256]:
dt.fit(selectedTrainData, plane_train_target)
lda.fit(selectedTrainData, plane_train_target)
lg.fit(selectedTrainData, plane_train_target)

LogisticRegression(max_iter=2000)

In [257]:
dt_selected_prediction = dt.predict(selectedTestData)
lda_selected_prediction = lda.predict(selectedTestData)
lg_selected_prediction = lg.predict(selectedTestData)

In [258]:
accuracy_score(plane_test_target, dt_selected_prediction)

0.8

In [259]:
accuracy_score(plane_test_target, lda_selected_prediction)

0.9333333333333333

In [260]:
accuracy_score(plane_test_target, lg_selected_prediction)

0.9238095238095239

### Car dataset

In [261]:
# SelectFromModel will select those features which importance 
# is greater than the mean importance of all the features

sel = SelectFromModel(RandomForestClassifier(n_estimators = 300))
sel.fit(car_train_features, car_train_target)
selectedFeaturesBoolean= sel.get_support()

In [262]:
selectedFeaturesTrainNames = car_train_features.columns[(selectedFeaturesBoolean)]
selectedFeaturesTestNames = car_test_features.columns[(selectedFeaturesBoolean)]

In [263]:
selectedTrainData = car_train_features[selectedFeaturesTrainNames]
selectedTestData = car_test_features[selectedFeaturesTestNames]
np.array(selectedTestData).shape

(60, 164)

In [264]:
dt  = DecisionTreeClassifier()
lda = LinearDiscriminantAnalysis()
lg  = LogisticRegression(max_iter=2000)

In [265]:
dt.fit(selectedTrainData, car_train_target)
lda.fit(selectedTrainData, car_train_target)
lg.fit(selectedTrainData, car_train_target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(max_iter=2000)

In [266]:
dt_selected_prediction = dt.predict(selectedTestData)
lda_selected_prediction = lda.predict(selectedTestData)
lg_selected_prediction = lg.predict(selectedTestData)

In [267]:
accuracy_score(car_test_target, dt_selected_prediction)

0.8333333333333334

In [268]:
accuracy_score(car_test_target, lda_selected_prediction)

0.8

In [269]:
accuracy_score(car_test_target, lg_selected_prediction)

0.85

# 9. Feature engineering

# 9.1. Using PCA

## PCA for the plane data

In [270]:
plane_pca = PCA(n_components=3)
plane_pca.fit(plane_train_features)

print('Explained variation per principal component: {}'.format(plane_pca.explained_variance_ratio_))

Explained variation per principal component: [0.53829005 0.16718511 0.0947978 ]


In [271]:
plane_train_features_trans = plane_pca.transform(plane_train_features)
plane_test_features_trans = plane_pca.transform(plane_test_features)

In [272]:
print(plane_test_features.shape, plane_test_features_trans.shape)

(105, 1781) (105, 3)


In [273]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_test_features_trans[plane_test_target==i, 0], plane_test_features_trans[plane_test_target==i, 1], \
               plane_test_features_trans[plane_test_target==i, 2], label=str(i))

lgnd = plt.legend()
ax.set_xlabel('Comp. 1')
ax.set_ylabel('Comp. 2')
ax.set_zlabel('Comp. 3')

for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [274]:
plt.close(fig)

In [275]:
plane_lda = LinearDiscriminantAnalysis()
plane_lda.fit(plane_train_features_trans, plane_train_target)
plane_lda_test_predictions = plane_lda.predict(plane_test_features_trans)
lda_acc =  accuracy_score(plane_test_target, plane_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.6190476190476191


## PCA for the car data

In [276]:
car_pca = PCA(n_components=3)
car_pca.fit(car_train_features)

print('Explained variation per principal component: {}'.format(car_pca.explained_variance_ratio_))

Explained variation per principal component: [0.71231913 0.21773543 0.03970791]


In [277]:
car_train_features_trans = car_pca.transform(car_train_features)
car_test_features_trans = car_pca.transform(car_test_features)

In [278]:
print(car_test_features.shape, car_test_features_trans.shape)

(60, 545) (60, 3)


In [279]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 5):
    ax.scatter(car_test_features_trans[car_test_target==i, 0], car_test_features_trans[car_test_target==i, 1], \
               car_test_features_trans[car_test_target==i, 2], label=str(i))

lgnd = plt.legend()
ax.set_xlabel('Comp. 1')
ax.set_ylabel('Comp. 2')
ax.set_zlabel('Comp. 3')

for i in range(4):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [280]:
plt.close(fig)

In [281]:
car_lda = LinearDiscriminantAnalysis()
car_lda.fit(car_train_features_trans, car_train_target)
car_lda_test_predictions = car_lda.predict(car_test_features_trans)
lda_acc =  accuracy_score(car_test_target, car_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.5166666666666667


# 9.2. Using TSNE

## TSNE for the plane data

In [282]:
plane_tsne = TSNE(n_components=3)
plane_tsne_trans3 = plane_tsne.fit_transform(plane_train_features_scaled)

In [283]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_tsne_trans3[plane_train_target==i, 0], plane_tsne_trans3[plane_train_target==i, 1], plane_tsne_trans3[plane_train_target==i, 2], label=str(i))

ax.set_xlabel('TSNE dim. 1')
ax.set_ylabel('TSNE dim. 2')
ax.set_zlabel('TSNE dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [284]:
plt.close(fig)

## TSNE for the car data

In [285]:
car_tsne = TSNE(n_components=3)
car_tsne_trans3 = car_tsne.fit_transform(car_train_features)

In [286]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 5):
    ax.scatter(car_tsne_trans3[car_train_target==i, 0], car_tsne_trans3[car_train_target==i, 1], car_tsne_trans3[car_train_target==i, 2], label=str(i))

ax.set_xlabel('TSNE dim. 1')
ax.set_ylabel('TSNE dim. 2')
ax.set_zlabel('TSNE dim. 3')

lgnd = plt.legend()
for i in range(4):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [287]:
plt.close(fig)

# 9.3. Using LDA

## LDA for the plane data

### Not scaling the data

In [288]:
plane_lda = LinearDiscriminantAnalysis(n_components=3)
plane_lda.fit(plane_train_features, plane_train_target)

LinearDiscriminantAnalysis(n_components=3)

In [289]:
plane_train_features_trans = plane_lda.transform(plane_train_features)
plane_test_features_trans = plane_lda.transform(plane_test_features)

In [290]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_test_features_trans[plane_test_target==i, 0], plane_test_features_trans[plane_test_target==i, 1], \
               plane_test_features_trans[plane_test_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [291]:
plt.close(fig)

In [292]:
plane_lda = LinearDiscriminantAnalysis()
plane_lda.fit(plane_train_features_trans, plane_train_target)
plane_lda_test_predictions = plane_lda.predict(plane_test_features_trans)
lda_acc =  accuracy_score(plane_test_target, plane_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.9333333333333333


### Scaling the data

In [293]:
plane_lda = LinearDiscriminantAnalysis(n_components=3)
plane_lda.fit(plane_train_features_scaled, plane_train_target)

LinearDiscriminantAnalysis(n_components=3)

In [294]:
plane_train_features_trans_scaled = plane_lda.transform(plane_train_features_scaled)
plane_test_features_trans_scaled = plane_lda.transform(plane_test_features_scaled)

In [295]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_test_features_trans_scaled[plane_test_target==i, 0], plane_test_features_trans_scaled[plane_test_target==i, 1], \
               plane_test_features_trans_scaled[plane_test_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [296]:
plt.close(fig)

In [297]:
plane_lda = LinearDiscriminantAnalysis()
plane_lda.fit(plane_train_features_trans_scaled, plane_train_target)
plane_lda_test_predictions = plane_lda.predict(plane_test_features_trans_scaled)
lda_acc =  accuracy_score(plane_test_target, plane_lda_test_predictions_scaled)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.9523809523809523


## LDA for the car data

### Not scaling the data

In [298]:
car_lda = LinearDiscriminantAnalysis(n_components=3)
car_lda_trans = car_lda.fit(car_train_features, car_train_target)

In [299]:
car_train_features_trans = car_lda.transform(car_train_features)
car_test_features_trans = car_lda.transform(car_test_features)

In [300]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 5):
    ax.scatter(car_test_features_trans[car_train_target==i, 0], car_test_features_trans[car_train_target==i, 1], \
               car_test_features_trans[car_train_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(4):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [301]:
plt.close(fig)

In [302]:
car_lda = LinearDiscriminantAnalysis()
car_lda.fit(car_train_features_trans, car_train_target)
car_lda_test_predictions = car_lda.predict(car_test_features_trans)
lda_acc =  accuracy_score(car_test_target, car_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.9


### Scaling the data

In [303]:
car_lda = LinearDiscriminantAnalysis(n_components=3)
car_lda.fit(car_train_features_scaled, car_train_target)

LinearDiscriminantAnalysis(n_components=3)

In [304]:
car_train_features_trans_scaled = car_lda.transform(car_train_features_scaled)
car_test_features_trans_scaled = car_lda.transform(car_test_features_scaled)

In [305]:
fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(car_test_features_trans_scaled[car_test_target==i, 0], car_test_features_trans_scaled[car_test_target==i, 1], \
               car_test_features_trans_scaled[car_test_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]

<IPython.core.display.Javascript object>

In [306]:
plt.close(fig)

In [307]:
car_lda = LinearDiscriminantAnalysis()
car_lda.fit(car_train_features_trans_scaled, car_train_target)
car_lda_test_predictions_scaled = car_lda.predict(car_test_features_trans_scaled)
lda_acc =  accuracy_score(car_test_target, car_lda_test_predictions_scaled)
print("Accuracy for LDA :", lda_acc)

Accuracy for LDA : 0.9


# 10. Pipeline Optimization
We use TPOT to generate an optimal pipeline to compare its accuracy with ours. This way we can know how good our classifiers are.

## 10.1. TPOT for the plane data

In [311]:
plane_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
plane_tpot.fit(plane_train_features, plane_train_target)
plane_tpot.fitted_pipeline_.steps



HBox(children=(HTML(value='Optimization Progress'), FloatProgress(value=0.0, max=60.0), HTML(value='')))


Generation 1 - Current best internal CV score: 0.7714285714285715
Generation 2 - Current best internal CV score: 0.7714285714285715
Generation 3 - Current best internal CV score: 0.7714285714285715
Generation 4 - Current best internal CV score: 0.8571428571428571
Generation 5 - Current best internal CV score: 0.8571428571428571
Best pipeline: KNeighborsClassifier(input_matrix, n_neighbors=75, p=2, weights=distance)


[('kneighborsclassifier',
  KNeighborsClassifier(n_neighbors=75, weights='distance'))]

In [312]:
plane_tpot_test_accuracy = plane_tpot.score(plane_test_features, plane_test_target)
print('The test accuracy obtained by tpot classification problem is:', plane_tpot_test_accuracy)

The test accuracy obtained by tpot classification problem is: 0.8


## 10.2. TPOT for the car data

In [314]:
car_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
car_tpot.fit(car_train_features, car_train_target)
car_tpot.fitted_pipeline_.steps



HBox(children=(HTML(value='Optimization Progress'), FloatProgress(value=0.0, max=60.0), HTML(value='')))


Generation 1 - Current best internal CV score: 0.8
Generation 2 - Current best internal CV score: 0.8
Generation 3 - Current best internal CV score: 0.8666666666666666
Generation 4 - Current best internal CV score: 0.8666666666666666
Generation 5 - Current best internal CV score: 0.8666666666666668
Best pipeline: MLPClassifier(SelectPercentile(RobustScaler(input_matrix), percentile=50), alpha=0.001, learning_rate_init=0.01)


[('robustscaler', RobustScaler()),
 ('selectpercentile', SelectPercentile(percentile=50)),
 ('mlpclassifier',
  MLPClassifier(alpha=0.001, learning_rate_init=0.01, random_state=16))]

In [315]:
car_tpot_test_accuracy = car_tpot.score(car_test_features, car_test_target)
print('The test accuracy obtained by tpot classification problem is:', car_tpot_test_accuracy)

The test accuracy obtained by tpot classification problem is: 0.8833333333333333
