# SciKit Learn Tutorial 

Exercises from online youtube course, which can be found [here](https://www.youtube.com/watch?v=pqNCD_5r0IU).

## Classification 

A classification model will use the features of an instance to label it into a specific category of other instances with similar features. 

### Data 

The dataset contains data regarding car evaluation. Using this data we will create a classification model to classify the condition of vehicles as one of: unacceptable, acceptable, good or very good. 

This is an example of a classification model using supervised machine learning. 

Features: 
* buying: buying price (categorical)
* maint: price of maintenance (categorical)
* doors: The number of doors (categorical)
* persons: Capacity in terms of persons to carry (categorical)
* lug_boot: The size of the car boot (categorical)
* safety: Estimated safety of the car (categorical)

Target: 
* class: The evaluation level = 'unacceptable', 'acceptable', 'good', 'very good' 

In [33]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import neighbors, metrics 
from sklearn.preprocessing import LabelEncoder
import numpy as np 
import pandas as pd

# Reading in the data and adding column labels 
cars = pd.read_csv('/Users/imogensole/Desktop/Git/Scikit_Learn/car+evaluation/car.data')
cars = cars.set_axis(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], axis=1)

### Train Test Split

In [57]:
# Split the data into features and labels 

X = cars.drop('class', axis='columns').values
y = cars['class']

# splits the data into training and test sets 
# test data = 20% of data set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1381, 6)
(346, 6)
(1381,)
(346,)


**What is KNN?**

* KNN = 'K Nearest Neighbours'
* Type of classification/regression algorithm 
* Divides the data set into regions for different categories of classification based
* Axis represent the features of the data set 

1. Test point plotted based on its features 
2. Identify the K nearest neighbours to the test point 
3. Test point classified in the same category as the majority of the nearest neighbours 

* Larger k is more appropriate for larger datasets 
* Advised to use an odd number for k 
* weights parameter = 'uniform' or 'distance'
    * 'uniform' = all datapoints given equal importance 
    * 'distance' = datapoints weighted higher if they are nearer to the test point 

In [61]:
# Converting the feature strings into numerical 
Le = LabelEncoder()

for i in range(len(X[0])): #iterates over each of the features 
    X[:, i] = Le.fit_transform(X[:,i]) #converts each column


# Converting the target strings into numerical using mapping 
# Create a dictionary for mapping labels 
label_mapping = { 
    'unacc':0, 
    'acc':1, 
    'good':2, 
    'vgood':3
}

y = cars['class']
y = y.map(label_mapping)
y = np.array(y)


[0 0 0 ... 0 2 3]


In [73]:
# Create the model 

knn = neighbors.KNeighborsClassifier(n_neighbors=25, weights='uniform')

# Creating the train test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Training the model 
knn.fit(X_train, y_train)

# Make predictions on test data
prediction = knn.predict(X_test)

# Evaluate the accuracy 
accuracy = metrics.accuracy_score(y_test, prediction)

print("predictions: ", prediction)
print("accuracy: ", accuracy)

# Comparing predicted and actual values 
a = 100
print("actual value: ", y[a])
print("predicted value: ", knn.predict(X)[a])

predictions:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 3 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 3 0 0 0 1 0 1 0 0 0 0 0 0 1 0
 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 1]
accuracy:  0.7890173410404624
actual value:  0
predicted value:  0


# Support Vector Machine (SVM)

* Effective in high dimensional spaces (= good for modelling )