In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Machine Learning for Classification with SciKit-Learn
Classification is the process of predicting the class (also called target or label) of given data points. In more technical terms, classification predictive modeling is the task of approximating a mapping function from input variables (X) to discrete output variables (y).  
Classification belongs to the category of supervised learning, meaning we know know the true class labels of the data we are using to train our model, that is both X and y are given in the data.

For example, spam detection in email service providers can be identified as a classification problem. This is binary classification since there are only 2 classes: spam and not spam.  
The model utilizes some training data to understand how given input variables relate to the class. In the email example case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to label new unknown emails.

### 23.0. Loading in the data
In the folder for today's exercises, you have been supplied with the file `seeds.data` which contains the [Wheat Seeds Dataset](http://archive.ics.uci.edu/ml/datasets/seeds).  The data consists of 210 observations of seeds from 3 different varieties of wheat. The number of observations for each class is balanced. Each seed is described by 7 attributes and the class it belongs to:

<img src="https://www.organicfoods.com.au/wp-content/uploads/2020/06/wheat-grain.png" width="300" align="right">

0. Surface area
1. Perimeter
2. Compactness
3. Length of kernel
4. Width of kernel
5. Asymmetry coefficient.
6. Length of kernel groove.
7. Class: {1, 2, 3}

**NB!** `seeds.data` is formatted as a tab-delimited data file and can be loaded using `np.loadtxt()` with default parameters.
___
`seeds.shape`  
\>\> `(210, 8)`

In [4]:
data = np.loadtxt('seeds.data')
data.shape

(210, 8)

### 23.1. Separate descriptive features and target feature
Extract the 7 descriptive features into a matrix, `X`, and the target feature, class, into a column vector, `y`. 

Hint: Use numpy slicing
___
`X.shape`  
\>\> `(210, 7)`

`y.shape`  
\>\> `(210,)`

In [16]:
X = data[:,:7]
Y = data[:,-1]
print(X.shape)
print(Y.shape)


(210, 7)
(210,)


## Training data and testing data
To assess your model’s performance later, you will also need to divide the data set into two parts: a training set and a test set. The first is used to train the model, while the second is used to evaluate the trained model.

The most common splitting choice is to take 70 % of your original data set as the training set, while the 30 % that remains will compose the test set. 

It is often a good idea to shuffle your data prior to splitting, to ensure that all classes are somewhat equally represented in both the train and test data. You’ll probably recognize, that shuffling has some randomness to it, so you should seed your shuffles to guarantee that your split data will always look the same. That is particularly handy if you want reproducible results.


### 23.2. Split your data into a training set and a test set
Use the `train_test_split` method from scikit-learn to split your data, with the test_size set to 0.3, and the random_state set to 3 (or another number you think is cool).

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
___
`X_train.shape`  
\>\> `(147, 7)`

`y_test.shape`  
\>\> `(63,)`

In [24]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(X.shape)
print(Y_test.shape)

(147, 7)
(63, 7)
(210, 7)
(63,)


## Naive Bayes Classifier

### 23.1. Make a NaiveBayes classifier object and fit it to your training data
Now, we finally get to make our machine learning model! Make a `GaussianNB()` object, and apply the `fit` method to the object to train the model to your training data.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

In [26]:
clf = GaussianNB()
clf.fit(X_train,Y_train)

GaussianNB()

### 23.5. Use your Naive Bayes Classifier to (re-)predict the labels for the training data
Let's make our first prediction using our model. You will use the `predict` method on your classifier object and feed it your training data. So we're making a prediction for the data that the model already knows, so that we can see how accurately we modelled/learned the training data.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict

In [40]:
guessed = clf.predict(X_train)
#print(guessed) # these are the guessed classes
#print(Y_train) #these are the real classes

guessed == Y_train
print(len(Y_train))
print(len(Y_train[guessed == Y_train]))# DAMN THAT IS FUCKING AMAZING I WAS ABLE TO GUESS ALMOST ALL OF THEM

probability = (len(Y_train[guessed == Y_train])/len(Y_train))*100 # probability of guessing right
print('Probability of guessing right is', probability, '%')

147
136
Probability of guessing right is 92.51700680272108 %


### 23.6. Measure the accuracy of your model for the training data
Use the `accuracy_score` method, and feed it your true labels for training data and your predicted labels.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [44]:
accuracy = accuracy_score(Y_train,guessed)
accuracy*100

92.51700680272108

### 23.7. Repeat the two prior steps for the testing data
Use your classifier object to make a prediction of the labels of the test data, and calculate the accuracy score by comparing the resulting labels to the true labels.

In [45]:
print('Probability of guessing right is', probability, '%')

Probability of guessing right is 92.51700680272108 %


## Random Forest Classifier
### 23.8. Make a Random Forest Classifier and train it on your training data
Make a `RandomForestClassifier()` object and train it on your training data using the `fit` method.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

In [47]:
clf2 = RandomForestClassifier()
clf2.fit(X_train,Y_train)


RandomForestClassifier()

### 23.9. Check the accuracy of your model on both the training and testing data
Use the `predict` method on your classifier object to get both the predicted labels for the known training data and the unseen test data.

Calculate the accuracy of your predictions using the `accuracy_score` function.

In [54]:
guessedTrain = clf2.predict(X_train)
guessedTest = clf2.predict(X_test)

accuracyForTrain = accuracy_score(Y_train, guessedTrain)
accuracyForTest = accuracy_score(Y_test, guessedTest)
print(accuracyForTrain)
print(accuracyForTest)

1.0
0.873015873015873


### Bonus: Play around with hyperparameters
The classifiers we have used today were both made with scikit-learn's default parameters. This will very seldom be the optimal parameters for your task.  
Make a new RandomForestClassifier object, but this time try changing some of the hyperparameters. Fit the classifier to your data, and see how the accuracy of your model changes on train and test data.

The list of possible parameters can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier).