In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Machine Learning for Classification with SciKit-Learn
Classification is the process of predicting the class (also called target or label) of given data points. In more technical terms, classification predictive modeling is the task of approximating a mapping function from input variables (X) to discrete output variables (y).  
Classification belongs to the category of supervised learning, meaning we know know the true class labels of the data we are using to train our model, that is both X and y are given in the data.

For example, spam detection in email service providers can be identified as a classification problem. This is binary classification since there are only 2 classes: spam and not spam.  
The model utilizes some training data to understand how given input variables relate to the class. In the email example case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to label new unknown emails.

### 23.0. Loading in the data
In the folder for today's exercises, you have been supplied with the file `seeds.data` which contains the [Wheat Seeds Dataset](http://archive.ics.uci.edu/ml/datasets/seeds).  The data consists of 210 observations of seeds from 3 different varieties of wheat. The number of observations for each class is balanced. Each seed is described by 7 attributes and the class it belongs to:

<img src="https://www.organicfoods.com.au/wp-content/uploads/2020/06/wheat-grain.png" width="300" align="right">

0. Surface area
1. Perimeter
2. Compactness
3. Length of kernel
4. Width of kernel
5. Asymmetry coefficient.
6. Length of kernel groove.
7. Class: {1, 2, 3}

**NB!** `seeds.data` is formatted as a tab-delimited data file and can be loaded using `np.loadtxt()` with default parameters.
___
`seeds.shape`  
\>\> `(210, 8)`

In [3]:
seeds = np.loadtxt('seeds.data')
seeds

array([[15.26  , 14.84  ,  0.871 , ...,  2.221 ,  5.22  ,  1.    ],
       [14.88  , 14.57  ,  0.8811, ...,  1.018 ,  4.956 ,  1.    ],
       [14.29  , 14.09  ,  0.905 , ...,  2.699 ,  4.825 ,  1.    ],
       ...,
       [13.2   , 13.66  ,  0.8883, ...,  8.315 ,  5.056 ,  3.    ],
       [11.84  , 13.21  ,  0.8521, ...,  3.598 ,  5.044 ,  3.    ],
       [12.3   , 13.34  ,  0.8684, ...,  5.637 ,  5.063 ,  3.    ]])

### 23.1. Separate descriptive features and target feature
Extract the 7 descriptive features into a matrix, `X`, and the target feature, class, into a column vector, `y`. 

Hint: Use numpy slicing
___
`X.shape`  
\>\> `(210, 7)`

`y.shape`  
\>\> `(210,)`

In [30]:
X, y = seeds[:,:7], seeds[:,-1]

z = seeds[0::2, ]
print(z)

# expected output 
print(X.shape)
print(y.shape)

[[15.26   14.84    0.871   5.763 ]
 [14.88   14.57    0.8811  5.554 ]
 [14.29   14.09    0.905   5.291 ]
 [13.84   13.94    0.8955  5.324 ]
 [16.14   14.99    0.9034  5.658 ]
 [14.38   14.21    0.8951  5.386 ]
 [14.69   14.49    0.8799  5.563 ]
 [14.11   14.1     0.8911  5.42  ]
 [16.63   15.46    0.8747  6.053 ]
 [16.44   15.25    0.888   5.884 ]
 [15.26   14.85    0.8696  5.714 ]
 [14.03   14.16    0.8796  5.438 ]
 [13.89   14.02    0.888   5.439 ]
 [13.78   14.06    0.8759  5.479 ]
 [13.74   14.05    0.8744  5.482 ]
 [14.59   14.28    0.8993  5.351 ]
 [13.99   13.83    0.9183  5.119 ]
 [15.69   14.75    0.9058  5.527 ]
 [14.7    14.21    0.9153  5.205 ]
 [12.72   13.57    0.8686  5.226 ]
 [14.16   14.4     0.8584  5.658 ]
 [14.11   14.26    0.8722  5.52  ]
 [15.88   14.9     0.8988  5.618 ]
 [12.08   13.23    0.8664  5.099 ]
 [15.01   14.76    0.8657  5.789 ]
 [16.19   15.16    0.8849  5.833 ]
 [13.02   13.76    0.8641  5.395 ]
 [12.74   13.67    0.8564  5.395 ]
 [14.11   14.18    0

## Training data and testing data
To assess your model’s performance later, you will also need to divide the data set into two parts: a training set and a test set. The first is used to train the model, while the second is used to evaluate the trained model.

The most common splitting choice is to take 70 % of your original data set as the training set, while the 30 % that remains will compose the test set. 

It is often a good idea to shuffle your data prior to splitting, to ensure that all classes are somewhat equally represented in both the train and test data. You’ll probably recognize, that shuffling has some randomness to it, so you should seed your shuffles to guarantee that your split data will always look the same. That is particularly handy if you want reproducible results.


### 23.2. Split your data into a training set and a test set
Use the `train_test_split` method from scikit-learn to split your data, with the test_size set to 0.3, and the random_state set to 3 (or another number you think is cool).

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
___
`X_train.shape`  
\>\> `(147, 7)`

`y_test.shape`  
\>\> `(63,)`

In [42]:
print(f'Initial Number of Datapoints: {seeds.shape[0]}')
print(f'Splitting the Data into Test and Train Data...')

X_train, X_test = train_test_split(X, test_size=0.3, random_state=3)
y_train, y_test = train_test_split(y, test_size=0.3, random_state=3)

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

print("\n" + "-"*5 + "Train Data" + "-"*5)
print(f'Descriptive: {X_train.shape}')
print(f'Target: {y_train.shape}')
print("\n" + "-"*5 + "Test Data" + "-"*5)
print(f'Descriptive: {X_test.shape}')
print(f'Target: {y_test.shape}')

Initial Number of Datapoints: 210
Splitting the Data into Test and Train Data...

-----Train Data-----
Descriptive: (147, 7)
Target: (147,)

-----Test Data-----
Descriptive: (63, 7)
Target: (63,)


## Naive Bayes Classifier

### 23.1. Make a NaiveBayes classifier object and fit it to your training data
Now, we finally get to make our machine learning model! Make a `GaussianNB()` object, and apply the `fit` method to the object to train the model to your training data.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

In [54]:
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)

print("ID\tName\tAbsolute and Relative Frequency")
for i, class_ in enumerate(naive_bayes.classes_):
	print(f"{i}\t{class_}\t{naive_bayes.class_count_[i]} ({round(naive_bayes.class_prior_[i]*100, 2)}%)")

ID	Name	Absolute and Relative Frequency
0	1.0	48.0 (32.65%)
1	2.0	47.0 (31.97%)
2	3.0	52.0 (35.37%)


### 23.5. Use your Naive Bayes Classifier to (re-)predict the labels for the training data
Let's make our first prediction using our model. You will use the `predict` method on your classifier object and feed it your training data. So we're making a prediction for the data that the model already knows, so that we can see how accurately we modelled/learned the training data.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict

In [20]:
y_predtrain = naive_bayes.predict(X_train)

sum(y_predtrain == y_train)/len(y_train)

0.9115646258503401

### 23.6. Measure the accuracy of your model for the training data
Use the `accuracy_score` method, and feed it your true labels for training data and your predicted labels.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [21]:
accuracy_score(y_train, y_predtrain)

0.9115646258503401

### 23.7. Repeat the two prior steps for the testing data
Use your classifier object to make a prediction of the labels of the test data, and calculate the accuracy score by comparing the resulting labels to the true labels.

In [23]:
y_predtest = naive_bayes.predict(X_test)

print(sum(y_predtest == y_test)/len(y_test))
print(accuracy_score(y_test, y_predtest))

0.9206349206349206
0.9206349206349206


## Random Forest Classifier
### 23.8. Make a Random Forest Classifier and train it on your training data
Make a `RandomForestClassifier()` object and train it on your training data using the `fit` method.

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

In [24]:
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train);

### 23.9. Check the accuracy of your model on both the training and testing data
Use the `predict` method on your classifier object to get both the predicted labels for the known training data and the unseen test data.

Calculate the accuracy of your predictions using the `accuracy_score` function.

In [26]:
y_predtrain, y_predtest = random_forest.predict(X_train), random_forest.predict(X_test)

print(f"Accuracy Score (Test): {accuracy_score(y_test, y_predtest)}")
print(f"Accuracy Score (Train): {accuracy_score(y_train, y_predtrain)}")

Accuracy Score (Test): 0.9682539682539683
Accuracy Score (Train): 1.0


### Bonus: Play around with hyperparameters
The classifiers we have used today were both made with scikit-learn's default parameters. This will very seldom be the optimal parameters for your task.  
Make a new RandomForestClassifier object, but this time try changing some of the hyperparameters. Fit the classifier to your data, and see how the accuracy of your model changes on train and test data.

The list of possible parameters can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier).

In [None]:
# play around with parameters of the model we are training on the test data