# Code

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Splitting Dataset into Training Set & Test Set

**Python Documentation**
```py
(function) train_test_split: (*arrays: Any, test_size: Any | None = None, train_size: Any | None = None, random_state: Any | None = None, shuffle: bool = True, stratify: Any | None = None) -> list[Any | list]
Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Read more in the User Guide <cross_validation>.

Parameters
*arrays : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

test_size : float or int, default=None
    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

train_size : float or int, default=None
    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, default=None
    Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See Glossary <random_state>.

shuffle : bool, default=True
    Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

stratify : array-like, default=None
    If not None, data is split in a stratified fashion, using this as the class labels. Read more in the User Guide <stratification>.

Returns
splitting : list, length=2 * len(arrays)
    List containing train-test split of inputs
```

- `X_train` - Matrix of Features of the training set
- `X_test` - Matrix of Features of the testing set
- `y_train` - Dependent Variable Vector of the training set
- `y_test` - Dependent Variable Vector of the testing set

- `train_test_set(matrix_of_features, dependent_variable_vector, test_size=), random_state=)` returns a tuple with the split dataset
  - The Matrix of Features (labels) and Dependent Variable Vector must be separated
  - `test_size` - size of the test set given in decimal, the default is 0.25 test (leaving 0.75 for training set)
  - `random_state` - a seed given so that the random split is consistent across multiple runs
  - The dataset is split into:
    - `X_train`, `X_test`, `y_train`, `y_test`
    - `train_test_split` returns a list which is then unpacked into the 4 variables
    - The spit is done at random (pseudo-random) as there could potentially be an order in the labels (matrix of features) 

In [None]:
iris = load_iris() # load the iris dataset
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0) # 75% training and 25% test

**Matrix of Features / Labels for Training Set**

In [None]:
print(X_train.shape)
print(X_train)

**Matrix of Features / Labels for Test Set**

In [None]:
print(X_test.shape)
print(X_test)

**Dependent Variable Vector for Training Set**

In [None]:
print(y_train.shape)
print(y_train)

**Dependent Variable Vector for Test Set**

In [None]:
print(y_test.shape)
print(y_test)

## Building First Model

- All machine learning models in `scikit-learn` are implemented in their own classes, which are parts of modules
- The K Nearest Neighbours classification algorithm is implemented in the `KNeighborsClassifier` class in the neighbors module
- Before we can use the model, we need to instantiate the class into an object.
- This is when we will set any parameters of the model
	The single parameter of the `KNeighborsClassifier` is the number of neighbours

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1) # n_neighbors is the number of neighbors to use (1)

- Fit the k-nearest neighbors classifier from the training dataset

In [None]:
knn.fit(X_train, y_train) # fit the model using the training data and training targets

## Making Predictions

### Creating Data to Predict

- A test iris object is created with the matrix of features (labels) filled in

In [None]:
X_new = np.array([[5, 2.9, 1, 0.2]]) # sample data to predict
X_new.shape

### Predicting Data

- Prediction about the test object is made to determine the species of iris
  - Prediction returns the class of the species of iris which is the index of the class

In [None]:
prediction = knn.predict(X_new) # predict the class of the new data point
print(prediction) # prints the predicted class

- The prediction (index of the class) is used to find the name associated with that that class index
  - `iris['target_names']` returns all the the names of the classes -> `['setosa' 'versicolor' 'virginica']`
  - `iris['target_names'][prediction]` returns the name of the specific class ->  `['setosa' 'versicolor' 'virginica'][0]` -> `'setosa'`

In [None]:
print(iris['target_names']) # print all the names of the classes
print(iris['target_names'][prediction]) # print the name of the class

## Evaluating the Model

### Making a Prediction

- Make a prediction for an iris in the test data and compare it against its label (feature)

In [None]:
y_pred = knn.predict(X_test) # predict the class of the test data

### Measure Accuracy of Prediction

- The accuracy of the model can be measured 
- How much of the sample matches the current dataset
  -  fraction of flowers for which the right species was predicted
- There are 2 ways of measuring accuracy, both of which return the same value:
  - [Numpy Mean](#numpy-mean)
  - [K-Nearest Neighbour](#k-nearest-neighbour)

#### Numpy Mean

In [None]:
np.mean(y_pred == y_test) # calculate the accuracy of the prediction

- Average number of `true` in the list is 97%

In [None]:
print(y_pred == y_test) # print the result of the prediction

#### K-Nearest Neighbour

In [None]:
knn.score(X_test, y_test) # calculate the accuracy of the prediction

#### Checking If Both Method Return the Same Value

In [None]:
print(np.mean(y_pred == y_test) == knn.score(X_test, y_test)) # check if the two methods give the same result

## Loading Data from File

In [None]:
X = np.genfromtxt("iris_data.txt") # load the data from the file
print(X.shape) # print the shape of the data

In [None]:
print(X[:3, :]) # print the first 3 rows of the data and all columns

- Checking if the iris dataset is the same as the loaded dataset from text file

In [None]:
print(np.array_equal(X, iris['data'])) # check if the data is the same as the one from the iris dataset
print(np.mean(X == iris['data'])) # check percentage of data being the same as the one from the iris dataset

# Exercises

## Question 1
Briefly explain the way `np.mean(y_pred == y_test)` is computed. 
It might help to run its part: `y_pred == y_test`.

- Comparing each element in the list `y_pred` to each element in list `y_test` which creates a new list with boolean values showing whether the elements were the same (true) or different (false)

In [None]:
print(y_pred == y_test) # print the result of the prediction for each data point

- `np.mean()` will compute the mean number of true values in the list
  - number of true values divided by total number of values

In [None]:
np.mean(y_pred == y_test) # calculate the accuracy of the prediction

## Question 2
Draw the test error rate of the K Nearest Neighbours algorithm on the same training set against `K`. Use the same test set for all `K`.
If you need to remember the scores you are getting for various `K`, you may use `NumPy` commands such as

In [None]:
results = np.empty(99) # create an empty array to store the results
for K in range(1, 100): # iterate over all values of k
	knn = KNeighborsClassifier(n_neighbors = K) # create a new model
	knn.fit(X_train, y_train) # fit the model using the training data and training targets
	results[K - 1] = knn.score(X_test, y_test) # calculate the accuracy of the prediction

In [None]:
plt.plot(np.arange(99)+1,1-results)

## Question 3
Check that the dataset that you loaded from file in Section 5 is identical to the one that you loaded using load_iris in Section 1. You may want to use your answer to Exercise 1. If the two data sets are not identical, please explore the difference.

- Check if the dataset imported from the `iris_data` file is the same as the 

In [None]:
np.array_equal(X, iris['data']) # check if the data is the same as the one from the iris dataset

- Check how much of the data matches

In [None]:
print(np.mean(X == iris['data']))

- Returns the indices for the locations where the data is different between the 2 datasets

In [None]:
print(np.where(X != iris['data'])) # print the indices of the data points that are different

# Quiz

## Question 1
In the context of conformal prediction, what is the minimal possible p-value for a training set of size 5?  (To two decimal places.)
*0.2*

## Question 2
In conformal prediction, *validity* is achieved automatically (under the IID assumption).  But *achieving* efficiency is an art.

## Question 3
Different data sciences often use different assumptions about the data to reach their conclusions.
- Which data science widely uses Gaussian assumptions? *Traditional statistics*
- Which data science widely uses the IID assumption? *Mainstream machine learning*

## Question 4
Suppose a conformity measure A maps the sequence of observations
(1,1), (2,0), (3,1)
to the sequence of conformity scores
0, 1, 0.
Which sequence of conformity scores does A map
(2,0), (3,1), (1,1)
to?
- *1, 0, 0*

## Question 5
Suppose `y_pred` and `y_test` are vectors of the same length.  Then `y_pred==y_test` is a *vector* `(y_pred==y_test)+1` is a *vector*, `y_pred==y_test+1` is a *vector*, `(y_pred==y_test)[0]+1` is a *scalar*, and `np.mean(y_pred==y_test)` is a *scalar* 

- Returns a list (vector) of comparisons

In [None]:
print(y_pred == y_test) # print the result of the prediction for each data point

- Returns a list (vector)
- `+0` turns all the true and false predictions into integers
  - True = 1
  - False = 0
- `+1` adds `1` to each prediction
  - True = 2
  - False = 1

In [None]:
print((y_pred == y_test) + 1) # print the result of the prediction for each data point

- Returns a list (vector)
- Each boolean element is an instance of an integer where true is 1 and false is 0
  - Adding 1 will make true into 0 which is false and the same for false

In [None]:
print((y_pred == y_test + 1)) # print the result of the prediction for each data point

- `[0]` returns a single element from the list which means it is a scalar
- The returned element is then incremented

In [None]:
print((y_pred==y_test)[0]) 
print((y_pred==y_test)[0]+1) 

- `y_pred==y_test` returns a list of predictions which is then used to work out the mean making it scalar 

In [None]:
np.mean(y_pred==y_test)

## Question 6
As you know, the scikit-learn function train_test_split splits the dataset (after shuffling) into two parts: training and test.  In what proportion does it do it?  (training:test)
- *3:1*

## Question 7
Answer the last question discussed in this week's slides.  Namely, the training set is:
- positive samples: 0 and 1
- negative samples: 10 and 11.
The test sample is 12.  Compute (to one decimal place) the two p-values using the distance to the nearest sample of the same class as nonconformity score.
- For postulated label +1 (positive), the p-value is *0.2*
- For postulated label -1 (negative), the p-value is *1*

## Question 8
This question is about the iris dataset.  What is the error rate of the KNN with K=5 when using the standard train_test_split function? *0.026*

In [None]:
knn = KNeighborsClassifier(n_neighbors = 5) # create a new model
knn.fit(X_train, y_train) # fit the model using the training data and training targets
results = knn.score(X_test, y_test) # calculate the accuracy of the prediction
print("Accuracy: ", results) # print the accuracy
print("Error Rate: ", 1 - results) # print the error rate
