In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# Splitting Dataset into Training Set & Test Set

- `X_train` - Matrix of Features of the training set
- `X_test` - Matrix of Features of the testing set
- `y_train` - Dependent Variable Vector of the training set
- `y_test` - Dependent Variable Vector of the testing set

- `train_test_set(matrix_of_features, dependent_variable_vector, test_size=), random_state=)` returns a tuple with the split dataset
  - `test_size` - size of the test set given in decimal, the default is 0.25 test (leaving 0.75 for training set)
  - `random_state` - a seed given so that the random split is consistent across multiple runs
  - The dataset is split into:
    - `X_train`, `X_test`, `y_train`, `y_test`
    - The spit is done at random (pseudo-random) as there could potentially be an order in the labels (matrix of features) 

**Python Documentation**
```py
(function) train_test_split: (*arrays: Any, test_size: Any | None = None, train_size: Any | None = None, random_state: Any | None = None, shuffle: bool = True, stratify: Any | None = None) -> list[Any | list]
Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Read more in the User Guide <cross_validation>.

Parameters
*arrays : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

test_size : float or int, default=None
    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

train_size : float or int, default=None
    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, default=None
    Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See Glossary <random_state>.

shuffle : bool, default=True
    Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

stratify : array-like, default=None
    If not None, data is split in a stratified fashion, using this as the class labels. Read more in the User Guide <stratification>.

Returns
splitting : list, length=2 * len(arrays)
    List containing train-test split of inputs
```

In [None]:
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0)

**Matrix of Features / Labels for Training Set**

In [None]:
print(X_train.shape)
print(X_train)

**Matrix of Features / Labels for Test Set**

In [None]:
print(X_test.shape)
print(X_test)

**Dependent Variable Vector for Training Set**

In [None]:
print(y_train.shape)
print(y_train)

**Dependent Variable Vector for Test Set**

In [None]:
print(y_test.shape)
print(y_test)

# Building First Model

- All machine learning models in `scikit-learn` are implemented in their own classes, which are parts of modules
- The K Nearest Neighbours classification algorithm is implemented in the `KNeighborsClassifier` class in the neighbors module
- Before we can use the model, we need to instantiate the class into an object.
- This is when we will set any parameters of the model
	The single parameter of the `KNeighborsClassifier` is the number of neighbours

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1) # n_neighbors is the number of neighbors to use (1)

In [None]:
knn.fit(X_train, y_train)

# Making Predictions

- A test iris object is created with the matrix of features (labels) filled in

In [None]:
X_new = np.array([[5, 2.9, 1, 0.2]]) # sample data to predict
X_new.shape

- Prediction about the test object is made to determine the species of iris
  - Prediction returns the class of the species of iris which is the index of the class

In [None]:
prediction = knn.predict(X_new) # predict the class of the new data point
print(prediction) # prints the predicted class

- The prediction (index of the class) is used to find the name associated with that that class index
  - `iris['target_names']` returns all the the names of the classes -> `['setosa' 'versicolor' 'virginica']`
  - `iris['target_names'][prediction]` returns the name of the specific class ->  `['setosa' 'versicolor' 'virginica'][0]` -> `'setosa'`

In [None]:
print(iris['target_names']) # print all the names of the classes
print(iris['target_names'][prediction]) # print the name of the class

# Evaluating the Model

## Making a Prediction

- Make a prediction for an iris in the test data and compare it against its label (feature)

In [None]:
y_pred = knn.predict(X_test) # predict the class of the test data

## Measure Accuracy of Prediction

- The accuracy of the model can be measured 
- How much of the sample matches the current dataset
  -  fraction of flowers for which the right species was predicted
- There are 2 ways of measuring accuracy, both of which return the same value:
  - [Numpy Mean](#numpy-mean)
  - [K-Nearest Neighbour](#k-nearest-neighbour)

### Numpy Mean

In [None]:
np.mean(y_pred == y_test) # calculate the accuracy of the prediction

### K-Nearest Neighbour

In [None]:
knn.score(X_test, y_test) # calculate the accuracy of the prediction

### Checking If Both Method Return the Same Value

In [None]:
print(np.mean(y_pred == y_test) == knn.score(X_test, y_test)) # check if the two methods give the same result

# Loading Data from File