### Nearest Neighbors Classification

https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py


##### load modules

In [33]:
#Load Modules
import sklearn.neighbors as nei
import pandas as pd


##### load data

In [34]:
df = pd.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


#### visualise

In [35]:
import seaborn as sns
sns.pairplot(df,hue='variety')

<seaborn.axisgrid.PairGrid at 0x1a49c8f0b20>

In [36]:
df.columns

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')

#### Inputs and outputs

In machine learning, inputs are the data provided to a model for learning or making predictions, while outputs are the model's predictions or responses based on that input data. Inputs typically consist of features or variables that influence the model's behavior, and outputs are the model's responses, such as classifications or numerical predictions.

In [37]:
inputs = df[['sepal.length', 'sepal.width', 'petal.length', 'petal.width']]
outputs = df['variety']
outputs

0         Setosa
1         Setosa
2         Setosa
3         Setosa
4         Setosa
         ...    
145    Virginica
146    Virginica
147    Virginica
148    Virginica
149    Virginica
Name: variety, Length: 150, dtype: object

#### Classifier

In [38]:
knn = nei.KNeighborsClassifier(n_neighbors=5) #classification accrding to the 5 nearest neighbours

In above code:

- `nei` refers to the scikit-learn module for neighbors.
- `KNeighborsClassifier` is a machine learning model for classification based on the k-nearest neighbors algorithm.
- `n_neighbors=5` sets the number of neighbors to consider when making predictions. In this case, it's set to 5, meaning the model will look at the labels of the 5 nearest neighbors to make a prediction.

After creating this `knn` object, you would typically proceed to train it on a dataset using the `fit` method and then use it to make predictions on new data. For example:

```python
# X_train and y_train are your training features and labels
knn.fit(X_train, y_train)

# Make predictions on new data
predictions = knn.predict(X_new)

In [39]:
knn.fit(inputs, outputs)

KNeighborsClassifier()

#### Predict

In [40]:
df.loc[105]

sepal.length          7.6
sepal.width           3.0
petal.length          6.6
petal.width           2.1
variety         Virginica
Name: 105, dtype: object

In [41]:
knn.predict([[8.1, 3.5, 6.5, 1.5]]) #predict species with the petal measurements given


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


array(['Virginica'], dtype=object)

In [42]:
#test with multilpe data
knn.predict([[8.1, 3.5, 6.5, 1.5],[2.1, 2.5, 5.5, 1.5],[0.1, 0.5, 0.5, 0.5]])

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


array(['Virginica', 'Versicolor', 'Setosa'], dtype=object)

#### Evaluate

In [43]:
#test predicted outputs with actual count how many didn't match by
(knn.predict(inputs)!=outputs).sum()

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


5

Randomly split a dataset into training and testing sets. 

- **inputs and outputs:** These are the feature matrix (input data) and target variable (output or label), respectively.
- **test_size=50:** This parameter determines the size of the testing set. In this case, it's set to 50, meaning 50 samples will be reserved for testing, and the rest will be used for training.
- **inputs_train, inputs_test, outputs_train, outputs_test:** These variables will hold the training and testing sets for both the input features and output labels after the split.


In [61]:
import sklearn.model_selection as mod

inputs_train, inputs_test, outputs_train, outputs_test = mod.train_test_split(inputs, outputs, test_size=50)
inputs_train

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
55,5.7,2.8,4.5,1.3
97,6.2,2.9,4.3,1.3
121,5.6,2.8,4.9,2.0
54,6.5,2.8,4.6,1.5
11,4.8,3.4,1.6,0.2
...,...,...,...,...
110,6.5,3.2,5.1,2.0
71,6.1,2.8,4.0,1.3
59,5.2,2.7,3.9,1.4
77,6.7,3.0,5.0,1.7


Re-train with subsection of data:

In [62]:
knn = nei.KNeighborsClassifier(n_neighbors=5)
knn.fit(inputs_test, outputs_test)


KNeighborsClassifier()

In [67]:
#check how many mistakes:
mistakes=(knn.predict(inputs_test)!=outputs_test).sum()
print(f"mistakes= {mistakes}")

mistakes= 1


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
