# **k-nearest neighbors algorithm**

In pattern recognition, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.



## **Importing libraries**

We will import NumPy as np, Pandas as *pd*, train_test_split & cross_val_score from sklearn.model_selection, confusion_matrix & accuracy_score from sklearn.metrics, KNeighborsClassifier from sklearn.neighbors and preprocessing  from sklearn.

In [7]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import preprocessing

## **DataFraming**

Read .csv data into a Dataframe 

In [8]:
data = pd.read_csv("credit_data.csv")
print(data.head())
print(data.describe())
print(data.corr())

   clientid        income        age         loan       LTI  default
0         1  66155.925095  59.017015  8106.532131  0.122537        0
1         2  34415.153966  48.117153  6564.745018  0.190752        0
2         3  57317.170063  63.108049  8020.953296  0.139940        0
3         4  42709.534201  45.751972  6103.642260  0.142911        0
4         5  66952.688845  18.584336  8770.099235  0.130989        1
          clientid        income          age          loan          LTI  \
count  2000.000000   2000.000000  2000.000000   2000.000000  2000.000000   
mean   1000.500000  45331.600018    40.927143   4444.369695     0.098403   
std     577.494589  14326.327119    13.262450   3045.410024     0.057620   
min       1.000000  20014.489470    18.055189      1.377630     0.000049   
25%     500.750000  32796.459717    29.062492   1939.708847     0.047903   
50%    1000.500000  45789.117313    41.382673   3974.719419     0.099437   
75%    1500.250000  57791.281668    52.596993   6432.4

## **Features Extraction**

Extracting features and splitting data into test and train.

In [9]:
data.features = data[["income","age","loan"]]
data.target = data.default

data.features = preprocessing.MinMaxScaler().fit_transform(data.features)

feature_train, feature_test, target_train, target_test = train_test_split(data.features,data.target)

  data.features = data[["income","age","loan"]]
  data.target = data.default


## **Training the Model**

We are using KNeighborsClassifier model as imported from sklearn.neighbors library and then it's being trained on feature_train and target_train

In [10]:
model = KNeighborsClassifier(n_neighbors=32)
fitModel = model.fit(feature_train, target_train)
predictions = fitModel.predict(feature_test)

## **Finding Optimal K Value**

We are going to use cross-validation in order to find the optimal k value. This optimal value is not going to have as good accuracy and precision as we have seen for any neighbors close to ***n*** but it is going to be much more realistic because we use cross-validation. So we are going to use the cross-validation scores and then we are going to make a simple iteration. Basically we are going to consider k values from 1 up to 100.


In [11]:
cross_valid_scores = []

for k in range(1, 100):
  knn = KNeighborsClassifier(n_neighbors = k)
  scores = cross_val_score(knn,data.features, data.target, cv = 10, scoring = 'accuracy')
  cross_valid_scores.append(scores.mean())

print("Optimal k with cross-validation: \t",np.argmax(cross_valid_scores))

Optimal k with cross-validation: 	 32


## **Printing an Error Matrix and Accuracy Score**

In [12]:
print(confusion_matrix(target_test,predictions))
print(accuracy_score(target_test,predictions))

[[423   2]
 [ 14  61]]
0.968
