#### Important points
- Most simple machine learning algorithm. It doesn't actually learn anything.
- This algorithm rely on distance between the feature vectors(in case of image it's the raw RGB pixel intensities).
- The kNN algorithm classifies unknown data points by finding the *most common class* among the *k closest examples*.
- The primary _assumption_ is that image with same visual contect should lie closer.
- To find the similarity, we need to slect a distance metric or similarity function. 
    - L2 *or* Eucledian distance
    - L1 *or* Manhattan distance


#### k-NN Hyperparameters
There are two hyper parameters associated with k-NN.
- The value of k.
- Chosing between L1 and L2 distance. Questioning which one is the best?

#### Implementing k-NN using Animals dataset
- Step 1 - **Gather our datset** - The Animals dataset consists of 3000 images with 1000 images per dog, cat, and pandas class. Represnted in RGB colorspace. Preprocess image by resizing it to 32 $\times$ 32 pixels. Each image in  dataset is represented by 32 $\times$ 32 $\times$ 3 = 3072 integers.
- Step 2 - **Split the dataset** - Two split for now: training and testing. Sometimes we split training into validation set also.
- Step 3 - **Train the classifier** - Our k-NN classifier will be traine d on the law pixel intensities of the images in the training set.
- Step 4 - **Evaluation on the test set**

In [1]:
# import the necessary packages
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from preprocessing.simplepreprocessor import SimplePreprocessor
from datasets.simpledatasetloader import SimpleDatasetLoader
from imutils import paths
import argparse

# construct argument parse and parse the arguments
#ap = argparse.ArgumentParser()
#ap.add_argument("-d", "--dataset", required=True, help="path to input datset")
#ap.add_argument("-k", "--neighbors", type=int, default=1, help="# of nearest neighbors for classification")
#ap.add_argument("-j", "--jobs", type=int, default=-1, help="# of jobs fpr k-NN distance (-1 uses all available cores)")
#args = vars(ap.parse_args())

# arguments
args = {"dataset" : "I:\ARVision\datasets\dataset", "neighbors" : "3", "jobs" : "-1"}

print("[INFO] loading images...")
imagePaths = list(paths.list_images(args["dataset"]))

# initialize the image preprocessor, load the datset from disk, and reshape the data matrix
sp = SimplePreprocessor(32, 32)
sdl = SimpleDatasetLoader(preprocessors=[sp])
(data, labels) = sdl.load(imagePaths, verbose=500)
data = data.reshape((data.shape[0], 3072))

# show some information about images
print("[INFO features matrix: {:.1f}MB".format(data.nbytes / (1024*1000.0)))

# encoding the labels as integers
le = LabelEncoder()
labels = le.fit_transform(labels)

# partition the data into training and testing splits
(trainX, testX, trainY, testY)  = train_test_split(data, labels, test_size=0.25, random_state=42)

# train and evaluate a k-NN classifier on raw pixel intensities
print("[INFO] evaluating k-NN classifier...")
model = KNeighborsClassifier(n_neighbors=args["neighbors"], n_jobs=args["jobs"])
model.fit(trainX, trainY)
print(classification_report(testY, model.predict(testX), target_names=le.classes_))

[INFO] loading images...
[INFO] processed 500/2000
[INFO] processed 1000/2000
[INFO] processed 1500/2000
[INFO] processed 2000/2000
[INFO features matrix: 6.0MB
[INFO] evaluating k-NN classifier...


TypeError: '<' not supported between instances of 'str' and 'int'

#### Pros and cons of k-NN
- Extremely simple to implement and understand.
- We simply store our data points for computing distances to them and obtaining our final classification.
- However, this is not effecient because classifying every **test data point** requires comparison to every **training data point**, which scales to $O(n)$, thereby making working with large data points prohibitive.
- This time cost can be combat by **Approximate Nearest Neighbor** algorithms like kd-trees, FLANN etc. but using these requires us to trade our time complexity for accuracy fro nearest neighbor.
- k-NN algorithms are more suited for low dimensional feature space, which images are not. $[70]$


#### k-NN in aspects of data of large size in GB/TB
- The most important issue is that k-NN stores **replica of whole training data** in the model. 
- Why keeping replica is an issue? Imagine size of data to be in GB/TB and think!
- The more desirable approach would be defining a machine learning model that can **learn patterns from training data**, but have benefit of being **defined by small number of parameters** regardless of size of training data.