# Homework #2
This task deals with the KNN method

In [23]:
# Various libraries used in this exercise

import math
import pandas as pd
from pandas import read_csv
import numpy as np
import operator

# importing the hamming and euclidean distance functions
from scipy.spatial.distance import hamming , euclidean

# also found a function for calculating manhattan distance
def manhattan(p, q):
    """ 
    Return the manhattan distance between points p and q
    assuming both to have the same number of dimensions
    """
    # sum of absolute difference between coordinates
    distance = 0
    for p_i,q_i in zip(p,q):
        distance += abs(p_i - q_i)
    return distance




## Warmup Exercise tasks:

1. Importing the relevant data from a CSV file
 * printing the first two vectors in the file
 * printing the Euclidean between the two vectors
2. Classification of the data into the data vectors and the tags
3. Classify a new vector of data according to the **KNN Algorithm**
 * the vector to classify in this case is [0,0,100], but we are going to write a generic function for that. the function should be able to:
  1. take a dataset, distance calculation method, k nearest neighbours and a vecotor to classify.
  2. calculate the distance (in a selected distance calculation method - Euclidean, Hamming or Manhattan)
  3. create a list of distances for every vector in the dataset and sort them by the distance. every distance should be linked its related tag.
  4. choose the first k elements in the list and finding the majority of them. according to the majority, the classification of the given vector will be printed.

 * print a generic form of message for the classifiction of the vector. it will also be very smart if this function will return if the prediction was correct or not (for later use in this homework)
  1. check for k=1, k=3 for every different distance calculation methods. (6 checks total)


In [24]:
# warmup code

def get_csv_data(url : str , n = 0 ):
  """
  This function use a url for csv file, reads the data
  and returns the data and the target vectors
  """
  df = pd.read_csv(url,  header=0 , error_bad_lines=False ) 
  #put data in dataset without header line
  dataset = np.array(df)
  data = []
  target = []

  for i in range(n):
    print(dataset[i])

  for p in dataset:
    data.append(p[0:-1])
    target.append(p[-1])
  

  return data , target


# creating a basic class for saving the distance for every point
# in the data while saving the related tag (target) value 
class myDist():
  def __init__(self , dist  : int , tag : str):
    self.dist = dist
    self.tag = tag



Now lets try and implement the basic KNN algorithm with the next function:

In [25]:

def my_classify( data , target , point , k = 1  , dist_calc_method = "e"):
  """
  This function classify an unknown point based on three methods of calculation:
  1. euclidean
  2. manhattan
  3. hamming

  The function takes a data metrix, target vector, point vector with the same
  length the lists in the data and try to predict the tag for the given point
  """
  

  results = {}
  cls_list = []
  method = dist_calc_method

  if method == "e":
    the_method = "euclidean"
  if method == "m":
    the_method = "manhattan"
  if method == "h":
    the_method = "hamming"

  ###################################
  # calculating the distances
  for d , t in zip(data,target):
    if method == "e":
      dist = euclidean(d , point)
    elif method == "m":
      dist = manhattan(d , point)
    elif method == "h":
      dist = hamming(d , point)
    else:
      dist = euclidean(d , point)
    
    cls_list.append(myDist(dist , t))
    results[t] = 0
  
  ###################################
  # sorting the list
  cls_list.sort(key=operator.attrgetter('dist'))

  # calculate the number of each target for classifing the unknown point's target
  for i in range(k):
    cur_my_dist = cls_list[i]
    try:
      # print(f"{i+1}. {cls_list[i].dist} , {cls_list[i].tag}")
      results[cur_my_dist.tag] += 1
    except:
      pass

  m = max(results.values())
  index = list(results.values()).index(m)
  prediction = list(results.keys())[index]

  return prediction , the_method





In this basic implementation we are using all the functions from above with the small dataset from the csv file:

In [26]:

#### Basic Implementation ####

url = 'https://github.com/rosenfa/ai/blob/master/myFile.csv?raw=true'
X , y = get_csv_data(url , 2)

# print the distance between the first two vectors
print(euclidean(X[0] , X[1]))

print("*"*80)
new_point = [0,0,100]

methods = ["e" , "m" , "h"] # running the algorithm for every distance calculation method
ks = [1,3] # and classifing of new_point for two different k

for mt in methods:
  for k in ks:
  
    print("\n")
    prd , used_method = my_classify(X , y , new_point , k = k , dist_calc_method = mt)
    print(f"The point prediction is: {prd}, ( k : {k} , distance calculation method : {used_method} ")
    

[0 1 2 'F']
[1 5 6 'F']
5.744562646538029
********************************************************************************


The point prediction is: M, ( k : 1 , distance calculation method : euclidean 


The point prediction is: M, ( k : 3 , distance calculation method : euclidean 


The point prediction is: M, ( k : 1 , distance calculation method : manhattan 


The point prediction is: F, ( k : 3 , distance calculation method : manhattan 


The point prediction is: F, ( k : 1 , distance calculation method : hamming 


The point prediction is: F, ( k : 3 , distance calculation method : hamming 




  X , y = get_csv_data(url , 2)


## Using the warmup functions on larger datasets
Now we are going to apply the functions we wrote for the small dataset to a bit larger datasets with the next steps:

1. devide the the [train set](https://github.com/rosenfa/ai/blob/master/mytrain.csv?raw=true) and [test set](https://github.com/rosenfa/ai/blob/master/mytest.csv?raw=true) to data and target (X_train , X_test and y_train , y_test).
2. now, we will have to check each vector in the test set against the train set. using the function from the warmup section. we will count the **number of times we were right** in our predictions against the **number of times we tried**. The division between them will give us the **accuracy** percentage of our model. this section will print a generic print something like:
```
For k=3 (nn), using Hamming distance, the accuracy is: 0.34
```
3. implament this for k=1, k=7, k=15.
Once we will be done, you should have a total of 9 different results(k=1,7,15 for E,H,M distances)






In [27]:
def knn_model(train_data , train_tags , test_data , test_tags , knn = 3 ,clac_method = "e"):
  """
  This function use a train and test datasets, and for a given k and calculation method [e,m,h]
  finding a predicted tag for each point in the test data set. It counts the number of correct
  predictions and returns the accuracy of the model and the distance calculation method used
  """

  correct = 0
  tries = 0


  for point , tag in zip(test_data , test_tags):
    # using the function from the warmup section
    prd , used = my_classify(train_data , train_tags , point , k=knn , dist_calc_method=clac_method)

    #calculate the accuracy - checking for a correct prediction
    if prd == tag:
      correct += 1
    tries +=1
  

  ac = correct / tries
  return ac , used









Now, we can use the `knn_model()` funciton from above with the given train and test csv files:

In [28]:

train_url = "https://raw.githubusercontent.com/rosenfa/ai/master/mytrain.csv"
test_url = "https://github.com/rosenfa/ai/blob/master/mytest.csv?raw=true"

# spliting the data with the first function we wrote
X_train , y_train = get_csv_data(train_url)
X_test , y_test = get_csv_data(test_url)

# checking for every calculation method with three k
methods = ["e" , "m" , "h"]
knns = [1,7,15]


for mt in methods:
  for knn in knns:
    acc , used_method = knn_model(X_train , y_train , X_test , y_test , knn=knn , clac_method =mt)
    print(f"For k={knn}, using {used_method} distance, the accuracy is: {round(acc , 3)}")




  X_train , y_train = get_csv_data(train_url)


  X_test , y_test = get_csv_data(test_url)


For k=1, using euclidean distance, the accuracy is: 0.5
For k=7, using euclidean distance, the accuracy is: 0.74
For k=15, using euclidean distance, the accuracy is: 0.7
For k=1, using manhattan distance, the accuracy is: 0.61
For k=7, using manhattan distance, the accuracy is: 0.63
For k=15, using manhattan distance, the accuracy is: 0.69
For k=1, using hamming distance, the accuracy is: 0.61
For k=7, using hamming distance, the accuracy is: 0.55
For k=15, using hamming distance, the accuracy is: 0.57
