# K-NN ANALYSIS PROJECT

## First of all we are importing needed modules and writing functions. 

We will use pandas for opening our dataset that is in .csv format. Also we will use pandas for manipulating our dataset to make it work in our KNN function.

We will use numpy for turning our pandas dataframe into an array and making some other mathematical operations on it. 

We will use random for creating random numbers for seperating our data to two data, one is train and other is test set.

In [1]:
import numpy as np
import pandas as pd
import random as rd

Now we will write our main "K-NN" algorithm in function form. That algorithm checks for first k nearest neighbors of our query point and checks for their labels too. Returns the most common label.

In this function, X is our features for like "train set" (I know there is no training actually). It checks other points' features. Y is our labels for again train set, checks their labels. query_point is our Query Point that we want to predict label for. K is our k number. Weights is a term for weighted KNN but I didn't used that in this assignment.

In [2]:
def knn(X, Y, query_point, k, weights=None):
    """
    Finds the k nearest neighbours of a query point
    """
    dot_product = np.sum(X**2, axis=1) + np.sum(query_point**2) - 2 * np.dot(query_point, X.T)
    
    nearest = np.argsort(dot_product)[:k]
    nearest_labels = [Y[i] for i in nearest]
    
    if weights is None:
        return max(set(nearest_labels), key=nearest_labels.count)
    else:
        weighted_labels = [nearest_labels[i] * weights[i] for i in range(k)]
        return max(set(weighted_labels), key=weighted_labels.count)

Now we need a "Feature Standardization" function that normalizes every feature. It turns every feature into 0-1 range. I mean, our features can range from -3 to 3 but if we normalize them they will range from 0 to 1. We will use this.

In [3]:
def standardize(X):
    """
    Standardizes the data by subtracting the mean and dividing by the standard deviation 
    """
    x_min = np.min(X, axis=0)
    x_max = np.max(X, axis=0)
    
    x_normalized = (X - x_min) / (x_max - x_min)
    
    return x_normalized

## Now that we are done writing the functions, we are ready to import our file into Python and analyze it. We will use some features of pandas for this. 

First we are importing our .csv file into python with pandas. It gives us some encoding errors while doing this, so we can change our encoding to latin1 or just ignore that messages, I myself ignored them.

In [4]:
df = pd.read_csv('data.csv', encoding='utf-8', encoding_errors='ignore')  # Read the data file which is in csv format and store it in a dataframe

Now we will drop response id and unknown lines. 

In [5]:
df.drop(['Response Id'], axis=1, inplace=True)  # Drop the Response Id column as it is not required
df.dropna(axis=1, inplace=True)  # Drop the columns with null values if any

In our data format, personalities (labels) are in string type. We need them to turn into integers. We'll do this by using dictionaries. 

In [6]:
personality_labels = {'ESTJ':0, 'ENTJ':1, 'ESFJ':2, 'ENFJ':3, 'ISTJ':4, 'ISFJ':5, 'INTJ':6, 'INFJ':7, 'ESTP':8,
                      'ESFP':9, 'ENTP':10, 'ENFP':11, 'ISTP':12, 'ISFP':13, 'INTP':14, 'INFP':15}  # Dictionary to map the personality types to integers

df.Personality = [personality_labels[item] for item in df.Personality]  # Map the personality types to integers

Now we have done everything that we need, now we need to turn our dataframe into a numpy array. We'll use .to_numpy() method for this.

In [7]:
numpy_array = df.to_numpy()  # Convert the dataframe to a numpy array

## We've made the necessary edits to our data and made it usable. Now it's time to split our dataset into pieces and test our KNN function using these pieces!

As stated in the assignment document, we divide our data set into 5 parts. Each time we consider one of these parts as a test set and the others as train sets. We further divide each test set and train set into two parts. One of these parts consists of independent variables (personality traits) and is assigned to a variable named X, while the other part consists of dependent variables (personality traits) and is assigned to a variable named Y. 

In [8]:
numpy_array = np.random.permutation(numpy_array)  # Shuffle the data
folds = np.array_split(numpy_array, 5)

"Confusion Matrices" are created separately for each test set. For each separate test case in each test set, a label prediction is made by placing that case and train set into the KNN function. Label prediction is placed in the confusion matrix and TP, FP, TN, FN are calculated.
Based on the calculated TP, FP, TN, FN, Accuracy, Precision, Recall are also calculated and this step is repeated 5 times in total for all test sets. 

In [None]:
for k in (1,3,5,7,9):
	total_accuracies = []
	total_precisions = []
	total_recalls = []

	for m in range(5):
		accuracies = []
		precisions = []
		recalls = []

		test_set = folds[m]
		train_set = np.concatenate(folds[:m] + folds[m+1:])
		print(test_set)
		X_train = train_set[:, :-1]
		Y_train = train_set[:, -1]
		X_test = test_set[:, :-1]
		Y_test = test_set[:, -1]

		confusion_matrix = [[0] * 16 for _ in range(16)]
		for example in test_set:
			true_class = example[-1]
			predicted_class = knn(X_train, Y_train, example[:-1], k)
			if predicted_class == true_class:
				confusion_matrix[true_class][predicted_class] += 1
			else:
				confusion_matrix[true_class][predicted_class] += 1

		for i in range(16):
			true_positives = confusion_matrix[i][i]
			false_positives = sum(confusion_matrix[i][j] for j in range(16) if j != i)
			false_negatives = sum(confusion_matrix[j][i] for j in range(16) if j != i)
			true_negatives = sum(confusion_matrix[j][k] for j in range(16) for k in range(16) if j != i and k != i)
			accuracy = (true_positives + true_negatives) / (true_positives + false_positives + false_negatives + true_negatives)
			precision = true_positives / (true_positives + false_positives)
			recall = true_positives / (true_positives + false_negatives)
			accuracies.append(accuracy)
			precisions.append(precision)
			recalls.append(recall)

The Accuracy, Precision and Recall of the model are calculated by averaging these values calculated separately for each test set. 

In [None]:
		accuracy = sum(accuracies) / len(accuracies)
		precision = sum(precisions) / len(precisions)
		recall = sum(recalls) / len(recalls)

		print('Accuracy For Set ', m+1, ": ", accuracy, "K = {}".format(k))
		print('Precision For Set ', m+1, ": ", precision, "K = {}".format(k))
		print('Recall For Set ', m+1, ": ", recall, "K = {}".format(k))

		total_accuracies.append(accuracy)
		total_precisions.append(precision)
		total_recalls.append(recall)

	accuracy_for_k = sum(total_accuracies) / len(total_accuracies)
	precision_for_k = sum(total_precisions) / len(total_precisions)
	recall_for_k = sum(total_recalls) / len(total_recalls)

	print('Accuracy for K = {}:'.format(k), accuracy_for_k)
	print('Precision for K = {}:'.format(k), precision_for_k)
	print('Recall for K = {}:'.format(k), recall_for_k)

Full written version of this code for all K values and all test sets is:

In [10]:
for k in (1,3,5,7,9):
	total_accuracies = []
	total_precisions = []
	total_recalls = []

	for m in range(5):
		accuracies = []
		precisions = []
		recalls = []

		test_set = folds[m]
		train_set = np.concatenate(folds[:m] + folds[m+1:])
		print(test_set)
		X_train = train_set[:, :-1]
		Y_train = train_set[:, -1]
		X_test = test_set[:, :-1]
		Y_test = test_set[:, -1]

		confusion_matrix = [[0] * 16 for _ in range(16)]
		test_row_counter = 0
		for example in test_set:
			true_class = example[-1]
			predicted_class = knn(X_train, Y_train, example[:-1], k)
			if predicted_class == true_class:
				confusion_matrix[true_class][predicted_class] += 1
			else:
				if test_row_counter < 5:
					print("Predicted Class: ", predicted_class, "True Class: ", true_class, "Test Row:", example)
					test_row_counter += 1
				confusion_matrix[true_class][predicted_class] += 1

		for i in range(16):
			true_positives = confusion_matrix[i][i]
			false_positives = sum(confusion_matrix[i][j] for j in range(16) if j != i)
			false_negatives = sum(confusion_matrix[j][i] for j in range(16) if j != i)
			true_negatives = sum(confusion_matrix[j][k] for j in range(16) for k in range(16) if j != i and k != i)
			accuracy = (true_positives + true_negatives) / (true_positives + false_positives + false_negatives + true_negatives)
			precision = true_positives / (true_positives + false_positives)
			recall = true_positives / (true_positives + false_negatives)
			accuracies.append(accuracy)
			precisions.append(precision)
			recalls.append(recall)

		accuracy = sum(accuracies) / len(accuracies)
		precision = sum(precisions) / len(precisions)
		recall = sum(recalls) / len(recalls)

		print('Accuracy For Set ', m+1, ": ", accuracy, "K = {}".format(k))
		print('Precision For Set ', m+1, ": ", precision, "K = {}".format(k))
		print('Recall For Set ', m+1, ": ", recall, "K = {}".format(k))

		total_accuracies.append(accuracy)
		total_precisions.append(precision)
		total_recalls.append(recall)

	accuracy_for_k = sum(total_accuracies) / len(total_accuracies)
	precision_for_k = sum(total_precisions) / len(total_precisions)
	recall_for_k = sum(total_recalls) / len(total_recalls)

	print('Accuracy for K = {}:'.format(k), accuracy_for_k)
	print('Precision for K = {}:'.format(k), precision_for_k)
	print('Recall for K = {}:'.format(k), recall_for_k)

[[ 0  0  3 ...  3 -2 11]
 [ 0  0  1 ... -2  2 12]
 [ 0  0  1 ...  0  1 11]
 ...
 [ 0  1 -2 ... -2  3 15]
 [ 0  0 -2 ...  0  1  8]
 [ 0  0 -1 ... -2 -2 15]]
Predicted Class:  5 True Class:  1 Test Row: [ 0  0 -3  0  2  0  3  0 -2  0 -3  0  2 -1  0 -2  0  1  0 -1  0  2  0  0
  1  1  0  0  1  0 -1  0 -1 -1  1  0  3 -2  2  1 -1 -1  3  0  0  0  2  0
  1 -1  0  0  1  0 -2  0  0 -2 -2 -3  1]
Predicted Class:  0 True Class:  3 Test Row: [ 1  0 -2  3  2  0 -2  0 -1  1  0 -2 -1  2 -1 -1  2 -1 -1  2  0  0  1  0
  2  0 -1  0  1  1  0  0  0  0  0  1 -1  2  0 -1 -1 -2  0  2 -1 -1  0  0
 -2  0  0  0  1  0 -1 -1  0  1  0  1  3]
Predicted Class:  4 True Class:  7 Test Row: [ 0  0  1 -1  0  0 -2  0  1  1  0  0  2 -2  0 -1 -2 -3 -2  0  0  0  0  0
  0 -1  0  0  2  0 -1  0  1  0 -1 -1 -2  1  1 -1 -2  3  0  1  1 -1 -1  0
  1  0 -1  1 -2  0  3  0  0  0 -1  1  7]
Predicted Class:  14 True Class:  3 Test Row: [ 0  0 -2  0 -1 -1  0  0 -1  0 -1 -1  0 -1 -1 -1 -1 -3  1  1  0  0  0  0
  0  2 -2  0 -3  0  0  0 -1  

Our testing is complete. The program printed the accuracy, precision and recall values separately for each test set and k value. 

If we want to see these values more clearly:

-------------------------------------------------

Accuracy for K = 1: 0.9972166203086369

Precision for K = 1: 0.9777332031895144

Recall for K = 1: 0.9777416696373276

----------------------------------------------

Accuracy for K = 3: 0.9985478911437067

Precision for K = 3: 0.9883773269076235

Recall for K = 3: 0.9884061240428175

----------------------------------------------

Accuracy for K = 5: 0.9986270588521264

Precision for K = 5: 0.9890129861803905

Recall for K = 5: 0.989039227828042

----------------------------------------------

Accuracy for K = 7: 0.9986603932272133

Precision for K = 7: 0.9892813281456254

Recall for K = 7: 0.9893052120341815

----------------------------------------------

Accuracy for K = 9: 0.9986728939217157

Precision for K = 9: 0.9893770059931579

Recall for K = 9: 0.9894039902842648

-------------------------------------------------

These results were obtained on a non-normalized data set. Now let's try to find the results again by normalizing our data set. We will see the differences and the effects of normalizing the data set on the results.

In [13]:
numpy_array_features = numpy_array[:, :-1]
numpy_array_labels = numpy_array[:, -1]

normalized_array_features = standardize(numpy_array_features)

folds = np.array_split(normalized_array_features, 5)
label_folds = np.array_split(numpy_array_labels, 5)

for k in (1,3,5,7,9):
	total_accuracies = []
	total_precisions = []
	total_recalls = []

	for m in range(5):
		accuracies = []
		precisions = []
		recalls = []

		test_set = folds[m]
		test_set_labels = label_folds[m]
		train_set = np.concatenate(folds[:m] + folds[m+1:])
		train_labels = np.concatenate(label_folds[:m] + label_folds[m+1:])
		print(test_set)
		X_train = train_set
		Y_train = train_labels
		X_test = test_set
		Y_test = test_set_labels

		confusion_matrix = [[0] * 16 for _ in range(16)]
		counter = 0
		for example in test_set:
			true_class = test_set_labels[counter] 
			predicted_class = knn(X_train, Y_train, example, k)
			if predicted_class == true_class:
				confusion_matrix[true_class][predicted_class] += 1
			else:
				confusion_matrix[true_class][predicted_class] += 1
			counter += 1
		for i in range(16):
			true_positives = confusion_matrix[i][i]
			false_positives = sum(confusion_matrix[i][j] for j in range(16) if j != i)
			false_negatives = sum(confusion_matrix[j][i] for j in range(16) if j != i)
			true_negatives = sum(confusion_matrix[j][k] for j in range(16) for k in range(16) if j != i and k != i)
			accuracy = (true_positives + true_negatives) / (true_positives + false_positives + false_negatives + true_negatives)
			precision = true_positives / (true_positives + false_positives)
			recall = true_positives / (true_positives + false_negatives)
			accuracies.append(accuracy)
			precisions.append(precision)
			recalls.append(recall)

		accuracy = sum(accuracies) / len(accuracies)
		precision = sum(precisions) / len(precisions)
		recall = sum(recalls) / len(recalls)

		print('Accuracy For Set ', m+1, ": ", accuracy, "K = {}".format(k))
		print('Precision For Set ', m+1, ": ", precision, "K = {}".format(k))
		print('Recall For Set ', m+1, ": ", recall, "K = {}".format(k))

		total_accuracies.append(accuracy)
		total_precisions.append(precision)
		total_recalls.append(recall)

	accuracy_for_k = sum(total_accuracies) / len(total_accuracies)
	precision_for_k = sum(total_precisions) / len(total_precisions)
	recall_for_k = sum(total_recalls) / len(total_recalls)

	print('Accuracy for K = {}:'.format(k), accuracy_for_k)
	print('Precision for K = {}:'.format(k), precision_for_k)
	print('Recall for K = {}:'.format(k), recall_for_k)

[[0.66666667 0.5        1.         ... 0.5        1.         0.16666667]
 [0.66666667 0.5        0.66666667 ... 0.33333333 0.16666667 0.83333333]
 [0.66666667 0.5        0.66666667 ... 0.66666667 0.5        0.66666667]
 ...
 [0.66666667 1.         0.16666667 ... 0.33333333 0.16666667 1.        ]
 [0.66666667 0.5        0.16666667 ... 0.66666667 0.5        0.66666667]
 [0.66666667 0.5        0.33333333 ... 0.16666667 0.16666667 0.16666667]]
Accuracy For Set  1 :  0.9963854166666667 K = 1
Precision For Set  1 :  0.9711609075808506 K = 1
Recall For Set  1 :  0.9711382793966229 K = 1
[[0.66666667 0.5        0.83333333 ... 0.33333333 0.33333333 0.33333333]
 [0.66666667 0.5        0.83333333 ... 0.5        0.5        0.5       ]
 [0.66666667 0.5        0.83333333 ... 0.33333333 0.33333333 0.5       ]
 ...
 [0.66666667 0.5        0.16666667 ... 0.5        0.16666667 0.66666667]
 [0.66666667 0.5        0.83333333 ... 0.83333333 0.5        0.66666667]
 [0.66666667 0.5        0.33333333 ... 1.  

Now we can see our normalized sets' accuracies, features, precisions. More clear vision is:

-------------------------------------------------

Accuracy for K = 1: 0.9966770307247825

Precision for K = 1: 0.9734150567435199

Recall for K = 1: 0.9734674342753958

------------------------------------------------

Accuracy for K = 3: 0.9983728878448204

Precision for K = 3: 0.9869829283601985

Recall for K = 3: 0.9869977379838055

-------------------------------------------------

Accuracy for K = 5: 0.9985458081576244

Precision for K = 5: 0.9883584002811796

Recall for K = 5: 0.9883827787290025

-------------------------------------------------

Accuracy for K = 7: 0.9985999755187933

Precision for K = 7: 0.9887942222096273

Recall for K = 7: 0.9888175383565802

-------------------------------------------------

Accuracy for K = 9: 0.9986208085048756

Precision for K = 9: 0.9889666310925083

Recall for K = 9: 0.9889868110452772

-------------------------------------------------

## Error Analysis for Classification

Now that we have our results for both datasets, we need to examine how the variables that affect the results affect the different accuracy, precision and recall.

Let's start with the most obvious change, normalization. 

For the same K values in the examples (let's assume K=3), the accuracy, precision and recall values between normalized and non-normalized data are as follows: 

-------------------------------------------------

Accuracy for K = 3: 0.9985478911437067

Precision for K = 3: 0.9883773269076235

Recall for K = 3: 0.9884061240428175

-------------------------------------------------

Accuracy for K = 3: 0.9983728878448204

Precision for K = 3: 0.9869829283601985

Recall for K = 3: 0.9869977379838055

--------------------------------------------------

The top values belong to the non-normalized data while the bottom values belong to the normalized data. Based on this, we can say that normalizing the data would be a faulty approach, at least for this data set. As can be seen, the accuracy in the normalized data has decreased by 0.0002% while the precision has decreased by 0.0014%. This shows us that when our data set is normalized, false positive responses increase significantly. Recall also seems to have decreased by about 0.0015%. The interpretation that can be made as a result of these is that standardization is not a logical choice in this data set. 

Let us now examine how different K values affect accuracy on the unnormalized data set. 

-------------------------------------------------

Accuracy for K = 1: 0.9972166203086369

Precision for K = 1: 0.9777332031895144

Recall for K = 1: 0.9777416696373276

----------------------------------------------

Accuracy for K = 3: 0.9985478911437067

Precision for K = 3: 0.9883773269076235

Recall for K = 3: 0.9884061240428175

----------------------------------------------

Accuracy for K = 5: 0.9986270588521264

Precision for K = 5: 0.9890129861803905

Recall for K = 5: 0.989039227828042

----------------------------------------------

As we can see, as the K value increases, both our accuracy, precision and recall values increase. Therefore, we can say that when the K value reaches the maximum, our prediction accuracy can also reach the maximum, that is, there is a direct proportionality between the K value and the prediction accuracy. 


Using the K-Fold system helped us to calculate the actual accuracy, precision and recall values in this program. As far as we can see, the program can provide us with a healthier prediction rate if these data calculated separately for each set are averaged. So, increasing the number of K in the K-Fold system will give the program an advantage to some extent. 