#**BBM409: Machine Learning Laboratory**
==================================

**Programming Assignment-1**

**Instructors:** 
*   Ahmet Burak Can
*   Burçak Asal

**Prepared By:** 
* Mert Tazeoğlu `(21946606)`

**Problem Definition:** 

In this assignment, we will use K-Nearest Neighbor Algorithm in order to handling both of classification and regression problems. Therefore, at implementation we have two problems.

#**Part-1: Classification Problem**

##**1.1 - Importing Required Libraries**
*   **`Numpy:`** To perform a wide variety of mathematical operations on datasets.
*   **`Pandas:`** To analyse and manipulate tabular data in different dataframes.
*   **`Math:`** To make basic mathematical and statistical operations on dataframes.
*   **`Random:`** Just to creating random integer in a given range.

In [None]:
import pandas as pd
import numpy as np
import math
from random import randrange

##**1.2 - Data Preparation**

In [None]:
# Downloading and reading dataset
df_url = "https://piazza.com/redirect/s3?bucket=uploads&prefix=paste%2Fi74ewlis7fl1f4%2F5a71fc9d9c21da11c7f27c372eb7eafcafed28fdff571c83d1dff35a85b421f2%2Fsubset_16P.csv"
original_df = pd.read_csv(df_url, encoding = "ISO-8859-1") # ISO-8859-1 is required, otherwise program gives an exception.
original_df.head(3)

Unnamed: 0,Response Id,You regularly make new friends.,You spend a lot of your free time exploring various random topics that pique your interest,Seeing other people cry can easily make you feel like you want to cry too,You often make a backup plan for a backup plan.,"You usually stay calm, even under a lot of pressure","At social events, you rarely try to introduce yourself to new people and mostly talk to the ones you already know",You prefer to completely finish one project before starting another.,You are very sentimental.,You like to use organizing tools like schedules and lists.,...,You believe that pondering abstract philosophical questions is a waste of time.,"You feel more drawn to places with busy, bustling atmospheres than quiet, intimate places.",You know at first glance how someone is feeling.,You often feel overwhelmed.,You complete things methodically without skipping over any steps.,You are very intrigued by things labeled as controversial.,You would pass along a good opportunity if you thought someone else needed it more.,You struggle with deadlines.,You feel confident that things will work out for you.,Personality
0,35874,-1,0,-1,1,-1,-2,-2,0,-1,...,0,3,0,0,0,0,1,-1,0,ENTP
1,42624,0,0,1,0,0,0,-1,0,0,...,0,2,0,0,0,0,-1,-3,2,INTP
2,55199,0,0,-2,-1,2,-2,0,0,-1,...,0,0,0,1,0,0,3,0,0,ESTP


In [None]:
# Creating a deep copy of dataframe in order to prevent modifying original dataframe
unnormalized_df = original_df.copy(deep = True)

# Dataframes most important column which we will try to predict includes string values
# String values are not acceptable for ML, therefore we will encode them with integers
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ESTJ", value=0)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ENTJ", value=1)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ESFJ", value=2)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ENFJ", value=3)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ISTJ", value=4)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ISFJ", value=5)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="INTJ", value=6)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="INFJ", value=7)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ESTP", value=8)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ESFP", value=9)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ENTP", value=10)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ENFP", value=11)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ISTP", value=12)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="ISFP", value=13)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="INTP", value=14)
unnormalized_df['Personality'] = unnormalized_df['Personality'].replace(to_replace="INFP", value=15)

print(original_df['Personality'].head(5))
print(unnormalized_df['Personality'].head(5))

0    ENTP
1    INTP
2    ESTP
3    ENTP
4    ENFJ
Name: Personality, dtype: object
0    10
1    14
2     8
3    10
4     3
Name: Personality, dtype: int64


In [None]:
# Response_id column is useless for ML, thats why we will drop that column
unnormalized_df.drop(labels=['Response Id'], axis=1, inplace=True) 

In [None]:
# Splitting the data set into predictor(X) and target(y) sets
X = unnormalized_df.iloc[:, :-1].values
y = unnormalized_df['Personality'].values

X

array([[-1,  0, -1, ...,  1, -1,  0],
       [ 0,  0,  1, ..., -1, -3,  2],
       [ 0,  0, -2, ...,  3,  0,  0],
       ...,
       [ 0,  0,  2, ...,  2,  0,  1],
       [ 0,  0,  1, ...,  2,  0,  1],
       [-1,  1,  1, ...,  2,  1,  1]])

##**1.3 - Data Scaling**
As you can see values are distributed widely. In machine learning, this causes avalanche effect. In other words, greatest factor effects model most and this decreases performance and trustworthy of model. Thats why we will scale our dataset with min-max scaler.

In [None]:
# This function scales dataset except classification / regression column
def minmaxScaler(numpyArray):
  num_rows, num_cols = numpyArray.shape
  for i in range(num_cols-1):
    max = float(numpyArray[i].max())
    min = float(numpyArray[i].min())
    for j in range(num_rows):
      numpyArray[j][i] = (numpyArray[j][i]-min) / (max-min)

dataColumnNames = np.array(unnormalized_df.columns)

# Turn it into numpy array
unnormalised_df = np.array(unnormalized_df).astype(float)
normalised_df = np.array(unnormalized_df).astype(float)
scaler = minmaxScaler(normalised_df)

# As you can see values are between -1 and 1, so it's scaled.
print(normalised_df[:,:])

[[ 0.15384615  0.17647059  0.18181818 ...  0.28571429  0.5
  10.        ]
 [ 0.23076923  0.17647059  0.36363636 ...  0.          1.5
  14.        ]
 [ 0.23076923  0.17647059  0.09090909 ...  0.42857143  0.5
   8.        ]
 ...
 [ 0.23076923  0.17647059  0.45454545 ...  0.42857143  1.
  11.        ]
 [ 0.23076923  0.17647059  0.36363636 ...  0.42857143  1.
   7.        ]
 [ 0.15384615  0.23529412  0.36363636 ...  0.57142857  1.
   8.        ]]


##**1.4 - Implementing KNN and Model Evaluation Algorithms**

In [None]:
# This function calculates distances between two rows.
# This is required for calculating distances in order to finding closest neighbours.
def euclidean_distance(row1, row2):
	distance = 0.0
	for i in range(len(row1)-1):
		distance += (row1[i] - row2[i])**2
	return math.sqrt(distance)

# This function returns nearest neighbours of a point.
def get_n_neighbors(train_data, test_row, k, mode):
	dist_list = list()
	for curr_row in train_data:
		# Calculation for standart KNN algorithm
		if(mode == 'Standart'):
			curr_dist = euclidean_distance(test_row, curr_row)
	 	# Calcukation for weighted KNN algorithm
		else:
			curr_dist = euclidean_distance(test_row, curr_row) + (1/(euclidean_distance(test_row, curr_row)))
		dist_list.append((curr_row, curr_dist))
		for curr_row in train_data:
			curr_dist = euclidean_distance(test_row, curr_row)
			dist_list.append((curr_row, curr_dist))
	# Sorting distances in order to filtering nearest neighbours
	dist_list.sort(key=lambda tup: tup[1])
	nearest_neighbours = list()
  # Filtering nearest 'num_neighbors' neigbors
	for i in range(k):
		nearest_neighbours.append(dist_list[i][0])
	return nearest_neighbours

# This function returns predicted class value which is found by KNN algorithm
# This function will be edited as 'predict_regression' in the second part of assignment
def class_prediction(train_data, test_row, k, mode):
	nearest_neighbours = get_n_neighbors(train_data, test_row, k, mode)
	target_values = [curr_neigbour[-1] for curr_neigbour in nearest_neighbours]		
	predicted_value = max(set(target_values), key=target_values.count)
	return predicted_value

def regression_prediction(train_data, test_row, k, mode):
	# THIS IS IMPLEMENTED (OVERRIDED) AT SECOND PART OF ASSIGNMENT
	return 0.0
 
# This function splits dataset in n_folds in order to improve model performance
def cross_validation_split(dataframe, fold_number):
	splitted_data = list()
	df_copy = list(dataframe)
	fold_size = int(len(dataframe) / fold_number)
	for n_th_fold in range(fold_number):
		curr_fold = list()
		while len(curr_fold) < fold_size:
			index = randrange(len(df_copy))
			curr_fold.append(df_copy.pop(index))
		splitted_data.append(curr_fold)
	return splitted_data

# This function evaluates model performance and calculates most important metrics
def metric_evaluation(actual, predicted):
	accuracy = 0.0
	precision = 0.0
	recall = 0.0

	# Part-1: Creation of 16x16 confusion matrix
	confusion_matrix = []
	for i in range(16):
		sub = [0]
		for j in range(16):
			sub.append(0)
		confusion_matrix.append(sub)

	for i in range(len(actual)):
		if(actual[i] == predicted[i]):
			confusion_matrix[int(actual[i])][int(actual[i])] += 1 
		elif(actual[i] != predicted[i]):
			confusion_matrix[int(actual[i])][int(predicted[i])] += 1

	# Part-2: Calculation of model accuracy
	positive = 0.0
	for i in range(16):
		if(confusion_matrix[i][i] > 0):
			positive += confusion_matrix[i][i]
	accuracy = positive / float(len(actual)) * 100.0

	# Part-3: Calculation of model precision
	for i in range(16):
		temp = 0.0
		try: # Try-except is required since there can be some zero division errors due to zero values in the confusion matrix 
			for j in range(16):
				temp += confusion_matrix[j][i]
			precision += (confusion_matrix[i][i] / temp)
		except:
			pass

	# Part-4: Calculation of model recall
	for i in range(16):
		temp = 0.0
		try: # Try-except is required since there can be some zero division errors due to zero values in the confusion matrix 
			for j in range(16):
				temp += confusion_matrix[i][j]
			recall += (confusion_matrix[i][i] / temp)
		except:
			pass

	return accuracy, precision*100/16, recall*100/16

# This function returns k nearest neighbors which are calculated by method
def k_nearest_neighbors(train, test, mode, sub_mode, k):
	all_predictions = list()
	for row in test:
		if(mode == 'Regression'):
			output = regression_prediction(train, row, k, sub_mode)
		else:
			output = class_prediction(train, row, k, sub_mode)
		all_predictions.append(output)
	return(all_predictions)
 
# This function performs and evaluates selected algorithm for each fold
def evaluate_algorithm(dataset, alg, n_folds, mode, sub_mode, k): 
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold)
		train_set = sum(train_set, [])
		test_set = list()
		for row in fold:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted_values = k_nearest_neighbors(train_set, test_set, mode, sub_mode, k)
		actual_values = [row[-1] for row in fold]
		accuracy, precision, recall = metric_evaluation(actual_values, predicted_values)
		scores.append(accuracy)
		scores.append(precision)
		scores.append(recall)
	return scores

##**1.5 - Executing KNN and Model Evaluation Algorithms**

In [None]:
# In this assignment, given n_fold is a constant
n_folds = 5

# Execution and evaluation of KNN algorithms with unnormalized datasets
# We will execute them with 1, 3, 5, 7, 9 neigbors
print('---Classification Report Without Feature Normalization ---')
print(' ')

for i in (1,3,5,7,9):
  print('Current Neigbor Number: ' + str(i))

  # Execution and evaluation with standart KNN algorithm
  scores = evaluate_algorithm(unnormalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Classification' ,'Standart', i)
  print('For Standart KNN -> Accuracy: ' + str("{:.2f}".format(scores[0])) + ' Precision: ' + str("{:.2f}".format(scores[1])) + ' Recall: ' + str("{:.2f}".format(scores[2])))

  # Execution and evaluation with weighted KNN algorithm
  scores = evaluate_algorithm(unnormalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Classification', 'Weighted', i)
  print('For Weighted KNN -> Accuracy: ' + str("{:.2f}".format(scores[0])) + ' Precision: ' + str("{:.2f}".format(scores[1])) + ' Recall: ' + str("{:.2f}".format(scores[2])))
  print(' ')

# Execution and evaluation of KNN algorithms with normalized datasets
# We will execute them with 1, 3, 5, 7, 9 neigbors
print('---Classification Report With Feature Normalization ---')
print(' ')

for i in (1,3,5,7,9):
  print('Current Neigbor Number: ' + str(i))

  # Execution and evaluation with standart KNN algorithm
  scores = evaluate_algorithm(normalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Classification', 'Standart', 1)
  print('For Standart KNN -> Accuracy: ' + str("{:.2f}".format(scores[0])) + ' Precision: ' + str("{:.2f}".format(scores[1])) + ' Recall: ' + str("{:.2f}".format(scores[2])))

  # Execution and evaluation with weighted KNN algorithm
  scores = evaluate_algorithm(normalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Classification', 'Weighted', 1)
  print('For Weighted KNN -> Accuracy: ' + str("{:.2f}".format(scores[0])) + ' Precision: ' + str("{:.2f}".format(scores[1])) + ' Recall: ' + str("{:.2f}".format(scores[2])))
  print(' ')

---Classification Report Without Feature Normalization ---
 
Current Neigbor Number: 1
For Standart KNN -> Accuracy: 91.00 Precision: 91.69 Recall: 91.94
For Weighted KNN -> Accuracy: 88.00 Precision: 89.69 Recall: 89.46
 
Current Neigbor Number: 3
For Standart KNN -> Accuracy: 95.00 Precision: 95.55 Recall: 96.29
For Weighted KNN -> Accuracy: 93.00 Precision: 93.73 Recall: 94.79
 
Current Neigbor Number: 5
For Standart KNN -> Accuracy: 92.00 Precision: 91.58 Recall: 90.95
For Weighted KNN -> Accuracy: 95.00 Precision: 94.51 Recall: 95.72
 
Current Neigbor Number: 7
For Standart KNN -> Accuracy: 92.00 Precision: 93.16 Recall: 93.75
For Weighted KNN -> Accuracy: 91.00 Precision: 89.35 Recall: 92.94
 
Current Neigbor Number: 9
For Standart KNN -> Accuracy: 92.00 Precision: 93.50 Recall: 90.91
For Weighted KNN -> Accuracy: 92.00 Precision: 92.36 Recall: 93.52
 
---Classification Report With Feature Normalization ---
 
Current Neigbor Number: 1
For Standart KNN -> Accuracy: 71.00 Precision

##**1.6 - Error Analysis for Classification**

###**1.6.1 - General Performance Analysis**

* **Effect of Neighbour Number (k):** General performance of kNN algorithm depends on number of neighbours. If k is smaller, it causes underfitting due to general classification. On the other hand, if k is larger it causes overfitting since model always tend to predict the majority class. Also our experiment results supports this facts. Therefore finding ideal k with using different metrics is critical.

* **Accuracy:** In this part, we used same dataset in different 20 experiments. According to the results, we get best accuracy with k number 3 without feature normalization and with k number 7 with feature normalization. Unfortunately feature normalization has worse accuracy value (which is unusual). I think this may be happened because of the dataset.

* **Precision & Recall:** Accuracy isn't enough for model evaulation. Precision and recall are useful measure of success of prediction when the classes are very imbalanced. According to 'wikipedia', precision (also known as positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances that were retrieved. Both precision and recall are based on relevance. Since classes in the dataset are balanced, precision & recall metrics have similar results with accuracy. If these two metrics were not have similar results as accuracy, then we could make comments with using F1 score (which is combination of both of two metrics).

* **Effect of Algorithm:** General performance of kNN and weighted kNN algorithms are similar but main idea behind weighted kNN is preventing wrong decisions. The main assumption is that neighbors which are closer to the sample should be given more relevance when deciding by basing on which class the sample belongs, since they are more similar. As you can see, precision and recall metrics of weighted kNN are a bit bitter.

* **Effect of K-Fold:** K-fold cross validation prevents special overfitting cases which may be caused from randomly selection of data samples. In some cases K-fold cross validation improves model accuracy but also it provides some clues about overfitting and data corruption. Because we are validating model directly. In this assignment we used 5-fold cross validation and metrics above are average of all folds.

* **Effect of Normalization:** Normalization avoids raw data usage and various problems of datasets by creating new values and maintaining general distribution as well as a ratio in data. If we dont normalize dataset, then classes which have greater integer values will have more effect on output. Further, normalization also improves the execution performance and accuracy of machine learning models.

###**1.6.2 - Factors That Make Hard To Classify**
k-NN is the most used instance-based learning machine learning algorithm. k-NN assumes that all instances are points in some n-dimensional space and defines neighbors in terms of distance. For example, in my implementation it is based on Euclidean distance in R-space. Also smaller k produces sharper boundary effect while larger K produces smoother boundary effect. 

Optimizing k number and dataset optimizes classifying performance but still there misclassified samples. According to my observations, it happens usually at boundary points or at special data points. For example samples that near to a class in n-dimensional space but not belong in that class causes this issue. Classifying this kind of samples is quite harder than others. On the other hand successfully classifying this kind of samples increases the risk of overfitting problem.

**Example:**
[0.0, 0.0, 0.0, -1.0, -3.0, -2.0, 0.0, 0.0, -3.0, 0.0, 1.0, -2.0, 0.0, 1.0, 1.0, 0.0, -1.0, -1.0, -1.0, 2.0, 0.0, 2.0, 0.0, 0.0, 0.0, -1.0, 0.0, -1.0, 1.0, -2.0, -1.0, 0.0, -1.0, 0.0, 0.0, 2.0, 3.0, 1.0, 1.0, -2.0, -1.0, -2.0, -1.0, 2.0, 1.0, 0.0, -1.0, 0.0, 3.0, 0.0, -1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -3.0, -1.0, -1.0, **12.0**]

Original class is 12 (ISTP) but it is predicted as 13 (ISFP). ISTP and ISFP are quite similar character types and as we can see our model can't work properly sometimes in this kind of situations.

**Summary:** k-NN is subject to the curse of dimensionality. Different irrelevant attributes effects main vector and therefore classification performance.

###**1.6.3 - Computation Time**
Since i implemented KNN with using eucledian distance and it calculates distances for each sample again and again, it takes quite much time in order to execute thousands of data samples. As a self-criticism, I can absolutely guarantee that the code is not optimized for speed for this reason. There are more optimized ways such as implementing a distance matrix which dramatically reduces execution time. I tested program with different size samples (10, 100, 500, 1000 etc.). 1000 data sample takes ~2 days to execute. Therefore, if execution process will take quite much time i strongly recommend to test it with at most 1000 samples. Also i can say that, increasing data size (until a point) has a positive effect to model performance.

#**Part-2: Regression Problem**

##**2.1 - Data Preparation**

In [None]:
# Downloading and reading dataset
# ******* Please, add dataset to working directory. (This is not imported as a link) *******
df_name = "energy_efficiency_data.csv"
original_df = pd.read_csv(df_name)
original_df.head(7)

Unnamed: 0,Relative_Compactness,Surface_Area,Wall_Area,Roof_Area,Overall_Height,Orientation,Glazing_Area,Glazing_Area_Distribution,Heating_Load,Cooling_Load
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28
5,0.9,563.5,318.5,122.5,7.0,3,0.0,0,21.46,25.38
6,0.9,563.5,318.5,122.5,7.0,4,0.0,0,20.71,25.16


In [None]:
# Turn it into numpy array
unnormalized_df = np.array(original_df).astype(float)
normalised_df = np.array(unnormalized_df).astype(float)
scaler = minmaxScaler(normalised_df)

# As you can see values are between -1 and 1, so it's scaled.
print(normalised_df[:,:])

[[1.90476190e-03 1.00000000e+00 1.00000000e+00 ... 0.00000000e+00
  5.69597070e-01 2.13300000e+01]
 [1.90476190e-03 1.00000000e+00 1.00000000e+00 ... 0.00000000e+00
  5.69597070e-01 2.13300000e+01]
 [1.90476190e-03 1.00000000e+00 1.00000000e+00 ... 0.00000000e+00
  5.69597070e-01 2.13300000e+01]
 ...
 [1.20505345e-03 1.57142857e+00 1.25000000e+00 ... 1.68918919e-01
  6.02197802e-01 1.71100000e+01]
 [1.20505345e-03 1.57142857e+00 1.25000000e+00 ... 1.68918919e-01
  6.03663004e-01 1.66100000e+01]
 [1.20505345e-03 1.57142857e+00 1.25000000e+00 ... 1.68918919e-01
  6.09523810e-01 1.60300000e+01]]


##**2.2 - Implementing KNN and Model Evaluation Algorithms**

In [None]:
# Most of required KNN algorithms are implemented at above
# Only thing we need here is, implementing a method for regresion
def regression_prediction(train_data, test_row, k, mode):
	nearest_neighbours = get_n_neighbors(train_data, test_row, k, mode)
	target_values = [curr_neigbour[-1] for curr_neigbour in nearest_neighbours]		
	predicted_value = sum(target_values) / len(target_values)
	return predicted_value

# This function evaluates model performance and calculates mean absolute error
def mae_evaluation(actual, predicted):
	mae = 0.0
	for i in range(len(actual)):
		err = abs((actual[i] - predicted[i]))
		mae += err
	mae = mae / len(actual)
	return mae

# This function performs and evaluates selected algorithm for each fold
def evaluate_regression_algorithm(dataset, algorithm, n_folds, mode, sub_mode, k):
	folds = cross_validation_split(dataset, n_folds)
	scores = list()
	for fold in folds:
		train_set = list(folds)
		train_set.remove(fold)
		train_set = sum(train_set, [])
		test_set = list()
		for row in fold:
			row_copy = list(row)
			test_set.append(row_copy)
			row_copy[-1] = None
		predicted = k_nearest_neighbors(train_set, test_set, mode, sub_mode, k)
		actual = [row[-1] for row in fold]
		mae = mae_evaluation(actual, predicted)
	return mae

##**2.3 - Executing KNN and Model Evaluation Algorithms (For Cooling Load)**

In [None]:
# In this assignment, given n_fold is a constant
n_folds = 5

# Execution and evaluation of KNN algorithms with unnormalized datasets
# We will execute them with 1, 3, 5, 7, 9 neigbors
print('---Regression Report Without Feature Normalization ---')
print(' ')

for i in (1,3,5,7,9):
  print('Current Neigbor Number: ' + str(i))

  # Execution and evaluation with standart KNN algorithm
  score = evaluate_regression_algorithm(unnormalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Regression', 'Standart', 1)
  print('For Standart KNN -> Mean Absolute Error: ' + str("{:.2f}".format(score)))
        
  # Execution and evaluation with weighted KNN algorithm
  score = evaluate_regression_algorithm(unnormalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Regression', 'Weighted', 1)
  print('For Weighted KNN -> Mean Absolute Error: ' + str("{:.2f}".format(score)))
  print(' ')

# Execution and evaluation of KNN algorithms with normalized datasets
# We will execute them with 1, 3, 5, 7, 9 neigbors
print('---Regression Report With Feature Normalization ---')
print(' ')

for i in (1,3,5,7,9):
  print('Current Neigbor Number: ' + str(i))

  # Execution and evaluation with standart KNN algorithm
  score = evaluate_regression_algorithm(normalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Regression', 'Standart', 1)
  print('For Standart KNN -> Mean Absolute Error: ' + str("{:.2f}".format(score)))

  # Execution and evaluation with weighted KNN algorithm
  score = evaluate_regression_algorithm(normalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Regression', 'Weighted', 1)
  print('For Weighted KNN -> Mean Absolute Error: ' + str("{:.2f}".format(score)))
  print(' ')

---Regression Report Without Feature Normalization ---
 
Current Neigbor Number: 1
For Standart KNN -> Mean Absolute Error: 0.38
For Weighted KNN -> Mean Absolute Error: 0.41
 
Current Neigbor Number: 3
For Standart KNN -> Mean Absolute Error: 0.45
For Weighted KNN -> Mean Absolute Error: 0.61
 
Current Neigbor Number: 5
For Standart KNN -> Mean Absolute Error: 0.57
For Weighted KNN -> Mean Absolute Error: 0.51
 
Current Neigbor Number: 7
For Standart KNN -> Mean Absolute Error: 0.73
For Weighted KNN -> Mean Absolute Error: 0.43
 
Current Neigbor Number: 9
For Standart KNN -> Mean Absolute Error: 0.26
For Weighted KNN -> Mean Absolute Error: 0.33
 
---Regression Report With Feature Normalization ---
 
Current Neigbor Number: 1
For Standart KNN -> Mean Absolute Error: 0.87
For Weighted KNN -> Mean Absolute Error: 1.22
 
Current Neigbor Number: 3
For Standart KNN -> Mean Absolute Error: 0.86
For Weighted KNN -> Mean Absolute Error: 0.56
 
Current Neigbor Number: 5
For Standart KNN -> Mea

##**2.4 - Executing KNN and Model Evaluation Algorithms (For Heating Load)**

In [None]:
# Data preparation is required since our algorithm takes last column as target column
# Also in our dataset 'heating load' is not last column, so we need to take care of that first
rearranged_df = np.array(original_df)
rearranged_df = rearranged_df[:,[0,1,2,3,4,5,6,7,9,8]]
rearranged_df = pd.DataFrame(rearranged_df, columns=["Relative_Compactness", "Surface_Area", "Wall_Area", "Roof_Area", "Overall_Height", "Orientation", "Glazing_Area", "Glazing_Area_Distribution", "Cooling_Load", "Heating_Load"])
rearranged_df.head(7)

# We rearranged order of last 2 columns of dataset

Unnamed: 0,Relative_Compactness,Surface_Area,Wall_Area,Roof_Area,Overall_Height,Orientation,Glazing_Area,Glazing_Area_Distribution,Cooling_Load,Heating_Load
0,0.98,514.5,294.0,110.25,7.0,2.0,0.0,0.0,21.33,15.55
1,0.98,514.5,294.0,110.25,7.0,3.0,0.0,0.0,21.33,15.55
2,0.98,514.5,294.0,110.25,7.0,4.0,0.0,0.0,21.33,15.55
3,0.98,514.5,294.0,110.25,7.0,5.0,0.0,0.0,21.33,15.55
4,0.9,563.5,318.5,122.5,7.0,2.0,0.0,0.0,28.28,20.84
5,0.9,563.5,318.5,122.5,7.0,3.0,0.0,0.0,25.38,21.46
6,0.9,563.5,318.5,122.5,7.0,4.0,0.0,0.0,25.16,20.71


In [None]:
# Turn it into numpy array
unnormalized_df = np.array(rearranged_df).astype(float)
normalised_df = np.array(unnormalized_df).astype(float)
scaler = minmaxScaler(normalised_df)

# As you can see values are between -1 and 1, so it's scaled.
print(normalised_df[:,:])

[[1.90476190e-03 1.00000000e+00 1.00000000e+00 ... 0.00000000e+00
  7.81318681e-01 1.55500000e+01]
 [1.90476190e-03 1.00000000e+00 1.00000000e+00 ... 0.00000000e+00
  7.81318681e-01 1.55500000e+01]
 [1.90476190e-03 1.00000000e+00 1.00000000e+00 ... 0.00000000e+00
  7.81318681e-01 1.55500000e+01]
 ...
 [1.20505345e-03 1.57142857e+00 1.25000000e+00 ... 1.68918919e-01
  6.26739927e-01 1.64400000e+01]
 [1.20505345e-03 1.57142857e+00 1.25000000e+00 ... 1.68918919e-01
  6.08424908e-01 1.64800000e+01]
 [1.20505345e-03 1.57142857e+00 1.25000000e+00 ... 1.68918919e-01
  5.87179487e-01 1.66400000e+01]]


In [None]:
# In this assignment, given n_fold is a constant
n_folds = 5

# Execution and evaluation of KNN algorithms with unnormalized datasets
# We will execute them with 1, 3, 5, 7, 9 neigbors
print('---Regression Report Without Feature Normalization ---')
print(' ')

for i in (1,3,5,7,9):
  print('Current Neigbor Number: ' + str(i))

  # Execution and evaluation with standart KNN algorithm
  score = evaluate_regression_algorithm(unnormalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Regression', 'Standart', 1)
  print('For Standart KNN -> Mean Absolute Error: ' + str("{:.2f}".format(score)))
        
  # Execution and evaluation with weighted KNN algorithm
  score = evaluate_regression_algorithm(unnormalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Regression', 'Weighted', 1)
  print('For Weighted KNN -> Mean Absolute Error: ' + str("{:.2f}".format(score)))
  print(' ')

# Execution and evaluation of KNN algorithms with normalized datasets
# We will execute them with 1, 3, 5, 7, 9 neigbors
print('---Regression Report With Feature Normalization ---')
print(' ')

for i in (1,3,5,7,9):
  print('Current Neigbor Number: ' + str(i))

  # Execution and evaluation with standart KNN algorithm
  score = evaluate_regression_algorithm(normalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Regression', 'Standart', 1)
  print('For Standart KNN -> Mean Absolute Error: ' + str("{:.2f}".format(score)))

  # Execution and evaluation with weighted KNN algorithm
  score = evaluate_regression_algorithm(normalised_df[:,:].tolist(), k_nearest_neighbors, 5, 'Regression', 'Weighted', 1)
  print('For Weighted KNN -> Mean Absolute Error: ' + str("{:.2f}".format(score)))
  print(' ')

---Regression Report Without Feature Normalization ---
 
Current Neigbor Number: 1
For Standart KNN -> Mean Absolute Error: 0.36
For Weighted KNN -> Mean Absolute Error: 0.39
 
Current Neigbor Number: 3
For Standart KNN -> Mean Absolute Error: 0.64
For Weighted KNN -> Mean Absolute Error: 0.49
 
Current Neigbor Number: 5
For Standart KNN -> Mean Absolute Error: 0.62
For Weighted KNN -> Mean Absolute Error: 0.26
 
Current Neigbor Number: 7
For Standart KNN -> Mean Absolute Error: 0.66
For Weighted KNN -> Mean Absolute Error: 0.65
 
Current Neigbor Number: 9
For Standart KNN -> Mean Absolute Error: 0.39
For Weighted KNN -> Mean Absolute Error: 0.59
 
---Regression Report With Feature Normalization ---
 
Current Neigbor Number: 1
For Standart KNN -> Mean Absolute Error: 0.85
For Weighted KNN -> Mean Absolute Error: 0.80
 
Current Neigbor Number: 3
For Standart KNN -> Mean Absolute Error: 0.75
For Weighted KNN -> Mean Absolute Error: 0.87
 
Current Neigbor Number: 5
For Standart KNN -> Mea

##**2.5 - Error Analysis for Regression**

*   Effects of algorithm, normalization, k-fold and neigbour number (k) are explained at classification part. In regression, these factors have similar effects. In addition to them, in regression it seems like in both of cases (regression of heating and cooling load) mean absolute errors are quite similar if current algorithm type, neigbour number and data types are same.

*   The closer MAE is to 0, the more accurate the model is. But MAE is returned on the same scale as the target we are predicting for and therefore there isn’t a general rule for what a good score is. For both of cooling and heating load k=1 and k=3 has smaller MAE. Also there is no significant difference between k=1, k=3 and k=5, k=7.