**INSTRUCTIONS**



1.  The assignment contains four questions. A few bonus questions are mentioned. 
2.   This assignment is due on **6th Feb, 23:59 **(**No Further extensions**).
3.   Assignment must be implemented in Python 3 only.
4.   You are allowed to use libraries for data preprocessing (numpy, pandas etc) and for evaluation metrics, data visualization (matplotlib etc.).
5.   You will be evaluated not just on the overall performance of the model and also on the experimentation with hyper parameters, data prepossessing techniques etc.
6.   The report file must be a well documented jupyter notebook, explaining the experiments you have performed, evaluation metrics and corresponding code. The code must run and be able to reproduce the accuracies, figures/graphs etc.
7.   For all the questions, you must create a train-validation data split and test the hyperparameter tuning on the validation set. Your jupyter notebook must reflect the same.
8.   Any attempts at **plagiarism will be penalized heavily**.
9.   Make sure you run and save your notebooks before submission.
10.  For question 3 of the Decision Trees section, output your model's depth first traversal into ```outputimp.txt``` and submit it along with the ipynb file.
10. Naming convention for the ipynb file is ```<roll_number>_assign1.ipynb```
11. Compress your submission files into a zip file with the naming convention: ```<roll_number>_assign1.zip``` and submit in the portal.

#**1) REGRESSION**

Please find the Diamond Price Prediction Data set https://drive.google.com/drive/folders/1qE1tm3Ke3uotTyv6SUqruI09t-AkcwRK?usp=sharing. "description.txt" contains the feature description of data, "diamonds.csv" has the data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn import preprocessing
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import os
import random

In [2]:
def give_data(to_drop):
    # To read data from diamonds.csv
    headers = ["carat",	"cut","color","clarity","depth","table","price","x","y","z"]
    data = pd.read_csv('diamonds.csv', na_values='?',    
             header=None,  names = headers) 
    data = data.reset_index(drop=True)
    data = data.iloc[1:]
    cut_dict = {'Fair':1, 'Good':2, 'Very Good':3, 'Premium':4, 'Ideal':5}
    color_dict = {'J':1, 'I':2, 'H':3, 'G':4, 'F':5, 'E':6, 'D':7}
    clarity_dict = {'I1':1, 'SI2':2, 'SI1':3, 'VS2': 4, 'VS1':5, 'VVS2':6, 'VVS1':7, 'IF':8}

    create_nums = {'cut' : cut_dict, 'clarity':clarity_dict, 'color':color_dict }

    data.replace(create_nums, inplace=True)

    data = data.astype(np.float64, copy=True)

    Y = data['price']
    Y = Y.astype(np.float64)

    data.drop(labels=to_drop, axis=1, inplace=True)
    # TRAIN TEST SPLIT
    X_train, X_test, Y_train, Y_test = train_test_split(data,  Y, test_size=0.20, random_state=40, shuffle=True)
    X_train = np.array(X_train[:30000])  # sub-sampling the data as my machine cannot handle all of it.
    Y_train = np.array(Y_train[:30000])
    X_test = np.array(X_test[:6000])
    Y_test = np.array(Y_test[:6000])
    return X_train, X_test, Y_train, Y_test

In [3]:
# ERROR METRICS
def RSME(Y, Y_hat):
    # Assumed that Y and Y_hat are one dimensional array
    print('RSME metric : ', end='')
    Y = np.array(Y)
    Y_hat = np.array(Y_hat)
    return np.sqrt(np.mean((Y - Y_hat)**2))/Y.shape[0]

def MSE(Y, Y_hat):
    # Assumed that Y and Y_hat are one dimensional array
    print('MSE metric : ', end='')
    Y = np.array(Y)
    Y_hat = np.array(Y_hat)
    return np.mean((Y - Y_hat)**2)/Y.shape[0]

def MAE(Y, Y_hat):
    # Assumed that Y and Y_hat are one dimensional array
    print('MAE metric : ', end='')
    Y = np.array(Y)
    Y_hat = np.array(Y_hat)
    return np.mean(np.abs(Y - Y_hat))/Y.shape[0]

In [4]:
def give_mean(X):
    return np.mean(X, axis=0) # take mean along columns

def give_Variance(X):
    return np.var(X, axis=0) # take variance along the columns

def covariance_matrix(X):
    return np.cov(X.T)  # gives the covariance of the columns

**KNN Regression [Diamond Price Prediction Dataset]**

1. a) Build a knn regression algorithm [using only python from scratch] to predict the price of diamonds.

In [5]:
# code for knn regression
class KNNRegression:
    def __init__(self, data=None, Y=None):
        self.data = data  # the data is preprocessed already, normalized(if it is required), etc, numpy array
        self.Y = data # numpy array
    
    def train(self, data, Y):
        self.data = data
        self.Y = Y
        
    def give_dist(self, X, Y, p=2):
        dist = np.sum(np.abs(X - Y) ** p)
        if p == 2:
            return math.sqrt(dist)
        else:
            return dist
    def dist_matrix_manhattan(self, X_test):
        num_test = X_test.shape[0]
        num_train = self.data.shape[0]
        dist_mat = np.zeros((num_test, num_train))
        for i in range(num_test):
            dist_mat[i, :] = np.sum(np.abs(self.data - X_test[i, :]), axis=1)
        return dist_mat
        
    def dist_matrix(self, X):
        dists = -2 * np.dot(X, self.data.T) + np.sum(self.data**2, axis=1) + np.sum(X**2, axis=1)[:, np.newaxis]
        return dists

    def do_regression(self, dist_mat, k, p=2):
        if p > 2 or p <= 0:
            raise NotImplementedError('p can only take two values')
        indices = np.argpartition(dist_mat, k)
        indices = indices[:, :k]
        values = np.take(self.Y, indices)
        return np.mean(values, axis=1)  # take average along the rows

In [6]:
X_train, X_test, Y_train, Y_test = give_data(['price', 'cut', 'clarity', 'color', 'depth', 'table'])
regressor = KNNRegression()
regressor.train(X_train, Y_train)
dists = regressor.dist_matrix_manhattan(X_test)
predicted = regressor.do_regression(dists, k=7)
print(predicted.shape)
print(RSME(Y_test, predicted))
r2 = r2_score(Y_test, predicted)
print('r2 score is :- {a}'.format(a=r2))

(6000,)
RSME metric : 0.24426404341630936
r2 score is :- 0.8682832208462565


1. b) Do we need to normalise data? [If so Does it make any difference?].

In [7]:
# NORMALISE DATA
mean = give_mean(X_train)
var = give_Variance(X_train)
X_train -= mean
X_train /= var
X_test -= mean
X_test /= var

regressor = KNNRegression()
regressor.train(X_train, Y_train)
dists = regressor.dist_matrix(X_test)
predicted = regressor.do_regression(dists, k=7)
print(predicted.shape)
print(RSME(Y_test, predicted))
r2 = r2_score(Y_test, predicted)
print('r2 score is :- {a}'.format(a=r2))

(6000,)
RSME metric : 0.24673852650742942
r2 score is :- 0.8656010263045751


#### Answer
Yes, it generally helps, but in this specific example it does not make much difference.

For classification/Regression algorithms like **KNN**, we measure the **distances** between pairs of samples and these distances are influenced by the measurement units also. For example: Let’s say, we are applying KNN on a data set having 3 features.First feature ranging from 1-10, second from 1-20 and the last one ranging from 1-1000. In this case, most of the clusters will be generated based on the last feature as the difference between 1 to 10 and 1-20 are smaller as compared to 1-1000. To avoid this mis-classification or value prediction, we should normalize the feature variables.

In [8]:
# show all the experiments

print('Manhattan Distance')
dists = regressor.dist_matrix_manhattan(X_test)
predicted = regressor.do_regression(dists, k=7)
print(RSME(Y_test, predicted))
r2 = r2_score(Y_test, predicted)
print('r2 score is :- {a}'.format(a=r2))

print('Euclidean Distance')
dists = regressor.dist_matrix(X_test)
predicted = regressor.do_regression(dists, k=7)
print(RSME(Y_test, predicted))
r2 = r2_score(Y_test, predicted)
print('r2 score is :- {a}'.format(a=r2))

Manhattan Distance
RSME metric : 0.24371112382925386
r2 score is :- 0.8688788579116036
Euclidean Distance
RSME metric : 0.24673852650742942
r2 score is :- 0.8656010263045751


3. Report Mean Squared Error(MSE), Mean-Absolute-Error(MAE), R-squared (R2) score in a tabular form.

In [9]:
# report a table
r2 = r2_score(Y_test, predicted)
print('r2 score is :- {a}'.format(a=r2))
print(MAE(Y_test, predicted))
print(MSE(Y_test, predicted))

r2 score is :- 0.8656010263045751
MAE metric : 0.13947561507936507
MSE metric : 365.2794027783447
