## *Table of Contents (Question 1)*
### **[Part A](#a)**
### **[Part B](#b)**
### **[Part C](#c)**
### **[Part D](#d)**
### **[Part E](#e)**
### **[Part F](#f)**

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsRegressor
from math import sqrt
pd.set_option('display.max_columns', None)

### Part A <a name="a"></a>

In [3]:
# read the data

with open('THA_diamonds.csv') as df:
    df = pd.read_csv(df, sep=',')
df.tail(10)

Unnamed: 0,cut,color,depth,price,carat
202,Good,I,57.1,premium,1.01
203,Good,I,58.0,premium,0.91
204,Fair,F,65.7,premium,0.9
205,Good,I,59.6,premium,0.92
206,Fair,F,66.8,premium,0.91
207,Good,F,63.7,premium,0.96
208,Fair,D,57.5,premium,0.9
209,Fair,F,64.7,premium,0.9
210,Good,I,58.2,premium,0.93
211,Fair,F,58.9,premium,0.9


In [4]:
df.shape

(212, 5)

Here, our original dataset has 5 variables with 212 observations.

We need to predict `carat` for new diamond with value of:

**`cut`** : "Good",

**`color`** : D, 

**`depth`** : 60 , 

**`price`** : "premium"

We have given the accurate value **`carat`** = 0.71.

we will pretend we don't know. We will predict this with help of **`KNN algorithm`** using **`Euclidean Distance`**

So, our target feature is `carat` and remaining are descriptive features.


Now, I will add, this new observation to the data.

In [5]:
# new_rec = {"Good","D",60,"premium",0.71}

new_rec = {"cut":"Good","color":"D","depth":60,"price":"premium","carat":0.71}
df = df.append(new_rec,ignore_index=True)
df.tail(10)

Unnamed: 0,cut,color,depth,price,carat
203,Good,I,58.0,premium,0.91
204,Fair,F,65.7,premium,0.9
205,Good,I,59.6,premium,0.92
206,Fair,F,66.8,premium,0.91
207,Good,F,63.7,premium,0.96
208,Fair,D,57.5,premium,0.9
209,Fair,F,64.7,premium,0.9
210,Good,I,58.2,premium,0.93
211,Fair,F,58.9,premium,0.9
212,Good,D,60.0,premium,0.71


In [6]:
df.shape

(213, 5)

Aftre appending the record, now we have 213 records with 5 variables.

For, **`KNN algorithm`**, all the values must be of numeric value, for this purpose we will perform `one-hot-encoding` to all features which are not numeric.

In [7]:
df.dtypes

cut       object
color     object
depth    float64
price     object
carat    float64
dtype: object

As we can see, **cut, color and price** are object, so we need to perfrom `one-hot-encoding` on this features.

Target feature `carat` is already numeric, so I copy `carat` values to new variable `carat` for **`training purpose`** and `remove` carat from df.

In [8]:
# creating target feature carat from data

carat = df["carat"]
carat.tail(10)

203    0.91
204    0.90
205    0.92
206    0.91
207    0.96
208    0.90
209    0.90
210    0.93
211    0.90
212    0.71
Name: carat, dtype: float64

In [9]:
# droping target feature carat from data

df = df.drop(columns = "carat")
df.tail(10)

Unnamed: 0,cut,color,depth,price
203,Good,I,58.0,premium
204,Fair,F,65.7,premium
205,Good,I,59.6,premium
206,Fair,F,66.8,premium
207,Good,F,63.7,premium
208,Fair,D,57.5,premium
209,Fair,F,64.7,premium
210,Good,I,58.2,premium
211,Fair,F,58.9,premium
212,Good,D,60.0,premium


In [10]:
# obtaining object type from dataset and printing counts of each categorical

categorical = df.columns[df.dtypes ==np.object].tolist()
for x in categorical:
    print(x + "\n")
    print(df[x].value_counts())
    print("\n")

cut

Good    153
Fair     60
Name: cut, dtype: int64


color

F    92
I    63
D    58
Name: color, dtype: int64


price

low        93
medium     74
high       31
premium    15
Name: price, dtype: int64




As we can see, **cut, color & price** are categorical and printed counts of each unique value of them to check if there is any **`white space error`** or **`typo error`**.


As everything looks perfect, now I will perfrom **`one-hot-encoding`**

In [11]:
# perfrom one-hot-encoding

# copied df to new df encoded_df
encoded_df = df.copy() 

# perform one-hot-encoding to all categorical feature of encoded_df
for x in categorical:
    levels = len(encoded_df[x].unique())
    
    # if categorical feature has only 2 unique values, then replace them with 0 or 1, 
    # if more than 2 unique values, then create new column for every unique value and place 1 when true
    if(levels==2):
        encoded_df[x] = pd.get_dummies(encoded_df[x], drop_first=True)

encoded_df = pd.get_dummies(encoded_df)
encoded_df.tail(10)

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium
203,1,58.0,0,0,1,0,0,0,1
204,0,65.7,0,1,0,0,0,0,1
205,1,59.6,0,0,1,0,0,0,1
206,0,66.8,0,1,0,0,0,0,1
207,1,63.7,0,1,0,0,0,0,1
208,0,57.5,1,0,0,0,0,0,1
209,0,64.7,0,1,0,0,0,0,1
210,1,58.2,0,0,1,0,0,0,1
211,0,58.9,0,1,0,0,0,0,1
212,1,60.0,1,0,0,0,0,0,1


After one-hot-encoding, encoded_df looks like above.

**`color`** has 3 unique values, so it created 3 separate columns for each value, `color_D`, `color_F`, `color_I`.

While, **`cut`** has two unique values, `Good` and `Fair`, it did not create two separate columns like `color`, it simply replace `Good with 1` and `Fair with 0`.


In [12]:
encoded_df.dtypes

cut                uint8
depth            float64
color_D            uint8
color_F            uint8
color_I            uint8
price_high         uint8
price_low          uint8
price_medium       uint8
price_premium      uint8
dtype: object

By checking type, So, all the features are of numeric and no categorical feature to be found for data-frame.

In [13]:
encoded_df.describe()

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium
count,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0
mean,0.71831,62.199531,0.2723,0.431925,0.295775,0.14554,0.43662,0.347418,0.070423
std,0.450883,2.829369,0.446192,0.496511,0.457465,0.353475,0.497135,0.477272,0.256461
min,0.0,55.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,60.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,63.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,64.1,1.0,1.0,1.0,0.0,1.0,1.0,0.0
max,1.0,68.7,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Here, all the `one-hot-encoded variables` are in the range of `0-1`. 

But, we need to concern about **`depth`** variable. It is distributed with the range of `55.3-68.7`. 

So, may be our target feature **`price`** will be more influenced by this feature in **`KNN algorithm`**, so, to eliminate this influence, we need to **`normalize`** the `encoded_df`.

For normalizing, I am using **`MinMaxScaler`**, which will scale the `encoded_df` in the range of `0-1`.

Equatoin of `MinMaxScaler` is,
$$m = \frac{(x -min)}{(max -min)} $$

Where,

**m** = new scaled value

**x** = original value

**min** = minimum value

**max** = maximum value


In [14]:
# normalizing the encoded_df with MinMaxScaler

scaler = preprocessing.MinMaxScaler()
norm_encoded_df = scaler.fit_transform(encoded_df)
norm_encoded_df = pd.DataFrame(norm_encoded_df, 
                                    columns = encoded_df.columns)
norm_encoded_df.tail(10)

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium
203,1.0,0.201493,0.0,0.0,1.0,0.0,0.0,0.0,1.0
204,0.0,0.776119,0.0,1.0,0.0,0.0,0.0,0.0,1.0
205,1.0,0.320896,0.0,0.0,1.0,0.0,0.0,0.0,1.0
206,0.0,0.858209,0.0,1.0,0.0,0.0,0.0,0.0,1.0
207,1.0,0.626866,0.0,1.0,0.0,0.0,0.0,0.0,1.0
208,0.0,0.164179,1.0,0.0,0.0,0.0,0.0,0.0,1.0
209,0.0,0.701493,0.0,1.0,0.0,0.0,0.0,0.0,1.0
210,1.0,0.216418,0.0,0.0,1.0,0.0,0.0,0.0,1.0
211,0.0,0.268657,0.0,1.0,0.0,0.0,0.0,0.0,1.0
212,1.0,0.350746,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [15]:
norm_encoded_df.describe()

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium
count,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0
mean,0.71831,0.51489,0.2723,0.431925,0.295775,0.14554,0.43662,0.347418,0.070423
std,0.450883,0.211147,0.446192,0.496511,0.457465,0.353475,0.497135,0.477272,0.256461
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.358209,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.604478,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.656716,1.0,1.0,1.0,0.0,1.0,1.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Now, as we can see, every descriptive feature is in the range of `0-1`.

Now, I will assign the target feature `carat` to the scaled norm_encoded_df.

I did not scale the target feature `carat`, because in **`KNN Algorithm`**, prediction is based on the nearest neighbour's value, so `unscaled target value` will not add any influence.   

Then, I created new data-frame **`clean_encoded_df`** from `norm_encoded_df`.

I displayed the last 10 rows after `one-hot-encoding` and `scaling`.

In [16]:
# assign the norm_encoded_df to the
clean_encoded_df = norm_encoded_df.assign(carat = carat.values)
clean_encoded_df.tail(10)

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium,carat
203,1.0,0.201493,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.91
204,0.0,0.776119,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.9
205,1.0,0.320896,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.92
206,0.0,0.858209,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.91
207,1.0,0.626866,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.96
208,0.0,0.164179,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.9
209,0.0,0.701493,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.9
210,1.0,0.216418,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.93
211,0.0,0.268657,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.9
212,1.0,0.350746,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.71


**`clean_encoded_df`** is our `final prepared` and we will submit this data to **`KNN algorithm`**

### Part B <a name="b"></a>

To find the prediction from **`KNN Algorithm`** using **`Euclidean distance`**, first we need to find Euclidean distance from our `target diamond or row` to `each diamond or row`. 

General Equation of `Euclidean Distance` is given below.

$$Euclidean(a,b) =\sqrt{\sum_{i=1}^m(a[i]−b[i])^2}$$,

where, 

**m** = number of iteratoin of descriptive feature

**a** = each row,

**b** = target row

In [17]:
# excel calculation
# =SQRT((A2-1)^2 + (B2-0.350746)^2 + (C2-1)^2 + (D2-0)^2 + (E2-0)^2 + (F2-0)^2 + (G2-0)^2 + (H2-0)^2 + (I2-1)^2 )

In [18]:
# finding euclidean distance from target row to each each row

euclidean = []

for x, y in clean_encoded_df[:-1].iterrows():
    euclidean.append(sqrt(((y[0]-1)**2) + 
                          ((y[1]-0.350746)**2) + 
                          ((y[2]-1)**2) + 
                          ((y[3]-0)**2) + 
                          ((y[4]-0)**2) + 
                          ((y[5]-0)**2) + 
                          ((y[6]-0)**2) + 
                          ((y[7]-0)**2) + 
                          ((y[8]-1)**2)))

In [19]:
# created euclidean_dist df

euclidean_dist = pd.DataFrame({'euclidean': euclidean})
euclidean_dist.head(10)

Unnamed: 0,euclidean
0,1.439506
1,2.257928
2,2.000223
3,2.014207
4,2.258976
5,2.277575
6,1.440917
7,1.440917
8,1.434234
9,2.26577


I have obtained the `euclidean distance` from target diamond to each diamond and assigned the `euclidean dist` to the clean_encoded_df.

In [20]:
# appending the euclidian_dist to the clean_encoded_df

clean_encoded_df["Euclidean Dist"] = euclidean_dist
clean_encoded_df.head(10)

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium,carat,Euclidean Dist
0,1.0,0.619403,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.44,1.439506
1,0.0,0.664179,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.45,2.257928
2,1.0,0.380597,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.5,2.000223
3,1.0,0.11194,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.45,2.014207
4,0.0,0.671642,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.45,2.258976
5,0.0,0.783582,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.45,2.277575
6,1.0,0.626866,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.45,1.440917
7,1.0,0.626866,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.45,1.440917
8,1.0,0.589552,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.46,1.434234
9,0.0,0.716418,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.5,2.26577


`Euclidean Dist` from target diamond to first diamond is 1.439506, 

    from target diamond to second diamond is 2.257928, 
    and so on.


To predict `carat` from `n-KNN algorithm`, we have to get at the `average of carat` with `first n-lowest Euclidian Dist`.

If it is **`1-KNN algorithm`**,

    we have to get the average of carat with first lowest Euclidian Dist.

If it is **`5-KNN algorithm`**, 

    we have to get the average of carat with first 5-lowest Euclidian Dist`.

If it is **`10-KNN algorithm`**, 

    we have to get the average of carat with first 10-lowest Euclidian Dist`.
    
    
So, I sorted the clean_encoded_df by `Euclidean Dist` and assign observations with **`first 10-lowest euclidean dist`** to the clean_encoded_df_sort.

In [21]:
# sorting (lowest-ighest) the clean_encoded_df by Euclidean Dist, 
# and assigning the first 10 rows to clean_encoded_df_sort

clean_encoded_df_sort = clean_encoded_df.sort_values(by=['Euclidean Dist']).head(10)
clean_encoded_df_sort

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium,carat,Euclidean Dist
198,1.0,0.343284,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.7,0.007462
200,1.0,0.19403,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.9,0.156716
208,0.0,0.164179,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.9,1.017255
28,1.0,0.358209,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.51,1.414233
205,1.0,0.320896,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.92,1.414529
10,1.0,0.395522,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.44,1.414922
67,1.0,0.41791,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.54,1.415808
114,1.0,0.268657,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.62,1.416594
59,1.0,0.231343,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.53,1.419245
95,1.0,0.223881,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.57,1.419893


For **`1-KNN algorithm`**, prediction is `carat-value` with `first lowest euclidian dist.` 

In [22]:
# predicton of the 1-KNN algorithm 
method = list(["1-KNN", "5-KNN", "10-KNN"])
knn_manual = []

KNN_1_ = clean_encoded_df_sort.iloc[0,9]
knn_manual.append(KNN_1_)
print(round(KNN_1_.mean(),3))

0.7


Prediction of `1-KNN algorithm` is **`0.7`**

### Part C <a name="c"></a>

For **`5-KNN algorithm`**, prediction is `average carat-value` with `first 5 lowest euclidian dist.`

$ pred = \frac{(0.7 + 0.9 + 0.9 + 0.51 + 0.92)}{5}$

In [23]:
# predicton of the 5-KNN algorithm 

KNN_5_ = clean_encoded_df_sort.iloc[0:5,9]
knn_manual.append(round(KNN_5_.mean(),3))
print(round(KNN_5_.mean(),3))

0.786


Prediction of `5-KNN algorithm` is **`0.786`**

### Part D <a name="d"></a>

For **`10-KNN algorithm`**, prediction is `average carat-value` with `first 10 lowest euclidian dist.`


$ pred = \frac{(0.7 + 0.9 + 0.9 + 0.51 + 0.92 + 0.44 + 0.54 + 0.62 + 0.53 + 0.57)}{10}$

In [24]:
# predicton of the 10-KNN algorithm 

KNN_10_ = clean_encoded_df_sort.iloc[0:10,9]
knn_manual.append(round(KNN_10_.mean(),3))
print(round(KNN_10_.mean(),3))

0.663


Prediction of `10-KNN algorithm` is **`0.663`**

Now, I have created the data_frame called **`df_summary_manual`**

In [25]:
# creating df_summary_manual

df_summary_manual = pd.DataFrame(list(zip(method, knn_manual)),
               columns =["method", "prediction"])
df_summary_manual

Unnamed: 0,method,prediction
0,1-KNN,0.7
1,5-KNN,0.786
2,10-KNN,0.663


To find the best **KNN algorithm**, we need to look at the absolute difference of predicted `carat` and actual `carat = 0.71`.

The algorithm with `shortest difference` is the best algorithm of all.

In [26]:
# finding the best algorithm from difference

df_summary_manual["difference"] = round((abs(df_summary_manual["prediction"] - 0.71)),3)
df_summary_manual["is_best"] = np.where(df_summary_manual["difference"] == min(df_summary_manual["difference"]),True,False)
df_summary_manual

Unnamed: 0,method,prediction,difference,is_best
0,1-KNN,0.7,0.01,True
1,5-KNN,0.786,0.076,False
2,10-KNN,0.663,0.047,False


**`1-KNN algorithm`** is the best as difference is only `0.010` compared to `0.076 of 5-KNN` and `0.047 of 10-KNN`.

While **`5-KNN algorithm`** is worst of all 3 KNN.

Now, I am converting the `df_summary_manual` to the given format.

In [27]:
# converting df_summary_manual to given format

df_summary_manual = df_summary_manual.drop(columns = ["difference"])
df_summary_manual

Unnamed: 0,method,prediction,is_best
0,1-KNN,0.7,True
1,5-KNN,0.786,False
2,10-KNN,0.663,False


### Part E <a name="e"></a>

Now, we have to create training sets to train the **`sklearn KNeighborsRegressor`**.

As, `golden rule of ML` that `model has not seen the testing data` or `testing or predictive data must be unknown for ML algorithm`.

For that purpose, I have created **`d_train`** with all `descriptive features` and **`t_train`** with `target feature` with `whole dataset except for the one row`, for which we have to find `carat`.

I have created **`d_predict`** with all `descriptive feature` for the observation for which, we have to find `carat`.

In [28]:
# creating the d_train (descriptive features) without last target row to train the KNeighborsRegressor

d_train = norm_encoded_df[:-1]
d_train.tail(10)

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium
202,1.0,0.134328,0.0,0.0,1.0,0.0,0.0,0.0,1.0
203,1.0,0.201493,0.0,0.0,1.0,0.0,0.0,0.0,1.0
204,0.0,0.776119,0.0,1.0,0.0,0.0,0.0,0.0,1.0
205,1.0,0.320896,0.0,0.0,1.0,0.0,0.0,0.0,1.0
206,0.0,0.858209,0.0,1.0,0.0,0.0,0.0,0.0,1.0
207,1.0,0.626866,0.0,1.0,0.0,0.0,0.0,0.0,1.0
208,0.0,0.164179,1.0,0.0,0.0,0.0,0.0,0.0,1.0
209,0.0,0.701493,0.0,1.0,0.0,0.0,0.0,0.0,1.0
210,1.0,0.216418,0.0,0.0,1.0,0.0,0.0,0.0,1.0
211,0.0,0.268657,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [29]:
# creating d_predict (descriptive features) to predict carat from KNeighborsRegressor

d_predict = norm_encoded_df.tail(1)
d_predict.tail()

Unnamed: 0,cut,depth,color_D,color_F,color_I,price_high,price_low,price_medium,price_premium
212,1.0,0.350746,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [30]:
# creating t_train (target feature) without last target row to train the KNeighborsRegressor

t_train = carat[:-1]
t_train.tail(10)

202    1.01
203    0.91
204    0.90
205    0.92
206    0.91
207    0.96
208    0.90
209    0.90
210    0.93
211    0.90
Name: carat, dtype: float64

In [31]:
# creating t_predict (target feature)

t_predict = carat.tail(1)
t_predict.tail()

212    0.71
Name: carat, dtype: float64

In [32]:
# printing the shape of split

print(d_train.shape)
print(d_predict.shape)
print(t_train.shape)
print(t_predict.shape)

(212, 9)
(1, 9)
(212,)
(1,)


Above output depicts as,

    d_train has 212 observation with 9 descriptive features,
    
    d_predict has 1 observation (target diamond) with 9 descriptive features,
    
    t_train has 212 observation with target feature
    
    t_predict has 1 observation with target feature (carat value of target diamond) 
    

From, sklearn library, I imported **`KNeighborsRegressor`**, and trained the `KNeighborsRegressor` with `d_train, t_train`.

`n_neighbors` argument will specify the number of neighbours,
       
       here, it is 1, 5 & 10,
     
`p` argument will specify the type of distance,

        if p = 1, then manhattan_distance,
        
        if p = 2, then euclidean_distance,

Here, we need to find through euclidean distance,
    
    so, p = 2


Then, I predict the **`carat`** value with passing the `d_predict` to `predict method of KNeighborsRegressor`.

In [33]:
# prediction using KNeighborsRegressor

k_list = list([1, 5, 10])

knn_predict = []

for k in k_list:
   
    knn_regressor = KNeighborsRegressor(n_neighbors=k, p=2)
    # fit or train the KNeighborsRegressor algorithm
    knn_regressor.fit(d_train, t_train)
    # prediction from KNeighborsRegressor algorithm
    knn_predict = knn_predict + [knn_regressor.predict(d_predict)]

In [34]:
# convering knn_predict to the float type

knn_sklearn = []
for pred in knn_predict:
    knn_sklearn.append(float(pred))

I created `df_summary_sklearn` data-frame with method and prediction.

In [35]:
# creating df_summary_sklearn

df_summary_sklearn = pd.DataFrame(list(zip(method,knn_sklearn)),
                                 columns = ["method","prediction"])
df_summary_sklearn

Unnamed: 0,method,prediction
0,1-KNN,0.7
1,5-KNN,0.786
2,10-KNN,0.663


To find the best **KNN algorithm**, we need to look at the absolute difference of predicted `carat` and actual `carat = 0.71`.

The algorithm with `shortest difference` is the best algorithm of all.

In [36]:
# finding the best algorithm from difference

df_summary_sklearn["difference"] = round((abs(df_summary_sklearn["prediction"] - 0.71)),3)
df_summary_sklearn["is_best"] = np.where(df_summary_sklearn["difference"] == min(df_summary_sklearn["difference"]),True,False)
df_summary_sklearn

Unnamed: 0,method,prediction,difference,is_best
0,1-KNN,0.7,0.01,True
1,5-KNN,0.786,0.076,False
2,10-KNN,0.663,0.047,False


**`1-KNN algorithm`** is the best as difference is only `0.010` compared to `0.076 of 5-KNN` and `0.047 of 10-KNN`.

While **`5-KNN algorithm`** is worst of all 3 KNN.

Now, I am converting the `df_summary_manual` to the given format.

In [37]:
# converting df_summary_sklearn to given format

df_summary_sklearn = df_summary_sklearn.drop(columns = ["difference"])
df_summary_sklearn

Unnamed: 0,method,prediction,is_best
0,1-KNN,0.7,True
1,5-KNN,0.786,False
2,10-KNN,0.663,False


### Part F <a name = "f"></a>

**`df_summary_manual`**

In [38]:
# for manual calculation

df_summary_manual

Unnamed: 0,method,prediction,is_best
0,1-KNN,0.7,True
1,5-KNN,0.786,False
2,10-KNN,0.663,False


**`df_summary_sklearn`**

In [39]:
# for sklearn calculation

df_summary_sklearn

Unnamed: 0,method,prediction,is_best
0,1-KNN,0.7,True
1,5-KNN,0.786,False
2,10-KNN,0.663,False


Output of both the calculation is same, 

`carat` value of diamond with
**`cut`** : Good,
**`color`** : D, 
**`depth`** : 60 , 
**`price`** : premium
is 

`0.7` for **`1-KNN`**

`0.786` for **`5-KNN`**

`0.663` for **`10-KNN`**.

out of which, `1-KNN` is the <font color='green'>**best**</font> with `difference` of only `0.010`, 

and `5-KNN` is <font color='red'>**worst**</font> with `difference` of `0.076`