<b>Dataset:</b> Diamonds-Kaggle
<br /><br />
<b>Objectives:</b> Predict Diamond prices
<br /><br />
<b>Context</b>
<br /><br />
This classic dataset contains the prices and other attributes of almost 54,000 diamonds. It's a great dataset for beginners learning to work with data analysis and visualization.
<br /><br />
<b>Content</b>
<br /><br />
<b>price</b> price in US dollars (\$326--\$18,823)
<br /><br />
<b>carat</b> weight of the diamond (0.2--5.01)
<br /><br />
<b>cut</b> quality of the cut (Fair, Good, Very Good, Premium, Ideal)
<br /><br />
<b>color</b> diamond colour, from J (worst) to D (best)
<br /><br />
<b>clarity</b> a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
<br /><br />
<b>x</b> length in mm (0--10.74)
<br /><br />
<b>y</b> width in mm (0--58.9)
<br /><br />
<b>z</b> depth in mm (0--31.8)
<br /><br />
<b>depth</b> total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
<br /><br />
<b>table</b> width of top of diamond relative to widest point (43--95)

In [3]:
import numpy as np
import pandas as pd 
from sklearn import preprocessing 

In [5]:
df = pd.read_csv('diamonds.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


1. Attributes x,y,z define the shape of the diamond.
2. Price is the value we are predicting here, that means the y vector.
3. Attributes Cut, clarity, and Color are categorical in nature, for efficient working we can convert them to numeric values.
4. Attribute Unnamed:0 is a additional index value given, as we are storing the data in df, we will use the in pandas index, so this can be removed.

### Data Preprocessing

In [7]:
cut_dict = {'Fair' : 1, 'Good' : 2, 'Very Good' : 3, 'Premium' : 4, 'Ideal' : 5}
clarity_dict ={ 'I1' : 1, 'SI2' : 2, 'SI1' : 3, 'VS2' : 4, 'VS1' : 5, 'VVS2' : 6, 'VVS1' : 7 , 'IF' : 8}
color_dict = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1}

In [8]:
df['cut'] = df['cut'].map(cut_dict)
df['clarity'] = df['clarity'].map(clarity_dict)
df['color'] = df['color'].map(color_dict)

In [9]:
df = df.drop('Unnamed: 0', axis = 1)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,5,6,2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,4,6,3,59.8,61.0,326,3.89,3.84,2.31
2,0.23,2,6,5,56.9,65.0,327,4.05,4.07,2.31
3,0.29,4,2,4,62.4,58.0,334,4.2,4.23,2.63
4,0.31,2,1,2,63.3,58.0,335,4.34,4.35,2.75


In [10]:
df.isnull().any()

carat      False
cut        False
color      False
clarity    False
depth      False
table      False
price      False
x          False
y          False
z          False
dtype: bool

### Prediction

In [11]:
df = sklearn.utils.shuffle(df, random_state = 42)
X = df.drop(['price'], axis = 1).values
X = preprocessing.scale(X)
y = df['price'].values
y = preprocessing.scale(y)

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

In [13]:
from sklearn.neighbors import KNeighborsRegressor
score = []
for k in range(1,20):   # running for different K values to know which yields the max accuracy. 
    clf = KNeighborsRegressor(n_neighbors = k,  weights = 'distance', p=1)
    clf.fit(X_train, y_train)
    score.append(clf.score(X_test, y_test))  

In [14]:
k_max = score.index(max(score))+1
print( "At K = {}, Max Accuracy = {}".format(k_max, max(score)*100))

At K = 11, Max Accuracy = 97.27399003930822


In [15]:
clf = KNeighborsRegressor(n_neighbors = k_max,  weights = 'distance', p=1)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test ))   
y_pred = clf.predict(X_test)

0.9727399003930821
