# Machine Learning Project - Regression

### Diamond Price Prediction basis labelled Data for close to 54000 Daimonds with price and other attributes.

This Project is based on Regression Analysis using Machine-Learning framework to predict the Price of Diamond.

We have used here 3 main SKLEARN Regressor algorithms viz. Linear Regression (with Polymonial for Degree 1 and 2), Decision Tree & Random Forest. Parameters to understand the 

For best Hyper-Parameters within each of these applied algorithms, have used GirdSearhCV functionality to get best results basis range of parameters.

RMSE and R2-Score as metrics have been evaluated for each model to choose the best applied logic.

Data, as a CSV file, is labelled with Prices (as Target or Dependent Variable) and below defined attributes (as Input or Independent Variables), comprising of Continuous and Categorical features.

The properties of these features (along with Price) are:

**Continuous Variables :**
1. Price : Price in USD {326 - 18,823}
2. Carat : Weight of the Diamond (0.2 - 5.01)
3. x : Length in mm (0 - 10.74)
4. y : Width in mm (0 - 58.9)
5. z : Depth in mm (0 - 31.8)
6. Depth : Total Depth Percentage = z/mean(x,y) : {43 - 79}
7. Table : Width of top of diamond relative to widest point {43 - 95}

**Categorical Variables :** *** These are actually Ordinal variables, since their classification means a proper order***
1. Cut : Quality of the cut {Fair, Good, Very Good, Premium, Ideal}
2. Color : Color of Diamond {from J being Worst to D as Best}
3. Clarity : Metrics to know how clear the diamond is {I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)}

#### Approach
1. Getting Libraries & Dependencies
2. Data Colection and diving into it
3. Data Processing for better hygiene
4. Preparing X and Y to further segregate into Train and Test
5. Invoking sklearn dependecies
6. Linear Regression with degree 1 & 2
7. Decision Tree
8. Random Forest
9. Conclusion

**Let's kick off...!**

## 1. Getting Libraries & Dependencies

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import datetime

In [2]:
# To avoid PRINT at every line
from IPython import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

## 2. Data Collection

In [3]:
path = 'C:/Users/Nivedit/CreativeSpace/myWork/Github_Projects/ML/Diamond_Price_Prediction/'
df = pd.read_csv(path + 'diamonds.csv', usecols=['carat','cut','color','clarity','depth','table','price','x','y','z'])

In [4]:
df.shape
df.head()

(53940, 10)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In [6]:
df.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


## 3. Data Processing

### Checking for Null

In [7]:
df.isnull().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

No Null Values

### Renaming Columns

In [8]:
df = df.rename(columns={'depth':'depth_pct', 'x':'length', 'y':'width', 'z':'depth'})

In [9]:
df.head()

Unnamed: 0,carat,cut,color,clarity,depth_pct,table,price,length,width,depth
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


#### Checking classification on all 3 Categorical / Ordinal Variables

In [10]:
for i in ['cut','color','clarity']:
    df[i].value_counts().sort_index()

Fair          1610
Good          4906
Ideal        21551
Premium      13791
Very Good    12082
Name: cut, dtype: int64

D     6775
E     9797
F     9542
G    11292
H     8304
I     5422
J     2808
Name: color, dtype: int64

I1        741
IF       1790
SI1     13065
SI2      9194
VS1      8171
VS2     12258
VVS1     3655
VVS2     5066
Name: clarity, dtype: int64

### DummY Encoding to convert all Categorical Variables into Numericals

In [11]:
df_mod = pd.get_dummies(df, columns=['cut','color','clarity'], drop_first=True)

In [12]:
df_mod.head(3)

Unnamed: 0,carat,depth_pct,table,price,length,width,depth,cut_Good,cut_Ideal,cut_Premium,...,color_H,color_I,color_J,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2
0,0.23,61.5,55.0,326,3.95,3.98,2.43,0,1,0,...,0,0,0,0,0,1,0,0,0,0
1,0.21,59.8,61.0,326,3.89,3.84,2.31,0,0,1,...,0,0,0,0,1,0,0,0,0,0
2,0.23,56.9,65.0,327,4.05,4.07,2.31,1,0,0,...,0,0,0,0,0,0,1,0,0,0


## 4. X and Y

In [13]:
Y = df_mod.price.values

In [14]:
X = df_mod.drop('price', axis=1)

In [15]:
X = X.values

In [16]:
df_mod.shape
X.shape
X[:2]

(53940, 24)

(53940, 23)

array([[ 0.23, 61.5 , 55.  ,  3.95,  3.98,  2.43,  0.  ,  1.  ,  0.  ,
         0.  ,  1.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
         1.  ,  0.  ,  0.  ,  0.  ,  0.  ],
       [ 0.21, 59.8 , 61.  ,  3.89,  3.84,  2.31,  0.  ,  0.  ,  1.  ,
         0.  ,  1.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  1.  ,
         0.  ,  0.  ,  0.  ,  0.  ,  0.  ]])

### Train-Test Split

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.3, random_state=42)

In [19]:
for i in trainX, testX, trainY, testY, X, Y:
    i.shape

(37758, 23)

(16182, 23)

(37758,)

(16182,)

(53940, 23)

(53940,)

In [20]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

### Applying MinMaxScaler

In [21]:
minmax = MinMaxScaler()

In [22]:
trainX_mm = minmax.fit_transform(trainX)
testX_mm = minmax.transform(testX)

In [23]:
trainX.mean()
trainX_mm.mean()
trainX.std()
trainX_mm.std()

5.993595690633946

0.1955575957201868

16.65523569417057

0.3393092720600112

### Applying StandardScaler since there is huge variation on X's variables b/w mean and Std Dev

#### This will make Mean = 0 and Std-Dev = 1

In [24]:
std = StandardScaler()

In [25]:
trainX_std = std.fit_transform(trainX)  # Fitting on trainX only
testX_std = std.transform(testX)

In [26]:
trainX.mean()
trainX_std.mean()
trainX.std()
trainX_std.std()

5.993595690633946

1.4484406844777323e-14

16.65523569417057

1.000000000000018

## 5. Invoking sklearn dependencies for models & metrics

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error

### Function to check Hyperparameters via GridSearchCV and FIT on various models basis passed model

In [50]:
def fun_GS(clf_name, clf, param, W):
    if W == 0:  # For Minmax base
        GS = GridSearchCV(clf, param, cv=5)
        model = GS.fit(trainX_mm, trainY)
        
        pred_train = model.predict(trainX_mm)
        pred_test = model.predict(testX_mm)
        
        score_train = r2_score(trainY, pred_train)
        score_test = r2_score(testY, pred_test)
        
        RMSE_train = np.sqrt(mean_squared_error(trainY, pred_train))
        RMSE_test = np.sqrt(mean_squared_error(testY, pred_test))
        
        print("For MinMax base transformed data:")
        print("For model {}, best parameters are: {}".format(clf_name, model.best_params_))
        print("Score on trainX for model {} is: {:.4f}".format(clf_name, score_train))
        print("Score on testX for model {} is: {:.4f}".format(clf_name, score_test))
        print("RMSE on trainX for model {} is: {:.4f}".format(clf_name, RMSE_train))
        print("RMSE on testX for model {} is: {:.4f}".format(clf_name, RMSE_test))
        print("\n")
        
    elif W == 1: # For StdScaler
        GS = GridSearchCV(clf, param, cv=5)
        model = GS.fit(trainX_std, trainY)
        
        pred_train = model.predict(trainX_std)
        pred_test = model.predict(testX_std)
        
        score_train = r2_score(trainY, pred_train)
        score_test = r2_score(testY, pred_test)
        
        RMSE_train = np.sqrt(mean_squared_error(trainY, pred_train))
        RMSE_test = np.sqrt(mean_squared_error(testY, pred_test))
        
        print("For Standard Scaler base transformed data:")
        print("For model {}, best parameters are: {}".format(clf_name, model.best_params_))
        print("Score on trainX for model {} is: {:.4f}".format(clf_name, score_train))
        print("Score on testX for model {} is: {:.4f}".format(clf_name, score_test))
        print("RMSE on trainX for model {} is: {:.4f}".format(clf_name, RMSE_train))
        print("RMSE on testX for model {} is: {:.4f}".format(clf_name, RMSE_test))

## 6. Linear Regression with degree 1 & 2

In [29]:
LR = LinearRegression()
poly_2 = PolynomialFeatures(2)  # Checking with Order 2

### 6.1) MinMax LR with Degree 1 and 2

#### Degree 1

In [30]:
# Degree 1
LR_mm_1 = LR.fit(trainX_mm, trainY)

In [31]:
# Predict
pred_train_LR_mm_1 = LR_mm_1.predict(trainX_mm)
pred_test_LR_mm_1 = LR_mm_1.predict(testX_mm)

In [32]:
# R2 Score
score_train_LR_mm_1 = r2_score(trainY, pred_train_LR_mm_1)
score_test_LR_mm_1 = r2_score(testY, pred_test_LR_mm_1)

In [33]:
score_train_LR_mm_1
score_test_LR_mm_1

0.9195976267987521

0.9201866914388086

In [34]:
# RMSE
RMSE_train_LR_mm_1 = np.sqrt(mean_squared_error(trainY, pred_train_LR_mm_1))
RMSE_test_LR_mm_1 = np.sqrt(mean_squared_error(testY, pred_test_LR_mm_1))

In [35]:
RMSE_train_LR_mm_1
RMSE_test_LR_mm_1

1136.0202087749415

1115.6905645006914

#### Degree 2

In [36]:
# Fit and Transform for Degree 2 of trainX_mm and testX_mm
trainX_mm_poly2 = poly_2.fit_transform(trainX_mm)
testX_mm_poly2 = poly_2.transform(testX_mm)

In [37]:
# Fitting LR for Degree 2
LR_mm_2 = LR.fit(trainX_mm_poly2, trainY)

In [38]:
# Predict
pred_train_LR_mm_2 = LR_mm_2.predict(trainX_mm_poly2)
pred_test_LR_mm_2 = LR_mm_2.predict(testX_mm_poly2)

In [39]:
# R2 Score
score_train_LR_mm_2 = r2_score(trainY, pred_train_LR_mm_2)
score_test_LR_mm_2 = r2_score(testY, pred_test_LR_mm_2)

In [40]:
score_train_LR_mm_2
score_test_LR_mm_2

0.971791797448781

0.8888652981306491

In [41]:
# RMSE
RMSE_train_LR_mm_2 = np.sqrt(mean_squared_error(trainY, pred_train_LR_mm_2))
RMSE_test_LR_mm_2 = np.sqrt(mean_squared_error(testY, pred_test_LR_mm_2))

In [42]:
RMSE_train_LR_mm_2
RMSE_test_LR_mm_2

672.8826494193335

1316.53056859231

#### RMSE is less for Train while it rises for TEST as Degree or Order is more, which clearly means that with Degree 2, model is highly Overfit to train set only. 

Thus, we shall stick to Order = 1 for StandardScaler as below

### 6.2) STD-Scaler for LR with Degree = 1

In [43]:
# Degree 1
LR_std_1 = LR.fit(trainX_std, trainY)

In [44]:
# Predict
pred_train_LR_std_1 = LR_std_1.predict(trainX_std)
pred_test_LR_std_1 = LR_std_1.predict(testX_std)

In [45]:
# R2 Score
score_train_LR_std_1 = r2_score(trainY, pred_train_LR_std_1)
score_test_LR_std_1 = r2_score(testY, pred_test_LR_std_1)

In [46]:
score_train_LR_std_1
score_test_LR_std_1

0.9195976267987521

0.9201866914388086

In [47]:
# RMSE
RMSE_train_LR_std_1 = np.sqrt(mean_squared_error(trainY, pred_train_LR_std_1))
RMSE_test_LR_std_1 = np.sqrt(mean_squared_error(testY, pred_test_LR_std_1))

In [48]:
RMSE_train_LR_std_1
RMSE_test_LR_std_1

1136.0202087749415

1115.6905645006907

#### Both Score and RMSE for MinMax and Std-Scaler is same for Linear-Regression with Order 1

## 7. Decision Tree

#### Passing Hyperparameters through GridSearchCV to get best results for both MinMax and Std-Scaler

In [51]:
# Decision Tree
for W in np.arange(0,2,1):
    clf_name = "DT"
    param = {"max_depth":range(3,20,1), "max_leaf_nodes": range(10,30,2)}
    clf = DecisionTreeRegressor(random_state=42, )
    
    if W == 0: # For Simple train_X, test_X
        fun_GS(clf_name, clf, param, W)
    
    elif W == 1: # For MinMaxScaler train_X, test_X
        fun_GS(clf_name, clf, param, W)

For MinMax base transformed data:
For model DT, best parameters are: {'max_depth': 10, 'max_leaf_nodes': 28}
Score on trainX for model DT is: 0.9298
Score on testX for model DT is: 0.9302
RMSE on trainX for model DT is: 1061.4909
RMSE on testX for model DT is: 1043.5595


For Standard Scaler base transformed data:
For model DT, best parameters are: {'max_depth': 10, 'max_leaf_nodes': 28}
Score on trainX for model DT is: 0.9298
Score on testX for model DT is: 0.9302
RMSE on trainX for model DT is: 1061.4909
RMSE on testX for model DT is: 1043.5595


**Basis both, Minmax and Standard Scaler, we are getting same results for Decision Tree. Also, Hyperparameters are same for both.**

RMSE for this is less than Linear Regression model for both Train and Test.

## 8. Random Forest

In [55]:
# Random Forest
for W in np.arange(0,2,1):
    clf_name = "RF"
    param = {"n_estimators": range(5,10,1),
             "max_depth":range(3,10,2),"max_leaf_nodes":range(10,20,2)}
    clf = RandomForestRegressor(random_state=42)
    
    if W == 0: # For Simple train_X, test_X
        fun_GS(clf_name, clf, param, W)
    
    elif W == 1: # For MinMaxScaler train_X, test_X
        fun_GS(clf_name, clf, param, W)

For MinMax base transformed data:
For model RF, best parameters are: {'max_depth': 7, 'max_leaf_nodes': 18, 'n_estimators': 9}
Score on trainX for model RF is: 0.9234
Score on testX for model RF is: 0.9256
RMSE on trainX for model RF is: 1108.5704
RMSE on testX for model RF is: 1077.3650


For Standard Scaler base transformed data:
For model RF, best parameters are: {'max_depth': 7, 'max_leaf_nodes': 18, 'n_estimators': 9}
Score on trainX for model RF is: 0.9234
Score on testX for model RF is: 0.9256
RMSE on trainX for model RF is: 1108.5704
RMSE on testX for model RF is: 1077.3650


** Even in this case, same results are obtained with Minmax and Standard Scaler. Hyperparameters are also same**

RMSE for this is less than Linear Regression but more than Decision Tree.

## 9. Conclusion

Basis all 3 Classifiers i.e. Linear Regression (with Degree 1 and 2), DecisionTree and RandomForest; Consdering, Score and RMSE for TRAIN and TEST datasets, **DecisionTree looks best Model here,** however, we can involve few other methods and could work on Feature Engineering to get our model be better with better features.

Also, since we have applied HyperParameters via GridSearch, which means best Parameters are selected and Overfit is avoided

As far as MinMax or StandardScaler Transformed Vector is concerned, both gives same result.