# Using Machine Learning to Predict a Number Value (Part 2)

Recall:
<br>
You are a diamond broker who wants better insight on current and future diamond prices.  You obtained data regarding past diamond sales worldwide and stored it in the DATA directory of your server.  You think that the last 9 columns of this dataset is good for predicting the first column, `price`.  That is, you think you can use this data to predict the price of a new diamond.

**Here is your data import:**

In [1]:
import pandas as pd
diamonds = pd.read_csv('data/diamonds.csv')
print(diamonds.head())

   price  carat      cut color clarity  depth  table     x     y     z
0    326   0.23    Ideal     E     SI2   61.5   55.0  3.95  3.98  2.43
1    326   0.21  Premium     E     SI1   59.8   61.0  3.89  3.84  2.31
2    327   0.23     Good     E     VS1   56.9   65.0  4.05  4.07  2.31
3    334   0.29  Premium     I     VS2   62.4   58.0  4.20  4.23  2.63
4    335   0.31     Good     J     SI2   63.3   58.0  4.34  4.35  2.75


**Here is the pre-processing step:**

In [2]:
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)],axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['color'], prefix='color', drop_first=True)],axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['clarity'], prefix='clarity', drop_first=True)],axis=1)
diamonds.drop(['cut','color','clarity'], axis=1, inplace=True)
print(diamonds.head())

   price  carat  depth  table     x     y     z  cut_Good  cut_Ideal  \
0    326   0.23   61.5   55.0  3.95  3.98  2.43         0          1   
1    326   0.21   59.8   61.0  3.89  3.84  2.31         0          0   
2    327   0.23   56.9   65.0  4.05  4.07  2.31         1          0   
3    334   0.29   62.4   58.0  4.20  4.23  2.63         0          0   
4    335   0.31   63.3   58.0  4.34  4.35  2.75         1          0   

   cut_Premium      ...       color_H  color_I  color_J  clarity_IF  \
0            0      ...             0        0        0           0   
1            1      ...             0        0        0           0   
2            0      ...             0        0        0           0   
3            1      ...             0        1        0           0   
4            0      ...             0        0        1           0   

   clarity_SI1  clarity_SI2  clarity_VS1  clarity_VS2  clarity_VVS1  \
0            0            1            0            0             0  

**Here is your test and training subsets:**

In [3]:
from sklearn.model_selection import train_test_split
X = diamonds.loc[:, 'carat':'clarity_VVS2']
y = diamonds.loc[:,'price']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20)

**Here is your modeling using Linear Regression:**

In [7]:
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)

y_predicted = linear_regression.predict(X_test)
eval_metric = r2_score(y_pred = y_predicted, y_true = y_test)
print(eval_metric)

0.917413190101


Recall that the evaluation metric is slightly different because the 80/20 split into test and training subsets is done randomly.

To test another regression model, all we need to do is replace some of the code in the modeling step above. We do NOT need to import the data again or re-split it into test and training subsets.

**STEP 1: Decide on an algorithm.**
<br>
Suppose you want to compare the Linear Regression model's `R Squared` with that of a `K-Nearest Neighbor` model.  You would simply import the algorithm and then initialize it.

In [4]:
from sklearn.neighbors import KNeighborsRegressor
k_nearest = KNeighborsRegressor()

**STEP 2:** Split your data into test and training subsets.
<br>
This was already done in part 1 of this notebook above and therefore does not need to be done again.

**STEP 3:** Use the training data to train your model.

In [5]:
k_nearest.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

**STEP 4:** Use the test data to evaluate your model.

In [8]:
y_predicted = k_nearest.predict(X_test)
eval_metric = r2_score(y_pred = y_predicted, y_true = y_test)
print(eval_metric)

0.933791314673


The `R Squared` of the `K-Nearest Neighbor` algorithm is 93.3% which means that this model is 93.3% likely to predict future diamond prices correctly.  This is a bit higher than for the `Linear Regression` model.

### Exercise 

Suppose for good measure you want to compare both `Linear Regression` and `K-Nearest Neighbor` model evaluation metrics to that of a `Lasso Regression` model.  Run the following code to import and initialize your algorithm:

In [8]:
from sklearn.linear_model import Lasso
lasso = Lasso()

The above code completed **STEP 1** of machine learning modeling.  We do not need to re-do **STEP 2**.
<br>
**STEP 3:** Use the training data to train your model.  Do this in the next cell below.

**STEP 4:** Use the test data to evaluate the model.  Do this in the next cell below.

Which model had the highest `R Squared`?  Explain below.

# Using the Best Model to Make a Prediction

**STEP 5:** Make a prediction.
<br>
This is the last step of this process.  Let's create data for a new diamond using the same columns as in our X DataFrame.  We will then use that data to predict the price of this diamond.

Here is our new diamond:

In [9]:
data = [0.3, 63.3, 56, 4.26, 4.3, 2.71, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]

import numpy as np
diamond = np.array(data).reshape((1,-1))
print(diamond)

[[  0.3   63.3   56.     4.26   4.3    2.71   1.     0.     0.     0.     0.
    0.     0.     0.     1.     0.     0.     0.     1.     0.     0.     0.
    0.  ]]


We will assume the `K-Nearest Neighbor` model was the best.  Here is our prediction using `K-Nearest Neighbor`:

In [10]:
price = k_nearest.predict(diamond)
print(price)

[ 491.2]


The model predicts that a diamond with these features will cost 491.20 US dollars.  Therefore, as a diamond broker you have insight on what a diamond like this would sell for on the market.  That is, you better understand how to price the diamond if you are the seller or what price you should pay if you are the buyer.

### Exercise 

Assume the Linear Regression model had the best `R Squared` value.  Use this model to predict what the same new diamond's price would be.  Explain in the cell below your analysis what the model results indicate regarding this diamond's price prediction.