<a href="https://colab.research.google.com/github/sapinspys/lambda-ds-precourse/blob/master/LSDS_Intro_Assignment_7_More_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School, Intro to Data Science, Day 7 — More Regression!

## Assignment

### 1. Experiment with Nearest Neighbor parameter

Using the same 10 training data points from the lesson, train a `KNeighborsRegressor` model with `n_neighbors=1`.

Use both `carat` and `cut` features.

Calculate the mean absolute error on the training data and on the test data.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

columns = ['carat', 'cut', 'price']

train = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 422],
        [0.31, 'Ideal', 489],
        [0.42, 'Premium', 737],
        [0.5, 'Ideal', 1415],
        [0.51, 'Premium', 1177],
        [0.7, 'Fair', 1865],
        [0.73, 'Fair', 2351],
        [1.01, 'Good', 3768],
        [1.18, 'Very Good', 3965],
        [1.18, 'Ideal', 4838]])

test  = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 432],
        [0.34, 'Ideal', 687],
        [0.37, 'Premium', 1124],
        [0.4, 'Good', 720],
        [0.51, 'Ideal', 1397],
        [0.51, 'Very Good', 1284],
        [0.59, 'Ideal', 1437],
        [0.7, 'Ideal', 3419],
        [0.9, 'Premium', 3484],
        [0.9, 'Fair', 2964]])

cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

In [108]:
features = ['carat', 'cut']
target = 'price'

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=1)
model.fit(train[features], train[target])

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                    weights='uniform')

In [109]:
model.predict([[0.9, 1]])

array([2351.])

In [110]:
from sklearn.metrics import mean_absolute_error

# Calculated mean absolute error on training data
y_true = train[target]
y_predicted = model.predict(train[features])
train_error = mean_absolute_error(y_true, y_predicted)

# Calculated mean absolute error on test data
y_true = test[target]
y_predicted = model.predict(test[features])
test_error = mean_absolute_error(y_true, y_predicted)

print('Train Error: $', round(train_error))
print('Test Error: $', round(test_error), '\n')

print('Lesson Train Error: $', round(210))
print('Lesson Test Error: $', round(296))

Train Error: $ 0.0
Test Error: $ 1129.0 

Lesson Train Error: $ 210
Lesson Test Error: $ 296


How does the train error and test error compare to the previous `KNeighborsRegressor` model from the lesson? (The previous model used `n_neighbors=2` and only the `carat` feature.)

Is this new model overfitting or underfitting? Why do you think this is happening here? 



## Answer:
### This new model is OVERFITTING because train error = 0!

Our model predicts price from the original data with perfect accuracy but it doesn't predict price accurately using the test data. This gap in accuracy is an obvious red flag that our new model doesn't generalize well to unseen data and it is most likely that the model created erroneous relationships after including the "cut" feature.


### 2. More data, two features, linear regression

Use the following code to load data for diamonds under $5,000, and split the data into train and test sets. The training data has almost 30,000 rows, and the test data has almost 10,000 rows.

In [111]:
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = sns.load_dataset('diamonds')
df = df[df.price < 5000]
train, test = train_test_split(df.copy(), random_state=0)

# Encoding ordinal variables
cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

train.shape, test.shape, train.head()

((29409, 10),
 (9804, 10),
        carat  cut color clarity  depth  table  price     x     y     z
 43601   0.31    3     E     SI1   61.2   58.0    507  4.34  4.38  2.67
 52706   0.74    1     H     VS2   66.1   61.0   2553  5.60  5.57  3.69
 1986    0.81    3     G     SI1   62.3   59.0   3095  5.93  5.98  3.71
 48617   0.70    1     G     SI2   61.5   66.0   1999  5.55  5.60  3.43
 10947   0.87    5     G     VS2   61.8   56.0   4899  6.11  6.13  3.78)

Then, train a Linear Regression model with the `carat` and `cut` features. Calculate the mean absolute error on the training data and on the test data.

In [112]:
features = ["carat", "cut"]
target = "price"

model = LinearRegression()
model.fit(train[features], train[target])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [113]:
y_true = train[target]
y_predicted = model.predict(train[features])
train_error = mean_absolute_error(y_true, y_predicted)

y_true = test[target]
y_predicted = model.predict(test[features])
test_error = mean_absolute_error(y_true, y_predicted)

print('Train Data Mean Absolute Error: $', round(train_error))
print('Test Data Mean Absolute Error: $', round(test_error), '\n')

Train Data Mean Absolute Error: $ 309.0
Test Data Mean Absolute Error: $ 310.0 



Use this model to predict the price of a half carat diamond with "very good" cut

In [114]:
print(f"0.5 carat diamond (very good cut) predicted price: ${round(model.predict([[0.5,3]])[0],2)}")

0.5 carat diamond (very good cut) predicted price: $1489.46


### 3. More data, more features, any model

You choose what features and model type to use! Try to get a better mean absolute error on the test set than your model from the last question.

Refer to [this documentation](https://ggplot2.tidyverse.org/reference/diamonds.html) for more explanation of the features.

Besides `cut`, there are two more ordinal features, which you'd need to encode as numbers if you want to use in your model:

In [115]:
train.describe(include=['object'])
train.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
43601,0.31,3,E,SI1,61.2,58.0,507,4.34,4.38,2.67
52706,0.74,1,H,VS2,66.1,61.0,2553,5.6,5.57,3.69
1986,0.81,3,G,SI1,62.3,59.0,3095,5.93,5.98,3.71
48617,0.7,1,G,SI2,61.5,66.0,1999,5.55,5.6,3.43
10947,0.87,5,G,VS2,61.8,56.0,4899,6.11,6.13,3.78


In [116]:
clarity_rank = {"IF":0,"VVS1":1, "VVS2":2,"VS1":3, "VS2":4,"SI1":5, "SI2":6, "I1":7}
train.clarity = train.clarity.map(clarity_rank)
test.clarity = test.clarity.map(clarity_rank)  

color_rank = {"J":7, "I":6, "H":5, "G":4, "F":3, "E":2, "D":1 }
train.color = train.color.map(color_rank)
test.color = test.color.map(color_rank)

features = ['color','clarity',"carat", "cut"]
target = ['price']
model = LinearRegression()
model.fit(train[features],train[target])

model.coef_,model.intercept_

(array([[-112.47381195, -161.90279604, 5529.14717977,   51.28925478]]),
 array([-447.29671118]))

In [117]:
y_true = train[target]
y_predicted = model.predict(train[features])
train_error = mean_absolute_error(y_true, y_predicted)

y_true = test[target]
y_predicted = model.predict(test[features])
test_error = mean_absolute_error(y_true, y_predicted)

print('Train Data Mean Absolute Error: $', round(train_error))
print('Test Data Mean Absolute Error: $', round(test_error), '\n')

Train Data Mean Absolute Error: $ 245.0
Test Data Mean Absolute Error: $ 246.0 

