<a href="https://colab.research.google.com/github/jwross24/lambda-intro/blob/master/[Jonathan_Ross]_LSDS_Intro_Assignment_7_More_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School, Intro to Data Science, Day 7 — More Regression!

## Assignment

### 1. Experiment with Nearest Neighbor parameter

Using the same 10 training data points from the lesson, train a `KNeighborsRegressor` model with `n_neighbors=1`.

Use both `carat` and `cut` features.

Calculate the mean absolute error on the training data and on the test data.

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

columns = ['carat', 'cut', 'price']

train = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 422],
        [0.31, 'Ideal', 489],
        [0.42, 'Premium', 737],
        [0.5, 'Ideal', 1415],
        [0.51, 'Premium', 1177],
        [0.7, 'Fair', 1865],
        [0.73, 'Fair', 2351],
        [1.01, 'Good', 3768],
        [1.18, 'Very Good', 3965],
        [1.18, 'Ideal', 4838]])

test  = pd.DataFrame(columns=columns, 
        data=[[0.3, 'Ideal', 432],
        [0.34, 'Ideal', 687],
        [0.37, 'Premium', 1124],
        [0.4, 'Good', 720],
        [0.51, 'Ideal', 1397],
        [0.51, 'Very Good', 1284],
        [0.59, 'Ideal', 1437],
        [0.7, 'Ideal', 3419],
        [0.9, 'Premium', 3484],
        [0.9, 'Fair', 2964]])

cut_ranks = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

In [4]:
from sklearn.metrics import mean_absolute_error

features = ['carat', 'cut']
target = 'price'

# Train a KNeighborsRegressor with K = 1 on carat and cut
model = KNeighborsRegressor(n_neighbors=1)
model.fit(train[features], train[target])

# Calculate the mean absolute error for training and test data
y_true = train[target]
y_pred = model.predict(train[features])
train_error = mean_absolute_error(y_true, y_pred)

y_true = test[target]
y_pred = model.predict(test[features])
test_error = mean_absolute_error(y_true, y_pred)

# Print the mean absolute error for training and test data
print('Training error: $', round(train_error))
print('Test error: $', round(test_error))

Training error: $ 0.0
Test error: $ 1129.0


How does the train error and test error compare to the previous `KNeighborsRegressor` model from the lesson? (The previous model used `n_neighbors=2` and only the `carat` feature.)

Is this new model overfitting or underfitting? Why do you think this is happening here? 



In [0]:
'''
Compared to the previous KNeighborsRegressor model, the training error is much
lower but the test error 3-4x higher.

It looks like the model is overfitting because the training error is 0, but
the test error shows that the model is very inaccurate. When k = 1, you use
the closest training sample to the test sample. Since the test sample is in
the data set, the model chooses itself and never makes a mistake.
'''


### 2. More data, two features, linear regression

Use the following code to load data for diamonds under $5,000, and split the data into train and test sets. The training data has almost 30,000 rows, and the test data has almost 10,000 rows.

In [5]:
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = sns.load_dataset('diamonds')
df = df[df.price < 5000]
train, test = train_test_split(df.copy(), random_state=0)
train.shape, test.shape

((29409, 10), (9804, 10))

Then, train a Linear Regression model with the `carat` and `cut` features. Calculate the mean absolute error on the training data and on the test data.

In [0]:
from sklearn.linear_model import LinearRegression

# Encode the cut ranks as integers
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

In [8]:
# Train a Linear Regression model with carat and cut
model = LinearRegression()
model.fit(train[features], train[target])

# Calculate the mean absolute error on the training and test data
y_true = train[target]
y_pred = model.predict(train[features])
train_error = mean_absolute_error(y_true, y_pred)

y_true = test[target]
y_pred = model.predict(test[features])
test_error = mean_absolute_error(y_true, y_pred)

# Print the training and test errors
print('Training error: $', round(train_error))
print('Test error: $', round(test_error))

Training error: $ 309.0
Test error: $ 310.0


Use this model to predict the price of a half carat diamond with "very good" cut

In [9]:
model.predict([[0.5, 3]])

array([1489.45526366])

### 3. More data, more features, any model

You choose what features and model type to use! Try to get a better mean absolute error on the test set than your model from the last question.

Refer to [this documentation](https://ggplot2.tidyverse.org/reference/diamonds.html) for more explanation of the features.

Besides `cut`, there are two more ordinal features, which you'd need to encode as numbers if you want to use in your model:

In [0]:
df = sns.load_dataset('diamonds')
train, test = train_test_split(df.copy(), random_state=0)

In [11]:
print(train.describe(include=['object']))
train.head()

          cut  color clarity
count   40455  40455   40455
unique      5      7       8
top     Ideal      G     SI1
freq    16226   8382    9791


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
441,0.89,Premium,H,SI2,60.2,59.0,2815,6.26,6.23,3.76
50332,0.7,Very Good,D,SI1,64.0,53.0,2242,5.57,5.61,3.58
35652,0.31,Ideal,G,VVS2,62.7,57.0,907,4.33,4.31,2.71
9439,0.9,Very Good,H,VS1,62.3,59.0,4592,6.12,6.17,3.83
15824,1.01,Good,F,VS2,60.6,62.0,6332,6.52,6.49,3.94


In [0]:
# Encode the cut ranks as integers
train.cut = train.cut.map(cut_ranks)
test.cut = test.cut.map(cut_ranks)

# Encode the clarity ranks as integers
clarity_rank = {"IF":0,"VVS1":1, "VVS2":2,"VS1":3, "VS2":4,"SI1":5, "SI2":6, "I1":7}
train.clarity = train.clarity.map(clarity_rank)
test.clarity = test.clarity.map(clarity_rank)

# Encode the color ranks as integers
color_rank = {"J":7, "I":6, "H":5, "G":4, "F":3, "E":2, "D":1 }
train.color = train.color.map(color_rank)
test.color = test.color.map(color_rank)

In [13]:
train.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
441,0.89,4,5,6,60.2,59.0,2815,6.26,6.23,3.76
50332,0.7,3,1,5,64.0,53.0,2242,5.57,5.61,3.58
35652,0.31,5,4,2,62.7,57.0,907,4.33,4.31,2.71
9439,0.9,3,5,3,62.3,59.0,4592,6.12,6.17,3.83
15824,1.01,2,3,4,60.6,62.0,6332,6.52,6.49,3.94


In [14]:
features = ['carat', 'cut', 'color', 'clarity']
# to_append = ['x', 'y', 'z', 'depth', 'table']
# [features.append(this) for this in to_append]
target = ['price']
model = KNeighborsRegressor()
model.fit(train[features], train[target])

# Calculate the mean absolute error on the training and test data
y_true = train[target]
y_pred = model.predict(train[features])
train_error = mean_absolute_error(y_true, y_pred)

y_true = test[target]
y_pred = model.predict(test[features])
test_error = mean_absolute_error(y_true, y_pred)

# Print the training and test errors
print('Training error: $', round(train_error))
print('Test error: $', round(test_error))

Training error: $ 281.0
Test error: $ 338.0
