## Codio Activity 7.7: Using Non-Numeric Features

**Expected Time = 90 minutes**

**Total Points = 40**

This activity focuses on making use of features that are categorical.  

In this activity, you will explore the dummy encoding process to build and compare different regression models.  Specifically, you will use the sklearn estimators `LinearRegression` and `HuberRegressor` to fit your models.  These two models implement the mean squared error and Huber loss functions, returning parameters that minimize the respective loss. 

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)
- [Problem 7](#Problem-8)
- [Problem 8](#Problem-8)
- [Problem 9](#Problem-9)
- [Problem 10](#Problem-10)

In [149]:
import plotly.express as px
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

### The Dataset

The `diamonds` dataset from Seaborn is loaded and displayed below.  You will explore models that use both the `cut` and `color` features independently, and models using all possible features.  To begin, you will use pandas `get_dummies` function to produce the dummy encoded data.  Your dummy encoded data should have as many features as there are unique values in the data.

In [152]:
import urllib

diamonds = None

try:
    diamonds = sns.load_dataset('diamonds')
except:
    diamonds_dataset_uri = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
    with urllib.request.urlopen(diamonds_dataset_uri) as response:
        diamonds = pd.read_csv(response) 

In [154]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


[Back to top](#Index:) 

## Problem 1

### Unique Values in `cut` and `color`

**4 Points**

Using the `cut` and `color` columns, determine the number of unique values in each column.  Assign the number of unique values in each feature as integers to `num_cuts` and `num_color` below.  

In [157]:
### GRADED

num_cuts = ''
num_color = ''

# YOUR CODE HERE
num_cuts = diamonds['cut'].nunique()
num_color = diamonds['color'].nunique()

# Answer check
print(num_cuts)
print(num_color)

5
7


[Back to top](#Index:) 

## Problem 2

### Encoding the `cut` column

**4 Points**

Use the `get_dummies()` function to create a dummy encoded version of the `cut` column.  Assign your encoded data as a DataFrame to the variable `cut_encoded` below.  

In [161]:
### GRADED

cut_encoded = ''

# YOUR CODE HERE
cut_encoded = pd.get_dummies(diamonds['cut'], dtype=np.int8)

# Answer check
print(cut_encoded.shape)
print(type(cut_encoded))
cut_encoded.head()

(53940, 5)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Ideal,Premium,Very Good,Good,Fair
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,0,1,0
3,0,1,0,0,0
4,0,0,0,1,0


[Back to top](#Index:) 

## Problem 3

### A Regression model on `cut`

**4 Points**

Use the `get_dummies()` function to create a dummy encoded version of the `cut` column and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `LinearRegression` estimator  with argument `fit_intercept = False` to build a regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign the model to `cut_linreg` below.  

In [165]:
### GRADED

X = ''
y = ''
cut_linreg = ''

# YOUR CODE HERE
X = cut_encoded
y = diamonds['price']

cut_linreg = LinearRegression(fit_intercept = False).fit(X, y)

# Answer check
print(cut_linreg)
print(type(cut_linreg))
cut_linreg.coef_

LinearRegression(fit_intercept=False)
<class 'sklearn.linear_model._base.LinearRegression'>


array([3457.54197021, 4584.2577043 , 3981.75989075, 3928.86445169,
       4358.75776398])

[Back to top](#Index:) 

## Problem 4

### Interpreting the results

**4 Points**

Compare the coefficients of the model.  Which cut does your model predict as the price for a diamond with an `ideal_cut`?  Assign your solution as a float rounded to two decimal places to `ideal_cut_prediction` below.  

In [169]:
### GRADED

# YOUR CODE HERE
ideal_cut_prediction = round(float(cut_linreg.coef_[0]), 2)

# Answer check
print(ideal_cut_prediction)
print(type(ideal_cut_prediction))

3457.54
<class 'float'>


[Back to top](#Index:) 

## Problem 5

### Building a model on `clarity`

**4 Points**

Use the `get_dummies()` function to create a dummy encoded version of the `clarity` column and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `LinearRegression` estimator  with argument `fit_intercept = False` to build a regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign the model to `clarity_linreg` below.  

In [173]:
### GRADED

X = ''
y = ''
clarity_linreg = ''

# YOUR CODE HERE
clarity_encoded = pd.get_dummies(diamonds['clarity'], dtype=np.int8)
X = clarity_encoded
y = diamonds['price']

clarity_linreg = LinearRegression(fit_intercept = False).fit(X, y)


# Answer check
print(clarity_linreg.coef_)
X.head(2)

[2864.83910615 2523.11463748 3283.73707067 3839.45539102 3924.98939468
 3996.00114811 5063.02860561 3924.16869096]


Unnamed: 0,IF,VVS1,VVS2,VS1,VS2,SI1,SI2,I1
0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,0


[Back to top](#Index:) 

## Problem 6

### Interpreting the results

**4 Points**

Examine your coefficients and compare these to the columns of the dummy encoded version of the `clarity` column.  What price does your model predict for a diamond with clarity `SI2`?  Assign your results as a float rounded to 2 decimal places to `clarity_si2_prediction`.

In [177]:
### GRADED

clarity_si2_prediction = ''

# YOUR CODE HERE
clarity_si2_prediction = round(float(clarity_linreg.coef_[6]), 2)

# Answer check
print(clarity_si2_prediction)
print(type(clarity_si2_prediction))

5063.03
<class 'float'>


[Back to top](#Index:) 

## Problem 7

### A Model with `cut`, `clarity`, and `carat`

**4 Points**

Use the `get_dummies()` function to create a dummy encoded version of the `carat`, `cut`, and `clarity` columns and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `LinearRegression` estimator  with argument `fit_intercept = False` to build a regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign the model to `ccc_linreg` below. 

In [181]:
### GRADED

ccc_linreg = ''

# YOUR CODE HERE
carat = diamonds['carat']
concat_df = pd.concat([carat, cut_encoded, clarity_encoded], axis=1)
X = concat_df
y = diamonds['price']

ccc_linreg = LinearRegression(fit_intercept = False).fit(X, y)

# Answer check
print(ccc_linreg)
print(ccc_linreg.coef_)
concat_df.head(2)

LinearRegression(fit_intercept=False)
[ 8472.02609407 -1629.11117972 -1766.44067048 -1781.57100539
 -1979.47703992 -2651.21619656   274.04419866   -33.28676273
   -43.73965124  -576.3380288   -798.42128426 -1431.05523376
 -2265.85353667 -4933.16579328]


Unnamed: 0,carat,Ideal,Premium,Very Good,Good,Fair,IF,VVS1,VVS2,VS1,VS2,SI1,SI2,I1
0,0.23,1,0,0,0,0,0,0,0,0,0,0,1,0
1,0.21,0,1,0,0,0,0,0,0,0,0,1,0,0


[Back to top](#Index:) 

## Problem 8

### Interpreting the results

**4 Points**

Examine the coefficients from the model and use them to determine the predicted price of a diamond with the following features:

```
carat = 0.8
cut = Ideal
clarity = SI2
```

Assign your solution as a float rounded to two decimal places to the variable `ccc_prediction` below.  

In [185]:
### GRADED

ccc_prediction = ''

# YOUR CODE HERE
carat = 0.8
carat_coef = 8472.03
ideal_coef = -1629.11
si2_coef = -2265.85

ccc_prediction = round(float((carat * carat_coef) + ideal_coef + si2_coef), 2)

# Answer check
print(ccc_prediction)
print(type(ccc_prediction))

2882.66
<class 'float'>


[Back to top](#Index:) 

## Problem 9

### A Model with all features

**4 Points**

Use the `get_dummies()` function to create a dummy encoded version of all the columns in the `diamonds` DataFrame except for the column `price` and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `LinearRegression` estimator  with argument `fit_intercept = False` to build a regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign the model to `all_features_linreg` below. 

Use the `mean_squared_error` function to compute the MSE between `all_features_linreg.predict(X)` and `y`. Assign the result to `linreg_mse` below. 

In [191]:
### GRADED
all_features_linreg = ''
linreg_mse = ''

# YOUR CODE HERE
color_encoded = pd.get_dummies(diamonds['color'], dtype=np.int8)
columns_to_drop = ['price', 'cut', 'color', 'clarity']
diamonds = diamonds.drop(columns=columns_to_drop, inplace=False)
diamonds = pd.concat([diamonds, cut_encoded, clarity_encoded, color_encoded], axis=1)
X = diamonds

all_features_linreg = LinearRegression(fit_intercept = False).fit(X, y)

all_features_prediction = all_features_linreg.predict(X)
linreg_mse = mean_squared_error(all_features_prediction, y)

# Answer check
print(all_features_linreg)
print(all_features_linreg.coef_)
print(linreg_mse)
diamonds.head()

LinearRegression(fit_intercept=False)
[ 1.12569783e+04 -6.38061004e+01 -2.64740847e+01 -1.00826110e+03
  9.60888648e+00 -5.01188909e+01  2.71221727e+03  2.64144937e+03
  2.60608801e+03  2.45905687e+03  1.87930542e+03  3.06769746e+03
  2.73035426e+03  2.67340929e+03  2.30099313e+03  1.98981878e+03
  1.38806730e+03  4.25181510e+02 -2.27740478e+03  2.58257671e+03
  2.37345863e+03  2.30972288e+03  2.10053781e+03  1.60231004e+03
  1.11633224e+03  2.13178649e+02]
1276545.1743083887


Unnamed: 0,carat,depth,table,x,y,z,Ideal,Premium,Very Good,Good,...,SI1,SI2,I1,D,E,F,G,H,I,J
0,0.23,61.5,55.0,3.95,3.98,2.43,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
1,0.21,59.8,61.0,3.89,3.84,2.31,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0.23,56.9,65.0,4.05,4.07,2.31,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
3,0.29,62.4,58.0,4.2,4.23,2.63,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0.31,63.3,58.0,4.34,4.35,2.75,0,0,0,1,...,0,1,0,0,0,0,0,0,0,1


[Back to top](#Index:) 

## Problem 10

### A `HuberRegressor` on all features

**4 Points**

Use the `get_dummies()` function to create a dummy encoded version of all the columns in the `diamonds` DataFrame except for the column `price` and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `HuberRegressor` estimator  with argument `fit_intercept = False` to build an Huber regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign this model to `huber_all_features` below. 

Use the `mean_squared_error` function to compute the MSE between `huber_all_features.predict(X)` and `y`. Assign the result to `huber_mse` below. 

In [195]:
### GRADED

huber_all_features = ''
huber_mse = ''

# YOUR CODE HERE
huber_all_features = HuberRegressor(fit_intercept = False).fit(X, y)
huber_all_features_prediction = huber_all_features.predict(X)
huber_mse = mean_squared_error(huber_all_features_prediction, y)

# Answer check
print(huber_all_features)
print(huber_all_features.coef_)
print(huber_mse)

HuberRegressor(fit_intercept=False)
[ 1.02051914e+04  3.32980239e+00 -4.31279049e+00 -4.44242606e+02
 -6.78870237e+02  5.08994359e+02  2.93278243e+01  8.23264365e+01
  3.25085236e+02  6.46392084e+01 -8.50363709e+02  2.10915171e+03
  9.76981474e+02  3.91018598e+02  2.69911971e+02  4.16923833e+02
 -3.75182671e+02 -1.13671777e+03 -3.00107216e+03  8.07052084e+02
  5.59136699e+02  3.98921097e+02  5.79725586e+02  3.03951866e+01
 -4.64206672e+02 -2.26000898e+03]
1561660.4998771406


### Conclusion

While some basic initial models have been explored here, there is much more to explore to fine tune things. One thing that could be revisited is the representation of features through transformations and the engineering of different representations of existing features.  For example, the dimensions of the diamond in `x`, `y`, and `z` could be multiplied to create a feature "volume".  This allows for a more reasonable representation of three columns of data with one.  A second approach we might take is to use PCA to reduce the dimensionality of the data.  Third is to use clustering to engineer new features based on the cluster results.  Consider exploring different representations of the features and trying to improve these initial models.