## Codio Activity 7.7: Using Non-Numeric Features

**Expected Time = 90 minutes**

**Total Points = 40**

This activity focuses on making use of features that are categorical.  In the example of the tips dataset, the day column was initially a string (or object). Through the process of dummy encoding the feature the resulting data representations can be used in a regression model.  

In this activity, you will explore the dummy encoding process to build and compare different regression models.  Specifically, you will use the sklearn estimators `LinearRegression` and `HuberRegressor` to fit your models.  These two models implement the mean squared error and huber loss functions, returning parameters that minimize the respective loss. 

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)
- [Problem 7](#Problem-8)
- [Problem 8](#Problem-8)
- [Problem 9](#Problem-9)
- [Problem 10](#Problem-10)

In [30]:
import plotly.express as px
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error
import warnings

warnings.filterwarnings("ignore")

### The Dataset

The `diamonds` dataset from seaborn is loaded and displayed below.  You will explore models that use both the `cut` and `color` features independently, and models using all possible features.  To begin, you will use pandas `get_dummies` function to produce the dummy encoded data.  Your dummy encoded data should have as many features as there are unique values in the data.

In [31]:
import urllib

diamonds = None

try:
    diamonds = sns.load_dataset("diamonds")
except:
    diamonds_dataset_uri = (
        "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
    )
    with urllib.request.urlopen(diamonds_dataset_uri) as response:
        diamonds = pd.read_csv(response)

In [32]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


[Back to top](#Index:) 

## Problem 1

### Unique Values in `cut` and `color`

**4 Points**

Using the `cut` and `color` columns, determine the number of unique values in each column.  Assign the number of unique values in each feature as integers to `num_cuts` and `num_color` below.  

In [33]:
### GRADED

num_cuts = diamonds["cut"].nunique()
num_color = diamonds["color"].nunique()

# Answer check
print(num_cuts)
print(num_color)

5
7


[Back to top](#Index:) 

## Problem 2

### Encoding the `cut` column

**4 Points**

Create a dummy encoded version of the `cut` column.  Assign your encoded data as a DataFrame to the variable `cut_encoded` below.  

In [34]:
### GRADED

cut_encoded = pd.get_dummies(diamonds[["cut"]])

# Answer check
print(cut_encoded.shape)
print(type(cut_encoded))
cut_encoded.head()

(53940, 5)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,cut_Fair
0,True,False,False,False,False
1,False,True,False,False,False
2,False,False,False,True,False
3,False,True,False,False,False
4,False,False,False,True,False


[Back to top](#Index:) 

## Problem 3

### A Regression model on `cut`

**4 Points**

Build a regression model using the dummy encoded version of the `cut` column to predict the `price` column.  Use the `LinearRegression` estimator and assign the model to `cut_linreg` below.  Be sure to set `fit_intercept = False`. 

In [35]:
### GRADED

X = cut_encoded
y = diamonds["price"]
cut_linreg = LinearRegression(fit_intercept=False).fit(X, y)

# Answer check
print(cut_linreg)
print(type(cut_linreg))
cut_linreg.coef_

LinearRegression(fit_intercept=False)
<class 'sklearn.linear_model._base.LinearRegression'>


array([3457.54197021, 4584.2577043 , 3981.75989075, 3928.86445169,
       4358.75776398])

[Back to top](#Index:) 

## Problem 4

### Interpreting the results

**4 Points**

Compare the coefficients of the model.  Which cut does your model predict as the price for a diamond with an `ideal_cut`?  Assign your solution as a float rounded to two decimal places to `ideal_cut_prediction` below.  

In [36]:
### GRADED

ideal_cut_prediction = float(round(cut_linreg.coef_[0], 2))

# Answer check
print(ideal_cut_prediction)
print(type(ideal_cut_prediction))

3457.54
<class 'float'>


[Back to top](#Index:) 

## Problem 5

### Building a model on `clarity`

**4 Points**

Below, create a dummy encoded DataFrame of the `clarity` feature.  Assign this DataFrame to the variable `X` below, and the column `price` to `y`.  Use this encoded data to build a regression model to predict price.  Assign your fit model to the variable `clarity_linreg` below.  Be sure to set `fit_intercept = False`.  

In [37]:
### GRADED

X = pd.get_dummies(diamonds[["clarity"]])
display(X.head())
y = diamonds["price"]
clarity_linreg = LinearRegression(fit_intercept=False).fit(X, y)

# Answer check
print(clarity_linreg.coef_)

Unnamed: 0,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,False,False,False,False,False,False,True,False
1,False,False,False,False,False,True,False,False
2,False,False,False,True,False,False,False,False
3,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,True,False


[2864.83910615 2523.11463748 3283.73707067 3839.45539102 3924.98939468
 3996.00114811 5063.02860561 3924.16869096]


[Back to top](#Index:) 

## Problem 6

### Interpreting the results

**4 Points**

Examine your coefficients and compare these to the columns of the dummy encoded version of the `clarity` column.  What price does your model predict for a diamond with clarity `SI2`?  Assign your results as a float rounded to 2 decimal places to `clarity_si2_prediction`.

In [38]:
### GRADED

clarity_si2_prediction = float(round(clarity_linreg.coef_[-1 - 1], 2))

# Answer check
print(clarity_si2_prediction)
print(type(clarity_si2_prediction))

5063.03
<class 'float'>


[Back to top](#Index:) 

## Problem 7

### A Model with `cut`, `clarity`, and `carat`

**4 Points**

Now, you are to build a model with three features -- `cut`, `clarity`, `carat`.  Create the dummy encoded data and use `LinearRegression` to build a model to predict `price`.  Assign your fit model to the variable `ccc_linreg` below.  Be sure to set `fit_intercept = False`.  

In [39]:
### GRADED

X = pd.get_dummies(diamonds[["carat", "cut", "clarity"]])
display(X.head())
y = diamonds["price"]
ccc_linreg = LinearRegression(fit_intercept=False).fit(X, y)

# Answer check
print(ccc_linreg)

Unnamed: 0,carat,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,cut_Fair,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.23,True,False,False,False,False,False,False,False,False,False,False,True,False
1,0.21,False,True,False,False,False,False,False,False,False,False,True,False,False
2,0.23,False,False,False,True,False,False,False,False,True,False,False,False,False
3,0.29,False,True,False,False,False,False,False,False,False,True,False,False,False
4,0.31,False,False,False,True,False,False,False,False,False,False,False,True,False


LinearRegression(fit_intercept=False)


[Back to top](#Index:) 

## Problem 8

### Interpreting the results

**4 Points**

Examine the coefficients from the model and use them to determine the predicted price of a diamond with the following features:

```
carat = 0.8
cut = Ideal
clarity = SI2
```

Assign your solution as a float rounded to two decimal places to the variable `ccc_prediction` below.  

In [40]:
### GRADED
display(ccc_linreg.coef_)
display([ccc_linreg.coef_[0], ccc_linreg.coef_[1], ccc_linreg.coef_[-2]])

ccc_prediction = round(
    float(0.8 * ccc_linreg.coef_[0] + ccc_linreg.coef_[1] + ccc_linreg.coef_[-2]), 2
)

# Answer check
print(ccc_prediction)
print(type(ccc_prediction))

array([ 8472.02609407, -1629.11117972, -1766.44067048, -1781.57100539,
       -1979.47703992, -2651.21619656,   274.04419866,   -33.28676273,
         -43.73965124,  -576.3380288 ,  -798.42128426, -1431.05523376,
       -2265.85353667, -4933.16579328])

[8472.026094067996, -1629.111179718865, -2265.853536674027]

2882.66
<class 'float'>


[Back to top](#Index:) 

## Problem 9

### A Model with all features

**4 Points**

Now, build a model that contains all features to predict `price`.  Be sure to dummy encode all of the features in the data.  Determine the `mean_squared_error` of your predictions.  Use the `LinearRegression` estimator and the `mean_squared_error` function from sklearn.  Be sure to set `fit_intercept = False`.  

Assign your fit model to the variable `all_features_linreg` and the mean squared error as a float to `linreg_mse` below.

In [41]:
### GRADED

X = pd.get_dummies(diamonds).drop(columns="price")
y = diamonds["price"]
# display(X.head())
all_features_linreg = LinearRegression(fit_intercept=False).fit(X, y)
linreg_mse = float(mean_squared_error(y, all_features_linreg.predict(X)))

# Answer check
print(all_features_linreg)
print(all_features_linreg.coef_)
print(type(linreg_mse))
print(linreg_mse)

LinearRegression(fit_intercept=False)
[ 1.12569783e+04 -6.38061004e+01 -2.64740847e+01 -1.00826110e+03
  9.60888648e+00 -5.01188909e+01  2.71221727e+03  2.64144937e+03
  2.60608801e+03  2.45905687e+03  1.87930542e+03  2.58257671e+03
  2.37345863e+03  2.30972288e+03  2.10053781e+03  1.60231004e+03
  1.11633224e+03  2.13178649e+02  3.06769746e+03  2.73035426e+03
  2.67340929e+03  2.30099313e+03  1.98981878e+03  1.38806730e+03
  4.25181510e+02 -2.27740478e+03]
<class 'float'>
1276545.174308389


[Back to top](#Index:) 

## Problem 10

### A `HuberRegressor` on all features

**4 Points**

Using all the features as in the previous problem, build a model using the `HuberRegressor` estimator from `sklearn`.  Be sure to set `fit_intercept = False` and assign your fit model to `huber_all_features` below.  Compute the mean squared error of the Huber model and assign it as a float to the variable `huber_mse` below.

In [44]:
### GRADED

X = pd.get_dummies(diamonds).drop(columns="price")
# X.head()
y = diamonds["price"]
huber_all_features = HuberRegressor(fit_intercept=False).fit(X, y)
huber_mse = float(mean_squared_error(y, huber_all_features.predict(X)))

# Answer check
print(huber_all_features)
print(huber_all_features.coef_)
print(huber_mse)

HuberRegressor(fit_intercept=False)
[ 7.43544155e+03 -1.48757842e+01 -4.15415264e+01 -4.20230257e+00
 -1.85887164e+02  5.76007856e+02  4.14942607e+01  8.79433634e+01
  4.04955111e+01  2.10704581e+02 -6.35146426e+02  8.96196938e+02
  4.79644553e+02  3.62370117e+02  2.60332737e+02  1.33284903e+02
 -4.88545049e+02 -1.89779291e+03  1.71644354e+03  9.87793502e+02
  3.91192306e+02 -1.05017831e+02 -6.07228197e+01 -3.92282525e+02
 -6.47648254e+02 -2.14426663e+03]
1833967.9141793738


### Conclusion

While some basic initial models have been explored here, there is much more to explore to fine tune things. One thing that could be revisited is the representation of features through transformations and the engineering of different representations of existing features.  For example, the dimensions of the diamond in `x`, `y`, and `z` could be multiplied to create a feature "volume".  This allows for a more reasonable representation of three columns of data with one.  A second approach we might take is to use PCA to reduce the dimensionality of the data.  Third is to use clustering to engineer new features based on the cluster results.  Consider exploring different representations of the features and trying to improve these initial models.