## Codio Activity 7.6: Multiple Linear Regression

This assignment focuses on building a regression model using multiple features.  Using a dataset from the `seaborn` library, you are to build and evaluate regression models with one, two, and three features.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

### The Dataset

Below, a dataset containing information on diamonds is loaded and displayed.  Your task is to build a regression model that predicts the price of the diamond given different features as inputs.  

In [2]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [3]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


## Problem 1

### Regression with single feature

Use sklearn's `LinearRegression` estimator with argument `fit_intercept` equal to `False` to build a regression model. Next, chain a `fit()` function using the `carat` column as the feature and the `price` column as the target.  

Assign your result to the variable `lr_1_feature` below.

In [4]:
lr_1_feature = LinearRegression(fit_intercept = False).fit(diamonds[['carat']],diamonds['price'])
lr_1_feature

## Problem 2

### Regression with two features

Use sklearn's `LinearRegression` estimator with argument `fit_intercept` equal to `False` to build a regression model. Next, chain a `fit()` function using the `carat` and `depth` columns as the feature and the `price` column as the target.  

Assign your result to the variable `lr_2_feature` below.

In [9]:
lr_2_feature = LinearRegression(fit_intercept = False).fit(diamonds[['carat', 'depth']], diamonds['price'])
lr_2_feature

## Problem 3

### Regression with three features

Use sklearn's `LinearRegression` estimator with argument `fit_intercept` equal to `False` to build a regression model. Next, chain a `fit()` function using the `carat`, `delth`, and `table` columns as the feature and the `price` column as the target.  

Assign your result to the variable `lr_3_feature` below.

In [6]:
lr_3_feature = LinearRegression(fit_intercept = False).fit(diamonds[['carat','depth','table']],diamonds['price'])
lr_3_feature

## Problem 4

### Computing MSE and MAE

For each of your models, compute the mean squared error and mean absolute errors.  Create a DataFrame to match the structure below:

| Features | MSE | MAE |
| ----- | ----- | ----- |
| 1 Feature |  -  | - |
| 2 Features | -  | -  |
| 3 Features | - | - |

Assign your solution as a DataFrame to `error_df` below.  Note that the `Features` column should be the index column in your DataFrame.

In [7]:
pred1 = lr_1_feature.predict(diamonds[['carat']])
pred1

array([1303.24211516, 1189.91671385, 1303.24211516, ..., 3966.38904615,
       4872.9922567 , 4249.70254945], shape=(53940,))

In [10]:
pred2 = lr_2_feature.predict(diamonds[['carat','depth']])
pred2

array([-471.56925195, -564.49746018, -302.6748354 , ..., 3131.32390223,
       4440.17870172, 3541.71788373], shape=(53940,))

In [11]:
pred3 = lr_3_feature.predict(diamonds[['carat','depth','table']])
pred3

array([-418.06478447, -757.13550535, -711.03706791, ..., 3075.55346593,
       4400.61598528, 3631.98674962], shape=(53940,))

In [13]:
mse_lst = []
for i in [pred1,pred2,pred3]:
    mse_lst.append(mean_squared_error(diamonds['price'],i))
print(mse_lst)

[3725918.889119714, 2385352.6720595844, 2376565.308478145]


In [14]:
mae_lst = []
for i in [pred1,pred2,pred3]:
    mae_lst.append(mean_absolute_error(diamonds['price'],i))
print(mae_lst)

[1540.1920001016372, 1005.8581107503078, 1002.3373135337706]


In [16]:
MSE_df = pd.DataFrame({'Features':['1 Features','2 Features','3 Features'],'MSE':mse_lst,'MAE':mae_lst})
MSE_df

Unnamed: 0,Features,MSE,MAE
0,1 Features,3725919.0,1540.192
1,2 Features,2385353.0,1005.858111
2,3 Features,2376565.0,1002.337314


## Codio Activity 7.7: Using Non-Numeric Features

This activity focuses on making use of features that are categorical.  

In this activity, you will explore the dummy encoding process to build and compare different regression models.  Specifically, you will use the sklearn estimators `LinearRegression` and `HuberRegressor` to fit your models.  These two models implement the mean squared error and Huber loss functions, returning parameters that minimize the respective loss. 

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)
- [Problem 6](#Problem-6)
- [Problem 7](#Problem-8)
- [Problem 8](#Problem-8)
- [Problem 9](#Problem-9)
- [Problem 10](#Problem-10)

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

### The Dataset

The `diamonds` dataset from Seaborn is loaded and displayed below.  You will explore models that use both the `cut` and `color` features independently, and models using all possible features.  To begin, you will use pandas `get_dummies` function to produce the dummy encoded data.  Your dummy encoded data should have as many features as there are unique values in the data.

In [20]:
import urllib

diamonds = None

try:
    diamonds = sns.load_dataset('diamonds')
except:
    diamonds_dataset_uri = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
    with urllib.request.urlopen(diamonds_dataset_uri) as response:
        diamonds = pd.read_csv(response) 

In [21]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## Problem 1

### Unique Values in `cut` and `color`

Using the `cut` and `color` columns, determine the number of unique values in each column.  Assign the number of unique values in each feature as integers to `num_cuts` and `num_color` below.  

In [22]:
num_cuts = diamonds['cut'].nunique()
num_cuts

5

In [24]:
num_color = diamonds['color'].nunique()
num_color

7

## Problem 2

### Encoding the `cut` column

Use the `get_dummies()` function to create a dummy encoded version of the `cut` column.  Assign your encoded data as a DataFrame to the variable `cut_encoded` below.  

In [25]:
cut_encoded = pd.get_dummies(diamonds[['cut']])
cut_encoded

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


Unnamed: 0,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,cut_Fair
0,True,False,False,False,False
1,False,True,False,False,False
2,False,False,False,True,False
3,False,True,False,False,False
4,False,False,False,True,False
...,...,...,...,...,...
53935,True,False,False,False,False
53936,False,False,False,True,False
53937,False,False,True,False,False
53938,False,True,False,False,False


## Problem 3

### A Regression model on `cut`

Use the `get_dummies()` function to create a dummy encoded version of the `cut` column and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `LinearRegression` estimator  with argument `fit_intercept = False` to build a regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign the model to `cut_linreg` below.  

In [26]:
X = cut_encoded
y = diamonds['price']
cut_linreg = LinearRegression(fit_intercept = False).fit(X,y)


In [27]:
cut_linreg

## Problem 4

### Interpreting the results

Compare the coefficients of the model.  Which cut does your model predict as the price for a diamond with an `ideal_cut`?  Assign your solution as a float rounded to two decimal places to `ideal_cut_prediction` below.  

In [28]:
cut_linreg.coef_


array([3457.54197021, 4584.2577043 , 3981.75989075, 3928.86445169,
       4358.75776398])

In [30]:
ideal_cut_prediction = float(round(cut_linreg.coef_[0],2))
ideal_cut_prediction

3457.54

## Problem 5

### Building a model on `clarity`

Use the `get_dummies()` function to create a dummy encoded version of the `clarity` column and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `LinearRegression` estimator  with argument `fit_intercept = False` to build a regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign the model to `clarity_linreg` below.  

In [31]:
X = pd.get_dummies(diamonds[['clarity']])
y = diamonds['price']
clarity_linreg = LinearRegression(fit_intercept = False).fit(X,y)

In [32]:
clarity_linreg

## Problem 6

### Interpreting the results

Examine your coefficients and compare these to the columns of the dummy encoded version of the `clarity` column.  What price does your model predict for a diamond with clarity `SI2`?  Assign your results as a float rounded to 2 decimal places to `clarity_si2_prediction`.

In [33]:
X

Unnamed: 0,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,False,False,False,False,False,False,True,False
1,False,False,False,False,False,True,False,False
2,False,False,False,True,False,False,False,False
3,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...
53935,False,False,False,False,False,True,False,False
53936,False,False,False,False,False,True,False,False
53937,False,False,False,False,False,True,False,False
53938,False,False,False,False,False,False,True,False


In [34]:
clarity_linreg.coef_

array([2864.83910615, 2523.11463748, 3283.73707067, 3839.45539102,
       3924.98939468, 3996.00114811, 5063.02860561, 3924.16869096])

In [36]:
clarity_si2_prediction = float(round(clarity_linreg.coef_[-2],2))
clarity_si2_prediction

5063.03

## Problem 7

### A Model with `cut`, `clarity`, and `carat`

Use the `get_dummies()` function to create a dummy encoded version of the `carat`, `cut`, and `clarity` columns and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `LinearRegression` estimator  with argument `fit_intercept = False` to build a regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign the model to `ccc_linreg` below. 

In [37]:
X = pd.get_dummies(diamonds[['carat','cut','clarity']])
y = diamonds['price']
ccc_linreg = LinearRegression(fit_intercept = False).fit(X,y)


In [38]:
ccc_linreg

In [39]:
X

Unnamed: 0,carat,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,cut_Fair,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.23,True,False,False,False,False,False,False,False,False,False,False,True,False
1,0.21,False,True,False,False,False,False,False,False,False,False,True,False,False
2,0.23,False,False,False,True,False,False,False,False,True,False,False,False,False
3,0.29,False,True,False,False,False,False,False,False,False,True,False,False,False
4,0.31,False,False,False,True,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53935,0.72,True,False,False,False,False,False,False,False,False,False,True,False,False
53936,0.72,False,False,False,True,False,False,False,False,False,False,True,False,False
53937,0.70,False,False,True,False,False,False,False,False,False,False,True,False,False
53938,0.86,False,True,False,False,False,False,False,False,False,False,False,True,False


## Problem 8

### Interpreting the results

Examine the coefficients from the model and use them to determine the predicted price of a diamond with the following features:

```
carat = 0.8
cut = Ideal
clarity = SI2
```
Assign your solution as a float rounded to two decimal places to the variable `ccc_prediction` below.  

In [41]:
diamonds_encoded = pd.get_dummies(diamonds[['carat','cut','clarity']])
diamonds_encoded

Unnamed: 0,carat,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,cut_Fair,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.23,True,False,False,False,False,False,False,False,False,False,False,True,False
1,0.21,False,True,False,False,False,False,False,False,False,False,True,False,False
2,0.23,False,False,False,True,False,False,False,False,True,False,False,False,False
3,0.29,False,True,False,False,False,False,False,False,False,True,False,False,False
4,0.31,False,False,False,True,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53935,0.72,True,False,False,False,False,False,False,False,False,False,True,False,False
53936,0.72,False,False,False,True,False,False,False,False,False,False,True,False,False
53937,0.70,False,False,True,False,False,False,False,False,False,False,True,False,False
53938,0.86,False,True,False,False,False,False,False,False,False,False,False,True,False


In [42]:
diamonds_features = pd.DataFrame({'carat':[0.8],'cut':['Ideal'],'clarity':['SI2']})
diamonds_features

Unnamed: 0,carat,cut,clarity
0,0.8,Ideal,SI2


In [47]:
# important: reindex(columns = df_dummies.columns, fill_value = 0)
diamonds_features_encoded = pd.get_dummies(diamonds_features).reindex(columns = diamonds_encoded.columns, fill_value = 0)
diamonds_features_encoded

Unnamed: 0,carat,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,cut_Fair,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.8,True,0,0,0,0,0,0,0,0,0,0,True,0


In [48]:
ccc_prediction = ccc_linreg.predict(diamonds_features_encoded)
ccc_prediction

array([2882.65615886])

In [49]:
ccc_prediction = round(ccc_prediction[0],2)
ccc_prediction

np.float64(2882.66)

## Problem 9

### A Model with all features

Use the `get_dummies()` function to create a dummy encoded version of all the columns in the `diamonds` DataFrame except for the column `price` and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `LinearRegression` estimator  with argument `fit_intercept = False` to build a regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign the model to `all_features_linreg` below. 

Use the `mean_squared_error` function to compute the MSE between `all_features_linreg.predict(X)` and `y`. Assign the result to `linreg_mse` below. 

In [54]:
X = pd.get_dummies(diamonds.drop('price', axis = 1))
X.head()

Unnamed: 0,carat,depth,table,x,y,z,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,...,color_I,color_J,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.23,61.5,55.0,3.95,3.98,2.43,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,0.21,59.8,61.0,3.89,3.84,2.31,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
2,0.23,56.9,65.0,4.05,4.07,2.31,False,False,False,True,...,False,False,False,False,False,True,False,False,False,False
3,0.29,62.4,58.0,4.2,4.23,2.63,False,True,False,False,...,True,False,False,False,False,False,True,False,False,False
4,0.31,63.3,58.0,4.34,4.35,2.75,False,False,False,True,...,False,True,False,False,False,False,False,False,True,False


In [56]:
y = diamonds['price']
all_features_linreg = LinearRegression(fit_intercept = False).fit(X,y)
all_features_linreg

In [57]:
linreg_mse = mean_squared_error(all_features_linreg.predict(X),y)
linreg_mse

1276545.174308389

## Problem 10

### A `HuberRegressor` on all features

Use the `get_dummies()` function to create a dummy encoded version of all the columns in the `diamonds` DataFrame except for the column `price` and assign the result to the variable `X`.

To the variable `y`, assign the column `price` in the `diamonds` dataset.

Use the `HuberRegressor` estimator  with argument `fit_intercept = False` to build an Huber regression model. Next, use the `fit()` function with arguments `X` and `y`  to predict the `price` column.  

Assign this model to `huber_all_features` below. 

Use the `mean_squared_error` function to compute the MSE between `huber_all_features.predict(X)` and `y`. Assign the result to `huber_mse` below. 

In [58]:
X = pd.get_dummies(diamonds.drop('price',axis = 1))
X.head()

Unnamed: 0,carat,depth,table,x,y,z,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,...,color_I,color_J,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.23,61.5,55.0,3.95,3.98,2.43,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,0.21,59.8,61.0,3.89,3.84,2.31,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
2,0.23,56.9,65.0,4.05,4.07,2.31,False,False,False,True,...,False,False,False,False,False,True,False,False,False,False
3,0.29,62.4,58.0,4.2,4.23,2.63,False,True,False,False,...,True,False,False,False,False,False,True,False,False,False
4,0.31,63.3,58.0,4.34,4.35,2.75,False,False,False,True,...,False,True,False,False,False,False,False,False,True,False


In [59]:
y = diamonds['price']
huber_all_features = HuberRegressor(fit_intercept = False).fit(X,y)
huber_all_features

In [60]:
huber_mse = mean_squared_error(huber_all_features.predict(X),y)
huber_mse

1771660.11957095

### Conclusion

While some basic initial models have been explored here, there is much more to explore to fine tune things. One thing that could be revisited is the representation of features through transformations and the engineering of different representations of existing features.  For example, the dimensions of the diamond in `x`, `y`, and `z` could be multiplied to create a feature "volume".  This allows for a more reasonable representation of three columns of data with one.  A second approach we might take is to use PCA to reduce the dimensionality of the data.  Third is to use clustering to engineer new features based on the cluster results.  Consider exploring different representations of the features and trying to improve these initial models.