<a href="https://colab.research.google.com/github/jaidatta71/ML---Berkeley/blob/main/colab_activity_7_1%20-%20Simple%20Linear%20Regression%20Line%20Using%20Plotly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Self-Study Colab Activity 7.1: Fitting a Simple Linear Regression Line Using Plotly and Scikit-learn

**Expected Time: 60 Minutes**



This activity focuses on using `sklearn` to build a `LinearRegression` estimator.  For the dataset, another built-in Seaborn dataset with information on geyser explosions is used.  Using this dataset, you are to build a regression model using the wait time to predict the duration of the explosion.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)

In [None]:
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

### The Geyser Data

The dataset contains information on the waiting time for a geyser explosion, the duration of the explosion, and a categorization of the explosion duration.  This data comes from the Seaborn built-in datasets.

In [None]:
geyser = sns.load_dataset('geyser')

In [None]:
geyser.head()

Unnamed: 0,duration,waiting,kind
0,3.6,79,long
1,1.8,54,short
2,3.333,74,long
3,2.283,62,short
4,4.533,85,long


[Back to top](#Index:)

## Problem 1

### Declaring `X` and `y`.  



Assign the column `waiting` as a DataFrame to the variable `X` and the column `duration` as a series to the variable `y` below.  

In [None]:


X = geyser[["waiting"]]
y = geyser['duration']


[Back to top](#Index:)

## Problem 2

### Building a model with `LinearRegression`

### In LinearRegression  - 1st argument is data Frame & 2nd argument is Series


Below, instantiate a linear regression model using the `LinearRegression()` function. The chain the `fit()` function with the arguments `X` and `y` from above.  Make sure to use only the default settings.  Assign your regressor to the variable `linreg` below.  

In [None]:
linreg = LinearRegression()
linreg.fit(X, y)

# Answer check
print(linreg)

LinearRegression()


[Back to top](#Index:)

## Problem 3

### Adding a prediction column



Add a column, `prediction`, to the `geyser` DataFrame. To this column assign `linreg.predict(X)`.

MAke sure to check that your DataFrame geyser contains the new column.


In [None]:
linreg.predict([[100]])
geyser['prediction'] = linreg.predict(X)

# Answer check
print(geyser.columns)
print(geyser.shape)
#geyser

Index(['duration', 'waiting', 'kind', 'prediction'], dtype='object')
(272, 4)




[Back to top](#Index:)

## Problem 4

### Equation of line



The equation of the line will be of the form

$$\text{duration} = \text{waiting}\times \text{slope} + \text{intercept}$$

Use the `coef_` attribute on `linreg` to assign the slope of the solution as a float correct to two decimal places to the variable `slope`.

Use the `intercept_` attribute on `linreg` to assign the intercept of the solution as a float correct to two decimal places to the variable `intercept`.


In [None]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=geyser['waiting'], y=geyser['duration'], mode='markers', name='Actuals'))
fig.add_trace(go.Scatter(x=geyser['waiting'], y=geyser['prediction'], mode='markers', name='Predicted'))
fig.update_layout(title='Geyser Data', xaxis_title='Waiting Time', yaxis_title='Duration')
fig.show()

In [None]:
slope = linreg.coef_
intercept = linreg.intercept_

# Answer check
print(type(slope))
print(slope, intercept)

<class 'numpy.ndarray'>
[0.07562795] -1.8740159864107366


In [None]:
# Measure Lose function
from sklearn.metrics import mean_squared_error
mean_squared_error(geyser['duration'], linreg.predict(geyser[['waiting']]))

0.2447124107084554

# MSE & MAE #


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


### Regression with single feature ###

In [None]:
linereg = LinearRegression(fit_intercept = False)
lr_1_feature = linereg.fit(diamonds[['carat']], diamonds['price'])
coefs_ = lr_1_feature.coef_
intercept_ = lr_1_feature.intercept_

### Regression with TWO features ###

In [None]:
linereg2 = LinearRegression(fit_intercept = False)
lr_2_features = linereg2.fit(diamonds[['carat','depth']], diamonds['price'])
coefs_ = lr_2_features.coef_
intercept_ = lr_2_features.intercept_

### Regression with THREE feature ###

In [None]:
linereg3 = LinearRegression(fit_intercept = False)
lr_3_features = linereg3.fit(diamonds[['carat','depth','table']], diamonds['price'])
coefs_ = lr_3_features.coef_
intercept_ = lr_3_features.intercept_

### Computing MSE and MAE ###
For each of your models, compute the mean squared error and mean absolute errors. Create a DataFrame to match the structure below:

Features	   MSE	MAE
1-Feature	   -	  -
2-Features	 -  	-
3-Features   -  	-

Assign your solution as a DataFrame to error_df below. Note that the Features column should be the index column in your DataFrame.

In [None]:
pred1 = lr_1_feature.predict(diamonds[['carat']])
pred2 = lr_2_features.predict(diamonds[['carat', 'depth']])
pred3 = lr_3_features.predict(diamonds[['carat', 'depth', 'table']])

error_dict = {'Features': ['1 Feature', '2 Features', '3 Features'],
                   'MSE': [mean_squared_error(diamonds['price'], i) for i in [pred1, pred2, pred3]],
                   'MAE': [mean_absolute_error(diamonds['price'], i) for i in [pred1, pred2, pred3]]}

error_df_ = pd.DataFrame(error_dict).set_index('Features')

# OR

# YOUR CODE HERE
#mse1 = mean_squared_error(diamonds['price'],   lr_1_feature.predict(diamonds[['carat']]))
#mae1 = mean_absolute_error(diamonds['price'],  lr_1_feature.predict(diamonds[['carat']]))

#mse2 = mean_squared_error(diamonds['price'],   lr_2_features.predict(diamonds[['carat','depth']]))
#mae2 = mean_absolute_error(diamonds['price'],  lr_2_features.predict(diamonds[['carat','depth']]))

#mse3 = mean_squared_error(diamonds['price'],   lr_3_features.predict(diamonds[['carat','depth','table']]))
#mae3 = mean_absolute_error(diamonds['price'],  lr_3_features.predict(diamonds[['carat','depth','table']]))

# Answer check
#data = {
#    'Features':[1,2,3],
#    'MSE':    [mse1, mse2, mse3],
#    'MAE':    [mae1, mae2, mae3]
#}

#error_df = pd.DataFrame(data).set_index('Features')

# One HOT Encoding #

Use Categorical (Non Linear info) data for Liner Regression.
Create K features, where K is the number of unique values for non numberic feature of interest to fit the Linear regression equation.  

In the below example we hypothesize to see if Fri, Sat, Sundays have influence on Tips. Hence convert the days in column "day" to Thursday, Saturday, Sunday Columns one for each day.

Based on coefficients calculate for each day what's the tip. You se the difference is very small.

In [None]:
# create dummies that represent the data as each column in "day" column. i.e., Thurs, Fri, Sat, Sunday
dummies = pd.get_dummies(df['day'])
dummies.iloc[[193, 50, 26], :]

# combine the above two DFs
data_with_dummies = pd.concat([df, dummies], axis=1)

# Delete Non Numeric feature/column as it gives error in model building
del data_with_dummies["day"]
data_with_dummies

f_with_day = LinearRegression(fit_intercept = False)
f_with_day.fit(diamonds[['carat','depth','table']], diamonds['price'])

# publish the coef for each of the term in the numerical equation
f_with_day.coef_

NameError: name 'df' is not defined

### Plotting Trend Lines for each cut. Each cut have different slope and Y-intercet  ###
As can be seen More the cut more the price

In [None]:
px.scatter(diamonds, x='carat', y='price', color = 'cut', trendline='ols')

# Categorical Features
##### Explore the dummy encoding process to build and compare different regression models. The diamonds dataset from Seaborn is loaded and displayed below. You will explore models that use both the cut and color features independently, and models using all possible features. To begin, you will use pandas get_dummies function to produce the dummy encoded data. Your dummy encoded data should have as many features as there are unique values in the data.

In [None]:
import plotly.express as px
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

In [None]:
import urllib

diamonds = None

try:
    diamonds = sns.load_dataset('diamonds')
except:
    diamonds_dataset_uri = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
    with urllib.request.urlopen(diamonds_dataset_uri) as response:
        diamonds = pd.read_csv(response)
diamonds.head(10)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


### Encoding on 'cut' column

In [None]:
cut_encoded = pd.get_dummies(diamonds['cut'])

print(cut_encoded.shape)
print(type(cut_encoded))
cut_encoded.head()

(53940, 5)
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Ideal,Premium,Very Good,Good,Fair
0,True,False,False,False,False
1,False,True,False,False,False
2,False,False,False,True,False
3,False,True,False,False,False
4,False,False,False,True,False


### A Regression model on cut

Use the get_dummies() function to create a dummy encoded version of the cut column and assign the result to the variable X. To the variable y, assign the column price in the diamonds dataset.

Use the LinearRegression estimator with argument fit_intercept = False to build a regression model. Next, use the fit() function with arguments X and y to predict the price column.

In [None]:
X = pd.get_dummies(diamonds[['cut']])
y = diamonds['price']
cut_linreg = LinearRegression(fit_intercept = False).fit(X,y)

# Answer check
print(cut_linreg)
print(type(cut_linreg))
cut_linreg.coef_

LinearRegression(fit_intercept=False)
<class 'sklearn.linear_model._base.LinearRegression'>


array([3457.54197021, 4584.2577043 , 3981.75989075, 3928.86445169,
       4358.75776398])

#### Compare the coefficients of the model. Which cut does your model predict as the price for a diamond with an ideal_cut?

In [None]:
ideal_cut_prediction = float(round(cut_linreg.coef_[0],2))

# Answer check
print(ideal_cut_prediction)
print(type(ideal_cut_prediction))

3457.54
<class 'float'>


### A Model with cut, clarity, and carat
Use the get_dummies() function to create a dummy encoded version of the carat, cut, and clarity columns and assign the result to the variable X.

To the variable y, assign the column price in the diamonds dataset.

Use the LinearRegression estimator with argument fit_intercept = False to build a regression model. Next, use the fit() function with arguments X and y to predict the price column.

In [None]:
X = pd.get_dummies(diamonds[['carat','cut','clarity']])
y = diamonds['price']
ccc_linreg = LinearRegression(fit_intercept = False).fit(X,y)

# Answer check
print(X)
print(ccc_linreg)
print(ccc_linreg.coef_)

       carat  cut_Ideal  cut_Premium  cut_Very Good  cut_Good  cut_Fair  \
0       0.23       True        False          False     False     False   
1       0.21      False         True          False     False     False   
2       0.23      False        False          False      True     False   
3       0.29      False         True          False     False     False   
4       0.31      False        False          False      True     False   
...      ...        ...          ...            ...       ...       ...   
53935   0.72       True        False          False     False     False   
53936   0.72      False        False          False      True     False   
53937   0.70      False        False           True     False     False   
53938   0.86      False         True          False     False     False   
53939   0.75       True        False          False     False     False   

       clarity_IF  clarity_VVS1  clarity_VVS2  clarity_VS1  clarity_VS2  \
0           False       

### Interpreting the results

Examine the coefficients from the model and use them to determine the predicted price of a diamond with the following features:

carat = 0.8                  
cut = Ideal             
clarity = SI2

###### 1. Create a DF with above 3 values
###### 2. Add all columns of X else if you pass only 3 columns of above DF, it
will error out.
    
     ValueError: The feature names should match those that were passed during fit.
     Feature names unseen at fit time:(these columns are not in X)
       - clarity
       - cut
     Feature names seen at fit time, yet now missing:
       - clarity_I1
       - clarity_IF
###### 3. Fill the rest of the columns with zero

In [None]:
diamond_features = pd.DataFrame({ 'carat': [0.8], 'cut': ['Ideal'], 'clarity': ['SI2']})
diamond_features


Unnamed: 0,carat,cut,clarity
0,0.8,Ideal,SI2


##### The .reindex(columns=x.columns) operation in pandas is used to align the columns of a DataFrame to match the columns of another DataFrame or a list of columns. This operation is commonly used when you need to ensure that two DataFrames have the same columns before performing operations between them, such as addition or concatenation.

In [None]:

abcd = pd.get_dummies(diamond_features).reindex(columns=X.columns,  fill_value=0)
abcd

Unnamed: 0,carat,cut_Ideal,cut_Premium,cut_Very Good,cut_Good,cut_Fair,clarity_IF,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.8,True,0,0,0,0,0,0,0,0,0,0,True,0


In [None]:
ccc_linreg = LinearRegression(fit_intercept=False).fit(X, diamonds['price'])
ccc_prediction = ccc_linreg.predict(abcd)

ccc_prediction = round(ccc_prediction[0], 2)
print(ccc_prediction)
print(type(ccc_prediction))

2882.66
<class 'numpy.float64'>


### A Model with all features

Use the get_dummies() function to create a dummy encoded version of all the columns in the diamonds DataFrame except for the column price and assign the result to the variable X. To the variable y, assign the column price in the diamonds dataset.

Use the LinearRegression estimator with argument fit_intercept = False to build a regression model. Next, use the fit() function with arguments X and y to predict the price column. Assign the model to all_features_linreg below.

Use the mean_squared_error function to compute the MSE between all_features_linreg.predict(X) and y.

In [None]:
X = ''
y = ''
all_features_linreg = ''
linreg_mse = ''

X = pd.get_dummies(diamonds[['carat','cut','clarity','color','depth','table','x','y','z']])
y = diamonds['price']
all_features_linreg =  LinearRegression(fit_intercept = False).fit(X,y)
linreg_mse = mean_squared_error(all_features_linreg.predict(X), y)
# Answer check
print(all_features_linreg)
print(all_features_linreg.coef_)
print(linreg_mse)

LinearRegression(fit_intercept=False)
[ 1.12569783e+04 -6.38061004e+01 -2.64740847e+01 -1.00826110e+03
  9.60888648e+00 -5.01188909e+01  2.71221727e+03  2.64144937e+03
  2.60608801e+03  2.45905687e+03  1.87930542e+03  3.06769746e+03
  2.73035426e+03  2.67340929e+03  2.30099313e+03  1.98981878e+03
  1.38806730e+03  4.25181510e+02 -2.27740478e+03  2.58257671e+03
  2.37345863e+03  2.30972288e+03  2.10053781e+03  1.60231004e+03
  1.11633224e+03  2.13178649e+02]
1276545.174308389


# Conclusion
While some basic initial models have been explored here, there is much more to explore to fine-tune things. One thing that could be revisited is the representation of features through transformations and the engineering of different representations of existing features. For example, the dimensions of the diamond in x, y, and z could be multiplied to create a feature "volume". This allows for a more reasonable representation of three columns of data with one. A second approach we might take is to use PCA to reduce the dimensionality of the data. The third is to use clustering to engineer new features based on the cluster results. Consider exploring different representations of the features and trying to improve these initial models.