<a href="https://colab.research.google.com/github/jaidatta71/Chatbot/blob/main/Linear%20Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Self-Study Colab Activity 7.1: Fitting a Simple Linear Regression Line Using Plotly and Scikit-learn

**Expected Time: 60 Minutes**



This activity focuses on using `sklearn` to build a `LinearRegression` estimator.  For the dataset, another built-in Seaborn dataset with information on geyser explosions is used.  Using this dataset, you are to build a regression model using the wait time to predict the duration of the explosion.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

### The Geyser Data

The dataset contains information on the waiting time for a geyser explosion, the duration of the explosion, and a categorization of the explosion duration.  This data comes from the Seaborn built-in datasets.

In [2]:
geyser = sns.load_dataset('geyser')

In [3]:
geyser.head()

Unnamed: 0,duration,waiting,kind
0,3.6,79,long
1,1.8,54,short
2,3.333,74,long
3,2.283,62,short
4,4.533,85,long


[Back to top](#Index:)

## Problem 1

### Declaring `X` and `y`.  



Assign the column `waiting` as a DataFrame to the variable `X` and the column `duration` as a series to the variable `y` below.  

In [4]:


X = geyser[["waiting"]]
y = geyser['duration']


[Back to top](#Index:)

## Problem 2

### Building a model with `LinearRegression`



Below, instantiate a linear regression model using the `LinearRegression()` function. The chain the `fit()` function with the arguments `X` and `y` from above.  Make sure to use only the default settings.  Assign your regressor to the variable `linreg` below.  

In [8]:
linreg = LinearRegression()
linreg.fit(X, y)

# Answer check
print(linreg)

LinearRegression()


[Back to top](#Index:)

## Problem 3

### Adding a prediction column



Add a column, `prediction`, to the `geyser` DataFrame. To this column assign `linreg.predict(X)`.

MAke sure to check that your DataFrame geyser contains the new column.


In [15]:
linreg.predict([[100]])
geyser['prediction'] = linreg.predict(X)

# Answer check
print(geyser.columns)
print(geyser.shape)
#geyser

Index(['duration', 'waiting', 'kind', 'prediction'], dtype='object')
(272, 4)




[Back to top](#Index:)

## Problem 4

### Equation of line



The equation of the line will be of the form

$$\text{duration} = \text{waiting}\times \text{slope} + \text{intercept}$$

Use the `coef_` attribute on `linreg` to assign the slope of the solution as a float correct to two decimal places to the variable `slope`.

Use the `intercept_` attribute on `linreg` to assign the intercept of the solution as a float correct to two decimal places to the variable `intercept`.


In [16]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=geyser['waiting'], y=geyser['duration'], mode='markers', name='Actuals'))
fig.add_trace(go.Scatter(x=geyser['waiting'], y=geyser['prediction'], mode='markers', name='Predicted'))
fig.update_layout(title='Geyser Data', xaxis_title='Waiting Time', yaxis_title='Duration')
fig.show()

In [17]:
slope = linreg.coef_
intercept = linreg.intercept_

# Answer check
print(type(slope))
print(slope, intercept)

<class 'numpy.ndarray'>
[0.07562795] -1.8740159864107366


In [18]:
# Measure Lose function
from sklearn.metrics import mean_squared_error
mean_squared_error(geyser['duration'], linreg.predict(geyser[['waiting']]))

0.2447124107084554

# MSE & MAE #


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [2]:
diamonds = sns.load_dataset('diamonds')

### Regression with single feature ###

In [3]:
linereg = LinearRegression(fit_intercept = False)
lr_1_feature = linereg.fit(diamonds[['carat']], diamonds['price'])
coefs_ = lr_1_feature.coef_
intercept_ = lr_1_feature.intercept_

### Regression with TWO features ###

In [4]:
linereg2 = LinearRegression(fit_intercept = False)
lr_2_features = linereg2.fit(diamonds[['carat','depth']], diamonds['price'])
coefs_ = lr_2_features.coef_
intercept_ = lr_2_features.intercept_

### Regression with THREE feature ###

In [5]:
linereg3 = LinearRegression(fit_intercept = False)
lr_3_features = linereg3.fit(diamonds[['carat','depth','table']], diamonds['price'])
coefs_ = lr_3_features.coef_
intercept_ = lr_3_features.intercept_

### Computing MSE and MAE ###
For each of your models, compute the mean squared error and mean absolute errors. Create a DataFrame to match the structure below:

Features	   MSE	MAE
1-Feature	   -	  -
2-Features	 -  	-
3-Features   -  	-

Assign your solution as a DataFrame to error_df below. Note that the Features column should be the index column in your DataFrame.

In [7]:
pred1 = lr_1_feature.predict(diamonds[['carat']])
pred2 = lr_2_features.predict(diamonds[['carat', 'depth']])
pred3 = lr_3_features.predict(diamonds[['carat', 'depth', 'table']])

error_dict = {'Features': ['1 Feature', '2 Features', '3 Features'],
                   'MSE': [mean_squared_error(diamonds['price'], i) for i in [pred1, pred2, pred3]],
                   'MAE': [mean_absolute_error(diamonds['price'], i) for i in [pred1, pred2, pred3]]}

error_df_ = pd.DataFrame(error_dict).set_index('Features')

# OR

# YOUR CODE HERE
#mse1 = mean_squared_error(diamonds['price'],   lr_1_feature.predict(diamonds[['carat']]))
#mae1 = mean_absolute_error(diamonds['price'],  lr_1_feature.predict(diamonds[['carat']]))

#mse2 = mean_squared_error(diamonds['price'],   lr_2_features.predict(diamonds[['carat','depth']]))
#mae2 = mean_absolute_error(diamonds['price'],  lr_2_features.predict(diamonds[['carat','depth']]))

#mse3 = mean_squared_error(diamonds['price'],   lr_3_features.predict(diamonds[['carat','depth','table']]))
#mae3 = mean_absolute_error(diamonds['price'],  lr_3_features.predict(diamonds[['carat','depth','table']]))

# Answer check
#data = {
#    'Features':[1,2,3],
#    'MSE':    [mse1, mse2, mse3],
#    'MAE':    [mae1, mae2, mae3]
#}

#error_df = pd.DataFrame(data).set_index('Features')