<center><img src="https://i2.wp.com/hackwagon.com/wp-content/uploads/2017/02/Logo-Web-Export.png?ssl=1" width=200/></center>
<h1> Hackwagon Academy DS102 Lesson 5 </h1>
<h2> Linear Regression </h2> 
<h3> Lesson Outline </h3>

- 1. [Scikit Learn - SKLearn](#1)
    - 1.1 [5 Standard Steps](#1.1)
- 2. [Simple Linear Regression](#2)
    - 2.1 [Example - Housing Prices](#2.1)
    - [Practice I](#P1)
- 3. [Multivariate Linear Regression](#3)
    - 1.1 [Example - Housing Prices](#3.1)
    - [Practice II](#P2)

In [None]:
!pip install sklearn

<hr/>

<a id='1'><h2><img src="https://images.vexels.com/media/users/3/153978/isolated/preview/483ef8b10a46e28d02293a31570c8c56-warning-sign-colored-stroke-icon-by-vexels.png" width=23 align="left"><font color="salmon">&nbsp;1.</font><font color="salmon"> Scikit Learn - SKLearn </font> </h2></a>

<a id='1.1'><h3>5 Standard Steps</h3></a>

**Step 1**: Choose a class of machine learning model from the library 

**Step 2**: Choose the model’s hyperparameters by instantiating with desired values (tuning)

**Step 3**: Arrange data into features and target

**Step 4**: Fit model to your data by using the fit() method of the model 

**Step 5**: Apply the model to new data:
    - For supervised learning, using the predict() method
    - For unsupervised learning, using the predict() or transform() method

In [1]:
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# ----- STEP 1 & 2 ----- 
# Create linear regression object
regr = linear_model.LinearRegression()

#  ----- STEP 3 -----
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# ----- STEP 4 -----
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# ----- STEP 5 -----
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)


<a id='2'><h2><img src="https://images.vexels.com/media/users/3/153978/isolated/preview/483ef8b10a46e28d02293a31570c8c56-warning-sign-colored-stroke-icon-by-vexels.png" width=23 align="left"><font color="salmon">&nbsp;2.</font><font color="salmon"> Simple Linear Regression </font> </h2></a>

<a id='2.1'><h3>2.1 Example - Tips </h3></a>



In [2]:
import pandas as pd

data = { 
    'ID': [1,2,3,4,5,6],
    'Tips': [10, 25, 12, 8, 15, 20],
    'Meal': [80, 150, 75, 60, 100, 150]
}

tips_df = pd.DataFrame(data)
tips_df

Unnamed: 0,ID,Tips,Meal
0,1,10,80
1,2,25,150
2,3,12,75
3,4,8,60
4,5,15,100
5,6,20,150


In [3]:
from sklearn import datasets, linear_model

# Step 1 & 2
regr = linear_model.LinearRegression()

# Step 3
X = tips_df[['Meal']] # << Must use 2 square brackets
y = tips_df[['Tips']] # << Must use 2 square brackets

# Step 4
regr.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Step 5 is in another cell because the model has been trained, there is not need to keep re-training the model again and again.

In [4]:
# Step 5 
# Predicting Meal of $10 
regr.predict([[10]]) # < Note that it's a list of lists

array([[0.30971993]])

In [5]:
print(regr.coef_)
print(regr.intercept_)

#  Tips = -1.27  + 0.158 (Meal)
# With every $1 increase in meal, tips would increase by $0.15

[[0.15881384]]
[-1.27841845]


### Housing Prices

Filter to just two columns, where the predictor variable is `sqft_living` and the target variable `price`.  

In [6]:
housing_x = pd.read_csv('housing_x.csv')
price = pd.read_csv('housing_y.csv')

### Train Test Split

Filter to just to the `RunTime` and `Performance` as a DataFrame, `rt_perf_df`. Conduct `train_test_split()` to get the four splits of data. 

In [9]:
from sklearn.model_selection import train_test_split
sqft_living = housing_x[['sqft_living']] # << must be a DataFrame (NOT housing['sqft_living'])

X_train, X_test, y_train, y_test = train_test_split(sqft_living, price, random_state=42)

### Fit Model

Using the trained datasets, train the model.

In [10]:
simple_housing_lr = linear_model.LinearRegression()

simple_housing_lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Interpret

With the fitted model, interpret and create the linear equation of the model. 

In [11]:
print(simple_housing_lr.coef_)
print(simple_housing_lr.intercept_)

# price = -34785.44 + 276.6(Sqft_living)
# For every increase in SQFT_Living, price of the house increases by $276.6 

[[276.61559523]]
[-34785.44388888]


### Predict 

Using `.predict()`, predict with a custom square-feet.

In [12]:
custom_sqft = 10000

simple_housing_lr.predict([[custom_sqft]]) # IMPORTANT - It's a nested list 

array([[2731370.50837813]])

### Assess Performance of Model

Using `r2_score` calculate the R-squared score of the model.

In [13]:
from sklearn.metrics import r2_score 
fitted_values = simple_housing_lr.predict(X_test)

print(r2_score(y_test, fitted_values))

0.49434750583619946


<a id='P1'><h2> <img src="https://cdn.shopify.com/s/files/1/1200/7374/products/book_aec28e76-52ec-44ab-bc01-41df1279c89f_550x825.png?v=1473897430" width=25 align="left"> <font color="darkorange"> &nbsp; Practice I </font><font color="skyblue"> * </font></h2></a>

### Fitness Dataset

Based on this fitness dataset, predict the `RunTime` based on the following predictors:

1. Performance

### Read  Dataset

Read the `fitness-data.csv` as `fit_df`. 

In [14]:
fit_df = pd.read_csv('fitness-data.csv')
fit_df.head()

Unnamed: 0,Name,Gender,RunTime,Age,Weight,Oxygen_Consumption,Run_Pulse,Rest_Pulse,Maximum_Pulse,Performance
0,Donna,F,8.17,42.0,68.15,59.57,166.0,40.0,172.0,90.0
1,Gracie,F,8.63,38.0,81.87,60.06,170.0,48.0,186.0,94.0
2,Luanne,F,8.65,43.0,85.84,54.3,156.0,45.0,168.0,83.0
3,Mimi,F,8.92,50.0,70.87,54.63,146.0,48.0,155.0,67.0
4,Chris,M,8.95,49.0,81.42,49.16,180.0,44.0,185.0,72.0


### Train Test Split

Filter to just to the `RunTime` and `Performance` as a DataFrame, `rt_perf_df`. Conduct `train_test_split()` to get the four splits of data. 

In [15]:
runtime_df = fit_df[['RunTime']]
performance_df = fit_df[['Run_Pulse']]

X_train, X_test, y_train, y_test = train_test_split(performance_df, runtime_df, random_state=42)

### Fit Model

Create a LinearRegression variable, named  `simple_performance_lr`. Using the trained datasets, train the model.

In [16]:
simple_performance_lr = linear_model.LinearRegression()
simple_performance_lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Interpret

With the fitted model, interpret and create the linear equation of the model. 

In [17]:
print(simple_performance_lr.coef_)
print(simple_performance_lr.intercept_)

# RunTime = 3.309 + 0.04(Run Pulse)
# For every 1 increase in Run Pulse, run time increase by 0.04 minutes

[[0.0416063]]
[3.30945214]


### Predict 

Using `.predict()`, predict with custom values. 

In [18]:
custom_pulse = 160

simple_performance_lr.predict([[custom_pulse]])

array([[9.96646011]])

### Assess Performance of Model

Using `r2_score` calculate the R-squared score of the model.

In [19]:
from sklearn.metrics import r2_score 
fitted_values = simple_performance_lr.predict(X_test)

print(r2_score(y_test, fitted_values))

-0.37580118367816695



<a id='3'><h2><img src="https://images.vexels.com/media/users/3/153978/isolated/preview/483ef8b10a46e28d02293a31570c8c56-warning-sign-colored-stroke-icon-by-vexels.png" width=23 align="left"><font color="salmon">&nbsp;3.</font><font color="salmon"> Multivariate Linear Regression </font> </h2></a>

<a id='3.1'><h3>3.1 Example - Housing Prices  </h3></a>

This time, we'll apply more variables to the model and drop variables where necessary.

In [21]:
# Read from CSV the following files: KX_train, KX_test, Ky_train, Ky_test
x_all = pd.read_csv('housing_x.csv')
y_all = pd.read_csv('housing_y.csv')
display(x_all.head())
display(y_all.head())

Unnamed: 0,sqft_living,sqm_living,floors
0,3500,106.71,2.0
1,1180,35.98,1.0
2,1260,38.41,1.5
3,1520,46.34,1.0
4,1780,54.27,1.0


Unnamed: 0,price
0,788600
1,600000
2,523000
3,415000
4,535000


### Feature Selection

Detect for multicollinearity and drop variables where necessary by using `.corr()`.

In [22]:
x_all.corr()

Unnamed: 0,sqft_living,sqm_living,floors
sqft_living,1.0,1.0,0.348178
sqm_living,1.0,1.0,0.348179
floors,0.348178,0.348179,1.0


### Train Test Split

Drop and filter to the valid predictors in the model as a DataFrame, `housing_df`. Conduct `train_test_split()` to get the four splits of data. 

In [23]:
from sklearn.model_selection import train_test_split
# Since either SQFT or SQM are highly correlated, drop either one

housing_x = x_all[['sqft_living', 'floors']]

X_train, X_test, y_train, y_test = train_test_split(housing_x, y_all, random_state=42)

### Fit Model

Create a LinearRegression variable, `multi_housing_lr`. With the train dataframes, use `.fit()` to train the model.

In [24]:
multi_housing_lr = linear_model.LinearRegression()

multi_housing_lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Interpret

With the fitted model, interpret and create the linear equation of the model. 

In [25]:
print(multi_housing_lr.coef_)
print(multi_housing_lr.intercept_)

# Housing Price = -48237.317 + 274 (SQFT_Living) + 12566.866 (Floors)

[[  274.0203467  12566.86687756]]
[-48237.31783364]


### Assess Performance of Model

Using `r2_score` calculate the R-squared score of the model.

In [26]:
from sklearn.metrics import r2_score 
fitted_values = multi_housing_lr.predict(X_test)

print(r2_score(y_test, fitted_values))

0.4947420580577476


<a id='P2'><h2> <img src="https://cdn.shopify.com/s/files/1/1200/7374/products/book_aec28e76-52ec-44ab-bc01-41df1279c89f_550x825.png?v=1473897430" width=25 align="left"> <font color="darkorange"> &nbsp; Practice II </font><font color="skyblue"> * </font></h2></a>

### Fitness Dataset

Based on this fitness dataset, predict the `RunTime` based on the following predictors:

1. Age
2. Weight
3. Oxygen_Consumption
4. Run_Pulse
5. Rest_Pulse
6. Maximum_Pulse
7. Performance

### Read 

In [27]:
fit_df = pd.read_csv('fitness-data.csv')
fit_df.head()

Unnamed: 0,Name,Gender,RunTime,Age,Weight,Oxygen_Consumption,Run_Pulse,Rest_Pulse,Maximum_Pulse,Performance
0,Donna,F,8.17,42.0,68.15,59.57,166.0,40.0,172.0,90.0
1,Gracie,F,8.63,38.0,81.87,60.06,170.0,48.0,186.0,94.0
2,Luanne,F,8.65,43.0,85.84,54.3,156.0,45.0,168.0,83.0
3,Mimi,F,8.92,50.0,70.87,54.63,146.0,48.0,155.0,67.0
4,Chris,M,8.95,49.0,81.42,49.16,180.0,44.0,185.0,72.0


### Feature Selection

With the variables above, detect for multicollinearity and drop variables where necessary by using `.corr()`.

In [28]:
fit_df[['Age', 'Weight', 'Oxygen_Consumption', 'Run_Pulse', 'Rest_Pulse', 'Maximum_Pulse', 'Performance']].corr()

Unnamed: 0,Age,Weight,Oxygen_Consumption,Run_Pulse,Rest_Pulse,Maximum_Pulse,Performance
Age,1.0,-0.240505,-0.311618,-0.316065,-0.150873,-0.414903,-0.71257
Weight,-0.240505,1.0,-0.162891,0.181516,0.043974,0.249381,0.089741
Oxygen_Consumption,-0.311618,-0.162891,1.0,-0.39808,-0.399348,-0.236767,0.778902
Run_Pulse,-0.316065,0.181516,-0.39808,1.0,0.352461,0.929754,-0.029435
Rest_Pulse,-0.150873,0.043974,-0.399348,0.352461,1.0,0.305124,-0.2256
Maximum_Pulse,-0.414903,0.249381,-0.236767,0.929754,0.305124,1.0,0.090016
Performance,-0.71257,0.089741,0.778902,-0.029435,-0.2256,0.090016,1.0


### Train Test Split

Based on the correlation matrix, fliter for the __valid__ predictor variables and do `train_test_split` with the target variable `RunTime`. 

In [29]:
# Should drop either Max Pulse or Run Pulse
all_x = fit_df[['Age', 'Weight', 'Oxygen_Consumption', 'Run_Pulse', 'Rest_Pulse', 'Maximum_Pulse', 'Performance']]
all_y = fit_df[['RunTime']]

X_train, X_test, y_train, y_test = train_test_split(all_x, all_y, random_state=42)

### Fit Model

Using the train dataframes, create a LinearRegression variable called `multi_runtime_lr`. Use `.fit()`, train the model.

In [30]:
multi_runtime_lr = linear_model.LinearRegression()

multi_runtime_lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Interpret

With the fitted model, interpret and create the linear equation of the model. 

In [31]:
print(multi_runtime_lr.coef_)
print(multi_runtime_lr.intercept_)

# Run Time = 23.9 - 0.18 (Age) + 0.014 (Weight) - 0.018 (Oxygen_Consumption) 
#            + 0.0012 (Run_Pulse) + 0.004 (Rest_Pulse) + 0.0008 (Max_pulse) 
#            - 0.09518 (Performance)

[[-0.18353192  0.01424313 -0.01806247  0.00129823  0.00402354  0.00080929
  -0.09518856]]
[23.91333899]


### Assess Performance of Model

Using `r2_score` calculate the R-squared score of the model.

In [35]:
from sklearn.metrics import r2_score 
fitted_values = multi_runtime_lr.predict(X_test)
p = 7

print(r2_score(y_test, fitted_values))

0.9784039385590093
