# Linear Regression
###                            A motivation example
$$y = m x + b$$
or
$$\mbox{weight} = m \times \mbox{height} + b$$

<img src="Formula_Line.JPG" alt="Drawing" style="width: 550px;"/>

### Example 1 - Height vs. Weight 

We'll look at the problem of predicting **weight**, using the given **height**


* This is a "simpler" dataset you may download from Kaggle.com

  https://www.kaggle.com/mustafaali96/weight-height
  
  
#### What are we trying to predict?

$$ \mbox{weight} = \theta_0 + \theta_1 \times \mbox{height}$$

In [1]:
import pandas as pd
df = pd.read_csv("weight_height.csv", sep=',', header=0)
df.head()

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


**Code: Extracting weight and height**

In [3]:
y = df['Weight']
X = df['Height']

print (len(X), len(y))

10000 10000


**Code: Simple linear regression model**

In [5]:
from sklearn import linear_model

X= X.to_numpy().reshape(-1, 1)

regr = linear_model.LinearRegression()
regr.fit(X,y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 -350.7371918121373
Coefficients: 
 [7.71728764]


### Example 2 - Air quality prediction

We'll look at the problem of predicting **air quality**, using an index called pm2.5, measured in Beijing

* This is a "simpler" dataset than some of the others we've been working with, as the relevant features are all real-valued

<img src="Beijing_PM25.JPG" alt="Drawing" style="width: 600px;"/>


#### What are we trying to predict?

$$ \mbox{pm2.5} = \theta_0 + \theta_1 \times \mbox{temp}$$


In [1]:
import pandas as pd
df = pd.read_csv("Datasets/Beijing_PM25_air_data.csv", sep=',', header=0)
df.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


**Code: Extracting the X and Y**

In [2]:
dataset = df.dropna(subset=['pm2.5'])
y = dataset['pm2.5']
X = dataset['TEMP']

print (len(X), len(y))

41757 41757


**Code: Simple linear regression model**

In [6]:
from sklearn import linear_model

X= X.to_numpy().reshape(-1, 1)

regr = linear_model.LinearRegression()
regr.fit(X,y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 107.10183392374576
Coefficients: 
 [-0.68447989]


### With more dimensions 

#### What are we trying to predict?

$$ \mbox{pm2.5} = \theta_0 + \theta_1 \times \mbox{temp} + \theta_2 \times \mbox{hour}$$

In [7]:
X = dataset[['TEMP', 'hour']]

regr = linear_model.LinearRegression()
regr.fit(X,y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 108.4637133461681
Coefficients: 
 [-0.67340075 -0.13034581]


## Summary of concepts

* Demonstrate how to perform simple linear regression in Python