## Regression
* Searches for relationships among variables
* Finds the fn that maps independant variables(inputs/feaures) t dependant variables(output/responses)
* Dependant variable is usually continuous and unbounded

## Linear Regression
𝑦 = $\theta$₀𝑥₀ + $\theta$₁𝑥₁ + ⋯ + $\theta$ᵣ𝑥ᵣ + 𝜀 ...(𝑥₀ = 1)

y - output

$\theta$₀....$\theta$ᵣ - parameters

𝑥₀....𝑥ᵣ - features/inputs

𝜀- error

## Implementation in Python

## Univariate 

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

In [2]:
x = np.array([5,15,25,45,37,67,55]).reshape((-1,1))
y = np.array([110,220,310,560,412,732,640])

In [3]:
x.shape, y.shape

((7, 1), (7,))

In [4]:
model = LinearRegression().fit(x, y)

In [5]:
r_sq = model.score(x, y)
print(f'R square (coefficient of determination): {r_sq}')

R square (coefficient of determination): 0.9911353350742254


In [6]:
print(f'intercept (θ0): {model.intercept_}')
print(f'slope (θ1): {model.coef_}')

intercept (θ0): 59.802734375
slope (θ1): [10.30273438]


In [7]:
y_pred = model.predict(x)
y_pred

array([111.31640625, 214.34375   , 317.37109375, 523.42578125,
       441.00390625, 750.0859375 , 626.453125  ])

In [8]:
y_pred = model.intercept_ + model.coef_*x
y_pred

array([[111.31640625],
       [214.34375   ],
       [317.37109375],
       [523.42578125],
       [441.00390625],
       [750.0859375 ],
       [626.453125  ]])

In [9]:
y_pred_new = model.predict(np.arange(10).reshape(-1,1))
y_pred_new

array([ 59.80273438,  70.10546875,  80.40820312,  90.7109375 ,
       101.01367188, 111.31640625, 121.61914062, 131.921875  ,
       142.22460938, 152.52734375])

## Multivariate Linear Regression 

In [10]:
x = np.array([[5,45,78],[15,37,38],[25,47,29],[45,34,89],[37,26,18],[67,17,49],[55,16,18]])
y = np.array([110,220,310,560,412,732,640])

In [11]:
model = LinearRegression().fit(x, y)

In [12]:
r_sq = model.score(x, y)
print(f'R square (coefficient of determination): {r_sq}')

R square (coefficient of determination): 0.993871363946136


In [13]:
print(f'intercept (θ0): {model.intercept_}')
print(f'slope (θ1): {model.coef_}')

intercept (θ0): 32.22988777391896
slope (θ1): [10.45292976  0.09564294  0.42124935]


In [14]:
y_pred = model.predict(x)
y_pred

array([121.65591833, 208.57009829, 310.2645811 , 543.35477916,
       429.05749357, 754.84332975, 616.2537998 ])

In [15]:
y_pred = model.intercept_ + np.sum(model.coef_*x, axis=1)
y_pred

array([121.65591833, 208.57009829, 310.2645811 , 543.35477916,
       429.05749357, 754.84332975, 616.2537998 ])

## Polynomial Regression

In [16]:
from sklearn.preprocessing import PolynomialFeatures

In [17]:
x = np.array([5,15,25,45,37,67,55]).reshape((-1,1))
y = np.array([110,220,310,560,412,732,640])

## include_bias = False in Polynomial feature transformer

In [18]:
transformer = PolynomialFeatures(degree=2, include_bias=False)
transformer.fit(x)
x_ = transformer.transform(x)
print(x_)

model = LinearRegression().fit(x_,y)

r_sq = model.score(x_,y)

print(f'R square (coefficient of determination): {r_sq}')
print(f'intercept (θ0): {model.intercept_}')
print(f'slope (θ1): {model.coef_}')

[[   5.   25.]
 [  15.  225.]
 [  25.  625.]
 [  45. 2025.]
 [  37. 1369.]
 [  67. 4489.]
 [  55. 3025.]]
R square (coefficient of determination): 0.991415533081388
intercept (θ0): 51.645334294410304
slope (θ1): [ 1.09827006e+01 -9.52302405e-03]


## include_bias = True in Polynomial feature transformer

In [19]:
transformer = PolynomialFeatures(degree=2, include_bias=True)
transformer.fit(x)
x_ = transformer.transform(x)
print(x_)

model = LinearRegression().fit(x_,y)

r_sq = model.score(x_,y)

print(f'R square (coefficient of determination): {r_sq}')
print(f'intercept (θ0): {model.intercept_}')
print(f'slope (θ1): {model.coef_}')

[[1.000e+00 5.000e+00 2.500e+01]
 [1.000e+00 1.500e+01 2.250e+02]
 [1.000e+00 2.500e+01 6.250e+02]
 [1.000e+00 4.500e+01 2.025e+03]
 [1.000e+00 3.700e+01 1.369e+03]
 [1.000e+00 6.700e+01 4.489e+03]
 [1.000e+00 5.500e+01 3.025e+03]]
R square (coefficient of determination): 0.991415533081388
intercept (θ0): 51.645334294406155
slope (θ1): [ 0.00000000e+00  1.09827006e+01 -9.52302405e-03]


## Advanced Linear Regression with statsmodels

In [20]:
import statsmodels.api as sm

In [21]:
x = [[0,1],[5,1],[15,2],[25,5],[35,11],[45,15],[55,34],[60,35]]
y = [4,5,20,14,32,22,38,43]
x,y = np.array(x),np.array(y)

In [22]:
x = sm.add_constant(x)    #adds bias term
x

array([[ 1.,  0.,  1.],
       [ 1.,  5.,  1.],
       [ 1., 15.,  2.],
       [ 1., 25.,  5.],
       [ 1., 35., 11.],
       [ 1., 45., 15.],
       [ 1., 55., 34.],
       [ 1., 60., 35.]])

In [23]:
model = sm.OLS(y,x)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.862
Model:                            OLS   Adj. R-squared:                  0.806
Method:                 Least Squares   F-statistic:                     15.56
Date:                Mon, 09 Aug 2021   Prob (F-statistic):            0.00713
Time:                        16:40:45   Log-Likelihood:                -24.316
No. Observations:                   8   AIC:                             54.63
Df Residuals:                       5   BIC:                             54.87
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.5226      4.431      1.246      0.2



In [28]:
results.predict(sm.add_constant(np.arange(10).reshape((-1, 2)))), np.arange(10).reshape((-1, 2))

(array([ 5.77760476,  7.18179502,  8.58598528,  9.99017554, 11.3943658 ]),
 array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]))

## Interpreting Linear Regression using statsmodels api

In [31]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

In [34]:
df = sm.datasets.get_rdataset("Guerry","HistData").data
df = df[['Lottery','Literacy','Wealth','Region']].dropna()
df.head()

Unnamed: 0,Lottery,Literacy,Wealth,Region
0,41,37,73,E
1,38,51,22,N
2,66,13,61,C
3,80,46,76,E
4,79,69,83,E


In [35]:
len(df)

85

In [36]:
df.Region.value_counts()

N    17
C    17
W    17
S    17
E    17
Name: Region, dtype: int64

## Terms in summary:
* Df Residuals: n-k-1 (no_of_obs - no_of_variables -1)
* Covariance
* R square -> most important (what percenatge)
* Property of linear regression - adding more variables will not reduce r-squared, will keep it same or incraese it
* Adjusted r-squared will penalize r squared that some variables are not contributing
* F-statistic - checks for statistical significanse of your entire group of variables and validates null hypo
* Log-Likelihood -
* AIC-BIC : Used for feature selection

In [38]:
mod = smf.ols(formula = 'Lottery ~ Literacy + Wealth + Region', data = df)
#formula 'dependant_variables ~ combination of independant variables'
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Mon, 09 Aug 2021   Prob (F-statistic):           1.07e-05
Time:                        16:57:03   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      

In [40]:
res = smf.ols(formula = 'Lottery ~ Literacy + Wealth + C(Region)', data = df).fit()
print(res.params)

Intercept         38.651655
C(Region)[T.E]   -15.427785
C(Region)[T.N]   -10.016961
C(Region)[T.S]    -4.548257
C(Region)[T.W]   -10.091276
Literacy          -0.185819
Wealth             0.451475
dtype: float64
