# Regression II

- Simple linear regression review; Multiple linear regression is similar, just more variables, and... interactions
- `import statsmodels.formula.api as smf; smf.ols(formula='Y ~ C(X1)*X2', data=Dataframe).fit().summary()`
- `import statsmodels.api as sm; sm.OLS(Y,X).fit().summary() # Y outcome (endogenous); X covariate or feature (exogeneous)`
- Categorical data, quantitative data, addition and multiplication 

- Formula: $$Y_i = \beta_0 + \beta_1 x_{1i} + \beta_1 x_{2i} + \beta_1 x_{3i} + \cdots + \epsilon_i \quad\quad \text{ with an } \epsilon_i \sim N(0, \sigma) \text{ assumption}$$

- Indicators: categorical variable, could be __low__, __medium__, and __high__. $$E[y_i] = \beta_0 + \beta_{\text{low}}\underbrace{1_{[\text{low}]}(\text{variable}_i)}_{\text{$1$ if variable$_i$ is "low"; else, $0$}} + \beta_{\text{medium}}\underbrace{1_{[\text{medium}]}(\text{variable}_i)}_{\text{$1$ if variable$_i$ is "medium"; else, $0$}}$$

## Simple Linear Regression

In [7]:
import pandas as pd

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
pokemon = pd.read_csv(url) # convert csv to dataframe
pokemon # Observe the Dataframe we have

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [8]:
import statsmodels.formula.api as smf

smf.ols(formula='HP ~ Q("Sp. Def")', data=pokemon).fit().summary() # Can we use "Sp.Def" to predict HP values?

0,1,2,3
Dep. Variable:,HP,R-squared:,0.143
Model:,OLS,Adj. R-squared:,0.142
Method:,Least Squares,F-statistic:,133.6
Date:,"Thu, 09 May 2024",Prob (F-statistic):,1.1e-28
Time:,13:46:40,Log-Likelihood:,-3664.8
No. Observations:,800,AIC:,7334.0
Df Residuals:,798,BIC:,7343.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,44.2729,2.318,19.103,0.000,39.724,48.822
"Q(""Sp. Def"")",0.3475,0.030,11.559,0.000,0.288,0.407

0,1,2,3
Omnibus:,322.832,Durbin-Watson:,1.496
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2653.174
Skew:,1.607,Prob(JB):,0.0
Kurtosis:,11.323,Cond. No.,214.0


In [13]:
import plotly.express as px
px.scatter(pokemon, x="Sp. Def", y="HP", trendline='ols')

## Multiple Linear Regression

In [15]:
import statsmodels.formula.api as smf

# Can we use "Sp.Def" and "Generation" to predict HP values?
smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation)', data=pokemon).fit().summary() 

0,1,2,3
Dep. Variable:,HP,R-squared:,0.153
Model:,OLS,Adj. R-squared:,0.147
Method:,Least Squares,F-statistic:,23.93
Date:,"Thu, 09 May 2024",Prob (F-statistic):,4.23e-26
Time:,14:04:27,Log-Likelihood:,-3660.1
No. Observations:,800,AIC:,7334.0
Df Residuals:,793,BIC:,7367.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,41.8745,2.774,15.095,0.000,36.429,47.320
C(Generation)[T.2],3.7194,2.936,1.267,0.206,-2.044,9.482
C(Generation)[T.3],-0.0153,2.614,-0.006,0.995,-5.146,5.115
C(Generation)[T.4],4.4562,2.830,1.575,0.116,-1.098,10.011
C(Generation)[T.5],6.0902,2.593,2.349,0.019,1.001,11.180
C(Generation)[T.6],0.4389,3.188,0.138,0.891,-5.819,6.697
"Q(""Sp. Def"")",0.3466,0.030,11.488,0.000,0.287,0.406

0,1,2,3
Omnibus:,332.044,Durbin-Watson:,1.514
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2871.657
Skew:,1.646,Prob(JB):,0.0
Kurtosis:,11.678,Cond. No.,457.0


### Interaction

In [16]:
import statsmodels.formula.api as smf

# What about given "Generation", would the "Sp. Def" determine pokemons' "HP"
smf.ols(formula='HP ~ Q("Sp. Def")*C(Generation)', data=pokemon).fit().summary() 

0,1,2,3
Dep. Variable:,HP,R-squared:,0.176
Model:,OLS,Adj. R-squared:,0.164
Method:,Least Squares,F-statistic:,15.27
Date:,"Thu, 09 May 2024",Prob (F-statistic):,3.5e-27
Time:,14:06:11,Log-Likelihood:,-3649.4
No. Observations:,800,AIC:,7323.0
Df Residuals:,788,BIC:,7379.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,26.8971,5.246,5.127,0.000,16.599,37.195
C(Generation)[T.2],20.0449,7.821,2.563,0.011,4.692,35.398
C(Generation)[T.3],21.3662,6.998,3.053,0.002,7.629,35.103
C(Generation)[T.4],31.9575,8.235,3.881,0.000,15.793,48.122
C(Generation)[T.5],9.4926,7.883,1.204,0.229,-5.982,24.968
C(Generation)[T.6],22.2693,8.709,2.557,0.011,5.173,39.366
"Q(""Sp. Def"")",0.5634,0.071,7.906,0.000,0.423,0.703
"Q(""Sp. Def""):C(Generation)[T.2]",-0.2350,0.101,-2.316,0.021,-0.434,-0.036
"Q(""Sp. Def""):C(Generation)[T.3]",-0.3067,0.093,-3.300,0.001,-0.489,-0.124

0,1,2,3
Omnibus:,337.229,Durbin-Watson:,1.505
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2871.522
Skew:,1.684,Prob(JB):,0.0
Kurtosis:,11.649,Cond. No.,1400.0


#### _"The condition number is large, 1.4e+03. This might indicate that there are strong multicollinearity or other numerical problems."_ suggests OVERFITTING
![](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*_7OPgojau8hkiPUiHoGK_w.png)

## What's the difference?
- `formula='output ~ x1'`
- `formula='output ~ x1 + x2'`
- `formula='output ~ x1 * x2'` -- This is equivalent to -- `formula='output ~ x1 + x2 + x1 * x2'`