## What is Forecasting?
Forecast: To predict or estimate (a future event or trend)
- Predicting weather patterns
- Estimate the quantity of stock required during a certain time-span
- Generally, determine the most likely outcome of a stochastic process based on previous events
- **Learn from patterns**
- Forecasting is just fancing trendlines within high dimensions

### Quick Forecast

In [7]:
import numpy as np
import plotly.express as px

x = np.linspace(-1, 1, 101)
y = 2 * (x + np.random.rand(101))

fig = px.scatter(x=x, y=y, trendline='ols')
fig

### Forecasting
- Time Series forecasts
- Probability models
- Forecasting using machine learning
- Using ensemble methods to strengthen our understanding
- Choosing the best tool for the job

## Remembering OLS...
- The aim of the Ordinary Least Sqaures method is to minimize the sum of squared differences between the predicted and the real values.
- Least squares stands for the minimum squares error (SSE)
- Ordininary Least Squares (OLS) is the foundation of regression analysis, and an excellent starting point for this course
- Estimates the expected outcome $(\hat{y})$ given the inputs $(x)$
- Calculating coefficient standard errors informs us about the level of noice in the data
- $R^2$ and Adjusted $R^2$ tell us how much of the total variation our model accounts for

### OLS Linear Regression with statsmodels

In [23]:
import pandas as pd
import numpy as np

data = pd.read_csv("https://github.com/dustywhite7/Econ8310/raw/master/DataSets/nflValues.csv")

In [39]:
import statsmodels.api as sm

x = data[['Playoffs', 'TVDeal', 'Expansion']]
x = sm.add_constant(x)
y = data['OperatingIncome']

fitted_model = sm.OLS(y, x).fit()
fitted_model.summary()

0,1,2,3
Dep. Variable:,OperatingIncome,R-squared:,0.014
Model:,OLS,Adj. R-squared:,0.008
Method:,Least Squares,F-statistic:,2.454
Date:,"Fri, 27 Jan 2023",Prob (F-statistic):,0.0624
Time:,21:13:39,Log-Likelihood:,-2628.9
No. Observations:,539,AIC:,5266.0
Df Residuals:,535,BIC:,5283.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,32.0112,1.950,16.419,0.000,28.181,35.841
Playoffs,0.6974,2.832,0.246,0.806,-4.865,6.260
TVDeal,5.9941,3.671,1.633,0.103,-1.218,13.206
Expansion,-8.0181,4.336,-1.849,0.065,-16.536,0.499

0,1,2,3
Omnibus:,401.869,Durbin-Watson:,0.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6843.384
Skew:,3.136,Prob(JB):,0.0
Kurtosis:,19.29,Cond. No.,3.66


In [41]:
import patsy as pt

y, x = pt.dmatrices("OperatingIncome ~ Playoffs + TVDeal + Expansion",  data=data)

fitted_model = sm.OLS(endog=y, exog=x).fit()
fitted_model.summary()

0,1,2,3
Dep. Variable:,OperatingIncome,R-squared:,0.014
Model:,OLS,Adj. R-squared:,0.008
Method:,Least Squares,F-statistic:,2.454
Date:,"Fri, 27 Jan 2023",Prob (F-statistic):,0.0624
Time:,21:14:56,Log-Likelihood:,-2628.9
No. Observations:,539,AIC:,5266.0
Df Residuals:,535,BIC:,5283.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,32.0112,1.950,16.419,0.000,28.181,35.841
Playoffs,0.6974,2.832,0.246,0.806,-4.865,6.260
TVDeal,5.9941,3.671,1.633,0.103,-1.218,13.206
Expansion,-8.0181,4.336,-1.849,0.065,-16.536,0.499

0,1,2,3
Omnibus:,401.869,Durbin-Watson:,0.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6843.384
Skew:,3.136,Prob(JB):,0.0
Kurtosis:,19.29,Cond. No.,3.66


## Cause and Effect
- The goal of statistical modeling is to understand the inputs that cause some specific outcome that we want to study. The catch with statistical models is that statistical models do not successfully identify causation. Statistical models instead identify correlation, and leve causation to domain expertise.

### Questioning Causality
1. Is is possible that y causes x instead?
2. Is it possible that z (a new factor that we have not considered before) is causing both x and y?
3. Could the relationship have been observed by chance?

### Establishing Causality
In order to establish causality, we need to meet several conditions:
- We can explain why x causes y
- We can demonstrate that nothing else is driving the changes (within reason)
- We can show that there is a correlation between x and y

In other words, we need a way to statistically isolate the relationship between two variables, even when there are other "moving parts" in our model

### RCT
One way to establish causality is through Randomized Controlled Trials (RCTs). In the context of an RCT, the experiment is designed such that only one variable is experimented upon. By assigning random individuals to the treatment and control groups, the researcher can use univatiate statistical tests to determine the relationship between the variable of interest

## Regression

#### Regression Assumptions

It is important to note taht the statistical models underlying linear regression depend on several assumptions:
1. Effects are Linear
2. Errors are normally distributed
3. Variables are not collinear
4. No Autocorrelation (problematic for time series data)
5. Homoskedasticity (Errors are shaped the same accross all observations)

While the study of these assumptions is essential for a practitioner, and frequently occupies an entire semester-long course, it is sufficient to state these assumptions, and be aware that these assumptions are baked into the models.

#### When should we use regression?
- Regression analysis is most useful when you care about why a particular outcome occurs. Regressions are very powerful transparent models, by which I mean that it is straightforward to see how each variable leads to the predicted outcome.
- Regression models are the de facto standard for understanding how one variable causes the other to change.


## Implementing Linear Regression in Python

In order to perform regression analysis, we will utilize the statsmodels library, which is capable of performing most types of regression modeling.

In [59]:
import pandas as pd
import polars as pl
import statsmodels.api as sm

data = pd.read_csv("https://github.com/dustywhite7/pythonMikkeli/raw/master/exampleData/fishWeight.csv")
x = data[['Length1', 'Species']]
x = pd.get_dummies(x, columns=['Species'], drop_first=True)
x = sm.add_constant(x)
y = data['Weight']

reg_fitted = sm.OLS(endog=y, exog=x).fit()
reg_fitted.summary()

0,1,2,3
Dep. Variable:,Weight,R-squared:,0.93
Model:,OLS,Adj. R-squared:,0.927
Method:,Least Squares,F-statistic:,286.9
Date:,"Fri, 27 Jan 2023",Prob (F-statistic):,7.78e-84
Time:,22:57:29,Log-Likelihood:,-948.61
No. Observations:,159,AIC:,1913.0
Df Residuals:,151,BIC:,1938.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-668.1044,40.472,-16.508,0.000,-748.069,-588.139
Length1,42.4320,1.221,34.741,0.000,40.019,44.845
Species_Parkki,28.2864,36.336,0.778,0.438,-43.506,100.078
Species_Perch,-41.6749,21.598,-1.930,0.056,-84.349,0.999
Species_Pike,-415.5526,32.256,-12.883,0.000,-479.283,-351.822
Species_Roach,-55.8549,29.596,-1.887,0.061,-114.331,2.621
Species_Smelt,201.6195,38.457,5.243,0.000,125.637,277.602
Species_Whitefish,-22.9381,42.825,-0.536,0.593,-107.552,61.676

0,1,2,3
Omnibus:,30.449,Durbin-Watson:,0.862
Prob(Omnibus):,0.0,Jarque-Bera (JB):,55.264
Skew:,0.91,Prob(JB):,9.99e-13
Kurtosis:,5.242,Cond. No.,230.0


#### Linear Model Exercise
Using the wage data provided here (https://github.com/dustywhite7/pythonMikkeli/raw/master/exampleData/wagePanelData.csv), create a linear regression model to explain and/or predict wages. Your fitted model should be stored as reg. If you do not name the model correctly, you won’t get any points!

All code needed to implement your model should be placed in the file linearModel.py found in the file tree.

In [58]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Read Data
data = pd.read_csv('https://github.com/dustywhite7/pythonMikkeli/raw/master/exampleData/wagePanelData.csv')

# Check VIF per column to test multicollinearity
vif_data = pd.DataFrame()
x_variables = data.drop(columns=['id', 'log_wage'])
vif_data["feature"] = x_variables.columns
vif_data["VIF"] = [variance_inflation_factor(x_variables.values, i) for i in range(len(x_variables.columns))]

# Get x and y features
x = data.drop(columns=['id', 'log_wage', 'education', 'weeks_worked', 'ms'])
x = sm.add_constant(x)
y = data['log_wage']

# Fit Regression
reg = sm.OLS(endog=y, exog=x)
reg_fitted = reg.fit()

reg_fitted.summary()

0,1,2,3
Dep. Variable:,log_wage,R-squared:,0.482
Model:,OLS,Adj. R-squared:,0.481
Method:,Least Squares,F-statistic:,429.8
Date:,"Fri, 27 Jan 2023",Prob (F-statistic):,0.0
Time:,22:57:05,Log-Likelihood:,-1318.4
No. Observations:,4165,AIC:,2657.0
Df Residuals:,4155,BIC:,2720.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.3061,0.018,355.609,0.000,6.271,6.341
year,0.0914,0.003,34.840,0.000,0.086,0.097
years_experience,0.0048,0.000,9.882,0.000,0.004,0.006
occupation_code,-0.3048,0.012,-25.932,0.000,-0.328,-0.282
industry_code,0.0333,0.011,2.997,0.003,0.012,0.055
south_region,-0.0921,0.012,-7.804,0.000,-0.115,-0.069
metropolitan_resident,0.1857,0.011,16.298,0.000,0.163,0.208
female,-0.4675,0.017,-27.318,0.000,-0.501,-0.434
union_member,0.0655,0.012,5.500,0.000,0.042,0.089

0,1,2,3
Omnibus:,100.547,Durbin-Watson:,0.565
Prob(Omnibus):,0.0,Jarque-Bera (JB):,231.789
Skew:,-0.033,Prob(JB):,4.65e-51
Kurtosis:,4.154,Cond. No.,99.6


## Don't Forgot Logistic Regression

- Linear regression is used to model outcomes that represent discrete outcomes such as success or failure, or to model the probability of success or failure. This is called a Linear Probability Model (LPM)
- One important part of any linear regression model is the linearity of the model. The issue with this is that a linear function with a nonzero slope will be by definition be unbounded, and will not remain within the [0,1] interval.

#### Non-Linear, But in a Good Way
- Is there a better way to model probabilties using regression analysis, it is called Logistic Regression.
- We want to redesign our regression model to resemble a linear model, but stay within the [0,1] interval
- To fix our regression equation, we make a really simple transformation called the logistic transformation.

$$y=\frac{exp(β0+β1⋅x1+β2⋅x2...+βk⋅xk)}{1+exp(β0+β1⋅x1+β2⋅x2...+βk⋅xk)}$$

- Where exp() represents Euler's number raise to the power of the interbal element (In our case, our original linear regression function)

In [32]:
import statsmodels.formula.api as smf
import statsmodels.api as sm
import pandas as pd

data = pd.read_csv("https://github.com/dustywhite7/pythonMikkeli/raw/master/exampleData/passFailTrain.csv")
x = data.drop(columns=['Unnamed: 0', 'G1', 'G2', 'G3'])
x = sm.add_constant(x)
y = data['G3']

reg = sm.Logit(y, x).fit()
reg.summary()

Optimization terminated successfully.
         Current function value: 0.494047
         Iterations 6


0,1,2,3
Dep. Variable:,G3,No. Observations:,296.0
Model:,Logit,Df Residuals:,265.0
Method:,MLE,Df Model:,30.0
Date:,"Sat, 28 Jan 2023",Pseudo R-squ.:,0.2096
Time:,11:06:19,Log-Likelihood:,-146.24
converged:,True,LL-Null:,-185.01
Covariance Type:,nonrobust,LLR p-value:,4.437e-06

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5.2626,2.919,1.803,0.071,-0.458,10.983
school,0.2258,0.545,0.415,0.678,-0.841,1.293
sex,0.5340,0.381,1.403,0.161,-0.212,1.280
age,-0.3386,0.151,-2.248,0.025,-0.634,-0.043
address,0.0286,0.402,0.071,0.943,-0.759,0.817
famsize,0.1232,0.343,0.360,0.719,-0.548,0.795
Pstatus,-0.5100,0.543,-0.939,0.348,-1.574,0.554
Medu,0.1859,0.203,0.915,0.360,-0.212,0.584
Fedu,-0.0157,0.185,-0.085,0.932,-0.379,0.348


In [28]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = x.columns
vif_data["VIF"] = [variance_inflation_factor(x.values, i) for i in range(len(x.columns))]
vif_data

Unnamed: 0,feature,VIF
0,const,395.226396
1,school,1.343139
2,sex,1.494889
3,age,1.595037
4,address,1.262928
5,famsize,1.139231
6,Pstatus,1.178782
7,Medu,2.235044
8,Fedu,1.891042
9,Mjob,1.422739
