## OLS Regression with statsmodels and DataFrames

*(Coding along with the Udemy Couse [Python for Business and Finance](https://www.udemy.com/course/complete-python-for-business-and-finance-bootcamp/) by Alexander Hagmann.)*

The task now is to perform an analysis of variance and calculate R-squared for our real data movies database:

- Performing a linear regression analysis with statsmodels

- We can use statsmodels for this whenever our data is organized in a pandas data frame

In [15]:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols

In [16]:
df = pd.read_csv("../assets/data/bud_vs_rev.csv", parse_dates = ["release_date"], index_col = "release_date")

In [17]:
df = df.loc["2016"]

In [18]:
df

Unnamed: 0_level_0,title,budget,revenue
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-01-01,Jane Got a Gun,25.0,1.397284
2016-01-07,Friend Request,9.9,2.400000
2016-01-07,The Forest,10.0,40.055439
2016-01-07,Wazir,5.2,9.200000
2016-01-13,13 Hours: The Secret Soldiers of Benghazi,50.0,69.411370
...,...,...,...
2016-12-23,Resident Evil: The Final Chapter,40.0,312.242626
2016-12-23,Railroad Tigers,50.0,102.205175
2016-12-23,Dangal,10.4,310.000000
2016-12-25,Live by Night,108.0,22.678555


In [19]:
# performing a linear regression with a dataframe and statsmodels
# dependent variable revenue in column revenue
# independent variable budget in column budget
# formula parameter expects as a string the formula for the linear regression: "dependent-variable tilde-symbol independent-variable"
# "revenue ~ budget"
model = ols("revenue ~ budget", data = df)

In [20]:
results = model.fit() # fit the model with fit()

In [21]:
results.params # getting the regression coefficient

Intercept   -9.449215
budget       3.349424
dtype: float64

In [22]:
results.rsquared # calculating r-squared

np.float64(0.6402124115463808)

__The budget alone explains 64% of the variation of the dependent variable revenue.__

And also the explained to variation.

And finally, also the unexplained variation of the same muskrat errors.

And again, asked about the sequel to 0.6 Far, and we can double check this by calculating the explained

variation divided by the total variation.

And of course, here we got the very same ask rate of 0.6 far.

All right.

Thanks for watching and see you also in the next lecture by.

In [23]:
tss = results.centered_tss # getting the total variation of the dependent variable
tss

np.float64(10848340.569368294)

In [24]:
rss = results.mse_model # explained variation
rss

np.float64(6945242.277191712)

In [25]:
sse = results.ssr # unexplained variation
sse

np.float64(3903098.292176582)

In [26]:
r_squared = results.rsquared
r_squared

np.float64(0.6402124115463808)

In [27]:
rss/tss # double checking R-squared by calculating the explained variation divided by the total variation

np.float64(0.6402124115463806)