# Regression

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels as sm
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

Regression is a statistical tool to analyze the relationships between variables.
Its composed of many statistical models to explore the relationship between a response variable (dependent variable) and some exploratory variables (independent variables), so, given values of the explanatory variables we can predict the values of the response variable.

Two main types:
- Linear Regression: the response variable is numeric
- Logistic Regression: the response variable is logical (True or False values)

## Before start

Before playing with any regression, **visualize the data**.
**Scatterplots** are very interesting as this stage. **Regplot** adds a trend line to the scatterplot.

In [None]:
taiwan_real_estate = pd.read_csv('../data/taiwan_real_estate2.csv')
taiwan_real_estate.head()

In [None]:
taiwan_real_estate['house_age_years'] = taiwan_real_estate['house_age_years'].astype('category')

In [None]:
taiwan_real_estate.info()

To keep it simple, lets focus on the simple linear regression, that is, using a single explanatory variable to predict the response variable. In this case, lets use the *n_convinience* variable to predict *price_twd_msq*

In [None]:
sns.scatterplot(data=taiwan_real_estate, x='n_convenience', y='price_twd_msq')
plt.show()

In [None]:
sns.regplot(x='n_convenience',
         y='price_twd_msq',
         data=taiwan_real_estate,
         ci=90,
         scatter_kws={'alpha': 0.5})

The fitted lines are defined by:
- Intercept: y value at x=0
- Slope: steepness. The amount the y value increases when x increases 1 unit

$$
  y = intercept + slope*x
$$

## Run the linear regression model

In [None]:
from statsmodels.formula.api import ols

In [None]:
mdl_price_vs_conv = ols("price_twd_msq ~ n_convenience",
                           data=taiwan_real_estate)

In [None]:
mdl_price_vs_conv = mdl_price_vs_conv.fit()

In [None]:
print(mdl_price_vs_conv.params)

On average, a house with zero convenience stores nearby had a price of 8.2242 TWD per square meter.

If you increase the number of nearby convenience stores by one, then the expected increase in house price is 0.7981 TWD per square meter.



## Run the linear Regression Model using a categorical variable
Lets predict the price using the age of the property.

In [None]:
taiwan_real_estate.house_age_years.value_counts()

The '0 to 15' value will be used as the baseline. The other coeficients will be calculated in relation to that one

In [None]:
sns.displot(data=taiwan_real_estate,
            x="price_twd_msq",
            col="house_age_years",
            bins=10)

In [None]:
mdl_price_vs_age = ols("price_twd_msq ~ house_age_years",
                           data=taiwan_real_estate)

In [None]:
mdl_price_vs_age=mdl_price_vs_age.fit()

In [None]:
print(mdl_price_vs_age.params)

If we want to calculate all the coefficients from 0 we can slightly edit the formula adding a *'+ 0'*

In [None]:
mdl_price_vs_age = ols("price_twd_msq ~ house_age_years + 0",
                           data=taiwan_real_estate)

In [None]:
mdl_price_vs_age=mdl_price_vs_age.fit()

In [None]:
print(mdl_price_vs_age.params)

In [None]:
taiwan_real_estate.groupby('house_age_years')['price_twd_msq'].mean()

## Predictions

In [None]:
import numpy as np

# Create explanatory_data 
explanatory_data = pd.DataFrame({'n_convenience': np.arange(0, 11)})

# Use mdl_price_vs_conv to predict with explanatory_data, call it price_twd_msq
price_twd_msq = mdl_price_vs_conv.predict(explanatory_data)

# Create prediction_data
prediction_data = explanatory_data.assign(price_twd_msq = price_twd_msq)

# Print the result
print(prediction_data)

In [None]:
# Create a new figure, fig
fig = plt.figure()

sns.regplot(x="n_convenience",
            y="price_twd_msq",
            data=taiwan_real_estate,
            ci=None)
# Add a scatter plot layer to the regplot
sns.scatterplot(x="n_convenience",
            y="price_twd_msq",
            data=prediction_data,
            color='r')

# Show the layered plot
plt.show()

## The model objects

In [None]:
mdl_price_vs_conv.params

In [None]:
mdl_price_vs_conv.fittedvalues

The .fittedvalues is a shortcut to access the predictions on the original dataset. Its equivalent to:

In [None]:
mdl_price_vs_conv.predict(taiwan_real_estate.n_convenience)

In [None]:
# Residuals are the difference between the observed response values and the predicted ones
mdl_price_vs_conv.resid

In [None]:
mdl_price_vs_conv.summary()

In [None]:
coeffs = mdl_price_vs_conv.params

intercept = coeffs['Intercept']
slope = coeffs['n_convenience']

# Manually calculate the predictions
price_twd_msq = intercept + slope*explanatory_data
print(price_twd_msq)

print(price_twd_msq.assign(predictions_auto=mdl_price_vs_conv.predict(explanatory_data)))

## Regression to the mean

Response = fitted value + residual
- Fitted value: what the model can explain
- Residual: what the model can not explain

Residuals exist due to problems in the model and fundamental randomness
Extreme values will tend to go towards the mean when we predict them

In [None]:
sp500_yearly_returns = pd.read_csv('../data/sp500_yearly_returns.csv')

In [None]:
sp500_yearly_returns.info()

In [None]:
sp500_yearly_returns.head()

In [None]:
sp500_yearly_returns['symbol'] = sp500_yearly_returns['symbol'].astype('category')

In [None]:
sp500_yearly_returns.info()

In [None]:
# Create a new figure, fig
fig = plt.figure()

# Plot the first layer: y = x
plt.axline(xy1=(0,0), slope=1, linewidth=2, color="green")

# Add scatter plot with linear regression trend line
sns.regplot(data=sp500_yearly_returns, x='return_2018', y='return_2019', ci=None)

# Set the axes so that the distances along the x and y axes look the same
plt.axis('equal')

# Show the plot
plt.show()

In [None]:
mdl_returns = ols('return_2019 ~ return_2018', data=sp500_yearly_returns).fit()
mdl_returns.params

In [None]:
mdl_returns = ols("return_2019 ~ return_2018", data=sp500_yearly_returns).fit()
explanatory_data = pd.DataFrame({'return_2018': [-1, 0, 1]})
print(mdl_returns.predict(explanatory_data))

## Transforming variables

Sometimes the relation between the dependent and independent variables is not a straight line.

In [None]:
sns.regplot(data=taiwan_real_estate, x='dist_to_mrt_m', y='price_twd_msq', ci=None)
plt.show()

In [None]:
# Create sqrt_dist_to_mrt_m
taiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])

plt.figure()

# Plot using the transformed variable
sns.regplot(data=taiwan_real_estate,
x='sqrt_dist_to_mrt_m', y='price_twd_msq', ci=None)
plt.show()

In [None]:
# Run a linear regression of price_twd_msq vs. sqrt_dist_to_mrt_m
mdl_price_vs_dist = ols("price_twd_msq ~ sqrt_dist_to_mrt_m", 
                        data=taiwan_real_estate).fit()

# Use this explanatory data
explanatory_data = pd.DataFrame({"sqrt_dist_to_mrt_m": np.sqrt(np.arange(0, 81, 10) ** 2),
                                "dist_to_mrt_m": np.arange(0, 81, 10) ** 2})

# Use mdl_price_vs_dist to predict explanatory_data
prediction_data = explanatory_data.assign(
    price_twd_msq = mdl_price_vs_dist.predict(explanatory_data)
)

fig = plt.figure()
sns.regplot(x="sqrt_dist_to_mrt_m", y="price_twd_msq", data=taiwan_real_estate, ci=None)

# Add a layer of your prediction points
sns.scatterplot(data=prediction_data, 
                x='sqrt_dist_to_mrt_m', 
                y='price_twd_msq', 
                color='red')
plt.show()

In [None]:
ad_conversion = pd.read_csv('../data/ad_conversion.csv')
ad_conversion.head()

In [None]:
# Create qdrt_n_impressions and qdrt_n_clicks
ad_conversion["qdrt_n_impressions"] = ad_conversion['n_impressions'] ** 0.25
ad_conversion["qdrt_n_clicks"] = ad_conversion['n_clicks'] ** 0.25

plt.figure()

# Plot using the transformed variables
sns.regplot(data=ad_conversion, x='qdrt_n_impressions', y="qdrt_n_clicks")
plt.show()

In [None]:
ad_conversion["qdrt_n_impressions"] = ad_conversion["n_impressions"] ** 0.25
ad_conversion["qdrt_n_clicks"] = ad_conversion["n_clicks"] ** 0.25

mdl_click_vs_impression = ols("qdrt_n_clicks ~ qdrt_n_impressions", data=ad_conversion, ci=None).fit()

explanatory_data = pd.DataFrame({"qdrt_n_impressions": np.arange(0, 3e6+1, 5e5) ** .25,
                                 "n_impressions": np.arange(0, 3e6+1, 5e5)})

# Complete prediction_data
prediction_data = explanatory_data.assign(
    qdrt_n_clicks = mdl_click_vs_impression.predict(explanatory_data.qdrt_n_impressions)
)

# Print the result
print(prediction_data)

In [None]:
# Back transform qdrt_n_clicks
prediction_data["n_clicks"] = prediction_data.qdrt_n_clicks ** 4
print(prediction_data)

In [None]:
# Back transform qdrt_n_clicks
prediction_data["n_clicks"] = prediction_data["qdrt_n_clicks"] ** 4

# Plot the transformed variables
fig = plt.figure()
sns.regplot(x="qdrt_n_impressions", y="qdrt_n_clicks", data=ad_conversion, ci=None)

# Add a layer of your prediction points
sns.scatterplot(data=prediction_data, x='qdrt_n_impressions', y='qdrt_n_clicks', color='red')
plt.show()

# Quantifying model fit

How good is our model?

## Coefficient of Determination (r-squared)

The proportion of the variance in the response variable that is predictable from the explanatory variable.

1 -> perfect fit
0 -> the worst possible fit

Statsmodels provides the r-squared value in the **.summary()** output of the fitted model. It can be accessed via the **.rsquared** attribute too.

## Residual standard error (RSE)

The residual standard error is the "typical" difference between a prediction and an observed response.

It has the same units as the response variable.

A related metric is the Mean Squared Error, or MSE, that is MSE^2. Another related metric is the RMSE (Root Mean Squared Error)


In [None]:
print(mdl_click_vs_impression.summary())


In [None]:
print(mdl_click_vs_impression.rsquared)