# Regression

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels as sm
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

Regression is a statistical tool to analyze the relationships between variables.
Its composed of many statistical models to explore the relationship between a response variable (dependent variable) and some exploratory variables (independent variables), so, given values of the explanatory variables we can predict the values of the response variable.

Two main types:
- Linear Regression: the response variable is numeric
- Logistic Regression: the response variable is logical (True or False values)

## Before start

Before playing with any regression, **visualize the data**.
**Scatterplots** are very interesting as this stage. **Regplot** adds a trend line to the scatterplot.

In [None]:
taiwan_real_estate = pd.read_csv('../data/taiwan_real_estate2.csv')
taiwan_real_estate.head()

In [None]:
taiwan_real_estate['house_age_years'] = taiwan_real_estate['house_age_years'].astype('category')

In [None]:
taiwan_real_estate.info()

To keep it simple, lets focus on the simple linear regression, that is, using a single explanatory variable to predict the response variable. In this case, lets use the *n_convinience* variable to predict *price_twd_msq*

In [None]:
sns.scatterplot(data=taiwan_real_estate, x='n_convenience', y='price_twd_msq')
plt.show()

In [None]:
sns.regplot(x='n_convenience',
         y='price_twd_msq',
         data=taiwan_real_estate,
         ci=90,
         scatter_kws={'alpha': 0.5})

The fitted lines are defined by:
- Intercept: y value at x=0
- Slope: steepness. The amount the y value increases when x increases 1 unit

$$
  y = intercept + slope*x
$$

## Run the linear regression model

In [None]:
from statsmodels.formula.api import ols

In [None]:
mdl_price_twd_msq = ols("price_twd_msq ~ n_convenience",
                           data=taiwan_real_estate)

In [None]:
mdl_price_twd_msq = mdl_price_twd_msq.fit()

In [None]:
print(mdl_price_twd_msq.params)

On average, a house with zero convenience stores nearby had a price of 8.2242 TWD per square meter.

If you increase the number of nearby convenience stores by one, then the expected increase in house price is 0.7981 TWD per square meter.



## Run the linear Regression Model using a categorical variable
Lets predict the price using the age of the property.

In [None]:
taiwan_real_estate.house_age_years.value_counts()

The '0 to 15' value will be used as the baseline. The other coeficients will be calculated in relation to that one

In [None]:
sns.displot(data=taiwan_real_estate,
            x="price_twd_msq",
            col="house_age_years",
            bins=10)

In [None]:
mdl_price_vs_age = ols("price_twd_msq ~ house_age_years",
                           data=taiwan_real_estate)

In [None]:
mdl_price_vs_age=mdl_price_vs_age.fit()

In [None]:
print(mdl_price_vs_age.params)

If we want to calculate all the coefficients from 0 we can slightly edit the formula adding a *'+ 0'*

In [None]:
mdl_price_vs_age = ols("price_twd_msq ~ house_age_years + 0",
                           data=taiwan_real_estate)

In [None]:
mdl_price_vs_age=mdl_price_vs_age.fit()

In [None]:
print(mdl_price_vs_age.params)

In [None]:
taiwan_real_estate.groupby('house_age_years')['price_twd_msq'].mean()