### <center>Class 7: Simple Linear Regression </center>

In [None]:
import os
import sys
import warnings

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings("ignore")

In [None]:
%matplotlib inline

## Data

In [None]:
path = os.path.join(os.pardir, 'data', 'hotels-vienna.csv') # this will produce a path with the right syntax for your operating system
path

In [None]:
df_hotels = pd.read_csv(path)

In [None]:
df_hotels

In [None]:
df_hotels.info()

### EDA (Explanatory Data Analysis) and Feature Engineering

What kind of accomondations do we have? We are interested in hotels only. 

In [None]:
df_hotels.accommodation_type.value_counts()

Star rating? We want no luxury but no low-end budget accomodations either.

In [None]:
df_hotels.stars.value_counts()

<br>
Prices, main statistincs

In [None]:
df_hotels.price.describe()

A deeper look at price distribution: what is an extremely high price? How do you interpret those numbers?

In [None]:
df_hotels.price.quantile([0.1, 0.25, 0.5, 0.75, 0.90, 0.99, 0.999])

Filtering for the right observations.

In [None]:
df_hotels = df_hotels[
    (df_hotels.accommodation_type == 'Hotel')
    & (df_hotels.city_actual == 'Vienna')
    & (df_hotels.stars >= 3)
    & (df_hotels.stars <= 4)
    & (df_hotels.price <= 600)]


In [None]:
df_hotels

In [None]:
df_hotels.price.describe(percentiles = [0.5, 0.95, 0.99])

In [None]:
df_hotels.distance.describe(percentiles = [0.5, 0.95, 0.99])

**Charting hotels data**

In [None]:
df_hotels.price.plot(
    kind = 'hist',  bins = range(50, 425, 25), rwidth = 0.9
    , figsize = (8,5)
    , xticks = range(50, 425, 25)
    , xlabel = 'USD'
    , title = 'Distribution of room prices'
);

In [None]:
df_hotels.distance.plot(
    kind = 'hist',  bins = [i/10 for i in range(0, 70, 5)], rwidth = 0.9
    , figsize = (8,5)
    , xticks = [i/10 for i in range(0, 70, 5)] # note how we use the range() function here
    , xlabel = 'miles'
    , title = 'Distances from the city center'
);

**Close vs Far**

For a simple analysis of the effect of distance from the city center on price we split the hotels into two categories: 'close' and 'far'. We calculate the respective means and plot the mean prices by these categories. 

In [None]:
df_hotels['distance_category'] = df_hotels.distance.map(lambda x: 'Far' if x >= 2 else 'Close')

In [None]:
df_hotels.head(10)

In [None]:
df_hotels[['distance_category', 'distance', 'price']].groupby('distance_category').aggregate('describe')

Note: here we are working with a `DataFrameGroupBy` object.

In [None]:
df_hotels[['distance_category', 'distance', 'price']].groupby('distance_category')

In [None]:
type(df_hotels[['distance_category', 'distance', 'price']].groupby('distance_category'))

For plotting we are turning to the `seaborn` library. This library is built on the matplotlib library and gives additional charting options. More in the [official doclumentation](https://seaborn.pydata.org/index.html). 

In [None]:
ps_mean_by_distcat = df_hotels.groupby('distance_category')['price'].mean()
ps_mean_by_distcat

In [None]:
sns.pointplot(
    data = ps_mean_by_distcat
    , linestyle = 'none'
    , marker = 'o'
    , color = 'k'
)
plt.xlabel('distance category')
plt.title('Mean price by distance category')
plt.ylim(0,400)
plt.grid(linestyle = '--');

**Question**: why not `scatterplot`?

In [None]:
sns.pointplot(
    data = ps_mean_by_distcat
    , linestyle = 'none'
    , marker = 'o'
    , color = 'k'
)
plt.xlabel('distance category')
plt.text(
    x = 'Close', y= ps_mean_by_distcat['Close'] + 20 # we add 20 to the y value to lift the annotation from the point itself
    , s = str(round(ps_mean_by_distcat['Close']))
    , fontsize=12, color='k')
plt.text(
    x = 'Far', y= ps_mean_by_distcat['Far'] + 20
    , s = str(round(ps_mean_by_distcat['Far']))
    , fontsize=12, color='k')
plt.ylim(0,400)
plt.title('Mean price by distance category')
plt.grid(linestyle = '--');

<br> **Elaborating on close vs far**: creating 4 distance categories, each being sort of a typical distance for that category.

In [None]:
df_hotels['dist_4_cat'] = df_hotels.distance.map(lambda x:  0.5 + 1 * int(x >= 1) + 1 * int(x >= 2) + 2.5 * int(x >= 3))

In [None]:
df_hotels[['dist_4_cat', 'distance', 'price']].groupby('dist_4_cat').aggregate('describe')

In [None]:
ps_mean_by_dist_4_cat = df_hotels.groupby('dist_4_cat')['price'].mean()
ps_mean_by_dist_4_cat

Since our categories are numerical values, we can use a `scatterplot`.

In [None]:
sns.scatterplot(
    data = ps_mean_by_dist_4_cat
    , marker = 'o'
    , color = 'k'
)
for i in ps_mean_by_dist_4_cat.index: # we are adding chart elements by using a for loop
    plt.text(
        x = i, y = ps_mean_by_dist_4_cat[i] + 15
        , s = str(round(ps_mean_by_dist_4_cat[i]))
        , fontsize = 12, color = 'k'
    )
plt.xlabel('distance category (miles)')
plt.ylabel('mean price')
plt.ylim(0,400)
plt.title('Mean price by distance category')
plt.grid(linestyle = '--');

<br> Checking outliers using a `boxplot`.

In [None]:
sns.boxplot(
    data = df_hotels, y = 'price', x = 'distance_category')
plt.xlabel('distance category')
plt.title('Typical and outlier values in hotel prices')
plt.grid(axis = 'y');

Finding the outlier observation.

In [None]:
df_hotels[df_hotels.price == df_hotels.price.max()].T

**Plotting the point-by-point relationship between distance from city center and price**

Starting with `lowess`

In [None]:
sns.regplot(
    data = df_hotels, x = 'distance', y = 'price'
    , marker = '.'
    , fit_reg = True, lowess = True
    , scatter_kws = {'color': 'dimgrey'}
    , line_kws = {'color': 'k'}
)
plt.xlabel('distance in miles')
plt.ylabel('price in USD')
plt.title('Vienna hotel prices vs distances from city center');

### Building a Linear Regression Model Using `statsmodels`

Tools: on of the most-known tools data scientists use for predictive analysis is `scikit-learn`. Here, however, we use the `statsmodels` library that allows users to explore data, estimate statistical models, and perform statistical tests. Scikit-learn is great for building all kinds of predictive machine learning models, including linear regression, but spends little effort on providing _insights_ into the models themselves. That's why we turn to statsmodels instead.

In [None]:
df_hotels.sort_values(by = 'distance', ascending= True, inplace = True) # we are sorting the dataframe for easier charting

#### Model 0: lowess

Note: the result of a lo(w)ess regression depends on the tools used. The values calculated below will be different compared to those seen on the `seaborn` regplot output.

In [None]:
lowess = sm.nonparametric.lowess

In [None]:
type(lowess)

We are getting function which for $x$ and $y$ input will return us the _fitted values_. 

In [None]:
y_hat_lowess = lowess(df_hotels.price, df_hotels.distance)
y_hat_lowess[0:10]

In [None]:
lowes_fitted_values = [x[1] for x in y_hat_lowess]
lowes_fitted_values[0:10]

Note: these are *not predictions, but fitted values*. Lowess can fit a curve on our existing data but will not be able to give us a fitted value on a new data point. We can use interpolation to get a fitted estimate on a new observation but only if its $x$ value is between the min and max in our existing sample. 

**Question**: what is *interpolation*?

#### Model 1: linear regression

Now we are using the `statsmodels.formula.api`. The key differences between this and the `stasmodels.api` are
- The formula API uses string formulas to specify models, while the main API requires explicit matrix definitions.
- The formula API is generally easier to use with pandas DataFrames and for specifying complex models.
- The main API, however, offers more control and flexibility for advanced modeling techniques.
- In API, similarly to the `R` approach, a constant is automatically added to your data and an intercept in fitted. In the main API you have to add the constant to the data matrix X yourself.
```python
X = sm.add_constant(X)
```

The formula is defined in R-style: the dependent variable followed by a ~ and then the independent variables. 

In [None]:
regression = smf.ols(formula = 'price ~ distance', data = df_hotels).fit(cov_type="HC0") # more on 'cov_type' later

In [None]:
type(regression)

In [None]:
type(regression.summary())

In [None]:
print(regression.summary()) # the __str__() method of a statsmodels.iolib.summary.Summary object gives you a nicely formatted output

<br>What can you say about the regression?
- Does distance from the city center seem to be important in prcing hotels in Vienna?
- If yes, does it give a sufficient information why any one hotel differs from the others regarding its room price? 

How is the F-statistic is related to the t-value of the explanatory variable in a simple linear regression?

The F-value

In [None]:
regression.fvalue

t-value of the independent variable

In [None]:
regression.tvalues

In [None]:
regression.tvalues.distance

In [None]:
regression.tvalues.distance**2

Why is this?

#### Fitted values of a linear regression model

We can get the fitted values using the `fittedvalues` attribute of the _regression_ object. 

In [None]:
df_hotels['price_fitted'] = regression.fittedvalues

More complicated, complex charts can still better handled by `matplotlib` graphs.

In [None]:
fig = plt.figure(figsize = (6,4))
ax = fig.add_axes([0,0,1,1])
ax.scatter(df_hotels.distance, df_hotels.price, s = 3, color = 'dimgrey')
ax.plot(df_hotels.distance, df_hotels.price_fitted, color = 'k')
plt.xlabel('distance in miles')
plt.ylabel('price')
plt.title('Vienna hotel prices and fitted values');

We can add the lowess regression's fitted values as well. 

In [None]:
fig = plt.figure(figsize = (6,4))
ax = fig.add_axes([0,0,1,1])
ax.scatter(df_hotels.distance, df_hotels.price, s = 10, color = 'dimgrey')
ax.plot(df_hotels.distance, df_hotels.price_fitted, color = 'k', label ='linear regression fitted values')
ax.plot(df_hotels.distance, lowes_fitted_values, color = 'blue', label ='lowess regression fitted values')
plt.legend(labelcolor = ['black', 'blue'])
plt.xlabel('distance in miles')
plt.ylabel('price')
plt.grid(linestyle = '--')
plt.title('Vienna hotel prices and fitted values');

### Build Your Own Regression Model on PL Matches

In [None]:
path = os.path.join(os.pardir, 'data', 'premier_league_2021-22.csv')
path

In [None]:
df_premier_league = pd.read_csv(path)

In [None]:
df_premier_league.head()

Reading the csv file using the default option may result in unexpected columns. What has happened here? Check out the exact content of the csv file.

In [None]:
df_premier_league = pd.read_csv(path, index_col = 0)

In [None]:
df_premier_league.head()

In [None]:
df_premier_league.info()

#### Build a regression model where you model the relationship between the *difference* in player enumerations and that of the bets of the playing teams.

Interpret the results. 
- What is the interpretation of the $\beta_0$ parameter? Does it have an actual meaning?
- How about $\beta_1$? Is it significantly different from zero? If not, how would you modify the your model to make it more sense, if it is possible at all?

In [None]:
regression = smf.ols(formula = 'ODDS_DIFF ~ HomeTeam_Excess_Weekly_Pay', data = df_premier_league).fit(cov_type="HC0")

In [None]:
print(regression.summary())

In [None]:
regression.params

In [None]:
regression.params.HomeTeam_Excess_Weekly_Pay

In [None]:
print(f'{regression.params.HomeTeam_Excess_Weekly_Pay:.8f}')

In [None]:
df_premier_league[['ODDS_DIFF', 'HomeTeam_Weekly_Pay']].describe().style.format({'ODDS_DIFF': '{:.2f}', 'HomeTeam_Weekly_Pay': '{:,.2f}'})

In [None]:
df_premier_league['Home_Excess_10Kpounds'] = df_premier_league.HomeTeam_Excess_Weekly_Pay / 10_000

In [None]:
regression = smf.ols(formula = 'ODDS_DIFF ~ Home_Excess_10Kpounds', data = df_premier_league).fit(cov_type="HC0")

In [None]:
print(regression.summary())

In [None]:
import this