[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek10.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week10.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week10.ipynb)

# Week 10: Regression

We'll continue learning about machine learning. We are still looking at *supervised learning*.

- What is the difference between *supervised* and *unsupervised* learning? 

- What are some examples of each?

In [None]:
import numpy as np
import pandas as pd
from numpy import random as rng
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

We are still working with `scikit-learn` which can help us with several machine learning tasks.

![](imgs/scikitlearn1.png)

![](imgs/scikitlearn2.png)

## Linear regression

You probably already know what linear regression is if the name is unfamiliar.

It is used all the time in sciences.

Linear regression = finding the line of best fit.

![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*N1-K-A43_98pYZ27fnupDA.jpeg)

We perform a linear regression when we want to analyze how some variables (usually one) *linearly* depend on other variables.

**Simplest:** we have data points in $\mathbb{R}^2$.
- We suspect the $y$-values depend, at least in part, on the $x$-values.
- We perform some linear algebra to get the line of best fit, usually written as
$$
	y = \beta x + \varepsilon
$$
- We calculate the $r^2$-value to see how well the line fits the data.
- If it's a decent fit, we can predict a reasonable range of $y$-values for a given $x$-value.

Let's build a simple example.

In [None]:
x = 10 * rng.rand(50) 			# Random numbers between 0 and 10
noise = rng.randn(50)			# Random noise
y = 2 * x - 5 + noise 			# Noisy but near the line 2x - 5

In [None]:
plt.xlabel("x")
plt.ylabel("y")
plt.grid()
_ = plt.scatter(x, y, color='blue', alpha=0.5, zorder=2)

There is a very *clear* trend in the data. We expect a high $r^2$-value.

We'll use `LinearRegression` to get the line of best fit in this simple example.

In [None]:
# Fit the model
model = LinearRegression(fit_intercept=True)
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(0, 10, 2)			# Only need 2 points to define a line
yfit = model.predict(xfit[:, np.newaxis])
beta = model.coef_[0]					# slope
epsilon = model.intercept_				# intercept
r2 = model.score(x[:, np.newaxis], y)	# r^2-value

# Plot the model
plt.scatter(x, y, color='blue', alpha=0.5, zorder=2)
plt.xlabel("x")
plt.ylabel("y")
plt.plot(xfit, yfit, color="red", label=f"$y = {beta:.2f}x {epsilon:.2f}$")
plt.text(0.1, 13, f"$r^2 = {r2:.2f}$", fontsize=12, color="green")
plt.title("Linear Regression")
plt.grid()
plt.legend()
plt.show()

We expected a very high $r^2$-value since we produced the data to have very little noise. 

**Try it yourself:** Scale the vector `noise` above and see how it impacts the $r^2$-value.

## Multidimensional data

Instead of looking for a line (of best fit), we will look for a hyperplane.

It's the same idea, except we have more "independent variables".

We will still have one dependent variable.

I gathered some data related to housing from 
- centralbank.ie
- cso.ie

#### Information on the data

The central bank tells us the interest rates for fixed rate mortgages for each quarter between 2015 and a little bit before 2020. 

The CSO tells us the number of new and existing houses sold in Ireland each month and their cumulative values. They also give the median house price for all of Ireland.

Putting this all together required some serious work. 

Below is the code to gather the data from the csv files. 

This is just for your information -- you do not need to understand this code -- though *you can*!

In [None]:
# Data from https://www.centralbank.ie/statistics/data-and-analysis/statistical-data-in-csv
# Interest data
df_int = pd.read_csv(
	"data/Interest.csv", 
	encoding = "ISO-8859-2", 
	usecols=["Table B.3.1  Retail Interest Rates - Lending for House Purchase", 
		"Reporting Date", 
		"PDH - Fixed Rate - Over 3 years"
	],
	parse_dates=["Reporting Date"]
)
df_int = (df_int
	.query("`Table B.3.1  Retail Interest Rates - Lending for House Purchase` == 'Outstanding Amounts - Rates (%)'")
	.drop(columns=["Table B.3.1  Retail Interest Rates - Lending for House Purchase"])
	.rename(columns={
		"Reporting Date": "Date", 
		"PDH - Fixed Rate - Over 3 years": "Interest"
	})
)
def month_to_interest(date):
	rel_data = df_int.query(f"Date <= '{date}'")
	if rel_data.empty:
		return np.nan
	return rel_data.tail(1).Interest.values[0]

# Data from: https://data.cso.ie/table/HPM04
# So much housing data
from functools import reduce
df = pd.read_csv(
	"data/Household.csv",
	parse_dates=["Month"], 
	date_format='%Y %m'
)
df = df.rename(columns={"Statistic Label": "Label"})
dfs = [df.query(f"Label == '{x}'") for x in df.Label.unique()]
for i, df in enumerate(dfs):
	dfs[i] = (df
		.rename(columns={"VALUE": f"{df.Label.unique()[0]}"})
		.query("`Stamp Duty Event` == 'Executions'")
		.drop(columns=["Label", "UNIT", "Stamp Duty Event", "Eircode Output"])
	)
df = reduce(pd.merge, dfs)
df = (df
	.rename(columns={
		"Volume of Sales": "Volume", 
		"Median Price": "Median", 
		"Value of Sales": "Value"
	})
	.astype({"Volume": "i"})
	.query("`Type of Buyer` == 'All Buyer Types'")
	.drop(columns=["Type of Buyer", "Mean Sale Price"])
	.set_index("Month")
)
df_all = df.query("`Dwelling Status` == 'All Dwelling Statuses'")
df_new = df.query("`Dwelling Status` == 'New'")
df_exist = df.query("`Dwelling Status` == 'Existing'")
ser_int = pd.Series([month_to_interest(date) for date in df_new.index], index=df_new.index)
df = pd.DataFrame({
	"Volume_New": df_new.Volume,
	"Value_New": df_new.Value,
	"Volume_Existing": df_exist.Volume,
	"Value_Existing": df_exist.Value,
	"Interest": ser_int,
	"Median": df_all.Median
}).dropna()

# Celebrate
df.head()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Columns to use as dependent variables
COLS = df.columns[[0, 2, 4]]

X = df[COLS]
y = df['Median']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

model.coef_, model.intercept_, model.score(X_test, y_test)
print("Coefficients:")
for col, coef in zip(COLS, model.coef_):
	print(f"\t{col}: {coef:.2f}")
print(f"r^2-value: {model.score(X_test, y_test):.2f}")