# Linear Regression using Normal Equation

In [6]:
import pandas as pd
import numpy as np
from matplotlib.pyplot import plot

## Load Dataset

In [7]:
teams = pd.read_csv("teams.csv")

In [10]:
teams.head(10)

Unnamed: 0,team,year,athletes,events,age,height,weight,prev_medals,medals
0,AFG,1964,8,8,22.0,161.0,64.2,0.0,0
1,AFG,1968,5,5,23.2,170.2,70.0,0.0,0
2,AFG,1972,8,8,29.0,168.3,63.8,0.0,0
3,AFG,1980,11,11,23.6,168.4,63.2,0.0,0
4,AFG,2004,5,5,18.6,170.8,64.8,0.0,0
5,AFG,2008,4,4,22.5,179.2,62.8,0.0,1
6,AFG,2012,6,6,24.8,171.7,60.8,1.0,1
7,AFG,2016,3,3,24.7,173.7,74.0,1.0,0
8,AHO,1964,4,4,28.5,171.2,69.4,0.0,0
9,AHO,1968,5,4,31.0,173.2,67.8,0.0,0


**Independenent Variable Columns:** 
- athletes 
- prev_medals

**Dependent Variable Columns:**
- medals

The goal is to predicted based on the number of athletes and previous medals what will be the medals predicted for a given country

**Equation** 

${y} = {b}_{0} + {b}_{1}{x} + {e}$ 

- ${y}$: actual medals
- ${b}_{0}$: y-intercept
- ${b}_{1}$: 
- ${x}$: 
- ${e}$: error

Now we can solve for ${B}$

${B} = ({X}^{T} {X})^{-1} {X}^{T} {y}$

In [16]:
# Independent Variables
X = teams[["athletes", "prev_medals"]].copy() # .copy() to avoid change the original dataframe teams

In [17]:
# Dependent Varaible
y = teams[["medals"]].copy()

In [18]:
# Create the b0 the y-intercept
X["intercept"] = 1

In [20]:
# Reorder the matric for the intercept be in first place
X = X[["intercept", "athletes", "prev_medals"]] 

In [21]:
X.head()

Unnamed: 0,intercept,athletes,prev_medals
0,1,8,0.0
1,1,5,0.0
2,1,8,0.0
3,1,11,0.0
4,1,5,0.0


In [22]:
# Create the transpose matrix
X_T = X.T

In [26]:
X_T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
intercept,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
athletes,8.0,5.0,8.0,11.0,5.0,4.0,6.0,3.0,4.0,5.0,...,52.0,20.0,47.0,28.0,21.0,26.0,14.0,16.0,9.0,31.0
prev_medals,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,15.0,0.0,0.0,0.0,0.0,0.0,3.0,4.0,0.0


In [29]:
# Calculate the Coefficients
B = np.linalg.inv(X_T @ X) @ X_T @ y # take the inverse of the matrixes

- **np.linalg.inv:** matrix inverse
- **@:** matrix multiplication used in python

In [31]:
B

Unnamed: 0,medals
0,-1.961889
1,0.071112
2,0.734137


- 0: ${b}_{0}$ (intercept)
- 1: ${b}_{1}$ (athetles) \
  *for every athetle in is expect to win 0.071112 medals*
- 2: ${b}_{2}$ (prev_medals) \
  *for every medal won in previous olimpics is expect to win 0.734137 medals*

In [33]:
# Attribute the column names from X to B
B.index = X.columns

In [12]:
B

Unnamed: 0,medals
intercept,-1.961889
athletes,0.071112
prev_medals,0.734137


**Calculate the predictions**

$\hat{y} = {X} * {B}$

In [34]:
# Calculate the predictions
predictions = X @ B

In [14]:
predictions

Unnamed: 0,medals
0,-1.392992
1,-1.606329
2,-1.392992
3,-1.179656
4,-1.606329
...,...
2009,-0.112974
2010,-0.966319
2011,1.378315
2012,1.614667


**Sum of Squared Residuals**\
Measures the distance between the **actual values** *(observed values)* and the **predicted values** by the model.\
The goal in linear regression is to minimize the SSR, which leads to finding the best-fitting line through the data.

${SSR} = \Sigma({y} - \hat{y})^{2}$ 
- ${y}$: actual values
- $\hat{y}$: predicted values


**Sum of Squared Totals**\
Measures of data is spread from the mean

${SST} = \Sigma({y} - \bar{y})^{2}$ 
- ${y}$: actual values
- $\bar{y}$: mean of the data points

**R Squared**

${r}^{2} = 1 - ({SSR} - {SST})$

In [35]:
#  SSR (Sum of Squared Residuals)
SSR = ((y - predictions)**2).sum()

In [36]:
# SST (Sum of Squared Total)
SST = ((y - y.mean()) ** 2).sum()

In [37]:
# R^2 
R2 = 1 - (SSR/SST)

In [38]:
R2

medals    0.872329
dtype: float64

In [39]:
# using the linear regression model from sklearn
from sklearn.linear_model import LinearRegression

In [40]:
lr = LinearRegression()

In [41]:
lr.fit(teams[["athletes", "prev_medals"]], teams[["medals"]])

In [42]:
lr.intercept_

array([-1.96188939])

In [43]:
lr.coef_

array([[0.07111214, 0.73413679]])

We got the same value as the linear model from sklear