# Linear Regression with Pytorch

## Simple Linear Regression Model with sklearn

In [1]:
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. load data
cols = ["CRIM",
    "ZN",
    "INDUS",
    "CHAS",
    "NOX",
    "RM",
    "AGE",
    "DIS",
    "RAD",
    "TAX",
    "PTRATIO",
    "B",
    "LSTAT",
    "MEDV",
]
# https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/housing/housing.data
df = pd.read_csv(
    "../data/raw/housing.data.txt",
    delimiter=r"\s+",
    names=cols,
)
df = df.dropna()
X = df.drop("MEDV", axis=1)
y = df[["MEDV"]]

# 2. Split Train and Test Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=5
)

# 3. Fit model
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

# 4. Predict
y_pred = reg.predict(X_test)

# 4. Measure performane
print(f"MSE: {mean_squared_error(y_test, y_pred)}")

MSE: 20.86929218377065


## Components
### 1. Prediction
### 2. Training

# Prediction
## Mathematical formula
$$
\begin{align}
y=\theta^Tx
\end{align}
$$
$$
\begin{align}
h_{\theta}(x)=\theta^Tx
\end{align}
$$
- notations
    - $\theta$: weight vector(D x 1 dim.)
    - **x**: input(1 row) vector(D x 1 dim.)
    - D: the number of features + 1
- 참고
    - 보통 선형회귀식은 intercept 항을 따로 빼서 표기하기도 하지만, 이후 수식 전개 등을 쉽게 하기 위해 theta0가 intercept라고 둠.

In [66]:
import numpy as np
N = X_train.shape[0]
D = X_train.shape[1]+1
theta = np.random.normal(0, 1, D).reshape(-1,1)
X = np.concatenate([np.ones(X_train.shape[0]).reshape(-1,1), X_train], axis=1) # Add bias term
x = X_train.values[0]
x = np.concatenate([[1], x]).reshape(-1,1) # add 1 for intercept 

# using for loop
prediction = 0
for i in range(D):
    prediction += theta[i][0]*x[i][0]
    prediction = prediction

# vectorized
prediction = theta.T.dot(x)

# Predict multiple data
predictions = X.dot(theta)

def predict(X, theta):
    X = np.concatenate([np.ones(X.shape[0]).reshape(-1,1), X], axis=1) # Add bias term
    y_hat = X.dot(theta)
    
    return y_hat.flatten()

# Training
## Cost function(= Loss function)
### Cost function of linear regression: Mean Squared Error

$$
\begin{align}
J(\theta) = \frac{1}{N}\sum_{i=1}^N(h_{\theta}(x^{(i)})-y^{(i)})^2
\end{align}
$$
$$
\begin{align}
J(\theta) = \frac{1}{N}\sum_{i=1}^N((\theta^Tx^{(i)}+b)-y^{(i)})^2
\end{align}
$$
$$
\begin{align}
J(\theta) = \frac{1}{N}(X\theta-y)^T(X\theta-y)
\end{align}
$$
- notations
    - **$\theta$**: weight vector(D x 1 dim.)
    - **x**: input(1 row) vector(D x 1 dim.)
    - **X**: input matrix(N x D dim.)
    - **y**: output vector(N x 1 dim.)
    - D: the number of features
- 위 세 개는 전부 같은 식
- MAE를 최소가 되게 하는 **w**를 찾는다!

In [7]:
# cost function
def mse(y_hat, y):
    N = len(y)
    cost = sum((y_hat-y)**2)/N
    return cost

## Optimization Method
- 위에 정의한 cost function을 최대화 하기위해서는 어떠한 과정이 필요한가?
- 선형회귀의 경우, 미분을 사용해서 해를 구할 수 있음!
- 단, 일반적으로는 수리적인 방법(Ex. 동전던지기의 확률을 구한 방법, 미분 등)으로 해를 구할 수 없음
- Optimization Method의 대표적인 예: **Gradient descent**
    - Mathematical Formula
$$
\begin{align}
\theta \leftarrow \theta - \eta{\nabla}_{\theta}J
\end{align}
$$
$$
\begin{align}
{\nabla}_{\theta}J=\sum_{n=1}^Nx^{(i)}(h_{\theta}(x^{(i)})-y^{(i)})
\end{align}
$$
$$
\begin{align}
{\nabla}_{\theta}J=\frac{1}{N}X^T(X\theta-y)
\end{align}
$$
</br>
$$
h_{\theta}(x^{(i)}) = \theta_0 + \theta_1x_1 + \theta_2x_2+...+ \theta_Dx_D
$$

- notations
    - $\eta$ : learning rate
 
<img src="../figures/GradientDescentGIF.gif" alt="drawing" width="600"/>

--------------------

## 조금 더 깊게

- 기계학습의 “학습”은 단순히 모델의 가중치(w)를 찾아내는 것
    - 비유하자면, 새로운 기억이 생성될 때마다, 뇌에 있는 각 시냅스 간의 연결의 세기가 변한다!
- 이러한 관점에서, 기계학습 문제는 단순히 주어진 데이터(X, y)를 가장 잘 설명하는 가중치를 찾아내는 것이다.
- 이러한 가중치를 찾아내는 방법 중 가장 많이 사용되는 것이 최대우도추정(Maximum likelihood Estimation) 방법이다. 

### Base theorem
![basetherom](../figures/baise_theorem.png)

### Likelihood?
<!-- ![likelihoood](../figures/likelihood2.png) -->
<img src="../figures/likelihood2.png" alt="drawing" width="600"/>

In [197]:
# Gradient Descent 함수 생성
def gradientDescent(X, y, theta, alpha, N, numIterations, verbose=1):
    X_tmp = X.copy()
    X_tmp = np.concatenate([np.ones(X_tmp.shape[0]).reshape(-1,1), X_tmp], axis=1) # Add bias term
    X_tmp = X_tmp/100 # standardization for computing convinience
    # bias 추가
    for i in range(0, numIterations):
        # Predict
        hypothesis = X_tmp.dot(theta)
        loss = hypothesis.reshape(-1,1) - y.reshape(-1,1)
        # avg cost per example (the 2 in 2*n doesn't really matter here.
        # But to be consistent with the gradient, I include it)
        cost = np.sum(loss ** 2) / N
        if verbose==1:
            if(i%10000==0):
                print("Iteration %d | Cost: %f" % (i, cost))
        # avg gradient per example
        gradient = X_tmp.T.dot(loss) / N
        # update
        theta = theta - alpha * gradient
    return theta

In [202]:
# 학습
X = X_train.values
y = y_train.values.flatten()
N, D = np.shape(X)
numIterations= 30000000
alpha = 0.05
theta = np.random.normal(0, 1, D+1).reshape(-1,1) #Initalize theta
theta = gradientDescent(X, y, theta, alpha, N, numIterations, verbose=0)
# print(theta)

In [203]:
def predict(X, theta):
    X = np.concatenate([np.ones(X.shape[0]).reshape(-1,1), X], axis=1) # Add bias term
    X = X/100
    y_hat = X.dot(theta)
    
    return y_hat.flatten()
# vectorized
y_pred = predict(X_test.values, theta)
print(f"MSE: {mean_squared_error(y_test, y_pred)}")

MSE: 22.758627549197097


##

# References
likelihood1: https://jjangjjong.tistory.com/41  
likelihood2: https://angeloyeo.github.io/2020/07/17/MLE.html  
cost function: https://computer-nerd.tistory.com/5  
Deriving Machine Learning Cost Functions using Maximum Likelihood Estimation: https://allenkunle.me/deriving-ml-cost-functions-part1  
Linear Regression Normality: https://stats.stackexchange.com/questions/327427/how-is-y-normally-distributed-in-linear-regression  
gradient descent: https://mccormickml.com/2014/03/04/gradient-descent-derivation/  
notatinos: https://humanunsupervised.github.io/humanunsupervised.com/topics/L2-linear-regression-multivariate.html