# Linear Regression
---

In [54]:
import pandas as pd
import numpy as np
import matplotlib as plt
dataset = pd.read_csv("file:/Users/kier/OneDrive/Notas/DataScience/datasets/SuperStoreOrders.csv")

In [55]:
dataset['sales'] = (np.array([int(feature.replace(",", "")) for feature in dataset['sales']]))

## [**What is it?**](http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm)
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable and the other is considered to be a dependent variable. A common oversight with the implementation of a linear regressor is properly determining whether or not there is a level of causality involved in the relationship between the two variables. The purpose of finding this relationship is the ability to generate a prediction (or output value) based on a single feature. Obviously, the first thing to consider with linear regression is the equation for a line
$$
f(x) = \theta_0 x_0 + \theta_1 x_1 + \dots + \theta_n x_n + \theta_{n+1}
$$
where $\theta_i$ are the constants, $\theta_{n+1}$ is the y-intercept, $x_i$ are the features, and $f(x)$ is the prediction.

In [48]:
def make_prediction (theta, features):
    if isinstance(features, pd.Series):
        return [theta[0] * feature + theta[1] for feature in features]
    return np.array([
        sum([t * x for t, x in zip(theta, feature)]) + theta[-1] for feature in features
    ])

## **Common Implementation**
The most common implementation for creating a linear regressor is shifting a line such that the loss--or overall difference between the line and the actual values--has been minimized. This method is called least squares regression, and the loss function is as follows:
$$
Loss=\sum_{i=0}^{n}(y_i-y_i^p)^2
$$
where $y_i$ represents the $i^{th}$ outcome and $y_i^p$ represents the $i^{th}$ prediction

In [53]:
def get_squared_error(actual, prediction):
    return sum([y + yhat for y, yhat in zip(actual, prediction)])

## [**Univariate vs Multivariate**]()
As stated above, when it comes to linear regression, its important to choose features and observations such that you will likely find some relationship between them. This also includes cases where you have a _set_ of features as opposed to a single one, this is where the real linear algebra comes out. Given we cannot visualize higher dimensions, we have to rely on mathematics to determine whether our regressor is operating properly when we have more than two features. Furthermore, this means that we will have more values to tweak in effort to find the optimal line through the hyperspace.



In [52]:
for i, (column, type) in enumerate(zip(dataset.columns, dataset.dtypes)):
    print(f"Column {i:2}: {column:15} {type}")

Column  0: order_id        object
Column  1: order_date      object
Column  2: ship_date       object
Column  3: ship_mode       object
Column  4: customer_name   object
Column  5: segment         object
Column  6: state           object
Column  7: country         object
Column  8: market          object
Column  9: region          object
Column 10: product_id      object
Column 11: category        object
Column 12: sub_category    object
Column 13: product_name    object
Column 14: sales           int64
Column 15: quantity        int64
Column 16: discount        float64
Column 17: profit          float64
Column 18: shipping_cost   float64
Column 19: order_priority  object
Column 20: year            int64


In [50]:
observations = dataset['profit']
features_univariate = dataset['sales']
features_multivariate = dataset[['sales', 'quantity', 'discount', 'category', 'country']]

In [60]:
print(f"The first 10 univariate observations:\n{features_univariate.head(10)}")
print(f"The first 10 multivariate observations:\n{features_multivariate.head(10)}")

The first ten univariate observations:
0    408
1    120
2     66
3     45
4    114
5     55
6    314
7    276
8    912
9    667
Name: sales, dtype: int64
The first 10 multivariate observations:
   sales  quantity  discount         category      country
0    408         2       0.0  Office Supplies      Algeria
1    120         3       0.1  Office Supplies    Australia
2     66         4       0.0  Office Supplies      Hungary
3     45         3       0.5  Office Supplies       Sweden
4    114         5       0.1        Furniture    Australia
5     55         2       0.1  Office Supplies    Australia
6    314         1       0.0       Technology       Canada
7    276         1       0.1  Office Supplies    Australia
8    912         4       0.4       Technology  New Zealand
9    667         4       0.0        Furniture         Iraq


## [**The Actual Regression**]()
Now we get to actually do the regression! 

## [**Outliers and Influential Observations**](https://www.stat.cmu.edu/~larry/=stat401/lecture-20.pdf)
An outlier is a point in the dataset which lies close to others in horizontal proximity, but not so much in vertical. An influential observation is similar to outliers in that they're a point which lies horizontally far away from the other data. These kinds of point are important because they can--without tending to--have a big impact on the final line generated through the linear regressor. Another way of looking at these kinds of points are "outlier\[s\] are points with a large residual" and influential points are those with a "large impact on the regression", noting that these two conditions are not mutually exclusive. 

### _Identifying Outliers_
There are three ways we can decide if a point is an outlier:
1. Look at the point's leverage
2. Look at the point's studentized residuals
3. Look at the point's Cook's statistics






##### [_Leverage_]()
Leverage is the measure of how far away the values of an observation are from those of neighboring observations along the axis which represents the independent variable. This can be described as:
$$
\hat{Y}=HY
$$
where $H$ is the hat matrix. This means that each element of $Y$ (or $\hat{Y_i}$) is a linear combination of elements of $H$. In particular, $H_{ii}$ is the contribution of the $i^{th}$ data point to $\hat{Y_i}$. $\therefore$ we call $h_{ii} \equiv H_{ii}$ the leverage.

Another expression for leverage, given by [jbstatistics](https://www.youtube.com/watch?v=xc_X9GFVuVU) is as follows:
$$
\frac{1}{n} + \frac{\left(X_i - \bar{X} \right)^2}{SS_{XX}}
$$

##### [_The Hat Matrix_](https://en.wikipedia.org/wiki/Projection_matrix)
The hat matrix (sometimes called the projection matrix or influence matrix) is a matrix which maps the vector of response values to the vector of fitted values. In other words, it maps the set of observations to the set of predictions. Furthermore, this matrix describes the influence each observation has on each prediction. The leverage values can be found on the diagonals of this matrix. You can use the hat matrix in the equation:
$$
\hat{y} = \bold{H} y
$$
where $\hat{y}$ is the prediction, $y$ is the observation, and $\bold{H}$ is the hat matrix. Moreover, the element at the $i^{th}$ row and $j^{th}$ column of $\bold{H}$ is equal to the _covariance_ between the $j^{th}$ observation and the $i^{th}$ prediction divided by the variance of the former; depicted as:
$$
h_{ij} = \frac{Cov\left[\hat{y_i}, y_j \right]}{Var\left[y_j \right]}
$$
You may also recall Covariance and Variance being:
$$
Cov(X,Y) = E\left[\left(X - E[X] \right) \left(Y - E[Y] \right) \right]
\newline
Var(X) = E 
\left[ 
    \left(
        X - \mu
    \right) ^ 2
\right]
$$
where $E[X]$ is the expected value of X (also called the mean of X). Please note that $Var(X)$ is sometimes denoted as $\sigma_X^2$ and $Cov(X,Y)$ is sometimes denoted as $\sigma_{XY}$. Also, $Var(X) = Cov(X, X)$

##### [_Properties of the Hat Matrix_](https://www.sciencedirect.com/topics/mathematics/hat-matrix)
The first condition is that for the elements along the diagonal of $\bold{H}$, their values are $0 < \bold{H_{ij}} < 1$ and $-1 < \bold{H_{ij}} < 1$ for all others.

For a model with an intercept term and the full rank of matrix, X: $\sum_{i=1}^{n}\bold{H_{ii}} = m$ and $\sum_{i=1}^{n}\bold{H_{ij}} = 1$. $\therefore$ The mean value of the diagonal element $\bold{H_{ii}}=\frac{m}{n}$.