# Linear and Logistic Regression

In [None]:
import numpy as np
import pandas
import matplotlib.pyplot as plt
%matplotlib

Linear regression can be used to build models based on training data:

* Prediction
* Classification

## Stock Market

In [None]:
smarket = pandas.read_csv('../datasets/Smarket.csv', index_col=0, parse_dates=True)
smarket.head()

Where *Lag1* through *Lag5* stand for the percentage returns for the five previous days.

**Prediction** : Will the index increase or decrease based on the past 5 days' percentage changes in the index ?

## Advertising

In [None]:
advertising = pandas.read_csv('../datasets/Advertising.csv', index_col=0)
advertising.head()

For a particular **product**, this dataset provides for 200 markets :
* Advertising budgets for TV, radio, and newspaper (in thousands of dollars)
* The sales (in thousands of units)

We wish to understand association between advertising and sales to control advertising expenditure in each of the three media. 

**Input variables** ($X$) :
* $X_1$ : TV budget
* $X_2$ : Radio budget
* $X_3$ : newspaper budget

**Output variable** :
* $Y$ : Sales

Relationship between $Y$ and $X$ :

$$
Y = f(X) + \epsilon
$$

where $f$ is an unkdown function and $\epsilon$ a random error term independent of $X$ and which has mean zero (otherwise easy to compensate by modifying $f$). This error cannot be reduced to zero. For example, it may depend on unmeasured variables that are useful for predicting the value of $Y$.

In [None]:
advertising.plot(kind='scatter', color='Blue', x='TV', y='sales')

In [None]:
# Ordinary Least Squares
X = advertising['TV']
Y = advertising['sales']
regression = np.polyfit(X, Y, deg=1)

advertising.plot(kind='scatter', color='Blue', x='TV', y='sales')

plt.plot(X, regression[0]*X + regression[1], 'r')

## Linear Regression

Given a collection of n points, find the line which best approximates or fits the points.

In [None]:
# Create a random noise
x1 = np.random.normal(size=200)
x = np.linspace(x1.min()-1, x1.max()+1, 100)
y = 3*(np.random.normal(0, 1, 100)+x)

fig, ax = plt.subplots()
ax.scatter(x, y , color='red')

#Find the regression line
regression = np.polyfit(x, y, 1)
print(regression)
longerX = np.append(x, [5,-5])
ax.plot(longerX, regression[0]*longerX + regression[1], color='black', linewidth='1.5')
ax.set_xlim(-4, 4.6)

$$
f(x) = mx + p
$$

The residual error is defined:

$$
r_i = y_i - f(x_i)
$$

In [None]:
x = np.linspace(0, 10, 100)
y = 2*x+3
#distance the dashed line is into the circle
s = 0.4

fig, ax = plt.subplots()
ax.plot(x, y)
ax.plot([1, 1, 2, 3, 3, 4, 4, 5, 6, 7],[7, 3, 12, 5, 11, 15, 10, 15, 12, 19], 'yo', markersize=7)
ax.plot([1, 1], [7-s, 5], 'b-', [1, 1], [3+s, 5], 'b-', \
[2, 2], [12-s, 7], 'b-', [3, 3], [5+s, 9], 'b-', \
[3, 3], [11-s, 9], 'b-', [4,4], [15-s, 11], 'b-', \
[4,4], [10+s, 11], 'b-', [5, 5], [15-s, 13], 'b-',\
[6, 6], [12+s+0.1, 15], 'b-', [7, 7], [19-s, 17], 'b-', linestyle='dashed')

We consider a set of $n$ points $x_i = (x_{i1}, x_{i2}, \ldots, x_{im})$, where
$(x_{i1}, x_{i2}, \ldots, x_{i(m-1)})$ is the feature vector and $x_{im}$ is
the **target variable**.

We use the **least squares regression** to find the optimal fit :

$$
\sum_{i=1}^n (y_i - f(x_i))^2, \text{where}\ f(x) = w_0 + \sum_{j=1}^{m-1}w_jx_j
$$

The vector of residual values :

$$
(b - A \cdot w),
$$

where b, A, and w are defined as follows:

$$
b = \left[
\begin{array}{c}
  x_{1m}\\
  x_{2m}\\
  \vdots \\
  x_{im}\\
  \vdots \\
  x_{nm}
\end{array}
\right], A = \left[
\begin{array}{ccccccc}
  1 & x_{11} & x_{12} & \ldots & x_{1j} & \ldots & x_{1(m-1)} \\
  1 & x_{21} & x_{22} & \ldots & x_{2j} & \ldots & x_{2(m-1)} \\
  \vdots & \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\
  1 & x_{i1} & x_{i2} & \ldots & x_{ij} & \ldots & x_{i(m-1)} \\
  \vdots & \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\
  1 & x_{n1} & x_{n2} & \ldots & x_{nj} & \ldots & x_{n(m-1)}
\end{array}
\right], w = \left[
\begin{array}{c}
  w_0\\
  w_1\\
  \vdots \\
  w_j \\
  \vdots \\
  w_{(m-1)}
\end{array}
\right].
$$

The vector $w$ that leads to the best fitting line is given by :

$$
w = (A^T A)^{-1}A^T b
$$

The right-hand side of this equation comprises the following components:

* $A^T A$ : the *covariance matrix* on the features of the data matrix
* $A^T b$ : this dot product between features and the target values measure how correlated are each feature with the targets

## Dealing with outliers

## Fitting Non-Linear Functions

## Scaling

## Correlated Features

## Linear Regression as a Parameter Fitting Problem

## Regularization

## Classification

## Logistic Regression

## References

* [An Introduction to Statistical Learning with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
* **The Data Science Design Manual**, by Steven Skiena, 2017, Springer
* Python notebooks available at [http://data-manual.com/data](http://data-manual.com/data)
* Lectures slides available at [http://www3.cs.stonybrook.edu/~skiena/data-manual/lectures/](http://www3.cs.stonybrook.edu/~skiena/data-manual/lectures/)