# Linear and Logistic Regression

In [None]:
import numpy as np
import pandas
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib

Linear regression can be used to build models based on training data:

* Prediction
* Classification

## Stock Market

In [None]:
smarket = pandas.read_csv('../datasets/Smarket.csv', index_col=0, parse_dates=True)
smarket.head()

Where *Lag1* through *Lag5* stand for the percentage returns for the five previous days.

**Prediction** : Will the index increase or decrease based on the past 5 days' percentage changes in the index ?

## Advertising

In [None]:
advertising = pandas.read_csv('../datasets/Advertising.csv', index_col=0)
advertising.head()

For a particular **product**, this dataset provides for 200 markets :
* Advertising budgets for TV, radio, and newspaper (in thousands of dollars)
* The sales (in thousands of units)

We wish to understand association between advertising and sales to control advertising expenditure in each of the three media. 

**Input variables** ($X$) :
* $X_1$ : TV budget
* $X_2$ : Radio budget
* $X_3$ : newspaper budget

**Output variable** :
* $Y$ : Sales

Relationship between $Y$ and $X$ :

$$
Y = f(X) + \epsilon
$$

where $f$ is an unkdown function and $\epsilon$ a random error term independent of $X$ and which has mean zero (otherwise easy to compensate by modifying $f$). This error cannot be reduced to zero. For example, it may depend on unmeasured variables that are useful for predicting the value of $Y$.

In [None]:
advertising.plot(kind='scatter', color='Blue', x='TV', y='sales')

In [None]:
# Ordinary Least Squares
X = advertising['TV']
Y = advertising['sales']
regression = np.polyfit(X, Y, deg=1)

advertising.plot(kind='scatter', color='Blue', x='TV', y='sales')

plt.plot(X, regression[0]*X + regression[1], 'r')

## Linear Regression

Given a collection of n points, find the line which best approximates or fits the points.

In [None]:
# Create a random noise
x1 = np.random.normal(size=200)
x = np.linspace(x1.min()-1, x1.max()+1, 100)
y = 3*(np.random.normal(0, 1, 100)+x)

fig, ax = plt.subplots()
ax.scatter(x, y , color='red')

#Find the regression line
regression = np.polyfit(x, y, 1)
print(regression)
longerX = np.append(x, [5,-5])
ax.plot(longerX, regression[0]*longerX + regression[1], color='black', linewidth='1.5')
ax.set_xlim(-4, 4.6)

$$
f(x) = mx + p
$$

The residual error is defined:

$$
r_i = y_i - f(x_i)
$$

In [None]:
x = np.linspace(0, 10, 100)
y = 2*x+3
#distance the dashed line is into the circle
s = 0.4

fig, ax = plt.subplots()
ax.plot(x, y)
ax.plot([1, 1, 2, 3, 3, 4, 4, 5, 6, 7],[7, 3, 12, 5, 11, 15, 10, 15, 12, 19], 'yo', markersize=7)
ax.plot([1, 1], [7-s, 5], 'b-', [1, 1], [3+s, 5], 'b-', \
[2, 2], [12-s, 7], 'b-', [3, 3], [5+s, 9], 'b-', \
[3, 3], [11-s, 9], 'b-', [4,4], [15-s, 11], 'b-', \
[4,4], [10+s, 11], 'b-', [5, 5], [15-s, 13], 'b-',\
[6, 6], [12+s+0.1, 15], 'b-', [7, 7], [19-s, 17], 'b-', linestyle='dashed')

We consider a set of $n$ points $x_i = (x_{i1}, x_{i2}, \ldots, x_{im})$, where
$(x_{i1}, x_{i2}, \ldots, x_{i(m-1)})$ is the feature vector and $x_{im}$ is
the **target variable**.

We use the **least squares regression** to find the optimal fit :

$$
\sum_{i=1}^n (y_i - f(x_i))^2, \text{where}\ f(x) = w_0 + \sum_{j=1}^{m-1}w_jx_j
$$

The vector of residual values :

$$
(b - A \cdot w),
$$

where b, A, and w are defined as follows:

$$
b = \left[
\begin{array}{c}
  x_{1m}\\
  x_{2m}\\
  \vdots \\
  x_{im}\\
  \vdots \\
  x_{nm}
\end{array}
\right], A = \left[
\begin{array}{ccccccc}
  1 & x_{11} & x_{12} & \ldots & x_{1j} & \ldots & x_{1(m-1)} \\
  1 & x_{21} & x_{22} & \ldots & x_{2j} & \ldots & x_{2(m-1)} \\
  \vdots & \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\
  1 & x_{i1} & x_{i2} & \ldots & x_{ij} & \ldots & x_{i(m-1)} \\
  \vdots & \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\
  1 & x_{n1} & x_{n2} & \ldots & x_{nj} & \ldots & x_{n(m-1)}
\end{array}
\right], w = \left[
\begin{array}{c}
  w_0\\
  w_1\\
  \vdots \\
  w_j \\
  \vdots \\
  w_{(m-1)}
\end{array}
\right].
$$

The vector $w$ that leads to the best fitting line is given by :

$$
w = (A^T A)^{-1}A^T b
$$

The right-hand side of this equation comprises the following components:

* $A^T A$ : the *covariance matrix* on the features of the data matrix
* $A^T b$ : this dot product between features and the target values measure how correlated are each feature with the targets

## Dealing with outliers

In [None]:
fig, (ax2, ax1) = plt.subplots(ncols=2, figsize=(12,4.5))

# The right graph without an outlier.
# Create a random 15 points
np.random.seed(7)
x1 = np.random.normal(size=10)
x = np.linspace(x1.min()-1, x1.max()+1, 15) 
y = 3*(np.random.normal(0, 1, 15)+x)
ax1.scatter(x, y, color='red')

# Find the regression line
regression = np.polyfit(x, y, deg=1)
longerX = np.append(x, [5,-5]) # this makes the regression line longer
ax1.plot(longerX, regression[0]*longerX + regression[1], color='black', linewidth='1.5')
ax1.set_xlim(-3.5, 4.5)
ax1.set_ylim(-15, 20)
# The correlation coefficient value r.
r1 = stats.pearsonr(x, y)
ax1.set_title('Correlation coefficient = {:.2}'.format(r1[0]))

# The left graph with an outlier
# plot the 15 points with an outlier.
x1 = np.append(x, [4])
y1 = np.append(y, [-10])
ax2.scatter(x1, y1, color='red')

# Find the regression line
regression = np.polyfit(x1, y1, deg=1)
longerX = np.append(x1, [5,-5]) # this makes the regression line longer
ax2.plot(longerX, regression[0]*longerX + regression[1], color='black', linewidth='1.5')
ax2.set_xlim(-3.5, 4.5)
ax2.set_ylim(-15, 20)
# The correlation coefficient value r.
r2 = stats.pearsonr(x1, y1)
ax2.set_title('Correlation coefficient = {:.2}'.format(r2[0]))

Least square regression are sensitive to outlier, because they have a large impact on the following objective function :

$$
\sum_{i=1}^n (y_i - f(x_i))^2
$$

For example, compare the impact of a point at distance 10 and a second point at distance 1 from the line.

A simple solution for dealing with such points, is to first compute a linear regression on the complete data set. Then, compare residual values $r_i = (y_i - f(x_i))^2$ to determine which points are outliers in the dataset. Finaly, compute a new linear regression without these points. However, are we sure these points represent errors ?

## Fitting Non-Linear Functions

How to fit a quadratic model ? 

$$
y = w_0 + w_1 x + w_2 x^2
$$

Add a new column to the data matrix equal to $x^2$.

In [None]:
fig, ax = plt.subplots(figsize=(12,4.5))

n = 20
x = np.linspace(0, n, n) 
y = np.random.normal(0, 10, n) + 0.5 * x**2
ax.scatter(x, y, color='red')

# Fit the model
A = np.vstack((np.ones(len(x)), x, x**2)).T
w = np.linalg.lstsq(A, y)[0]

# Predict
x = np.linspace(-5, n+5, 10*n) 
A = np.vstack((np.ones(len(x)), x, x**2)).T
ax.plot(x, np.dot(A, w), 'b')
ax.set_xlim(-5, n+5)

## Scaling

Let us consider the following prediction problem :

* $y$ : Gross national product (in dollars)
* $x_1$ : Population size
* $x_2$ : Literacy rate

In [None]:
cols = ['Country Name', '2016']
gdp = pandas.read_csv('../datasets/API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv',
                      index_col=0, skiprows=4, usecols=cols)
gdp.rename(columns={'2016' : 'GDP'}, inplace=True);
gdp.head()

In [None]:
literacy = pandas.read_csv('../datasets/API_SE.ADT.LITR.ZS_DS2_en_csv_v2.csv',
                           index_col=0, skiprows=4, usecols=cols)
literacy.rename(columns={'2016' : 'Literacy'}, inplace=True);
literacy.head()

In [None]:
population = pandas.read_csv('../datasets/API_SP.POP.TOTL_DS2_en_csv_v2.csv',
                               index_col=0, skiprows=4, usecols=cols)
population.rename(columns={'2016' : 'Population'}, inplace=True);
population.head()

In [None]:
df = pandas.concat([population, gdp, literacy], axis=1)
df.dropna(inplace=True)
df.head()

In [None]:
# Fit the model
A = np.vstack((np.ones(df.shape[0]), df['Population'], df['Literacy'])).T
w = np.linalg.lstsq(A, df['GDP'])[0]
print(w)

Issues in such models :
    
* Unreadable coefficients
* Numerical imprecision
* Inapproprite formulations

Scaling operations are used to address these problems.

## Correlated Features

## Linear Regression as a Parameter Fitting Problem

## Regularization

## Classification

## Logistic Regression

## References

* [An Introduction to Statistical Learning with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
* **The Data Science Design Manual**, by Steven Skiena, 2017, Springer
* Python notebooks available at [http://data-manual.com/data](http://data-manual.com/data)
* Lectures slides available at [http://www3.cs.stonybrook.edu/~skiena/data-manual/lectures/](http://www3.cs.stonybrook.edu/~skiena/data-manual/lectures/)