# Linear Regression

## Fitting a Line

Problem : You want to train a model that represents a linear relationship between the feature and target vector.

In [1]:
# Load libraries
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# Load data with only two features
boston = load_boston()
features = boston.data[:,0:2]
target = boston.target

# Create linear regression
regression = LinearRegression()

# Fit the linear regression
model = regression.fit(features, target)

In [2]:
# View the intercept
model.intercept_

22.485628113468223

And βˆ1 and βˆ2 are

In [3]:
# View the feature coefficients
model.coef_

array([-0.35207832,  0.11610909])

In [4]:
# First value in the target vector multiplied by 1000
target[0]*1000

24000.0

In [5]:
# Predict the target value of the first observation, multiplied by 1000
model.predict(features)[0]*1000

24573.366631705547

Not bad! Our model was only off by $573.36!

the coefficients of the model are the effect of a one-unit change on the target vector.

In [6]:
# First coefficient multiplied by 1000
# the first feature in our solution is the number of crimes per resident
model.coef_[0]*1000

-352.07831564026765

This says that every single crime per capita will decrease the price of the house by approximately $352!

## Handling Interactive Effects

Problem : You have a feature whose effect on the target variable depends on another feature.



In [8]:
# Load libraries
from sklearn.preprocessing import PolynomialFeatures

# Load data with only two features
boston = load_boston()

features = boston.data[:,0:2]
target = boston.target

# Create interaction term
interaction = PolynomialFeatures(degree=3, # the degree parameter determines the maximum number of features to create interaction terms from
                                 include_bias=False,  #  By default, PolynomialFeatures will add a feature containing ones called a bias
                                 interaction_only=True) # interaction_only=True tells PolynomialFeatures to only return interaction terms

features_interaction = interaction.fit_transform(features)

# Create linear regression
regression = LinearRegression()

# Fit the linear regression
model = regression.fit(features_interaction, target)

it is the interaction of putting sugar in the coffee and stirring the coffee (sugar=1, stirred=1) that will make a coffee taste sweet.

The effects of sugar and stir on sweetness are dependent on each other. In this case we say there is an interaction effect between the features sugar and stirred.

In [9]:
# View the feature values for first observation
features[0]

array([  6.32000000e-03,   1.80000000e+01])

In [10]:
# Import library
import numpy as np

# For each observation, multiply the values of the first and second feature
interaction_term = np.multiply(features[:, 0], features[:, 1])

In [11]:
# View interaction term for first observation
interaction_term[0]

0.11376

 from our solution by checking to see if the first observation’s feature values and interaction term value match our manually calculated version:

In [12]:
# View the values of the first observation
features_interaction[0]

array([  6.32000000e-03,   1.80000000e+01,   1.13760000e-01])

## Fitting a Nonlinear Relationship

Problem : You want to model a nonlinear relationship.

In [20]:
# Load library
from sklearn.preprocessing import PolynomialFeatures

# Load data with one feature
boston = load_boston()
features = boston.data[:,0:1]
target = boston.target

# Create polynomial features x^2 and x^3
polynomial = PolynomialFeatures(degree=3, #!!!!!!!!!!!! degree
                                include_bias=False) # !!!! do not include extra variable called bias for each feature 
features_polynomial = polynomial.fit_transform(features)

# Create linear regression
regression = LinearRegression()

# Fit the linear regression
model = regression.fit(features_polynomial, target)

do not change how the linear regression fits the model, but rather only add polynomial features.

The more of these new features we add, the more flexible the “line” created by our model

In [21]:
# View first observation
features[0]

array([ 0.00632])

In [22]:
# View first observation raised to the second power, x^2
features[0]**2

array([  3.99424000e-05])

In [23]:
# View first observation raised to the third power, x^3
features[0]**3

array([  2.52435968e-07])

In [24]:
# View the first observation's values for x, x^2, and x^3 from PolynomialFeatures since d = 3
features_polynomial[0]

array([  6.32000000e-03,   3.99424000e-05,   2.52435968e-07])

## Reducing Variance with Regularization

Problem : You want to reduce the variance of your linear regression model.

Use a learning algorithm that includes a shrinkage penalty (also called regularization) like `ridge` regression and `lasso` regression

In [25]:
# Load libraries
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# Load data
boston = load_boston()
features = boston.data
target = boston.target

# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

# Create ridge regression with an alpha value
regression = Ridge(alpha=0.5)

# Fit the linear regression
model = regression.fit(features_standardized, target)

In standard linear regression the model trains to minimize the sum of squared error between the true (yi) and prediction, (yˆi) target values, or residual sum of squares (RSS).

$$ RSS = \sum_{i=1}^n (y_i−\widehat{y}_i)^2 $$

the shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients:

$$ RSS + \alpha \sum_{j=1}^p \widehat{β}_j^2 $$

The lasso is similar, except the shrinkage penalty is a tuning hyperparameter multiplied by the sum of the absolute value of all coefficients

$$ \frac{1}{2n} RSS + \alpha \sum_{j=1}^p \| \widehat{β}_j\| $$

 Regardless of which one we use, both ridge and lasso regressions can penalize large or complex models by including coefficient values in the loss function we are trying to minimize.

with higher values of $\alpha$ creating simpler models

In [26]:
# Load library
from sklearn.linear_model import RidgeCV

# Create ridge regression with three alpha values
regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0]) # L2 PENALTY regularization

# Fit the linear regression
model_cv = regr_cv.fit(features_standardized, target)

# View coefficients
model_cv.coef_

array([-0.91987132,  1.06646104,  0.11738487,  0.68512693, -2.02901013,
        2.68275376,  0.01315848, -3.07733968,  2.59153764, -2.0105579 ,
       -2.05238455,  0.84884839, -3.73066646])

In [28]:
regr_cv.alpha_

1.0

because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, **we must make sure to standardize the feature prior to training**

## Reducing Features with Lasso Regression

Problem : You want to simplify your linear regression model by reducing the number of features.

In [29]:
# Load library
from sklearn.linear_model import Lasso

# Load data
boston = load_boston()
features = boston.data
target = boston.target

# Standardize features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

# Create lasso regression with alpha value
regression = Lasso(alpha=0.5) # L1 PENALTY regularization

# Fit the linear regression
model = regression.fit(features_standardized, target)

it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model

many of the coefficients are 0 while using $\alpha = 0.5$, meaning their corresponding features are not used in the model

In [30]:
# View coefficients
model.coef_

array([-0.11526463,  0.        , -0.        ,  0.39707879, -0.        ,
        2.97425861, -0.        , -0.17056942, -0.        , -0.        ,
       -1.59844856,  0.54313871, -3.66614361])

In [31]:
# Create lasso regression with a high alpha
regression_a10 = Lasso(alpha=10)
model_a10 = regression_a10.fit(features_standardized, target)
model_a10.coef_

array([-0.,  0., -0.,  0., -0.,  0., -0.,  0., -0., -0., -0.,  0., -0.])

**This lets us REDUCE VARIANCE while improving the INTERPRETABILITY of our model (since fewer features is easier to explain).**