# Multiple Linear Regression

Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables.

The model for MLR, given n observations, is:

$$y_i = B_0 + B_1x_{i1} + B_2x_{i2} + ... + B_px_{ip} + E \quad\quad where \quad i = 1,2, ..., n $$

![tile](Doc\multiple linear regression.PNG)

For more detail refer MLR Example.pdf file

In [1]:
# Importing the libraries
import numpy as no
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.max_columns = None

### Importing Data

In [2]:
# Gettind database
df = pd.read_csv('50_Startups.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


### Data Preprocessing

In [11]:
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

In [12]:
X.shape

(50, 6)

In [13]:
# Avoiding the Dummy Variable Trap
X = X[:, 1:]

By including dummy variable in a regression model however, one should be careful of the ***Dummy Variable Trap***. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

The ***solution*** to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.

In [14]:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=0)

### Fitting Multiple Linear Regression to the Training set

In [15]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_test,y_test)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Predicting the Test set results

In [16]:
y_pred = lr.predict(X_test)

In [17]:
y_pred

array([ 100494.19441414,  144259.40000006,  141456.63348949,
         81648.2429768 ,  183278.81846194,  106297.95337744,
         76209.33824841,  101213.63837421,  111693.01992799,
        176222.83072951])