# Machine Learning Basics: Series

There is a series on TDS that contains information on Machine Learning Basics and this is going to inform the usage of various machine learning basics.

## Multiple Linear Regression
Multiple Linear Regression is a situation where where y is the dependent variable, $x_{1,2,3,4,...n}$ are the independent variables, $w_{1,2,3,...n}$  are the coefficients of the independent variables and b is the intercept.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas.util.testing as tm

In [5]:
from IPython.core.display import display,HTML

In [6]:
display(HTML('<style>.container{width:100% !important;</style>'))

### Importing the Data Set

In [7]:
dataset = pd.read_csv('https://raw.githubusercontent.com/mk-gurucharan/Regression/master/Startups_Data.csv')

In [8]:
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


#### Variables
The dependent variable is **Profit** and the independent vairables are the following:
- R&D Spend - Numeric
- Administration - Numeric
- Marketing Spend - Numeric
- State - Categorical

The categorical will need to be converted to a dummy variable for this to be used in the machine learning model

In [45]:
dataset = pd.get_dummies(dataset, columns=['State'])

In [48]:
dataset = dataset[['R&D Spend', 'Administration', 'Marketing Spend','State_California', 'State_Florida', 'State_New York','Profit']]

In [49]:
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_California,State_Florida,State_New York,Profit
0,165349.2,136897.8,471784.1,0,0,1,192261.83
1,162597.7,151377.59,443898.53,1,0,0,191792.06
2,153441.51,101145.55,407934.54,0,1,0,191050.39
3,144372.41,118671.85,383199.62,0,0,1,182901.99
4,142107.34,91391.77,366168.42,0,1,0,166187.94


In [50]:
X = dataset.iloc[:,:-1].values

In [51]:
y = dataset.iloc[:,-1].values

In [52]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state=1)

In [53]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [54]:
regressor.coef_

array([ 7.74342081e-01, -9.44369585e-03,  2.89183133e-02, -2.85177769e+02,
        2.97560876e+02, -1.23831070e+01])

In [59]:
for idx, col_name in enumerate(dataset.columns[0:6]):
    print("The coefficient for {} is {:,.6f}".format(col_name, regressor.coef_[idx]))

The coefficient for R&D Spend is 0.774342
The coefficient for Administration is -0.009444
The coefficient for Marketing Spend is 0.028918
The coefficient for State_California is -285.177769
The coefficient for State_Florida is 297.560876
The coefficient for State_New York is -12.383107


In [60]:
intercept = regressor.intercept_

print("The intercept for our model is {:,.2f}".format(intercept))

The intercept for our model is 49,834.89


In [61]:
print("{:.2%}".format(regressor.score(X_test, y_test)), ' of the variation in the Profit is explained by the variables')

96.50%  of the variation in the Profit is explained by the variables


The scoring model is computed using the formula 
$$R^2 = 1 - \frac{RSS}{TSS}$$

In [63]:
from sklearn.metrics import mean_squared_error

y_pred = regressor.predict(X_test)

regression_model_mse = mean_squared_error(y_pred, y_test)

print("The intercept for our model is {:,.2f}".format(regression_model_mse))

The intercept for our model is 79,495,441.50


In [64]:
import math

print("The RSME for our model is {:,.2f}".format(math.sqrt(regression_model_mse)))

The RSME for our model is 8,916.02


#### Prediction

Predicting using the follow parameters.

- R&D Spend - 150,000
- Administration - 95,000
- Marketing Spend - 40,000
- Location - Florida

In [66]:
print("The Predicted Profit is {:,.2f}".format(regressor.predict([[150000,95000,40000,0,1,0]])[0]))

The Predicted Profit is 166,543.34
