## Importing the libraries

In [196]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

## Importing the dataset

In [197]:
dataset = pd.read_csv('50_Startups.csv')

## Spliting data into independent variables (X) and dependent variables (y)

In [198]:
X = pd.DataFrame(dataset.iloc[:, :-1],columns=dataset.columns[:-1])# independent 
y = pd.DataFrame(dataset.iloc[:, 4],columns=['Profit']) # dependent

## Take a look at X

In [199]:
X.head(10)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida
5,131876.9,99814.71,362861.36,New York
6,134615.46,147198.87,127716.82,California
7,130298.13,145530.06,323876.68,Florida
8,120542.52,148718.95,311613.29,New York
9,123334.88,108679.17,304981.62,California


## Take a look at y

In [200]:
y.head(10)

Unnamed: 0,Profit
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94
5,156991.12
6,156122.51
7,155752.6
8,152211.77
9,149759.96


## Encoding categorical data (state feature)
We use the pandas get_dummies function to one hot encode the data.
We specify drop_first=True to drop one of the one hot encoded variables.
We do this to avoid multicollinearity.
In our case sklearn takes care of this silently even if we don't tell get_dummies, but to me it seems like best practice.

In [201]:
X = pd.get_dummies(X, columns=['State'],drop_first=True)

## Take a look at X after encoding the states

In [202]:
X.head(10)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
0,165349.2,136897.8,471784.1,0,1
1,162597.7,151377.59,443898.53,0,0
2,153441.51,101145.55,407934.54,1,0
3,144372.41,118671.85,383199.62,0,1
4,142107.34,91391.77,366168.42,1,0
5,131876.9,99814.71,362861.36,0,1
6,134615.46,147198.87,127716.82,0,0
7,130298.13,145530.06,323876.68,1,0
8,120542.52,148718.95,311613.29,0,1
9,123334.88,108679.17,304981.62,0,0


Notice there is no column for State_California

## Splitting the dataset into the Training set and Test set
We go with a 20% test size as this is pretty standard.

In [203]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state = 0)

The random_state=0 is simply there for reproducibility of the results.

## Let's take a look at X_train

In [204]:
X_train.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
33,55493.95,103057.49,214634.81,1,0
35,46014.02,85047.44,205517.64,0,1
26,75328.87,144135.98,134050.07,1,0
34,46426.07,157693.92,210797.67,0,0
18,91749.16,114175.79,294919.57,1,0


## and now y_train

In [205]:
y_train.head()

Unnamed: 0,Profit
33,96778.92
35,96479.51
26,105733.54
34,96712.8
18,124266.9


Take notice that X_train's index columns is the same as y_train's.

## Now let's look at X_test

In [206]:
X_test.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
28,66051.52,182645.56,118148.2,1,0
11,100671.96,91790.61,249744.55,0,0
10,101913.08,110594.11,229160.95,1,0
41,27892.92,84710.77,164470.71,1,0
2,153441.51,101145.55,407934.54,1,0


## and y_test

In [207]:
y_test.head()

Unnamed: 0,Profit
28,103282.38
11,144259.4
10,146121.95
41,77798.83
2,191050.39


Same as above, the index columns of the X_test and y_test are the same.

## Here we are fitting a linear regression model to the Training set.

In [208]:
regressor = LinearRegression()
regressor.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

## Getting the r squared score .

In [209]:
regressor.score(X,y) * 100

94.8522354717156

This value is the % of the variability in profit that can be explained by the model. 94.85% in our case.

## Now we're predicting the Test set results.

In [210]:
predictions = regressor.predict(X_test)

## Let's see how we did by checking the mae (mean absolute error).

In [211]:
mae = mean_absolute_error(predictions,y_test)
mae

7514.293659640607

So, on average our model is off by $7514.29.

## Let's put that into context by seeing it as a percent.

In [212]:
float(mae/y_test.mean() * 100)

6.145283780543863

So, on average our model is only off by 6.15%, not too bad.

In the future we could do feature elimination in order to try to remove uneeded features and make our model even better.

### Note
The 50_Startups.csv was taken from the Machine Learning A-Z™: Hands-On Python & R In Data Science course offered on Udemy
This served as an exercise for me to delve into machine learning and data science as well as work with Jupyter.