### For this notebook, looking at a dataset of 50 Startups and training a model to see which variables/features (investments) contribute to profit.

# Data Pre-processing

## Importing Libraries

In [2]:
import pandas as pd
import numpy as np

## Importing the dataset

In [8]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 4].values

In [11]:
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


## Encode Categorical data to numerical values so that model can recognize values

In [12]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

## Avoiding Dummy Variable Trap

###  As a reminder, dummy variable acts like another variable feature set, which slows down model processing

#### Choosing to avoid "R&D Spend" feature since its values are closely tied with "Administration" feature.

In [15]:
X = X[:, 1:]

## Split Dataset into Training Set and Test Set

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

# Fitting Training Set into Model

In [18]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# Predicting the Test Set Results

In [19]:
y_pred = regressor.predict(X_test)

In [20]:
y_pred

array([ 103015.20159796,  132582.27760815,  132447.73845175,
         71976.09851258,  178537.48221056,  116161.24230166,
         67851.69209676,   98791.73374687,  113969.43533013,
        167921.06569551])

In [23]:
#Plot actual results with test predictions and see how closely they match. 
#This will indicate model strength

from bokeh.plotting import figure, show
from bokeh.io import output_notebook

output_notebook()

p = figure(plot_width=450, plot_height=450, title="Multiple L.R.: Actual (blue) & Predictions (red)")


p.line(list(range(40)), Y_train, line_width=2) #Actual

p.line(list(range(40)), Y_test, color="firebrick", line_width=2) #Predictions

p.xaxis.axis_label = "Combined: Administration, Marketing Spend, State"
p.yaxis.axis_label = "Profit"

show(p)



#### Based on this graph, there is some overlap between predictions and actual results. Model could be optimized by gathering more data and optimizing feature set for trained model.