# Startup Profit Prediction using Multiple Linear Regression

**Scenario**: You are a VC looking to invest in a startup company. Use data for 50 existing startup companies with the following:
- R&D Spend (\$)
- Administration (\$)
- Marketing Spend (\$)
- State (New York, California, or Florida)
- Profit (\$)

**Goal**: Predict the profit for a startup with given attributes

## Table of Contents

* [Data Preprocessing](#Data-Preprocessing)
    * Importing the libraries
    * Importing the dataset
    * Encoding categorical data
    * Splitting the dataset into the Training set and Test set
* [Training the Multiple Linear Regression model on the Training set](#Training-the-Multiple-Linear-Regression-model-on-the-Training-set)
* [Predicting the Test set results](#Predicting-the-Test-set-results)
* [Predicting the profit of a startup with given attributes](#Predicting-the-profit-of-a-startup-with-given-attributes)
* [Getting the final linear regression equation with the values of the coefficients](#Getting-the-final-linear-regression-equation-with-the-values-of-the-coefficients)

**Note**: Feature scaling as data preprocessing step not needed for multiple linear regression

## Data Preprocessing

### **Importing the libraries**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### **Importing the dataset**

In [2]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Print 5 sample rows of `dataset`

In [3]:
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


Print 5 samples of `X`

In [4]:
print(X[:5])

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']]


Print 5 samples of `y`

In [5]:
print(y[:5])

[192261.83 191792.06 191050.39 182901.99 166187.94]


### **Encoding categorical data**

Apply one-hot encoding to the `State` column.

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# columns to be transformed are noted in 3rd index (4th column) of transformers; passthrough to keep the other columns unchanged
ct = ColumnTransformer(transformers=[("encoder", OneHotEncoder(), [3])], remainder="passthrough")
X = np.array(ct.fit_transform(X))

Print 5 samples of X with `State` one-hot encoded. `New York` is `0 0 1`, `California` is `1 0 0`, and `Florida` is `0 1 0`.

In [7]:
print(X[:5])

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]]


### **Splitting the dataset into the Training set and Test set**

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

The `sklearn` model below
- avoids the dummy variable trap
- avoids needing to implement a technique like backward elimination to select statistically significant features when training the model to make accurate predictions
- is the same used in single linear regression

Allows more time to deploy models and accelerate the model selection process

In [9]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Predicting the Test set results

In [10]:
# Instantiate predicted vector of the test set
y_pred = regressor.predict(X_test)

np.set_printoptions(precision=2)

# Print predicted vector vertically instead of default horizontally
# Axis 1 concatenates the columns
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), axis = 1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## Predicting the profit of a startup with given attributes

**Question 1**: How do I use my multiple linear regression model to make a single prediction, for example, the profit of a startup with R&D Spend = \\$160,000, Administration Spend = \\$130,000, Marketing Spend = \\$300,000, and State = California?

In [11]:
# Recall the one-hot encoding for California is 1 0 0
print(regressor.predict([[1, 0, 0, 160000,130000,300000]]))

[181566.92]


Our model predicts that the profit of a startup in California which spent \\$160,000 in R\&D, \\$130,000 in Administration, and \\$300,000 in Marketing is \\$181,566.92.

>**Important note**: Notice that the value of the feature (12 years) was input into a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting 12 into a double pair of square brackets makes the input exactly a 2D array. Simply put:
>- `12` -> scalar
>- `[12]` -> 1D array
>- `[[12]]` -> 2D array

## Getting the final linear regression equation with the values of the coefficients

**Question 2**: How do I get the final regression equation y = b0 + b1 \* x1 + b2 \* x2 + ... with the final values of the coefficients?

In [12]:
print(regressor.coef_) # b1, b2, ... = regression coefficients
print(regressor.intercept_) # b0 = regression constant

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.529248548686


Therefore, the equation of our simple linear regression model is:
$$ \text{Profit} = (86.6 * \text{Dummy State 1: California}) - (873 * \text{Dummy State 2: Florida}) + (786 * \text{Dummy State 3: New York}) + $$
$$ (0.773 * \text{R&D Spend}) + (0.0329 * \text{Administration}) + (0.0366 * \text{Marketing Spend}) + 42467.5 $$

>**Important note**: To get these coefficients, we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.