# Multiple Linear Regression

# Project Description
A car manufacturer has 50 small factories in India, China, and Brazil for the production of some small parts. File
Business Data.csv shows the amount of money (in US Dollars) each of these factories has spent in a fiscal year (2019) on
equipment, workers' salary, raw materials; and the end year profit.

Build a model to predict the profit if the company decides to build new factories and plan a business (i.e. the amount
to spend on equipment, salary, and materials) in each of these countries.

## Step 1. Importing the Libraries

In [14]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Step 2. Importing the Data and Checking

In [15]:
dataset = pd.read_csv('data/Business Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [16]:
# Quick check of the data

dataset

Unnamed: 0,Equipment Spend,Employee Salary,Raw Materials Spend,Country,Profit
0,165349,136898,471784,India,192262
1,162598,151378,443899,China,191792
2,153442,101146,407935,Brezil,191050
3,144372,118672,383200,India,182902
4,142107,91392,366168,Brezil,166188
5,131877,99815,362861,India,156991
6,134615,147199,127717,China,156123
7,130298,145530,323877,Brezil,155753
8,120543,148719,311613,India,152212
9,123335,108679,304982,China,149760


## Step 3. Cleaning the Data

In [17]:
# already clean

## Step 4. Encoding Categorical Data

In [18]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = ct.fit_transform(X)

### Checking the Encoded Data

In [19]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5
0,0.0,0.0,1.0,165349,136898,471784
1,0.0,1.0,0.0,162598,151378,443899
2,1.0,0.0,0.0,153442,101146,407935
3,0.0,0.0,1.0,144372,118672,383200
4,1.0,0.0,0.0,142107,91392,366168
5,0.0,0.0,1.0,131877,99815,362861
6,0.0,1.0,0.0,134615,147199,127717
7,1.0,0.0,0.0,130298,145530,323877
8,0.0,0.0,1.0,120543,148719,311613
9,0.0,1.0,0.0,123335,108679,304982


## Scaling

In [20]:
# Since the input variables are all in a same range, we can skip this step
# Later, we will come back and add this step to check the difference 

# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X[:, 3:6] = sc.fit_transform(X[:, 3:6])
# pd.DataFrame(X)

'"\nfrom sklearn.preprocessing import StandardScaler\nsc = StandardScaler()\nX[:, 3:6] = sc.fit_transform(X[:, 3:6])\npd.DataFrame(X)\n'

## Step 5. Splitting the Dataset into the Training set and Test Set

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Step 6. Training the Multiple Linear Regression Model on the Training Set

In [27]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

## Step 7. Checking the Model with the Test set

In [31]:
y_pred = regressor.predict(X_test)

pd.DataFrame([y_pred, y_test])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,103015.995771,132582.258096,132447.973629,71976.311689,178538.193312,116161.551521,67851.843225,98791.559647,113970.00692,167921.073688
1,103282.0,144259.0,146122.0,77799.0,191050.0,105008.0,81229.0,97484.0,110352.0,166188.0


## Step 8. Measuring the Model Performance
<font color=red> Note: <br> 
 <font color=Green> A typical performance measure for regression problems is the RMSE (Root Mean Square Error)

In [32]:
from sklearn.metrics import mean_squared_error

RMSE = (mean_squared_error(y_pred, y_test)) ** 0.5
print('RMSE: ', RMSE)

RMSE:  9137.824294071946


## Step 9. Using the Model to Predict

In [37]:
# Assume we want to check the profit if we open a factory in any of these 3
# countries - which will yield a higher profit 

NewFactory = np.array([0, 0, 1, 100000, 250000, 500000])

NewFactory = NewFactory.reshape(1, -1)
ProfitPrediction = regressor.predict(NewFactory)
print('The Profit Prediction for the New Factory = ', ProfitPrediction)

The Profit Prediction for the New Factory =  [147126.61557351]
