# Polynomial Regression

Implementing polynomial regression using SKLearn

In [19]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for plotting and visualozing data
from matplotlib import rcParams
from pandas_profiling import ProfileReport
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [4]:
df = pd.read_csv('./data/50_Startups.csv')
print(df.shape)
df.head()

(50, 5)


Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [5]:
X, y = df.iloc[:, :-1], df.iloc[:, -1]
print(X.shape, y.shape)

(50, 4) (50,)


As the fourth column, `state` contains categorical data, it should be encoded before building the model.

For enoding categorical data, if we take 3 Dummy Variable D1 for `State_California`, D2 for `State_Florida` and D3 for `State_NewYork` it will lead to a Dummy variable trap.

Dummy Variables D1, D2 and D3 are highly correlated or multicollinear because if D1 and D2 are zero it is obvious D3 is 1. ie., if the data does not belong to `State_California` or `State_Florida` then it is sure, the data belongs to `State_NewYork`.

So to avoid the dummy variable trap we have to drop one dummy variable while building the model.

In [7]:
X = pd.get_dummies(X, drop_first=True)
X.head(10)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
0,165349.2,136897.8,471784.1,0,1
1,162597.7,151377.59,443898.53,0,0
2,153441.51,101145.55,407934.54,1,0
3,144372.41,118671.85,383199.62,0,1
4,142107.34,91391.77,366168.42,1,0
5,131876.9,99814.71,362861.36,0,1
6,134615.46,147198.87,127716.82,0,0
7,130298.13,145530.06,323876.68,1,0
8,120542.52,148718.95,311613.29,0,1
9,123334.88,108679.17,304981.62,0,0


In [21]:
X.iloc[:10, :1]

Unnamed: 0,R&D Spend
0,165349.2
1,162597.7
2,153441.51
3,144372.41
4,142107.34
5,131876.9
6,134615.46
7,130298.13
8,120542.52
9,123334.88


### Taking only 1 feature - **`RD Spend`**

In [32]:
#Splitting testdata into X_train,X_test,y_train,y_test
X_train, X_test, y_train, y_test = train_test_split(X.iloc[:, :1].values, y.values, test_size=.33, random_state=10)

In [33]:
X_train = X_train.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)

In [34]:
y_train = y_train[X_train[:,0].argsort()]
X_train = X_train[X_train[:, 0].argsort()]

In [35]:
print("X_train shape:", X_train.shape, "; y_train shape:", y_train.shape, "\nX_test shape:", X_test.shape, "; y_test shape:", y_test.shape)

X_train shape: (33, 1) ; y_train shape: (33, 1) 
X_test shape: (17, 1) ; y_test shape: (17,)


In [36]:
poly = PolynomialFeatures(degree=2)

In [37]:
X_poly = poly.fit_transform(X_train)

In [38]:
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y_train)

LinearRegression()