## SVR in python

<font size=4 >Let's import all the libraries first!</font>

In [4]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.svm import SVR #Our today's topic

<font size=4 >Let's get the Data!</font>

In [6]:
dataset = pd.read_csv('50_Startups.csv') #Loading the data into DataFrame
dataset.head() #Displaying the data to make sure DataFrame is alright

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


<font size=4>This time our Independent variables(X) will be not only Marketing spend but also <b>Adminstration and R&D_Spend</b>
    <br>It will increase accuracy of our model

In [7]:
###  dataframe.iloc requires 1)ROW(s) and 2)Column(s)
x = dataset.iloc[:, :3].values #Put a colon to choose all rows and column 0,1,2 are as features.
y = dataset.iloc[:, 4:].values #Profit column is at 4 index(start counting from 0 like a programmer!)


In [9]:
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
X = sc_x.fit_transform(x)
Y = sc_y.fit_transform(y)
X

array([[  2.01641149e+00,   5.60752915e-01,   2.15394309e+00],
       [  1.95586034e+00,   1.08280658e+00,   1.92360040e+00],
       [  1.75436374e+00,  -7.28257028e-01,   1.62652767e+00],
       [  1.55478369e+00,  -9.63646307e-02,   1.42221024e+00],
       [  1.50493720e+00,  -1.07991935e+00,   1.28152771e+00],
       [  1.27980001e+00,  -7.76239071e-01,   1.25421046e+00],
       [  1.34006641e+00,   9.32147208e-01,  -6.88149930e-01],
       [  1.24505666e+00,   8.71980011e-01,   9.32185978e-01],
       [  1.03036886e+00,   9.86952101e-01,   8.30886909e-01],
       [  1.09181921e+00,  -4.56640246e-01,   7.76107440e-01],
       [  6.20398248e-01,  -3.87599089e-01,   1.49807267e-01],
       [  5.93085418e-01,  -1.06553960e+00,   3.19833623e-01],
       [  4.43259872e-01,   2.15449064e-01,   3.20617441e-01],
       [  4.02077603e-01,   5.10178953e-01,   3.43956788e-01],
       [  1.01718075e+00,   1.26919939e+00,   3.75742273e-01],
       [  8.97913123e-01,   4.58678535e-02,   4.1921870

<font size=4>Firstly, let's split the data into Training and Test sets

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, shuffle=False)

<font size=4>Let's create three different models with RBF, Linear and Polynomial kernels and compare scores of each model

In [13]:
model_rbf = SVR(kernel="rbf").fit(X_train, y_train)
model_poly = SVR(kernel="poly").fit(X_train, y_train)
model_linear = SVR(kernel="linear").fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


<font size=4>Checking Training part

In [14]:
model_rbf.score(X_train, y_train)

0.95606650940533222

In [15]:
model_poly.score(X_train, y_train)

0.80420135413437843

In [16]:
model_linear.score(X_train, y_train)

0.94620681799790973

<font size=4>As we can see, RBF and Linear kernels fit well for our data. Now, let's check the score for Test part

In [17]:
model_rbf.score(X_test, y_test)

-5.1008365827279025

In [18]:
model_poly.score(X_test, y_test)

-0.71884773961332149

In [19]:
model_linear.score(X_test, y_test)

0.43685891344036332

## Shortcut!

<font size=4> Instead of creating each model separately, We can use <b>For loop</b> and apply all three kernels and get the scores

In [20]:
kernels=['linear','rbf','poly'] #Create a list of kernels like this
for i in kernels:
    model = SVR(kernel=i).fit(X_train, y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    print("Kernel: " + i + ". Training Score: " + str(train_score) +". Test score: " + str(test_score))
    

Kernel: linear. Training Score: 0.946206817998. Test score: 0.43685891344
Kernel: rbf. Training Score: 0.956066509405. Test score: -5.10083658273
Kernel: poly. Training Score: 0.804201354134. Test score: -0.718847739613


  y = column_or_1d(y, warn=True)


<font size=4> So far, SVR model with Linear Kernel fitted our data well. However, kernel type is one of the attributes of model and there are other important attributes(C=penalty parameter and epsilon) that should be <b>TUNED</b>

<font size=4> Tuning Penalty Parameter

In [21]:
kernels=['linear','rbf','poly']
penalties = [1, 10, 100]

for kernel in kernels:
    print("Tuning parameters for RVS model with "+kernel+" Kernel")
    for c in penalties:
        model = SVR(kernel=kernel, C=c).fit(X_train, y_train)
        train_score = model.score(X_train, y_train)
        test_score = model.score(X_test, y_test)
        print("Kernel: "+kernel+". Penalty: "+str(c)+". Training Score: "+str(train_score)+". Test score: "+str(test_score))
    

Tuning parameters for RVS model with linear Kernel
Kernel: linear. Penalty: 1. Training Score: 0.946206817998. Test score: 0.43685891344
Kernel: linear. Penalty: 10. Training Score: 0.945843224236. Test score: 0.454120425373
Kernel: linear. Penalty: 100. Training Score: 0.945850756719. Test score: 0.453889038499
Tuning parameters for RVS model with rbf Kernel
Kernel: rbf. Penalty: 1. Training Score: 0.956066509405. Test score: -5.10083658273
Kernel: rbf. Penalty: 10. Training Score: 0.978390037994. Test score: -4.05775739555
Kernel: rbf. Penalty: 100. Training Score: 0.98605688602. Test score: -5.15041956591
Tuning parameters for RVS model with poly Kernel
Kernel: poly. Penalty: 1. Training Score: 0.804201354134. Test score: -0.718847739613
Kernel: poly. Penalty: 10. Training Score: 0.831415388293. Test score: -0.530221529845
Kernel: poly. Penalty: 100. Training Score: 0.831861635729. Test score: -1.11800305171


  y = column_or_1d(y, warn=True)


<font size=4> Tuning Epsilon Parameter

<font size=4> We can decide that Linear kernel fits our data better than other kernels. So, let's tune Penalty and Epsilon parameters for Linear kernel only.

In [22]:
penalties = [1, 10, 100]
epsilons = [0.1, 0.2, 0.3]


for c in penalties:
    for e in epsilons:
        model = SVR(kernel="linear", C=c, epsilon=e).fit(X_train, y_train)
        train_score = model.score(X_train, y_train)
        test_score = model.score(X_test, y_test)
        print("Penalty: "+str(c)+". Epsilon: "+str(e)+". Training Score: "+str(train_score)+". Test score: "+str(test_score))
    

Penalty: 1. Epsilon: 0.1. Training Score: 0.946206817998. Test score: 0.43685891344
Penalty: 1. Epsilon: 0.2. Training Score: 0.944360823466. Test score: 0.391346085225
Penalty: 1. Epsilon: 0.3. Training Score: 0.937306909146. Test score: 0.186320837563
Penalty: 10. Epsilon: 0.1. Training Score: 0.945843224236. Test score: 0.454120425373
Penalty: 10. Epsilon: 0.2. Training Score: 0.945272039967. Test score: 0.400557355088
Penalty: 10. Epsilon: 0.3. Training Score: 0.941381622635. Test score: 0.307624207515
Penalty: 100. Epsilon: 0.1. Training Score: 0.945850756719. Test score: 0.453889038499
Penalty: 100. Epsilon: 0.2. Training Score: 0.945261134818. Test score: 0.400594441732
Penalty: 100. Epsilon: 0.3. Training Score: 0.941379878779. Test score: 0.308275239322


  y = column_or_1d(y, warn=True)


## Conclusion

<font size=4> Based on Training Score, best model is the one with penalty=1 and epsilon=0.1(default value) (Train Score: 0.946)


<font size=4> Based on Test Score, best model is the one with penalty=10 and epsilon=0.1 (Test Score: 0.454)

<font size=4> So far, we have got three good models:

<font size=4> 
Model 1: Kernel = Linear, Penalty = 1(Default), Epsilon = 0.1(Default)<br>
Model 2: Kernel = Linear, Penalty = 10, Epsilon = 0.1(Default)<br>
Model 3: Kernel = Linear, Penalty = 100, Epsilon = 0.1(Default)

In [23]:
model_1 = SVR(kernel="linear", C=1, epsilon=0.1).fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [24]:
model_2 = SVR(kernel="linear", C=10, epsilon=0.1).fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [37]:
model_3 = SVR(kernel="linear", C=100, epsilon=0.1).fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


### Attention: Model Selection splits the data differently in each run, so score values are different for each "Run".<br>If not "shuffle=False"

## prediction with model 3:
let's predict first value of benefit(y) with x1(R&D Spend), x2(Administration) ,x3(Marketing Spend) features to see the prediction acuracy with model 3. x1=162597.7,x2=151377.59,  x3=443898.53 and y=192261.83


In [45]:
y_pred= model_3.predict([[ 162597.7 ,  151377.59,  443898.53]])
y_pred

array([ 164053.26975896])

as its clear the y_pred=164053.27 with model 3 (Penalty: 100, Epsilon: 0.1, Training Score: 0.9458, Test score: 0.4538) so the test acuracy of this model is 0.45 and the diffrence between y_pred=164053.26 and real y=192261.83 is acceptable. now you can try any new values for x1,x2,x3 and predict y as benefit of the new start up.

## Questions

Find another dataset and construct questions