# Supervised Learning Algorithms: Decision Trees

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [13]:
%matplotlib notebook
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [14]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('Iris.csv' , error_bad_lines=False)
X = df[['SepalLengthCm', 'SepalWidthCm' , 'PetalLengthCm' , 'PetalWidthCm']]
y = df['Species']

## 3. The Model

*Run to build the model.*

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 3)
clf = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 8,random_state = 0).fit(X_train, y_train).fit(X_train, y_train)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of Decision Tree classifier on training set: 0.96
Accuracy of Decision Tree classifier on test set: 0.95


# Supervised Learning Algorithms: Gradient-boosted Decision Trees

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [16]:
%matplotlib notebook
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [17]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('Iris.csv' , error_bad_lines=False)
X = df[['SepalLengthCm', 'SepalWidthCm' , 'PetalLengthCm' , 'PetalWidthCm']]
y = df['Species']

## 3. The Model

*Run to build the model.*

In [18]:
from sklearn.ensemble import GradientBoostingClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

#### COMPARE YOUR MODELS ####

# Model with the learning rate = 0.1 and max_dept = 3 (default settings)
clf = GradientBoostingClassifier(random_state = 0).fit(X_train, y_train)

print('Car dataset (learning_rate=0.1, max_depth=3)')
print('Accuracy of GBDT classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of GBDT classifier on test set: {:.2f}\n'
     .format(clf.score(X_test, y_test)))

# Model with the learning rate = 0.01 and max_dept = 2
clf = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 2, random_state = 0).fit(X_train, y_train)

print('Car dataset (learning_rate=0.01, max_depth=2)')
print('Accuracy of GBDT classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of GBDT classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Car dataset (learning_rate=0.1, max_depth=3)
Accuracy of GBDT classifier on training set: 1.00
Accuracy of GBDT classifier on test set: 0.97

Car dataset (learning_rate=0.01, max_depth=2)
Accuracy of GBDT classifier on training set: 0.97
Accuracy of GBDT classifier on test set: 0.97


# Supervised Learning Algorithms: Kernelized Support Vector Machines

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [19]:
%matplotlib notebook

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [20]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('Iris.csv' , error_bad_lines=False)
X = df[['SepalLengthCm', 'SepalWidthCm' , 'PetalLengthCm' , 'PetalWidthCm']]
y = df['Species']

## 3. The Model

*Run to build the SVM with both default radial basis function (RBF) and polynomial kernel.*

In [21]:
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)



# The default SVC kernel is radial basis function (RBF)
clf = SVC().fit(X_train, y_train)

print('Accuracy of RBF-kernel SVC on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RBF-kernel SVC on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

## THIS MIGHT TAKE A WHILE
# Compare decision boundries with polynomial kernel, degree = 3
clf = SVC(kernel = 'poly', degree = 3).fit(X_train, y_train)

print('Accuracy of poly-kernel SVC on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of poly-kernel SVC on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of RBF-kernel SVC on training set: 0.96
Accuracy of RBF-kernel SVC on test set: 0.97
Accuracy of poly-kernel SVC on training set: 0.99
Accuracy of poly-kernel SVC on test set: 0.97


### 3.1. Support Vector Machine with RBF kernel: gamma parameter

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

for this_gamma in [0.00001, 100]:
    clf = SVC(kernel = 'rbf', gamma=this_gamma).fit(X_train, y_train)
    print('SVM (RBF) with gamma = {}'.format(this_gamma))
    print('Accuracy of SVM (RBF) classifier on training set: {:.2f}'
         .format(clf.score(X_train, y_train)))
    print('Accuracy of SVM (RBF) classifier on test set: {:.2f}\n'
         .format(clf.score(X_test, y_test)))

SVM (RBF) with gamma = 1e-05
Accuracy of SVM (RBF) classifier on training set: 0.37
Accuracy of SVM (RBF) classifier on test set: 0.24

SVM (RBF) with gamma = 100
Accuracy of SVM (RBF) classifier on training set: 1.00
Accuracy of SVM (RBF) classifier on test set: 0.37



### 3.2. Support Vector Machine with RBF kernel: using both C and gamma parameter

In [23]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

for this_gamma in [0.01, 1, 5]:
    
    for this_C in [0.1, 1, 15, 250]:
        title = 'gamma = {:.2f}, C = {:.2f}'.format(this_gamma, this_C)
        clf = SVC(kernel = 'rbf', gamma = this_gamma, C = this_C).fit(X_train, y_train)
        print('SVM (RBF) with gamma = {} and C = {}'.format(this_gamma, this_C))
        print('Accuracy of SVM (RBF) classifier on training set: {:.2f}'
             .format(clf.score(X_train, y_train)))
        print('Accuracy of SVM (RBF) classifier on test set: {:.2f}\n'
             .format(clf.score(X_test, y_test)))

SVM (RBF) with gamma = 0.01 and C = 0.1
Accuracy of SVM (RBF) classifier on training set: 0.70
Accuracy of SVM (RBF) classifier on test set: 0.58

SVM (RBF) with gamma = 0.01 and C = 1
Accuracy of SVM (RBF) classifier on training set: 0.94
Accuracy of SVM (RBF) classifier on test set: 0.92

SVM (RBF) with gamma = 0.01 and C = 15
Accuracy of SVM (RBF) classifier on training set: 0.97
Accuracy of SVM (RBF) classifier on test set: 0.97

SVM (RBF) with gamma = 0.01 and C = 250
Accuracy of SVM (RBF) classifier on training set: 0.98
Accuracy of SVM (RBF) classifier on test set: 0.97

SVM (RBF) with gamma = 1 and C = 0.1
Accuracy of SVM (RBF) classifier on training set: 0.96
Accuracy of SVM (RBF) classifier on test set: 0.97

SVM (RBF) with gamma = 1 and C = 1
Accuracy of SVM (RBF) classifier on training set: 0.97
Accuracy of SVM (RBF) classifier on test set: 0.97

SVM (RBF) with gamma = 1 and C = 15
Accuracy of SVM (RBF) classifier on training set: 1.00
Accuracy of SVM (RBF) classifier on te

### 3.3. SVMs with normalized data (feature preprocessing using minmax scaling)

In [24]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = SVC(C=10).fit(X_train_scaled, y_train)
print('Cars dataset (normalized with MinMax scaling)')
print('RBF-kernel SVC (with MinMax scaling) training set accuracy: {:.2f}'
     .format(clf.score(X_train_scaled, y_train)))
print('RBF-kernel SVC (with MinMax scaling) test set accuracy: {:.2f}'
     .format(clf.score(X_test_scaled, y_test)))

Cars dataset (normalized with MinMax scaling)
RBF-kernel SVC (with MinMax scaling) training set accuracy: 0.98
RBF-kernel SVC (with MinMax scaling) test set accuracy: 0.97


# Supervised Learning Algorithms: Lasso Regression

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [25]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [37]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('iris.csv')

X = df[['a',
 'b',
 'c',
 'd']]
y = df['label']

## 3. The Model

*Run to build the model.*

In [38]:
from sklearn.linear_model import Lasso

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=2.0, max_iter = 10000).fit(X_train_scaled, y_train)

In [39]:


### Intercept & coefficient, # of non-zero features & weights, R-squared for training & test data set
print('lasso regression linear model intercept: {}'
     .format(linlasso.intercept_))
print('lasso regression linear model coeff:{}'
     .format(linlasso.coef_))
print('\nNon-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('\nR-squared score (training): {:.3f}'
     .format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_test_scaled, y_test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))

lasso regression linear model intercept: 2.0357142857142856
lasso regression linear model coeff:[ 0. -0.  0.  0.]

Non-zero features: 0

R-squared score (training): 0.000
R-squared score (test): -0.035

Features with non-zero weight (sorted by absolute magnitude):


### 3.1. Regularization parameter alpha on R-squared

*Run to check how alpha affects the model score.*

In [40]:
print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

for alpha in [0.5, 1, 2, 3, 5, 10, 20, 50]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
    r2_train = linlasso.score(X_train_scaled, y_train)
    r2_test = linlasso.score(X_test_scaled, y_test)
    
    print('Alpha = {:.2f}\nFeatures kept: {}, r-squared training: {:.2f}, \
r-squared test: {:.2f}\n'
         .format(alpha, np.sum(linlasso.coef_ != 0), r2_train, r2_test))

Lasso regression: effect of alpha regularization
parameter on number of features kept in final model

Alpha = 0.50
Features kept: 0, r-squared training: 0.00, r-squared test: -0.03

Alpha = 1.00
Features kept: 0, r-squared training: 0.00, r-squared test: -0.03

Alpha = 2.00
Features kept: 0, r-squared training: 0.00, r-squared test: -0.03

Alpha = 3.00
Features kept: 0, r-squared training: 0.00, r-squared test: -0.03

Alpha = 5.00
Features kept: 0, r-squared training: 0.00, r-squared test: -0.03

Alpha = 10.00
Features kept: 0, r-squared training: 0.00, r-squared test: -0.03

Alpha = 20.00
Features kept: 0, r-squared training: 0.00, r-squared test: -0.03

Alpha = 50.00
Features kept: 0, r-squared training: 0.00, r-squared test: -0.03



# Supervised Learning Algorithms: Linear Regression

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Add Libraries

*Run to import the required libraries.*

In [41]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [54]:
### Data Input Example 
df = pd.read_csv('iris.csv')
X = df[['a',
 'b',
 'c',
 'd']]
y = df['label']

df.describe()



Unnamed: 0,a,b,c,d,label
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,2.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,1.0
25%,5.1,2.8,1.6,0.3,1.0
50%,5.8,3.0,4.35,1.3,2.0
75%,6.4,3.3,5.1,1.8,3.0
max,7.9,4.4,6.9,2.5,3.0


## 3. The Model

*Run to build the model.*

In [55]:
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2 ,random_state = 0)

X_train.shape , X_test.shape , y_train.shape , y_test.shape

((120, 4), (30, 4), (120,), (30,))

In [56]:
# Feature Normalization
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [57]:
linreg = LinearRegression()
linreg = linreg.fit(X_train_scaled, y_train)
linreg

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

*To Predict our test data (unseen data)*

In [58]:
linreg.predict(X_test) # 

array([4.67983566, 3.56374028, 2.13899303, 4.74827843, 2.23835506,
       5.03972676, 2.20245403, 4.00618698, 3.99581108, 3.68607204,
       4.38777709, 3.95009275, 3.92373159, 3.98861937, 4.01641147,
       2.15605107, 3.99121915, 3.85907007, 2.26968997, 2.20681338,
       4.42767297, 4.025512  , 2.41912921, 2.26020546, 4.23858947,
       2.0674726 , 2.479683  , 3.7974435 , 3.36718129, 2.37343671])

**Regression Equation: Y = AX + B.**

In [59]:
# Dataset Description

print('linear model coeff (w): {}'.format(linreg.coef_) , " are the coefficient for each column or A.")
print('linear model intercept (b): {:.3f}'.format(linreg.intercept_) , " is our intercept or B.")
print(" X is the predictor")

linear model coeff (w): [-0.09000309 -0.01708388  0.40693384  0.47194134]  are the coefficient for each column or A.
linear model intercept (b): 2.042  is our intercept or B.
 X is the predictor


In [60]:
print('R-squared score (training): {:.3f}' .format(linreg.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg.score(X_test_scaled, y_test)))

R-squared score (training): 0.934
R-squared score (test): 0.906


**What Is R-Squared?**

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

`R-squared = Explained variation / Total variation`

R-squared is always between 0 and 100%:

`0% indicates that the model explains none of the variability of the response data around its mean.`

`100% indicates that the model explains all the variability of the response data around its mean.`

**In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.**

In [62]:
import numpy as np
x_values = [1,2,3]
y_values = [1,5,25]

correlation_matrix = np.corrcoef(x_values, y_values)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2

print('r_squard is ',r_squared)

r_squard is  0.870967741935484


# Supervised Learning Algorithms: Linear Support Vector Machines

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [63]:
%matplotlib notebook
import pandas as pd
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [64]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('Iris.csv' , error_bad_lines=False)
X = df[['SepalLengthCm', 'SepalWidthCm' , 'PetalLengthCm' , 'PetalWidthCm']]
y = df['Species']

## 3. The Model

*Run to build the model.*

In [65]:
from sklearn.svm import SVC


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

#C parameter
this_C = 1.0

#model
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
print('Linear SVC, C = {:.3f}'.format(this_C))
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Linear SVC, C = 1.000
Accuracy of Linear SVC classifier on training set: 0.98
Accuracy of Linear SVC classifier on test set: 0.97


### 3.1. Linear SVM regularization: C parameter

In [66]:
for this_C in [0.00001, 100]:
    clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
    print('Linear SVM with C = {}'.format(this_C))
    print('Accuracy of Linear SVM classifier on training set: {:.2f}'
         .format(clf.score(X_train, y_train)))
    print('Accuracy of Linear SVM classifier on test set: {:.2f}\n'
         .format(clf.score(X_test, y_test)))

Linear SVM with C = 1e-05
Accuracy of Linear SVM classifier on training set: 0.37
Accuracy of Linear SVM classifier on test set: 0.24

Linear SVM with C = 100
Accuracy of Linear SVM classifier on training set: 0.99
Accuracy of Linear SVM classifier on test set: 0.95



### 3.2. LinearSVC with M classes generates M one vs rest classifiers

In [67]:
from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

clf = LinearSVC(C=5, random_state = 67).fit(X_train, y_train)
print('Coefficients:\n', clf.coef_)
print('Intercepts:\n', clf.intercept_)

Coefficients:
 [[ 0.31043251  0.36867151 -1.00262293 -0.5141807 ]
 [-0.09729638 -0.94587259  0.50707037 -1.06958204]
 [-1.4579303  -0.91366055  2.19421646  2.06024111]]
Intercepts:
 [ 0.14297945  2.18877299 -2.30899525]




# Supervised Learning Algorithms: Linear vs Polynomial Regression

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [68]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [71]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('iris.csv')
X = df[['a',
 'b',
 'c',
 'd']]
y = df['label']


## 3. The Models

### 3.1. Linear Regression

*Run to build the Linear Regression model.*

In [72]:
from sklearn.linear_model import LinearRegression

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)

# Linear regression def
linreg = LinearRegression().fit(X_train, y_train)

### intercept & coefficient, R-squared for training & test data set
print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))

linear model coeff (w): [-0.15330146 -0.02540761  0.26698013  0.57386186]
linear model intercept (b): 1.300
R-squared score (training): 0.940
R-squared score (test): 0.889


### 3.2. Polynomial Regression

*Run to build the Polynomial Regression model.*

In [73]:
from sklearn.preprocessing import PolynomialFeatures

'''
Now we transform the original input data to add
polynomial features up to degree 2

'''

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_poly, y,
                                                   random_state = 0)
# Polynomial regression def
linreg = LinearRegression().fit(X_train, y_train)

### intercept & coefficient, R-squared for training & test data set
print('(poly deg 2) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2) R-squared score (test): {:.3f}\n'
     .format(linreg.score(X_test, y_test)))

(poly deg 2) linear model coeff (w):
[ 0.         -0.35383024 -1.09570199 -0.31595616  2.68998471 -0.20173422
  0.68593108  0.06865964  0.14123184 -0.31523687 -0.07634341 -0.75562247
  0.20831985 -0.87488173  1.04333891]
(poly deg 2) linear model intercept (b): 3.461
(poly deg 2) R-squared score (training): 0.961
(poly deg 2) R-squared score (test): 0.891



### 3.3. Polynomial Regression with Regularization

Run to build the Polynomial Regression model with a regularization penalty.

In [74]:
from sklearn.linear_model import Ridge

'''
Addition of many polynomial features often leads to
overfitting, so we often use polynomial features in combination
with regression that has a regularization penalty, like ridge
regression.
'''

X_train, X_test, y_train, y_test = train_test_split(X_poly, y,
                                                   random_state = 0)
linreg = Ridge().fit(X_train, y_train)

### intercept & coefficient, R-squared for training & test data set
print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))

(poly deg 2 + ridge) linear model coeff (w):
[ 0.          0.0626021  -0.10693389  0.37870369  0.19859575 -0.07053918
  0.18970256 -0.03593867  0.10221477 -0.12045426 -0.07232519 -0.12093446
  0.05957226 -0.03572845  0.07498019]
(poly deg 2 + ridge) linear model intercept (b): 0.873
(poly deg 2 + ridge) R-squared score (training): 0.953
(poly deg 2 + ridge) R-squared score (test): 0.908


# Supervised Learning Algorithms: Logistic regression

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [75]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [76]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('Iris.csv' , error_bad_lines=False)
X = df[['SepalLengthCm', 'SepalWidthCm' , 'PetalLengthCm' , 'PetalWidthCm']]
y = df['Species']

## 3. The Model

*Run to build the model.*

In [77]:
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)

clf = LogisticRegression().fit(X_train, y_train)

print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of Logistic regression classifier on training set: 0.98
Accuracy of Logistic regression classifier on test set: 0.97


### 3.1. Logistic regression regularization: C parameter

In [78]:
for this_C in [0.1, 1, 100]:
    clf = LogisticRegression(C=this_C).fit(X_train, y_train)
    print('Logistic Regression with C = {}'.format(this_C))
    print('Accuracy of Logistic regression classifier on training set: {:.2f}'
         .format(clf.score(X_train, y_train)))
    print('Accuracy of Logistic regression classifier on test set: {:.2f}\n'
         .format(clf.score(X_test, y_test)))

Logistic Regression with C = 0.1
Accuracy of Logistic regression classifier on training set: 0.93
Accuracy of Logistic regression classifier on test set: 0.92

Logistic Regression with C = 1
Accuracy of Logistic regression classifier on training set: 0.98
Accuracy of Logistic regression classifier on test set: 0.97

Logistic Regression with C = 100
Accuracy of Logistic regression classifier on training set: 0.99
Accuracy of Logistic regression classifier on test set: 0.97



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


# Supervised Learning Algorithms: Naive Bayes classifiers

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [81]:
%matplotlib notebook
import pandas as pd
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [82]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 
### Data Input Example 
df = pd.read_csv('Iris.csv' , error_bad_lines=False)
X = df[['SepalLengthCm', 'SepalWidthCm' , 'PetalLengthCm' , 'PetalWidthCm']]
y = df['Species']

## 3. The Model

*Run to build the model.*

In [83]:
from sklearn.naive_bayes import GaussianNB

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

nbclf = GaussianNB().fit(X_train, y_train)
print('Breast cancer dataset')
print('Accuracy of GaussianNB classifier on training set: {:.2f}'
     .format(nbclf.score(X_train, y_train)))
print('Accuracy of GaussianNB classifier on test set: {:.2f}'
     .format(nbclf.score(X_test, y_test)))

Breast cancer dataset
Accuracy of GaussianNB classifier on training set: 0.95
Accuracy of GaussianNB classifier on test set: 1.00


# Supervised Learning Algorithms: Neural Networks - Classification

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [84]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [85]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('Iris.csv' , error_bad_lines=False)
X = df[['SepalLengthCm', 'SepalWidthCm' , 'PetalLengthCm' , 'PetalWidthCm']]
y = df['Species']

## 3. The Model

In [86]:
from sklearn.neural_network import MLPClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

nnclf = MLPClassifier(hidden_layer_sizes = 2, solver='lbfgs', random_state = 0).fit(X_train, y_train)

print('Accuracy of NN Classifier on training set: {:.2f}'.format(nnclf.score(X_train, y_train)))
print('Accuracy of NN Classifier on test set: {:.2f}'.format(nnclf.score(X_test, y_test)))

Accuracy of NN Classifier on training set: 0.98
Accuracy of NN Classifier on test set: 0.97


### 3.1. Accuracy with different hidden layers

In [87]:
for units in [1, 10, 100]:
    nnclf = MLPClassifier(hidden_layer_sizes = [units], solver='lbfgs',
                         random_state = 0).fit(X_train, y_train)
    print('Accuracy of NN Classifier with hidden layers = {} on training set: {:.2f}'.format(units, nnclf.score(X_train, y_train)))
    print('Accuracy of NN Classifier with hidden layers = {} on test set: {:.2f}\n'.format(units, nnclf.score(X_test, y_test)))



Accuracy of NN Classifier with hidden layers = 1 on training set: 0.37
Accuracy of NN Classifier with hidden layers = 1 on test set: 0.24

Accuracy of NN Classifier with hidden layers = 10 on training set: 1.00
Accuracy of NN Classifier with hidden layers = 10 on test set: 0.95

Accuracy of NN Classifier with hidden layers = 100 on training set: 1.00
Accuracy of NN Classifier with hidden layers = 100 on test set: 0.97



### 3.2. Accuracy with a regularization parameter: alpha

In [88]:
for this_alpha in [0.01, 0.1, 1.0, 5.0]:
    nnclf = MLPClassifier(solver='lbfgs', activation = 'tanh',
                         alpha = this_alpha,
                         hidden_layer_sizes = [100, 100],
                         random_state = 0).fit(X_train, y_train)
    print('Accuracy of NN Classifier with alpha = {} on training set: {:.2f}'.format(this_alpha, nnclf.score(X_train, y_train)))
    print('Accuracy of NN Classifier with alpha = {} on test set: {:.2f}\n'.format(this_alpha, nnclf.score(X_test, y_test)))



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN Classifier with alpha = 0.01 on training set: 1.00
Accuracy of NN Classifier with alpha = 0.01 on test set: 0.97



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN Classifier with alpha = 0.1 on training set: 1.00
Accuracy of NN Classifier with alpha = 0.1 on test set: 0.97



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN Classifier with alpha = 1.0 on training set: 0.99
Accuracy of NN Classifier with alpha = 1.0 on test set: 0.97

Accuracy of NN Classifier with alpha = 5.0 on training set: 0.96
Accuracy of NN Classifier with alpha = 5.0 on test set: 0.97



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


### 3.3. Choices of activation function - tanh, logistic and relu

##### 3.3.1. Activation functions representation

In [89]:
# plot tanh, logistic and relu

xrange = np.linspace(-2, 2, 200)

plt.figure(figsize=(7,6))

plt.plot(xrange, np.maximum(xrange, 0), label = 'relu')
plt.plot(xrange, np.tanh(xrange), label = 'tanh')
plt.plot(xrange, 1 / (1 + np.exp(-xrange)), label = 'logistic')
plt.legend()
plt.title('Neural network activation functions')
plt.xlabel('Input value (x)')
plt.ylabel('Activation function output')

plt.show()

<IPython.core.display.Javascript object>

##### 3.3.2. The effect of different choices of activation function

In [90]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

for this_activation in ['logistic', 'tanh', 'relu']:
    nnclf = MLPClassifier(solver='lbfgs', activation = this_activation,
                         alpha = 0.1, hidden_layer_sizes = [10, 10],
                         random_state = 0).fit(X_train, y_train)
    print('Accuracy of NN Classifier with {} activation function on training set: {:.2f}'.format(this_activation, nnclf.score(X_train, y_train)))
    print('Accuracy of NN Classifier with {} activation function on test set: {:.2f}\n'.format(this_activation, nnclf.score(X_test, y_test)))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN Classifier with logistic activation function on training set: 1.00
Accuracy of NN Classifier with logistic activation function on test set: 0.97

Accuracy of NN Classifier with tanh activation function on training set: 1.00
Accuracy of NN Classifier with tanh activation function on test set: 0.97

Accuracy of NN Classifier with relu activation function on training set: 1.00
Accuracy of NN Classifier with relu activation function on test set: 0.97



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


### 3.4. Accuracy with different activation functions and regularization parameter alpha

In [91]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)


# Accuracy with different activation functions and regularization parameter alpha
for thisactivation in ['tanh', 'relu']:
    for thisalpha in [0.0001, 1.0, 100]:
        nnclf = MLPClassifier(hidden_layer_sizes = [100,100],
                             activation = thisactivation,
                             alpha = thisalpha,
                             solver = 'lbfgs').fit(X_train, y_train)
        print('Accuracy of NN Classifier with activation funtion = {} and alpha = {} on training set: {:.2f}'.format(thisactivation, thisalpha, nnclf.score(X_train, y_train)))
        print('Accuracy of NN Classifier with activation funtion = {} and alpha = {} on test set: {:.2f}\n'.format(thisactivation, thisalpha, nnclf.score(X_test, y_test)))
        
        

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN Classifier with activation funtion = tanh and alpha = 0.0001 on training set: 1.00
Accuracy of NN Classifier with activation funtion = tanh and alpha = 0.0001 on test set: 0.97



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN Classifier with activation funtion = tanh and alpha = 1.0 on training set: 0.99
Accuracy of NN Classifier with activation funtion = tanh and alpha = 1.0 on test set: 0.97

Accuracy of NN Classifier with activation funtion = tanh and alpha = 100 on training set: 0.37
Accuracy of NN Classifier with activation funtion = tanh and alpha = 100 on test set: 0.24

Accuracy of NN Classifier with activation funtion = relu and alpha = 0.0001 on training set: 1.00
Accuracy of NN Classifier with activation funtion = relu and alpha = 0.0001 on test set: 0.97

Accuracy of NN Classifier with activation funtion = relu and alpha = 1.0 on training set: 0.99
Accuracy of NN Classifier with activation funtion = relu and alpha = 1.0 on test set: 0.97

Accuracy of NN Classifier with activation funtion = relu and alpha = 100 on training set: 0.37
Accuracy of NN Classifier with activation funtion = relu and alpha = 100 on test set: 0.24



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


# Supervised Learning Algorithms: Neural Networks - Regression

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [92]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [95]:
df = pd.read_csv('iris.csv')
X = df[['a',
 'b',
 'c',
 'd']]
y = df['label']


## 3. The Model

In [96]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPRegressor

# normalized
scaler = MinMaxScaler()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# model 
clf = MLPRegressor(hidden_layer_sizes = [100, 100], alpha = 5.0, random_state = 0, solver='lbfgs').fit(X_train_scaled, y_train)

print('Accuracy of NN regressor on training set: {:.2f}'
     .format(clf.score(X_train_scaled, y_train)))
print('Accuracy of NN regressor on test set: {:.2f}'
     .format(clf.score(X_test_scaled, y_test)))

Accuracy of NN regressor on training set: 0.92
Accuracy of NN regressor on test set: 0.88


### 3.1. Accuracy with different activation functions and regularization parameter alpha

In [97]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# normalized
scaler = MinMaxScaler()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Accuracy with different activation functions and regularization parameter alpha
for thisactivation in ['tanh', 'relu']:
    for thisalpha in [0.0001, 1.0, 100]:
        mlpreg = MLPRegressor(hidden_layer_sizes = [100,100],
                             activation = thisactivation,
                             alpha = thisalpha,
                             solver = 'lbfgs').fit(X_train, y_train)
        print('Accuracy of NN regressor with activation funtion = {} and alpha = {} on training set: {:.2f}'.format(thisactivation, thisalpha, clf.score(X_train_scaled, y_train)))
        print('Accuracy of NN regressor on test set: {:.2f}\n'.format(clf.score(X_test_scaled, y_test)))
        
        

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN regressor with activation funtion = tanh and alpha = 0.0001 on training set: 0.92
Accuracy of NN regressor on test set: 0.88



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN regressor with activation funtion = tanh and alpha = 1.0 on training set: 0.92
Accuracy of NN regressor on test set: 0.88

Accuracy of NN regressor with activation funtion = tanh and alpha = 100 on training set: 0.92
Accuracy of NN regressor on test set: 0.88



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Accuracy of NN regressor with activation funtion = relu and alpha = 0.0001 on training set: 0.92
Accuracy of NN regressor on test set: 0.88

Accuracy of NN regressor with activation funtion = relu and alpha = 1.0 on training set: 0.92
Accuracy of NN regressor on test set: 0.88

Accuracy of NN regressor with activation funtion = relu and alpha = 100 on training set: 0.92
Accuracy of NN regressor on test set: 0.88



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


# Supervised Learning Algorithms: Random forests

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [98]:
%matplotlib notebook
import pandas as pd
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [99]:
### Data Input
# df = 

### Defining Variables  
# X = 
# y = 

### Data Input Example 
df = pd.read_csv('Iris.csv' , error_bad_lines=False)
X = df[['SepalLengthCm', 'SepalWidthCm' , 'PetalLengthCm' , 'PetalWidthCm']]
y = df['Species']

## 3. The Model

*Run to build the model.*

In [100]:
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

clf = RandomForestClassifier(max_features = 2, random_state = 0).fit(X_train, y_train)

print('Accuracy of RF classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Accuracy of RF classifier on training set: 1.00
Accuracy of RF classifier on test set: 0.97


# Supervised Learning Algorithms: Ridge Regression

*In this template, only **data input** and **input/target variables** need to be specified (see "Data Input & Variables" section for further instructions). None of the other sections needs to be adjusted. As a data input example, .csv file from IBM Box web repository is used.*

## 1. Libraries

*Run to import the required libraries.*

In [101]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## 2. Data Input and Variables

*Define the data input as well as the input (X) and target (y) variables and run the code. Do not change the data & variable names **['df', 'X', 'y']** as they are used in further sections.*

In [104]:
df = pd.read_csv('iris.csv')
X = df[['a',
 'b',
 'c',
 'd']]
y = df['label']

## 3. The Model

*Run to build the model.*

In [105]:
from sklearn.linear_model import Ridge

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)
# feature normalization
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ridge regression def
linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)

### intercept & coefficient, # of non-zero features & weights, R-squared for training & test data set
print('Ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('Ridge regression linear model coeff: {}\n'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Ridge regression linear model intercept: 2.0357142857142856
Ridge regression linear model coeff: [ 0.08652397 -0.07775645  0.29427123  0.35684307]

R-squared score (training): 0.924
R-squared score (test): 0.884

Number of non-zero features: 4


### 3.1. Regularization parameter alpha on R-squared

*Run to check how alpha affects the model score.*

In [106]:
print('Ridge regression: effect of alpha regularization parameter\n')
for this_alpha in [0, 1, 10, 20, 50, 100, 1000]:
    linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
    r2_train = linridge.score(X_train_scaled, y_train)
    r2_test = linridge.score(X_test_scaled, y_test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    print('Alpha = {:.2f}\nnum abs(coeff) > 1.0: {}, \
r-squared training: {:.2f}, r-squared test: {:.2f}\n'
         .format(this_alpha, num_coeff_bigger, r2_train, r2_test))

Ridge regression: effect of alpha regularization parameter

Alpha = 0.00
num abs(coeff) > 1.0: 0, r-squared training: 0.94, r-squared test: 0.89

Alpha = 1.00
num abs(coeff) > 1.0: 0, r-squared training: 0.94, r-squared test: 0.89

Alpha = 10.00
num abs(coeff) > 1.0: 0, r-squared training: 0.93, r-squared test: 0.89

Alpha = 20.00
num abs(coeff) > 1.0: 0, r-squared training: 0.92, r-squared test: 0.88

Alpha = 50.00
num abs(coeff) > 1.0: 0, r-squared training: 0.90, r-squared test: 0.86

Alpha = 100.00
num abs(coeff) > 1.0: 0, r-squared training: 0.86, r-squared test: 0.82

Alpha = 1000.00
num abs(coeff) > 1.0: 0, r-squared training: 0.39, r-squared test: 0.35



# RMSE

In [107]:
from sklearn.metrics import mean_squared_error 

# Given values 
Y_true = [1,1,2,2,4] # Y_true = Y (original values) 

# calculated values 
Y_pred = [0.6,1.29,1.99,2.69,3.4] # Y_pred = Y' 

# Calculation of Mean Squared Error (MSE) 
print(mean_squared_error(Y_true,Y_pred)) 

0.21606
