---
# scikit-learn
In this notebook, we learn two major problems in machine learning: 
regression and classification. 
scikit-learn is the library which houses majority of the machine learning functions. 
More on this library can be learned on following the link:
https://scikit-learn.org/stable/

The flow of this notebook is as follows: first we talk about the concept with codes and then implement the concept on a full-fledged exercise. 

In [1]:
import numpy as np
import pandas as pd

---
# Regression

Predicting a continuous-valued attribute associated with an object.

https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

---
## Abalone data

Abalone are a type of edible marine snail, and they have internal rings that correspond to their age (like trees). In the following, we will use a dataset of [abalone measurements](https://archive.ics.uci.edu/ml/datasets/abalone). It has the following fields:

    Sex / nominal / -- / M, F, and I (infant) 
    Length / continuous / mm / Longest shell measurement 
    Diameter	/ continuous / mm / perpendicular to length 
    Height / continuous / mm / with meat in shell 
    Whole weight / continuous / grams / whole abalone 
    Shucked weight / continuous	/ grams / weight of meat 
    Viscera weight / continuous / grams / gut weight (after bleeding) 
    Shell weight / continuous / grams / after being dried 
    Rings / integer / -- / +1.5 gives the age in years 

Suppose we are interested in predicting the age of the abalone given their measurements. This is an example of a regression problem.

In [2]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',
                 header = None, 
                 names = ['sex', 'length', 'diameter', 'height', 'weight', 'shucked_weight','viscera_weight', 'shell_weight', 'rings'])

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sex             4177 non-null   object 
 1   length          4177 non-null   float64
 2   diameter        4177 non-null   float64
 3   height          4177 non-null   float64
 4   weight          4177 non-null   float64
 5   shucked_weight  4177 non-null   float64
 6   viscera_weight  4177 non-null   float64
 7   shell_weight    4177 non-null   float64
 8   rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [4]:
# data preparation
X_clean = df.drop(['rings'], axis=1)
print(X_clean.columns)
# convert categorical variable 'sex' into dummy/indicator variables.
print(pd.get_dummies(X_clean).columns)
X_full = pd.get_dummies(X_clean).to_numpy()
y = df['rings'].to_numpy()


Index(['sex', 'length', 'diameter', 'height', 'weight', 'shucked_weight',
       'viscera_weight', 'shell_weight'],
      dtype='object')
Index(['length', 'diameter', 'height', 'weight', 'shucked_weight',
       'viscera_weight', 'shell_weight', 'sex_F', 'sex_I', 'sex_M'],
      dtype='object')


In [5]:
from sklearn.model_selection import train_test_split

# split traning/test data
X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.3, random_state=206)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(2923, 10) (2923,)
(1254, 10) (1254,)


---
## Multiple linear regression

In [6]:
from sklearn import linear_model

# create an estimator of a linear regression model
model_mlr = linear_model.LinearRegression()
# fit the model to the data
model_mlr.fit(X_train, y_train)
# print model attributes
print(model_mlr.coef_, model_mlr.intercept_)

[  0.21322442  11.54076824   9.11920244   9.6726404  -20.07767893
 -12.0572845    6.99158725   0.25311087  -0.60640995   0.35329909] 3.497827402964525


In [7]:
# Apply the model to classify test data
model_mlr.predict(X_test)

array([13.03096611, 14.57733669, 13.37106689, ..., 12.09053647,
        6.97686387,  9.97738752])

In [8]:
# ground truth
y_test

array([18, 20,  9, ...,  8,  8,  6])

---
## Evaluating your model
The R^2 tells us that our model amount of the variations in our target variable is explained by our model.

https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

In [9]:
# R^2 coefficient of determination 
model_mlr.score(X_test, y_test)

0.5355509018060667

In [10]:
import sklearn.metrics as metrics

# R^2 coefficient of determination 
metrics.r2_score(y_test, model_mlr.predict(X_test))

0.5355509018060667

In [11]:
# mean squared error
metrics.mean_squared_error(y_test, model_mlr.predict(X_test))

4.98772910720453

---
## Cross validation

Another method of evelauating model.

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score

In [12]:
# Cross validation 5-fold
from sklearn.model_selection import cross_val_score

model = linear_model.LinearRegression()
scores = cross_val_score(model, X_full, y, cv = 5)
scores

array([0.42892258, 0.20421095, 0.49486196, 0.51858889, 0.44955174])

In [13]:
print(f"5-fold CV R^2: {scores.mean()} (+/- {scores.std()})")

5-fold CV R^2: 0.41922722379930466 (+/- 0.112106135908679)


---
## &diams; Exercise

Try to fit some of the models in the following cell to the abalone data. Compute the 5-fold cv R^2 statistics.

Look up the documentation for the regressor, and see if the regressor takes any parameters. How does changing the parameter affect the result?

In [14]:
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.kernel_ridge import KernelRidge
from xgboost.sklearn import XGBRegressor
from lightgbm import LGBMRegressor

In [15]:

# INSERT_YOUR_ANSWER

#Model: SVR-1
model = SVR(kernel='rbf', C=1, epsilon=0.1)
scores = cross_val_score(model, X_full, y, cv = 5)
scores
print(f"5-fold CV R^2 of SVR with default parameters (C=1, epsilon=0.1): {scores.mean()} (+/- {scores.std()})")


#Model: SVR-2
model = SVR(kernel='rbf', C=1, epsilon=0.5)
scores = cross_val_score(model, X_full, y, cv = 5)
scores
print(f"5-fold CV R^2 with greater epsilon (=0.5): {scores.mean()} (+/- {scores.std()})")

#Result :  The performance slightly increases when we increase the epsilon 



#Model: SVR-3
model = SVR(kernel='rbf', C=1e2, epsilon=0.1)
scores = cross_val_score(model, X_full, y, cv = 5)
scores
print(f"5-fold CV R^2 with greater C(=100): {scores.mean()} (+/- {scores.std()})")

#Result: We note that the performance increases when we increase the parameter C 


5-fold CV R^2 of SVR with default parameters (C=1, epsilon=0.1): 0.4478334068480342 (+/- 0.10896876359157394)
5-fold CV R^2 with greater epsilon (=0.5): 0.4516831910485199 (+/- 0.10478344681123607)
5-fold CV R^2 with greater C(=100): 0.5156661259781938 (+/- 0.05891758319999079)


In [16]:
#Model KernelRidge-1 :
model = KernelRidge(kernel='linear', alpha=1)
scores = cross_val_score(model, X_full, y, cv = 5)
scores
print(f"5-fold CV R^2 of KernalRidge with default parameter: {(scores.mean())} (+/- {scores.std()})")

#Model  KernelRidge-2
model = KernelRidge(kernel='poly', alpha=1)
scores = cross_val_score(model, X_full, y, cv = 5)
scores
print(f"5-fold CV R^2 of KernalRidge with poly kernel: {(scores.mean())} (+/- {scores.std()})")
#Result: We note that the performance increases with poly kernel relative to linear kernel 



#Model  KernelRidge-3
model = KernelRidge(kernel='linear', alpha=10)
scores = cross_val_score(model, X_full, y, cv = 5)
scores
print(f"5-fold CV R^2 of KernalRidge with poly kernel with alpha=10: {(scores.mean())} (+/- {scores.std()})")
# Result: performance decreases if we increase the parameter alpha 

5-fold CV R^2 of KernalRidge with default parameter: 0.4266418481646834 (+/- 0.10936364173706144)
5-fold CV R^2 of KernalRidge with poly kernel: 0.4351594978406229 (+/- 0.10610431412505102)
5-fold CV R^2 of KernalRidge with poly kernel with alpha=10: 0.3803706972941808 (+/- 0.11358423402519184)


In [17]:
#Model-3
model = GradientBoostingRegressor()
scores = cross_val_score(model, X_full, y, cv = 5)
scores
print(f"5-fold CV R^2: {(scores.mean())} (+/- {scores.std()})")

5-fold CV R^2: 0.46438069879998733 (+/- 0.09846334978376436)


Model-4
model = BayesianRidge() scores = cross_val_score(model, X_full, y, cv = 5) scores print(f"5-fold CV R^2: {(scores.mean())} (+/- {scores.std()})")

Summary: Changing the parameters can have a significant effect on the performance of the SVR model. For example, increasing the C value can lead to overfitting, while decreasing it can lead to underfitting. Similarly, changing the kernel function can affect the model's ability to capture non-linear relationships in the data. The optimal values for these parameters will depend on the specific dataset and the problem being solved, and can be determined through hyperparameter tuning. The epsilon parameter is particularly important when dealing with noisy data, as it can help to reduce the impact of outliers on the model's performance.

---
# Classification

https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Another typical example of a supervised machine learning problem is classification.

---
## Iris data

Here we will use a dataset of flower measurements from three different flower species of *Iris* (*Iris setosa*, *Iris virginica*, and *Iris versicolor*). We aim to predict the species of the flower. Because the species is not a numerical output, it is not a regression problem, but a classification problem.

https://archive.ics.uci.edu/ml/datasets/iris

The iris dataset is included in `scikit-learn`.

https://scikit-learn.org/stable/datasets/toy_dataset.html

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

In [18]:
from sklearn import datasets
iris = datasets.load_iris()

In [19]:
print(iris.data.shape)
iris.feature_names

(150, 4)


['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [20]:
print(iris.target.shape)
iris.target_names

(150,)


array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [21]:
from sklearn.model_selection import train_test_split

# data preparation
X = iris.data
y = iris.target_names[iris.target]
# split traning/test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=206)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(120, 4) (120,)
(30, 4) (30,)


---
## K-nearest neighbor classifier

https://scikit-learn.org/stable/modules/neighbors.html

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors


In [22]:
from sklearn.neighbors import KNeighborsClassifier

# create an estimator of a KNN classifier
model_knn = KNeighborsClassifier(n_neighbors = 3)
# fit the model to the training data
model_knn.fit(X_train, y_train)

In [23]:
# Apply the model to classify test data
model_knn.predict(X_test)

array(['setosa', 'versicolor', 'setosa', 'versicolor', 'virginica',
       'versicolor', 'versicolor', 'virginica', 'setosa', 'versicolor',
       'virginica', 'virginica', 'virginica', 'setosa', 'setosa',
       'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica',
       'versicolor', 'versicolor', 'versicolor', 'versicolor', 'setosa',
       'setosa', 'virginica', 'setosa', 'versicolor', 'virginica'],
      dtype='<U10')

In [24]:
# ground truth
y_test

array(['setosa', 'versicolor', 'setosa', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'virginica', 'setosa', 'versicolor',
       'virginica', 'virginica', 'virginica', 'setosa', 'setosa',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'setosa', 'setosa', 'versicolor', 'setosa',
       'versicolor', 'virginica'], dtype='<U10')

---
## Evaluating your model

https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

In [25]:
# Accuracy
model_knn.score(X_test, y_test)

0.9

In [26]:
# Accuracy
np.mean(model_knn.predict(X_test) == y_test) 

0.9

In [27]:
import sklearn.metrics as metrics
# Accuracy
metrics.accuracy_score(y_test, model_knn.predict(X_test))

0.9

In [28]:
from sklearn.metrics._plot.confusion_matrix import confusion_matrix

# confusion matrix
# https://en.wikipedia.org/wiki/Confusion_matrix
print(confusion_matrix(y_test, model_knn.predict(X_test)))

[[ 8  0  0]
 [ 0 13  3]
 [ 0  0  6]]


In [29]:
# classification report
print(metrics.classification_report(y_test, model_knn.predict(X_test)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         8
  versicolor       1.00      0.81      0.90        16
   virginica       0.67      1.00      0.80         6

    accuracy                           0.90        30
   macro avg       0.89      0.94      0.90        30
weighted avg       0.93      0.90      0.90        30



---
## Cross validation

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score

In [30]:
# Cross validation 5-fold
from sklearn.model_selection import cross_val_score
model = KNeighborsClassifier()
scores = cross_val_score(model, X, y, cv = 5)
scores

array([0.96666667, 1.        , 0.93333333, 0.96666667, 1.        ])

In [31]:
print(f"5-fold CV Accuracy: {scores.mean()} (+/- {scores.std()})")

5-fold CV Accuracy: 0.9733333333333334 (+/- 0.02494438257849294)


---
## &diams; Exercise

Try to fit some of the models in the following cell to the same data. Compute the relevant statistics (e.g. accuracy, precision, recall) with 5-fold cv. 

Look up the documentation for the classifier, and see if the classifier takes any parameters. How does changing the parameter affect the result?

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [33]:
# INSERT_YOUR_ANSWER

#Model QuadraticDiscriminantAnalysis :defualt parameters
model = QuadraticDiscriminantAnalysis(priors=None, reg_param=0.1, store_covariance=True)
scores = cross_val_score(model, X, y, cv = 5)
print(f"5-fold CV Accuracy of QuadraticDiscriminantAnalysis with default parameters: {scores.mean()} (+/- {scores.std()})")

#Model QuadraticDiscriminantAnalysis :defualt parameters
model = QuadraticDiscriminantAnalysis(priors=None, reg_param=0.5, store_covariance=True)
scores = cross_val_score(model, X, y, cv = 5)
print(f"5-fold CV Accuracy of QuadraticDiscriminantAnalysis with default parameters: {scores.mean()} (+/- {scores.std()})")

#we note that increase in reg_param leads to decrease in the accuracy. 

5-fold CV Accuracy of QuadraticDiscriminantAnalysis with default parameters: 0.9733333333333334 (+/- 0.02494438257849294)
5-fold CV Accuracy of QuadraticDiscriminantAnalysis with default parameters: 0.9466666666666667 (+/- 0.03399346342395189)


In [34]:
#Model RandomForestClassifier :defualt parameters
model =RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='auto', random_state=42)
scores = cross_val_score(model, X, y, cv = 5)
print(f"5-fold CV Accuracy of RandomForestClassifier with default parameters: {scores.mean()} (+/- {scores.std()})")


#Model RandomForestClassifier : number of splits has increased to 3
model = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=3, min_samples_leaf=1, max_features='auto', random_state=42)
scores = cross_val_score(model, X, y, cv = 5)
print(f"5-fold CV Accuracy of RandomForestClassifier with number of split increased (=3) : {scores.mean()} (+/- {scores.std()})")

#Model RandomForestClassifier : number of splits has increased to 3
model = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='auto', random_state=42)
scores = cross_val_score(model, X, y, cv = 5)
print(f"5-fold CV Accuracy of RandomForestClassifier with number of estimators increased : {scores.mean()} (+/- {scores.std()})")

#Model RandomForestClassifier : number of splits has increased to 3
model = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=2, max_features='auto', random_state=42)
scores = cross_val_score(model, X, y, cv = 5)
print(f"5-fold CV Accuracy of RandomForestClassifier with number of estimators increased : {scores.mean()} (+/- {scores.std()})")


#We observe that when we increase the minimum number of split at each node, the performance doesn't change much.
#When we increase the number of estimators, the performance doesn't very either
# One reason could be that this classifier is already very good. 

# We note that the accuracy slighly decreases if we make any changes from defualt parameters


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


5-fold CV Accuracy of RandomForestClassifier with default parameters: 0.9666666666666668 (+/- 0.02108185106778919)


  warn(
  warn(
  warn(
  warn(


5-fold CV Accuracy of RandomForestClassifier with number of split increased (=3) : 0.9666666666666668 (+/- 0.02108185106778919)


  warn(
  warn(
  warn(
  warn(
  warn(


5-fold CV Accuracy of RandomForestClassifier with number of estimators increased : 0.96 (+/- 0.024944382578492935)


  warn(
  warn(
  warn(
  warn(


5-fold CV Accuracy of RandomForestClassifier with number of estimators increased : 0.9666666666666668 (+/- 0.02108185106778919)


  warn(
