# Sklearn library

In this notebook, we will go deeper into the sklearn library.
We begin by importing a new dataset. This is the iris dataset which is a classification problem for different iris plant. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

In [3]:
X = iris.data
y = iris.target

### Preprocessing
Split the dataset into train and test sets, and then scale it using Standard scaling. 

In [4]:
# split data into train_test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# use your favorite scaling to scale the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Linear Model
import sklearn Logistic Regression model, and use it to construct a classification model. Report your score

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# train a logistic regression model on the data and report the score

l_classifier = LogisticRegression()
l_classifier.fit(X_train, y_train)

print(accuracy_score(y_test, l_classifier.predict(X_test)))

0.9473684210526315


# Random Forest
Construct a random forest model and compare the score with the linear model.

In [6]:
from sklearn.ensemble import RandomForestClassifier

# train a random forest model on the data and report the score
r_classifier = RandomForestClassifier()
r_classifier.fit(X_train, y_train)

print(accuracy_score(y_test, r_classifier.predict(X_test)))

0.9473684210526315


Train-test splitting is performed randomly. That means everytime you run it, it gives you a different train-test split. Therefore, your accuracy might change. 
One way to get a more reliable value for accuracy is to use cross validation.
In cross validation, we divide the dataset into a number of cross folds. We will use each of these fold as a test set, and the rest as the train set. 
For example the picture belows a 5-fold cross validation.

![alt text](https://drive.google.com/uc?id=1tDEdnjfYx3r8Sxl_QxyVeDJMIGXy3fyS)

Use cross validation with 5 folds to get accuracy of both Logistic Regression and Random Forest

In [7]:
from sklearn.model_selection import cross_validate
lr_model = LogisticRegression(max_iter = 500) #
cross_validate(lr_model, X, y, cv = 5)

{'fit_time': array([0.01818657, 0.03986788, 0.02003455, 0.0171926 , 0.02361083]),
 'score_time': array([0.        , 0.        , 0.00505924, 0.        , 0.        ]),
 'test_score': array([0.96666667, 1.        , 0.93333333, 0.96666667, 1.        ])}

In [8]:
rf_model = RandomForestClassifier(n_estimators=200)
cross_validate(rf_model, X, y, scoring='accuracy')

{'fit_time': array([0.20752001, 0.20262647, 0.19990039, 0.22745609, 0.19748998]),
 'score_time': array([0.0123136 , 0.01218486, 0.00794697, 0.0202775 , 0.01191139]),
 'test_score': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.        ])}

# Grid Search

Another extremely useful tool in sklearn library is GridSearch. Several machine learning models have one or more hyper-parameters. At times, fine tuning hyper-parameter can boost the model performance signifcantly. 

Example Below shows how to use Grid Search for finding the best number of trees in a RandomForest

In [9]:
from sklearn.model_selection import GridSearchCV
rf_model = RandomForestClassifier()
param_grid = {'n_estimators':[5, 10, 50, 100, 200, 500]}
grid = GridSearchCV(rf_model, param_grid=param_grid, cv=3, scoring= 'accuracy', verbose=1)
grid.fit(X, y)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


GridSearchCV(cv=3, estimator=RandomForestClassifier(),
             param_grid={'n_estimators': [5, 10, 50, 100, 200, 500]},
             scoring='accuracy', verbose=1)

In [10]:
grid.cv_results_

{'mean_fit_time': array([0.00346851, 0.00952387, 0.06847914, 0.11338441, 0.20820793,
        0.55482086]),
 'std_fit_time': array([0.00447129, 0.00017958, 0.01435892, 0.02340348, 0.00625826,
        0.01321341]),
 'mean_score_time': array([0.00578117, 0.00023047, 0.00535417, 0.01088587, 0.01622558,
        0.050131  ]),
 'std_score_time': array([0.00428128, 0.00032594, 0.00096204, 0.00091419, 0.00374892,
        0.01009359]),
 'param_n_estimators': masked_array(data=[5, 10, 50, 100, 200, 500],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_estimators': 5},
  {'n_estimators': 10},
  {'n_estimators': 50},
  {'n_estimators': 100},
  {'n_estimators': 200},
  {'n_estimators': 500}],
 'split0_test_score': array([0.98, 0.98, 0.98, 0.98, 0.98, 0.98]),
 'split1_test_score': array([0.9 , 0.94, 0.94, 0.94, 0.94, 0.94]),
 'split2_test_score': array([0.94, 0.96, 0.98, 0.98, 0.96, 0.96]),
 'mean_test_score': array([0

Use Grid Search to find the best regularizing parameter `C` in Logistric Regression

In [13]:
from sklearn.model_selection import GridSearchCV

params = { "C": [0.1, 1, 10, 100, 1000] }
grid = GridSearchCV(LogisticRegression(max_iter=500), param_grid=params, cv=3, scoring="accuracy", verbose=1)
grid.fit(X, y)

grid.cv_results_

Fitting 3 folds for each of 5 candidates, totalling 15 fits


{'mean_fit_time': array([0.01345722, 0.01846083, 0.02648735, 0.0387044 , 0.06208014]),
 'std_fit_time': array([0.00252996, 0.00761275, 0.00409736, 0.00661937, 0.01379493]),
 'mean_score_time': array([0.        , 0.        , 0.        , 0.00150967, 0.        ]),
 'std_score_time': array([0.        , 0.        , 0.        , 0.00213499, 0.        ]),
 'param_C': masked_array(data=[0.1, 1, 10, 100, 1000],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.1}, {'C': 1}, {'C': 10}, {'C': 100}, {'C': 1000}],
 'split0_test_score': array([0.92, 0.98, 0.98, 1.  , 1.  ]),
 'split1_test_score': array([0.94, 0.96, 0.98, 0.96, 0.96]),
 'split2_test_score': array([0.98, 0.98, 0.96, 0.96, 0.96]),
 'mean_test_score': array([0.94666667, 0.97333333, 0.97333333, 0.97333333, 0.97333333]),
 'std_test_score': array([0.02494438, 0.00942809, 0.00942809, 0.01885618, 0.01885618]),
 'rank_test_score': array([5, 1, 1, 1, 1])}

## Other Linear Models

In addition Logistic RegressionSklearn has other linear models. They are mostly different in the regularizatin. Some of these models are only for regression.
See https://scikit-learn.org/stable/modules/linear_model.html

As an exercise use RidgeClassifier on the dataset. 

In [16]:
from sklearn.linear_model import RidgeClassifier

classifier = RidgeClassifier()
classifier.fit(X_train, y_train)

RidgeClassifier()

In [18]:
y_pred = classifier.predict(X_test)
accuracy_score(y_test, y_pred)

0.8421052631578947