# Scikit-learn

**Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.**

## Fitting and predicting: estimator basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each estimator can be fitted to some data using its fit method.

Here is a simple example where we fit a RandomForestClassifier to some very basic data:

In [2]:
#imort libararies
import pandas as pd
import numpy as np
df= pd.read_csv("data/salary_data.csv",encoding = "Latin-1")
df.head()
#assigning x and y from the dataset
X = df.iloc[:, :-1].values #get a copy of dataset exclude last column
y = df.iloc[:, 1].values #get array of dataset in column 1st

In [3]:
from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(random_state=0)
>>> clf.fit(X, y)


RandomForestClassifier(random_state=0)

Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:

In [4]:
>>> clf.predict(X)  # predict classes of the training data

array([ 39343,  46205,  37731,  43525,  39891,  56642,  60150,  64445,
        64445,  57189,  63218,  55794,  55794,  57081,  61111,  67938,
        66029,  83088,  81363,  93940,  91738,  98273, 101302, 113812,
       109431, 105582, 116969, 112635, 122391, 121872], dtype=int64)

In [5]:
>>> clf.predict([[10]])  # predict classes of new data


array([122391], dtype=int64)

## Transformers and pre-processors

In scikit-learn, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same BaseEstimator class). The transformer objects don’t have a predict method but rather a transform method that outputs a newly transformed sample matrix X:

In [6]:
>>> from sklearn.preprocessing import StandardScaler

>>> # scale data according to computed scaling values
>>> StandardScaler().fit(X).transform(X)


array([[-1.51005294],
       [-1.43837321],
       [-1.36669348],
       [-1.18749416],
       [-1.11581443],
       [-0.86493538],
       [-0.82909552],
       [-0.75741579],
       [-0.75741579],
       [-0.57821647],
       [-0.50653674],
       [-0.47069688],
       [-0.47069688],
       [-0.43485702],
       [-0.29149756],
       [-0.1481381 ],
       [-0.07645838],
       [-0.00477865],
       [ 0.21026054],
       [ 0.2461004 ],
       [ 0.53281931],
       [ 0.6403389 ],
       [ 0.92705781],
       [ 1.03457741],
       [ 1.21377673],
       [ 1.32129632],
       [ 1.50049564],
       [ 1.5363355 ],
       [ 1.78721455],
       [ 1.85889428]])

## Pipelines: chaining pre-processors and estimator
ransformers and estimators (predictors) can be combined together into a single unifying object: a Pipeline. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with fit and predict. As we will see later, using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your training data.

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
# from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

# # load the iris dataset and split it into train and test sets
# X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])

In [8]:
# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.0

## Model evaluation
scikit-learn provides many other tools for model evaluation, in particular for cross-validation.

In [9]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
sal = LinearRegression()

result = cross_validate(sal, X, y)  # defaults to 5-fold CV
result['test_score']  # r_squared score is high because dataset is easy

array([1., 1., 1., 1., 1.])

## Automatic parameter searches
Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation). In the following example, we randomly search over the parameter space of a random forest with a RandomizedSearchCV object. When the search is over, the RandomizedSearchCV behaves as a RandomForestRegressor that has been fitted with the best set of parameters.

In [10]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5),
                       'max_depth': randint(5, 10)}


In [11]:
# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
                            n_iter=5,
                            param_distributions=param_distributions,
                            random_state=0)
search.fit(X_train, y_train)




search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [12]:
# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
search.score(X_test, y_test)


0.7357088178446147

# 1.1. Linear Models

## 1.1.1. Ordinary Least Squares

In [13]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X,y)

reg.coef_


array([ 4.36693293e-01,  9.43577803e-03, -1.07322041e-01,  6.45065694e-01,
       -3.97638942e-06, -3.78654265e-03, -4.21314378e-01, -4.34513755e-01])

## 1.1.2. Ridge regression and classification
### 1.1.2.1. Regression

In [14]:
from sklearn import linear_model
reg = linear_model.Ridge(alpha=.5)
reg.fit(X,y)

reg.coef_

array([ 4.36643796e-01,  9.43658673e-03, -1.07227325e-01,  6.44563694e-01,
       -3.97336560e-06, -3.78645054e-03, -4.21306864e-01, -4.34499254e-01])

In [15]:
reg.intercept_

-36.94025397892831