## Example of a basic `RandomForestClassifier` [estimator](https://scikit-learn.org/stable/glossary.html#term-estimators) implementation

#### Importing the `RandomForestClassifier` and creating an object for the same classifier

In [7]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)

#### A example set of training examples.
**X**: samples matrix (or design matrix). The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.\
**y**: target values y which are real numbers for regression tasks, or integers for classification. For unsupervised learning tasks, y does not need to be specified. y is usually a 1d array where the i th entry corresponds to the target of the i th sample (row) of X.

In [8]:
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
y = [0, 1]  # classes of each sample

#### Fit the model using the training data and do predictions.

In [12]:
clf.fit(X, y)

# clf.predict(X)  # predict classes of the training data
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data


array([0, 1])

## Example of a basic [Pre-processing data](https://scikit-learn.org/stable/glossary.html#term-transform)

#### A typical pipeline consists of a pre-processing step that transforms or imputes the data, and a final predictor that predicts target values.
Importing the `Standard Scalar` transformer

In [13]:
from sklearn.preprocessing import StandardScaler

X = [[0, 15],
     [1, -10],]

#### Transform the input parameters to the standard scalar

In [14]:
# scale data according to computed scaling values
StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

## Example of basic [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)

#### Transformers and estimators (predictors) can be combined together into a single unifying object: a Pipeline. 

In [20]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#### Create the pipeline for Scalar --> Regressor

In [21]:
# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

#### Load the IRIS dataset and split it into training and test sets

In [34]:
X, y =load_iris(return_X_y=True)
print(X, y)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

#### Fit the whole pipeline using the training data

In [38]:
# fit the whole pipeline
pipe.fit(X_train, y_train)

0.9736842105263158

#### Check the final accuracy score for the filtted pipeline

In [None]:
# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

## Automatic parameter(hyper-parameters) searches example

#### Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation).

In [49]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

In [50]:
# load the california Housing dataset and split it into train and test sets
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#### Define the hyper parameter for the [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)

In [51]:
# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5),
                       'max_depth': randint(5, 10)}

#### Create a searchCV object and fit it to the data

In [52]:
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
                            n_iter=5,
                            param_distributions=param_distributions,
                            random_state=0)

search.fit(X_train, y_train)

#### Find the best hyper paramnters for the best fitted model

In [53]:
search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [54]:
search.score(X_test, y_test)

0.735363411343253