# scikit-learn
---
This is the getting started at [scikit-learn.org](https://scikit-learn.org/stable/getting_started.html).

sklearn calls its built-in algorithms and models estimators. They all inherit the same class `BaseEstimator`. Here's an example, a random forest classifier:

In [8]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)

X = [[1, 2, 3], [11, 12, 13]]
y = [0, 1]
clf.fit(X, y)

clf.predict([[4,5,6], [14,15,16]])

array([0, 1])

This is a transformer—a set of steps that pre-process and impute the data.

In [5]:
from sklearn.preprocessing import StandardScaler
X = [[0,15],[1, -10]]
StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

This is pretty cool: it's creating a pipeline made up of a transformer (it cleans the data) and the estimator itself—in this case logistic regression. When you fit the model to the data, it automatically runs the transformer first, then fits the estimator.

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=0)
)

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(random_state=0))])

We can then evaluate this model using the imported `accuracy_score`:

In [7]:
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

sklearn has a bunch of tools to evaluate your model, including cross-validation. My model of it right now is a way of splitting the training and test data in such a way that it helps your model learn better. I don't have a formal definition.

In [11]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0) # this is cool! it generates a random regression problem
lr = LinearRegression()

result = cross_validate(lr, X, y)
result['test_score']

array([1., 1., 1., 1., 1.])

Estimators have parameters—called hyper-parameters—that you can tune. It's not clear beforehand what these should be, since they depend on the data you're working with. sklearn has tools to find the best parameter combination.

In [16]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

param_distributions = {'n_estimators': randint(1, 5),
                        'max_depth': randint(5, 10)}
                
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
                            n_iter=5,
                            param_distributions=param_distributions,
                            random_state=0)

search.fit(X_train, y_train)

search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [17]:
search.score(X_test, y_test)

0.735363411343253