## Beginner's guide to the `scikit` API

![Choosing the right estimator](http://scikit-learn.org/stable/_static/ml_map.png)

[Link to the interactive map from above](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

`scikit-learn` is a library (or collection) of Python resources for use in machine learning (ML). 
It is well-suited to a wide variety of ML tasks, and as such, 
the codebase is relatively complex on a cursory viewing. 
Fortunately, `scikit` was designed to be (more or less) intuitive to use once the basics of the API are understood. 

## Validation splits / training vs. test data

For statistical learning, it is important to separate your dataset into dependent and independent variables, 
and then to separate the data further into training and test sets.
Selecting features is entirely on the user, 
but the training/test split is accomodated within `sklearn.model_selection` via `train_test_split()`. 

```python
from sklearn.model_selection import train_test_split

X, Y = (some data, some response data)
# f is some decimal value between 0 and 1 representing the percentage of data to reserve for testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=f)

```

With long references such as these, it can be useful to use the `as` keyword for imports - 
just be considerate of anyone else who may later be reading or working with your code.

## Model construction

Aside from data preparation (covered above), 
the first step to solving a machine learning problem in `scikit` is to construct a model. 
All of the green boxes in the flowchart above represent separate (though often similar) 
modeling algorithms within the `scikit` API. 
As long as the relevant modules were correctly imported, 
actually declaring and initializing a model looks the same for all of those model types:

```python
from sklearn.(model_category) import model_type
...
model = ModelType()
# to fit model with training data:
fitted_model = model.fit(X, Y)
```

For example:

```python
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris

X = (predictor variable data)
Y = (response variable data)

# with dot notation, you can declare and fit on one line
model = KMeans().fit(X, Y)
```

## Using a fitted model for prediction

Once a model has been fitted to the training data, 
you can typically use it to make predictions with `model.predict()` 
(see each model's documentation for return types, variant functions and notes on their usage).

```python
# validation split
X, Y = (some data, some response variable)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=f)

# fit model 
fitted_model = (some model).fit(X_train, Y_train)

# predict
Y_predict = fitted_model.predict(X_test)

# probabilistic:
Y_pred_proba = fitted_model.predict_proba(X_test)
```

## General notes

The [scikit documentation](http://scikit-learn.org/stable/index.html) is rather consistently good. 
Most questions about algorithm selection and what parameters can/should be tuned with each model 
can be resolved with a good look at the relevant doc pages. 
It also contains information on the library's wide array of preprocessing and dimensionality reduction resources
(this may not mean anything yet, but it should as you begin to tackle more complicated ML problems).

Specific similarities between different parts of the API will be discussed as they arise in the rest of this lab's examples.