# Machine Learning with Scikit-Learn

Now that we've gotten an introduction to machine learning is in this lecture we're going to discuss how we're going to be using Python and the sikat learn package to perform machine learning with Python.

- Like I mention we're going to be using the **Scikit Learn** package for Python.
- It's the most popular machine learning package for Python and it has a lot of algorithms already built into it.

In case you need to install Scikit-Learn:


If you have the Anaconda distribution:

```bash
$ conda install scikit-learn
```

or 

If you have another Python distribution:

```bash
$ pip install scikit-learn
```

## Machine Learning Process

- Let's talk about the basic structure of how to use Scikit-Learn

First a quick review of the machine learning process: 

[image]

The machine learning process starts off with your data. Somehow you need to acquire data and then the next step usually is to clean that data and format it so that the machine learning model can accept it. 

Before you actually give it to the machine learning model however you're going to split that clean the data into a **test** set and a **training set**.

Your **train** your model or the **training set** and then in the next step you **test** your model on the **test set**

You **iterate** your model and **tune the parameters** of your model until it's ready to **deploy**.

## Scikit Learn

- Now let's go over an example of the process to use Scikit-Learn.
- Now don't worry about memorizing any of this. We're going to get plenty of practice and review all of this when we actually start coding in subsequent lectures.

Every algorithm in Scikit-Learn is exposed via an **"Estimator"** object.

First you'll import the model and the general form of this is:

```python
from sklearn.family import Model
```

Example:

```python
from sklearn.linear_model import LinearRegression
```

the `LinearRegression` is the Estimator object, it is the model itself. And then,  the next step is to **instantiate** that model or estimator.

**Estimator parameters**: All the parameters of an estimator can be set when it is instantiated, and all of them have suitable default values.

You can use `Shift+tab` in jupyter to check the possible parameters

**For example**: For example you can instantiate a linear regression model by specifying a parameter to be normalized equals true. Then, print the model you just instantiated.

```python
model = LinearRegression(normalize=True)
print(model)
```

Output:

```
LinearRegression(copy_X=True, fit_intercept=True, normalize=True)
```

You can go ahead and check out the parameters that were defaulted to the model such as the intercept true and fit intercept true again you actually don't need to specify normalized as equal to true. Those are just additional parameters you can use to tune the model to something more specific.

Once you have your model created your parameters it's time to fit your model on some data.

But remember we should split this data into a **training set** and a **test set**. Let's see an example:

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5,2)), range(5)
X

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [5]:
list(y)

[0, 1, 2, 3, 4]

We need to import numpy to create some facke data. 

Then, the way to use Scikit-Learn to do a train, set split, splitting your data into training and test data is importing from `sklearn.model_selection` the `train_test_split` function. 

Then we have a set of data, `X` and `y` where `X` is the atual features and `y` is the actua label for each of those features rows. 

Using `train_test_split` you pass in  `X` and `y` and you pass in the test size.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [7]:
X_train

array([[0, 1],
       [6, 7],
       [2, 3]])

In [8]:
y_train

[0, 3, 1]

In [9]:
X_test

array([[4, 5],
       [8, 9]])

In [10]:
y_test

[2, 4]

You can use `train_test_split` on your features and labels and Scikit-Learn will automatically output your training set and your testing set.

Basically you just have your labels for your training and testing data as well as your features for your training and testing data.

Now that we have split the data, we can **train/fit our model** on the **training data**. 

This is done through the `model.fit()` method:

```python
model.fit(X_train,y_train)
```

Basically you take your model and, remember, we instantiated it as a linear regression estimator and then you say `model.fit()` and you pass in your training data, yoou pass in your `X` training data which the features of your data and then you
pass in your `y` training data which are the training labels.

- Now the model has been fit and trained on the training data.
- The model is ready to predict labels or values on the test set.

> Keep in mind I'm showing an example of a supervised learning process.
The process for an unsupervised model is going to be a little different because you're actually not going to have those labels.

For a supervise learning model we get **predicted values** using the predict method:

```python
predictions = model.predict(X_test)
```



We can then **evaluate** our model by comparing our predictions to the correct values.

The evaluation methdo depends on what sort of machine learning algorithm we are using (e.g. Regression, Classificaiton, Clustering, etc.)

Let's get a quick recap!


Scikit-Learn really strives to have a uniform interface across all methods and we're going to see examples of these below.

Given a Scikit-Learn _estimator_ object named model these following methods are available on all estimators.


- Available in **all Estimators**:

    - `model.fit()`: fit training data.
    - For supervised learning applications, this accepts two arguiments: the data `X` and the labels `y` (e.g. `model.fit(X,y))`.
    - For unsupervised learning applications, this accepts only a single argument, the data `X` (e.g. `model.fit(X))`.

- Available in **supervised estimators**:

    - `model.predict()`: given a trained model, predict the label of a new set of data. This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new))`, and returns the learned label for each object in the array.

- Available in **supervised estimators**:

    - `model.predict_proba()`: For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by `model.predict()`.

- Available in **supervised estimators**:

    - `model.score()`: for classificaiton or regression problems, most estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.

- Available in **unsupervised estimators**:

    - `model.predit()`: predict labels in clustering algorithms.
    
    
- Available in **unsupervised estimators**:

    - `model.transform()`: given a unsupervised model, transform new data into the new basis. This also accepts one argument `X_new`, and returns the new representation of the data based on the unsupervised model.
    
    
- Available in **unsupervised estimators**:

    - `model.fit_transform()`: some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.

## Choosing the right Algorithm

Maybe you wondering well how do I choose an algorithm classification versus regression versus clustering.

If you go ahead and just google search Scikit-Learn algorithm cheat sheet you should get an image that looks like this.


https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


And this is from the official Scikit-Learn documentation. But basically this is a bit of a decision tree or a walkthrough guide on how to actually go about choosing
an algorithm and if we go ahead and start off you'll see that it asks you the more than 50 samples.

If not you should really get more data. If you do have more than 50 samples then asks you Are you trying to predict the category predict the
quantity. Whether your label is data, etc.