# Machine Learning with a Library

### Introduction

In the last number of lessons, we have been built a simple linear regression machine learning.  We did so using the three components of any machine learning algorithm: first with choosing a prediction model, then fitting the model, and finally using the model to predict future data.

1. Our **prediction model** was simply a line, or a function, that given an input predicted an output.  In our example of basketball shooting, given a shooting angle, the model predicted a distance.  

2. We **fit** the model by comparing our model against the actual data.  We do this by calculating the difference between our actual data and the value that our model predicts -- this difference is called the error.  Then we square each of those errors and add up the these squared errors.  This is expressed mathematically as Sum of Squared Errors $= (actual  - predicted )^2 + (actual  - predicted )^2 +  ...$ .  

3. Now we can **predict** new distances with our fitted model.  We can predict by inputting new angles for which the updated model better predicts distances given different angles. 

### Using a machine learning library

Now wouldn't it be nice if, instead of writing these algorithms from scratch, we could use a tool to do these for us?  Well we can.

<img src="./scikitlearn.png" >

Scikit learn is an excellent tool for running machine learning algorithms.  Let's get going.

Our first step, of course, is to install the library.

In [2]:
!pip install sklearn

[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


> Press shift + enter to ensure that the library is installed.

Then we do the following.

In [1]:
import sklearn

Now that we have this library, let's import the linear regression model.

## Going through our three steps

Now that we have downloaded the `scikitlearn` library, and imported the library it is time to follow our three step process of (1) creating an initial model, (2) fitting the model and (3) then fitting the model to make new predictions.

### 1. Creating an initial model

In our introduction to machine learning lesson, we created an initial model simply by writing a function that takes an input and predicts an output.  

In [21]:
def predicted_distance(angle):
    return 12 + 5*angle 

Now when working with scikitlearn we also create an initial model, but we do so by using the `LinearRegression` function from the scikitlearn library.

So first we import the `LinearRegression` function.

In [16]:
from sklearn.linear_model import LinearRegression

And now we can create our initial model.

In [19]:
linear_regression = LinearRegression()

This model is fairly abstract at this point.  And admittedly difficult to understand.  But the big takeaway is that it's pretty similar to a dictionary.  As you can see, it has key value pairs just like a dictionary does.  

In [22]:
linear_regression

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

It's ok that we don't understand what these key value pairs mean, we will in time.  Right now let's focus on the fact that we were able to create an initial model with the lines:

```python
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
```

And now that we have an initial model, we are ready to move onto step 2.

### 2. Fit the model with the actual data

Now that we have initialized our model with the code `linear_regression = LinearRegression()`, it is now time to pass through some data into this model so what we can fit the model.  In the last lesson, we saw how we can fit the model by comparing what a model predicts versus the results of our actual data.  We then choose the model with the lowest error by finding the model that has the lowest residual sum of squares.

Here we do the same thing, and we do so by passing the data into our model.  Let's use the same data as earlier.

| angle        | distance           
| ------------- |:-------------:| 
|    .30        | 8 feet | 
|    .50        | 11 feet | 
|    .70        | 17 feet | 

Now remember that the shooting angles are the inputs and that each angle leads to an output of a distance.  Ok, so we may like to simply pass through these inputs and outputs as two lists to our model.  And out model has a `fit` method to do precisely that.

In [29]:
angles = [.3, .5, .7]
distances = [8, 11, 17]
# linear_regression.fit()

However `scikitlearn` requires our lists to be in a specific format to pass through this data.  It wants us to organize our data like so:

In [30]:
inputs = [[.3], [.5], [.7]]
outputs = [8, 11, 17]

### A special format...but why??

Scikitlearn recognizes that can include more inputs than just one in predicting an output.  For example, in the table above we just used `angle` to predict `distance`, but later we may decide to use both `angle` and `arm speed` to predict `distance`.

|arm speed| angle        | distance           
|----| ------------- |:-------------:| 
|5 mph|    .30        | 9 feet | 
|6 mph |    .50        | 15 feet | 
|4 mph|    .70        | 21 feet | 

So scikitlearn, along with other machine learning libraries, recognize that we have rows of data, and that each row can have multiple inputs, but just one output.

So to represent the rows of inputs of `arm speed` and `angle`, we can organize our data like so:

In [31]:
inputs = [
    [5, .30],
    [6, .50],
    [4, .70],
]

So we use a nested list, where each element of the outer list is a row, with each column being a different input for that row.  For our outputs, however, there is only ever one output per row, so we can just organize our outputs in one unnested list.

In [32]:
outputs = [9, 15, 21]

So even when me move back to with just one column of input data:  

| angle        | distance           
| ------------- |:-------------:| 
|    .30        | 8 feet | 
|    .50        | 11 feet | 
|    .70        | 17 feet | 

We still use a nested list of inputs, with each element of the outer list being a row, and each element of the inner list being an input column.

In [36]:
inputs = [
    [.30],
    [.50],
    [.70],
]

### 2 (continued). Now back to fitting the model

Ok, now that we know the format for our data, the next thing to do is to fit our linear model to the data.  We do this by using the `fit` method on our linear model and passing through the data in the proper format.

In [37]:
# nested list for the inputs
inputs = [
    [.30],
    [.50],
    [.70],
]

# single list for the outputs
outputs = [8, 11, 17]

linear_regression.fit(inputs, outputs)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### 3. Using our fit model to predict

Believe it or not with that line there at the end, we have fit our model to our data.  Here we'll prove it.  So remember that in our Introduction to Machine Learning lesson, when we fit our model, we went forward with a function that looked like the following.



In [39]:
def predicted_distance(angle):
    return 3 + 4*angle 

So the key things that change in our model are the two numbers.  The number on the left (that's not multiplied by anything) is called the y intercept.  We can see the intercept that `scikitlearn` arrived after fitting out data with the following:

In [40]:
linear_regression.intercept_

0.7500000000000018

The number that we multiply by our angle is called the coefficient, and we can find it by calling the `coef_` method.

In [41]:
linear_regression.coef_

array([22.5])

So `scikitlearn` is telling us that really our model should look like the following:

In [2]:
def predicted_distance(angle):
    return (.75 + 22.5*angle)

### 4. Predicting new distances

Now we can get a sense of how well our new formula does.  So this was our original data that we used to fit our model:

| angle        | distance           
| ------------- |:-------------:| 
|    .30        | 8 feet | 
|    .50        | 11 feet | 
|    .70        | 17 feet | 

And this is the predictions that our updated `predicted_distance` function produces.

In [7]:
predicted_distance(.3)  # 7.5
predicted_distance(.5) # 12.0
predicted_distance(.7) # 16.5

16.5

Not bad at all.  Of course, `scikitlearn` has a built in method that allows us to see the outputs of our model.  We can pass through our three rows of inputs.  But because scikitlearn does not know how many columns of inputs will be in each row, once again we use a nested list.

In [49]:
inputs = [
    [.3],
    [.5],
    [.7]
]

linear_regression.predict(inputs)

array([ 7.5, 12. , 16.5])

And we can pass through new inputs not in our data, to predict new outputs as well.

In [50]:
new_inputs = [
    [.4],
    [.6],
    [.8]
]
linear_regression.predict(new_inputs)

array([ 9.75, 14.25, 18.75])

So you can see that if we were to use a shooting angle of .4, .6, and .8, our model would predict outputs of 9.75, 14.25, and 18.75.

### Summary

In this lesson, we saw how to use the scikitlearn library to fit a machine learning model and make new predictions with our fitted model. 



We do so using similar steps to what we saw in our introduction to machine learning lesson.

1. Create an initial model
2. Fit the model to data
3. Use the fitted model to make new predictions

We can translate these steps into code with the following:

In [8]:
# import libraries
import sklearn
from sklearn.linear_model import LinearRegression

# 1. Create an initial model
linear_regression = LinearRegression()

# 2. Fit the model to data
linear_regression.fit(inputs, outputs)

# 3. Use the fitted model to make new predictions
linear_regression.predict(inputs)

The other thing to remember is that if we want to see the numbers behind these new predictions, we can see them by calling the corresponding methods.