# Introduction to Machine Learning

Hello! This notebook is intended to be an introduction to machine learning. The model we will create won't be super advanced, but it will serve as a good basis for you to explore more on your own.

## The Data

Before we get started, I should introduce the data we will be working with. The Boston Housing dataset is a very common beginner dataset (So much so, that it's included in the library we will be using!). 

This data was collected by the US Census Bureau in 1978 in the suburbs surrounding Boston, Massachussets. It includes 14 statistics for 506 suburbs, including:
* Per capita crime rate by town
* Average number of rooms per dwelling
* Proportion of owner-occupied units built prior to 1940

Let's go ahead and load this dataset into our environment. We are using the sklearn library in this tutorial as it provides an easy way to access all of the information we need.

In [10]:
from sklearn.datasets import load_boston #Import the Boston dataset loader function from the datasets module.

dataset = load_boston() # Loads a dictionary full of the dataset into our variable.


Our `dataset` variable now includes some useful information:


In [18]:
print(dataset.keys())

['filename', 'data', 'target', 'DESCR', 'feature_names']


`'filename'` just tells us where the dataset is stored on our computer.

`'data'` contains all the 506 examples and their values.

`'target'` is the value we will try to predict for all of the 506 examples.

`'DESCR'` is more information about the dataset which you should certainly read.

`'feature_names'` tells you to what each of the columns in `'data'` corresponds (the statistics I mentioned above). 

In our case, the 'target' we will be trying to predict is the median household value in \$1000's.

## But How?
The algorithm we will be looking at today is called linear regression. If you have ever used the "line of best fit" feature in Microsoft Excel, you are already familiar. In fact, you could probably do what we're about to do in Excel, but where's the fun in that?

### The Hypothesis Function
We are all familiar with the basic formula of a line:

\begin{aligned}
y = mx+b \\
\end{aligned}

In this case, we have our y (the cost of the house), and our x (the 13 other statistics), and now we want to find the m that will, given any _unseen_ (as in, nothing we already have in our dataset) x values, give us the correct y value.

In machine learning terms,this formula is called our _hypothesis function_. Typically, this will be written in the form:

\begin{aligned}
h_\theta(x) = \theta x \\
\end{aligned}

In our case, it would actually look more like this:

\begin{aligned}
h_\theta(x) = \theta_0x_0 + \theta_1 x_1 + \dotsb + \theta_0 x_{12} \\
\end{aligned}

Since we have 13 x's.

### The Cost Function
So now that we have our function defined, how to we "grade" our prediction? That's where the cost function comes into play! Now, there are many cost functions out there and which one you use depends very much on the problem you are trying to solve, and the later steps in this tutorial. That discussion, however, is outside the scope of this tutorial.

In our case, the cost function that we will be using is called the Mean Squared Error. That function looks like this:
\begin{aligned}
J(\theta) = \frac{1}{2m}\sum_{i=0}^{m} (h_{i\theta}(x) - y_i)^2 \\
\end{aligned}
Where $h_{i\theta}(x)$ is what we predicted for the $i^{th}$ example and $y_i$ is the correct value. As you can see, this simply takes the average of the squared difference between the real value and the predicted value. The $\frac{1}{2}$ part just makes the math nice when you take the first derivative of the equation (we'll talk about that later).

### Gradient Descent
Okay, so now that we have the hypothesis function, the correct values, and a way to quantify our error, what next? We need to find a way to improve our guess for $\theta$ using our cost function. The way we are going to do that is called gradient descent.

Gradient descent will use the cost function we defined above to gradually change the value of $\theta$ until the cost is zero, or as close to it as we can get. The formula is as follows:

\begin{aligned}
\theta_j := \theta_j - \alpha\frac{\partial}{\partial \theta_j}J(\theta)\\
\end{aligned}

So, for every $\theta$ value, subtract from it's current value the derivate of the cost function with respect to that $\theta$. $\alpha$ is the rate by which we change that value is called the learning rate. This typically has a low value like .01.

## The Real Fun
Enough boring math, it's time to try this out!

### The Model
So the first step is to import the libraries that we will need. We will be using keras in this case since it provides (in my opinion) the easiest interface for creating models.

In [23]:
''' I wouldn't pay too much attention to what these do for now, they are just libraries specific to keras
  that we will need
'''
from keras.models import Sequential
from keras.layers import Input, Dense

In [27]:
model = Sequential() # Create our model
model.add(Dense(1, activation="linear",input_shape=(13,))) #Create the first and only layer of our model
model.compile(optimizer="SGD", loss="mean_squared_error", metrics=['accuracy'])

Let me explain the last two lines a little. Since Keras is made for neural networks, we have to do some ugly stuff to make it do linear regression. All this does is create a single "neuron" using the "linear" function (our hypothesis function), and tell it that our input will be our 13 features. 

Then in the next line, we tell it we want to learn using SGD (Stochasic Gradient Descent, our gradient descent algorithm above). We also tell it we want to use our Mean Squared Error function and we will be grading ourselves on our accuracy.

### The Data
So, before we try to run our model, we should do some data preprocessing. What we are going to do is normalize the features we have. We do this to ensure our model trains well. If we leave the data as it is, the changes our gradient descent makes to our weights can be drastic or otherwise unhelpful. Let's consider the first sample:

In [46]:
data = dataset['data']
print(train_data[0])

[-0.48270736 -0.35874599 -0.46683694 -0.4827509  -0.47904453 -0.43745466
 -0.03357755 -0.45457423 -0.47586174  1.55644101 -0.37734672  2.25155744
 -0.44844287]


As you can see, some values are very small and others are rather large. That means steps small steps for the large features will mean huge steps for the small variables, and we'll potentially never reach the minimum!

We will be using numpy for this part.

In [47]:
import numpy as np
data = np.array(data) # Convert train_data to a numpy array for these operations
averages = np.average(data) # Calculate the average of every sample
ranges = np.std(data) # Calculate the standard deviation of values for every sample

data = (data - averages)/ ranges

Now, as you can see, the distribution of values is much smaller:

In [48]:
print(data[0])

[-0.48270736 -0.35874599 -0.46683694 -0.4827509  -0.47904453 -0.43745466
 -0.03357755 -0.45457423 -0.47586174  1.55644101 -0.37734672  2.25155744
 -0.44844287]


Now let's split up the data so that we reserve 10% of the dataset for testing our model.

In [54]:
split_index = int(len(data) * .10)
test_data = data[:split_index]
train_data = data[split_index:]

train_labels = dataset['target'][split_index:]

Now, you should never split your data the way we just did. The reason for that is: if there are any underlying trends in your data, your model will learn those instead and not be accurate for new data. You should typically random sample the data for the splits, but in our case it will be fine.

### Putting it all together
Now is the moment of truth. Let's train our model.

In [57]:
model.fit(x=train_data, y=train_labels, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x12c994dd0>