<a href="https://colab.research.google.com/github/jdhaecker/Training/blob/master/IntroToMachineLearningProgramming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Intro-to-ML-Programming.git cloned-repo
%cd cloned-repo
!ls

In [0]:
from IPython.display import Image
def page(num):
    #print(str(num))
    return Image("Intro to Machine Learning Programming ("+ str(num) + ").png")
print("done")

In [0]:
from IPython.display import Image
Image("Intro to Machine Learning Programming.png")
print("done")

Google CoLab is similar to Jupyter Notebooks. 
The both run iPython, which is the interactive version of Python. 

When a cell is run, the code is compiled and executed. <br>
The user can run a cell, make changes and run the cell again. <br>
<br>
To save changes, download the notebook: **File>Download .ipynb**
or **File>Save a copy in Drive**

In [0]:
page(1)

In [0]:
page(2)

In [0]:
page(3)

In [0]:
page(4)

In [0]:
# Use seaborn for pairplot
!pip install seaborn


In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Install TensorFlow
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
import pathlib

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)


In [0]:
page(5)

In [0]:
page(6)

#Regression: predict fuel efficiency




In [0]:
page(7)

In [0]:
page(8)

Note: we are trying to get the smallest overall error for our model predictions.....this implies....**we expect some error in our machine learning model!**

In [0]:
 page(9)

In a regression problem, we aim to predict the output of a continuous value, like a price or a probability. As opposed to a classification problem, where we aim to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This notebook uses the classic [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we'll provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

This example uses the tf.keras API, see [this guide](https://www.tensorflow.org/guide/keras) for details.

##The Auto MPG dataset
The dataset we are using is a publicly available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).

### Get the data
First download the dataset.

Import it using pandas

Pandas can import 14 different file types. This dataset is stored as a CSV file.<br>
[14 File Types you can Import with Pandas](https://www.cbtnuggets.com/blog/technology/programming/14-file-types-you-can-import-into-pandas)

In [0]:
page(10)

In [0]:
dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

In [0]:
page(11)

In [0]:
page(12)

In [0]:
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset = raw_dataset.copy()
print("done")

In [0]:
page(13)

Print out the first few rows of the dataset, this will help you understand your data better. 

In [0]:
dataset.head()

### Prepare the data

In [0]:
page(14)

###Clean the data

In [0]:
dataset.isna().sum()

For this tutorial, drop the rows that contain NA. For some cases you make need to change missing values instead.

In [0]:
dataset = dataset.dropna()
dataset.isna().sum()

In [0]:
dataset.describe()

You could also look at the number of times each value appears in a column. 

In [0]:
dataset['Cylinders'].value_counts()

#Jesse: Note the data below is biased towards larger engines.  Do we throw away 3 and 5 cyclinder engines?  
# In this case, Business people said keep them.  Otherwise, we'd probably throw out

###Convert categorical data 

The `"Origin"` column is categorical, not numeric. So convert it to a one-hot:

In [0]:
page(15)

In [0]:
origin = dataset.pop('Origin')
print("done")

In [0]:
dataset.tail()

In [0]:
dataset['USA'] = (origin == 1)*1.0
dataset['Europe'] = (origin == 2)*1.0
dataset['Japan'] = (origin == 3)*1.0
dataset.tail()

###Split the data into train and test sets
Now split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our model.

In [0]:
page(16)

In [0]:
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)
print("done")

### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the training set.

In [0]:
sns.pairplot(train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
plt.show()

Have a quick look at the joint distribution of a few pairs of columns from the test set.

In [0]:
sns.pairplot(test_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
plt.show()

Also look at the overall statistics.<br> The stats for the training set and the test set should be similiar.

In [0]:
train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
train_stats

In [0]:
test_stats = test_dataset.describe()
test_stats.pop("MPG")
test_stats = test_stats.transpose()
test_stats

### Split features from labels

Separate the target value, or "label", from the features. This label is the value that you will train the model to predict.

In [0]:
page(17)

In [0]:
page(18)

In [0]:
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')
print("done")

### Normalize the data

Look again at the `train_stats` block above and note how different the ranges of each feature are.

In [0]:
page(19)

It is good practice to normalize features that use different scales and ranges. Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

Note: Although we intentionally generate these statistics from only the training dataset, these statistics will also be used to normalize the test dataset. We need to do that to project the test dataset into the same distribution that the model has been trained on.

In [0]:
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)
print("done")

This normalized data is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model, along with the one-hot encoding that we did earlier. That includes the test set as well as live data when the model is used in production.

## The model

In [0]:
page(20)

In [0]:
page(21)

###Define the model

In [0]:
inputs = len(train_dataset.keys())
print("number of inputs to the model = " + str(inputs))

def build_model():
  model = keras.Sequential([
    #input_shape=(9,),
    layers.Dense(64, activation=tf.nn.relu,input_shape=([len(train_dataset.keys())]),),
    layers.Dense(64, activation=tf.nn.relu),
    layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mean_squared_error',
                optimizer=optimizer,
                metrics=['mean_absolute_error', 'mean_squared_error'])
  return model
  print("done")

### Build the model

Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output layer that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on.

In [0]:
model = build_model()
print("done")

Inspect the model

Use the `.summary` method to print a simple description of the model

In [0]:
model.summary()

In [0]:
page(22)

### Train the model

Train the model for 1000 epochs, and record the training and validation accuracy in the `history` object.

ML Vocabulary: https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9

In [0]:
page(23)

In [0]:
page(24)

In [0]:
page(25)

In [0]:
page(26)

In [0]:
page(27)

In [0]:
page(28)

In [0]:
page(29)

In [0]:
page(30)

In [0]:
page(31)

In [0]:
page(32)

In [0]:
page(33)

In [0]:
# Display training progress by printing a single dot for each completed epoch
class PrintDot(keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 100 == 0: print('')
    print('.', end='')

EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[PrintDot()])

Visualize the model's training progress using the stats stored in the `history` object.

In [0]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

In [0]:
page(35)

In [0]:
page(36)

https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/


In [0]:
def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Abs Error [MPG]')
  plt.plot(hist['epoch'], hist['mean_absolute_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
           label = 'Val Error')
  plt.ylim([0,5])
  plt.legend()

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Square Error [$MPG^2$]')
  plt.plot(hist['epoch'], hist['mean_squared_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_squared_error'],
           label = 'Val Error')
  plt.ylim([0,20])
  plt.legend()
  plt.show()


plot_history(history)

This graph shows little improvement, or even degradation in the validation error after about 100 epochs. Let's update the `model.fit` call to automatically stop training when the validation score doesn't improve. We'll use an *EarlyStopping callback* that tests a training condition for  every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.

You can learn more about this callback [here](https://keras.io/callbacks/#earlystopping)

In [0]:
page(37)

You can also save time by doing a lot fewer epochs. Now that you know where overfitting begins... stop just before that point.

In [0]:
model = build_model()

# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

history = model.fit(normed_train_data, train_labels, epochs=EPOCHS,
                    validation_split = 0.2, verbose=0, callbacks=[early_stop, PrintDot()])

plot_history(history)

The graph shows that on the validation set, the average error is usually around +/- 2 MPG. Is this good? We'll leave that decision up to you.

Let's see how well the model generalizes by using the **test** set, which we did not use when training the model.  This tells us how well we can expect the model to predict when we use it in the real world.

In [0]:
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=0)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))

### Test the model

Finally, predict MPG values using data in the testing set:

In [0]:
page(38)

In [0]:
page(39)

In [0]:
test_predictions = model.predict(normed_test_data).flatten()

plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-100, 100], [-100, 100])
plt.show()

It looks like our model predicts reasonably well. Let's take a look at the error distribution.

In [0]:
error = test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [MPG]")
_ = plt.ylabel("Count")
plt.show()

It's not quite gaussian, but we might expect that because the number of samples is very small.

In [0]:
page(40)

In [0]:
page(41)

In [0]:
page(42)

In [0]:
#Normalized data
trial=([1, 1, 1, 1, 1, 1, 1.0, 0.0, 0.0],)
trial_predictions = model.predict(trial).flatten()
print("done")

In [0]:
print(trial_predictions)

## Conclusion

This notebook introduced a few techniques to handle a regression problem.

* Mean Squared Error (MSE) is a common loss function used for regression problems (different loss functions are used for classification problems).
* Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).
* When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.
* If there is not much training data, one technique is to prefer a small network with few hidden layers to avoid overfitting.
* Early stopping is a useful technique to prevent overfitting.

# Please provide feedback: 

[Intro to ML Programming Feedback](https://docs.google.com/forms/d/e/1FAIpQLScMf6j9h9Yxm5zdUhoSPsXEP_c5ruO2ZDmNYTDlW-9XKQ3Ogg/viewform?usp=pp_url)