# Lesson 1: Laptop Machine Learning

## Learning objectives of this lesson

* A quick refresher on what machine learning is
* Refresh ourselves as to some of the ways to do machine learning in Python
* Build random forests with scikit-learn
* Build boosted trees with xgboost
* Build neural networks with keras and tensorflow

The purpose of this refresher is (1) to set the scene and (2) so that we have some typical ML code that we can then productionize!

## What is Machine Learning?

Machine learning is the science and art of teaching computers to "learn" patterns from data. In some ways, we can consider it a subdiscipline of data science, which is often sliced into

* Descriptive analytics (BI, classic analytics, dashboards),
* Predictive analytics (machine learning), and
* Prescriptive analytics (decision science).

Machine learning itself is often sliced into

* Supervised learning (predicting a label: classification, or a continuous variable),
* Unsupervised learning (pattern recognition for unlabelled data, a paradigm being clustering),
* Reinforcement learning, in which software agents are placed in constrained environments and given “rewards” and “punishments” based on their activity (AlphaGo Zero, self-driving cars). 

## Machine Learning: Classification

So we're now going to jump in and build our first machine learning model. It is the (now) famous Iris dataset, where each row consists of measurements of a flower and the target variable (the one you're trying to predict) is the species of flower. 

**On terminology:**

- The **target variable** is the variable you are trying to predict;
- Other variables are known as **features** (or **predictor variables**), the features that you're using to predict the target variable.

**On practice and procedure:**

To build machine learning models, you require two things:

- **Training data** (which the algorithms learn from) and
- An **evaluation metric**, such as accuracy.

For more on these, check out Cassie Kozyrkov's wonderful articles [Forget the robots! Here’s how AI will get you](https://towardsdatascience.com/forget-the-robots-heres-how-ai-will-get-you-b674c28d6a34) and [Machine learning — Is the emperor wearing clothes?](https://medium.com/@kozyrkov/machine-learning-is-the-emperor-wearing-clothes-928fe406fe09).

After training your algorithm on your training data, you can use it to make predictions on a _labelled_ **holdout** (or **test**) set and compare those predictions with the known labels to compute how well it performs.

You can also use a technique called **(k-fold) cross validation**, where you train and test several times using different holdout sets and compute the relevant accuracies (see more [here](https://en.wikipedia.org/wiki/Cross-validation_(statistics))). Image from Wikipedia:

![flow0](../img/cv.png)

Also note that the ML ingredients of *training data* and *evaluation* metric can introduce all type of biases and other problems into your ML algorithms, for example:

* If your training data is biased, your model more than likely will be;
* If you optimize solely for accuracy, what happens to groups that are under-represented in your training data?

The latter challenge follows from the broader class of problems we face when optimizing anything, as detailed by Rachel Thomas in ["The problem with metrics is a big problem for AI"](https://www.fast.ai/2019/09/24/metrics/):

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">The problem with metrics is a big problem for AI<br>- Most AI approaches optimize metrics<br>- Any metric is just a proxy<br>- Metrics can, and will, be gamed<br>- Metrics overemphasize short-term concerns<br>- Online metrics are gathered in highly addictive environment</a></p>&mdash; Rachel Thomas (@math_rachel) <a href="https://twitter.com/math_rachel/status/1176606580264951810?ref_src=twsrc%5Etfw">September 24, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

### Typical Machine Learning code in Python

We'll now show how to build some typical ML models in Python for

* random forests,
* boosted trees, and
* neural networks (deep learning).

The intention is not to be exhaustive but rather to show typical code for the 3 most practical types of models that you will write. We won't go into the details of all of these models but we will link to relevant resources so you can explore to your heart's content!

#### Random Forests

[Random forests](https://scikit-learn.org/stable/modules/ensemble.html#forest) are both powerful and commonly use ML algorithms. In the following, we

* Load our dataset,
* Instantiate three models: decision tree, random forest, and extra trees classifier, and
* Perform cross-validation for each model

Note that we're building more than just random forests here but we couldn't help ourselves as scikit-learn makes it so easy! These examples are from the [scikit-learn documentation](https://scikit-learn.org/stable/modules/ensemble.html#forest).

In [None]:
#Import scikit-learn dataset library
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

#Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

clf_dt = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
    random_state=0)
scores_dt = cross_val_score(clf_dt, X, y, cv=5)
print(scores_dt)


clf_rf = RandomForestClassifier(n_estimators=10, max_depth=None,
    min_samples_split=2, random_state=0)
scores_rf = cross_val_score(clf_rf, X, y, cv=5)
print(scores_rf)


clf_et = ExtraTreesClassifier(n_estimators=10, max_depth=None,
    min_samples_split=2, random_state=0)
scores_et = cross_val_score(clf_et, X, y, cv=5)
print(scores_et)


### Boosted trees



[Boosted trees](https://en.wikipedia.org/wiki/Gradient_boosting) are similar to random forests, in that they're both ensembles of decision trees. They are built differently, however. You can read [here](https://medium.com/@aravanshad/gradient-boosting-versus-random-forest-cfa3fa8f0d80) about the differences.

We'll use XGBoost, which is a popular package for boosted trees:

In [None]:
import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
# specify parameters
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic', 'eval_metric':'logloss'}
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
preds = bst.predict(dtest)
print(preds)

### Neural nets and deep learning

The third type of algorithm we'll now build is a neural network (also known as deep learning). These are:


- ML models inspired by biological neural networks.
- Performant for image classification, NLP, and more.




![flow0](../img/george.jpg)

Image from [here](https://www.pnas.org/content/116/4/1074/tab-figures-data).

When making predictions with neural networks, we use a procedure called **forward propagation**. When training neural networks (that is, finding the parameters, called weights), we use a procedure called **backpropogation**. To put it another way,

- **forward propagation** is for prediction (`.predict()`);
- **backpropogation** is for training (`.fit()`).



The following is (somewhat) typical deep learning code. We're using Keras & TensorFlow (and the example is based on the [Keras documentation](https://keras.io/examples/vision/mnist_convnet/)) but you have many other options, such as PyTorch, fast.ai, JAX, and/or PyTorch Lightning.

In [None]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")


# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)

model.summary()



In [None]:
batch_size = 128
epochs = 15

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])


### Lesson Recap

In this lesson, we covered the following:

* A quick refresher on what machine learning is
* Refreshing ourselves as to some of the ways to do machine learning in Python
* Building random forests with scikit-learn
* Building boosted trees with xgboost
* Building neural networks with keras and tensorflow

In the next lesson, we'll take these machine learning workflows and see what it means to productionize them!