# P-ai AI/ML Workshop: Session 4

Welcome to P-ai's third session of the AI/ML workshop series! Today we'll learn about
- Deep learning
    - Intuition behind neural nets
    - How to build and train a neural net with Tensorflow and Keras
    - Types of neural nets
- Key takeaways from the workshops thus far

<img src="https://images.squarespace-cdn.com/content/5d5aca05ce74150001a5af3e/1580018583262-NKE94RECI46GRULKS152/Screen+Shot+2019-12-05+at+11.18.53+AM.png?content-type=image%2Fpng" width="200px">

## 1. Feed-forward Neural Nets

### A simple example

Let's say we have a text, and we want to predict whether the language of the text is English, Spanish, or German. One simple way to attempt this would be to calculate the frequency of each letter in the alphabet, and use those frequencies to predict the language. For this example, we'll only use the 26 standard English letters, although in practice, that would be a very dumb decision since the presence of letters like `ñ` or `ü` would pretty much immediately give away the answer.

So, our features would look like:
- $x_1$ frequency of the letter `a`
- $x_2$ frequency of the letter `b`  
...
- $x_{26}$ frequency of the letter `z`

To build a little bit of intuition for neural nets, let's say we saw this string of text:

In [1]:
text = 'Mi abuela tiene noventa y cuatro años. Cuando era joven trabajó como enfermera en un hospital. Ahora le gustan las manualidades y hace pulseras para toda la familia. Por las mañanas sale a dar un paseo con sus amigas y por las tardes ve la televisión.'
# Remove special characters for the sake of example
text = text.replace('ñ', 'n')
text = text.replace('ó', 'o')
text

'Mi abuela tiene noventa y cuatro anos. Cuando era joven trabajo como enfermera en un hospital. Ahora le gustan las manualidades y hace pulseras para toda la familia. Por las mananas sale a dar un paseo con sus amigas y por las tardes ve la television.'

In [2]:
# Helpful for counting
from collections import Counter
# Case doesn't matter
text = text.lower()
# Punctuation doesn't matter
text = text.replace(' ','')
text = text.replace('.','')
text = text.replace(',','')
text_len = len(text)
counter = Counter(text)
# Sort entries
counts = dict(sorted(counter.items(), key=lambda item: -item[1]))
print('Letter frequencies\n------------------')
for letter in counts:
    print(f"{letter}: {counts[letter] / text_len}")

Letter frequencies
------------------
a: 0.19306930693069307
e: 0.09900990099009901
s: 0.08415841584158416
n: 0.07920792079207921
o: 0.07920792079207921
l: 0.06435643564356436
r: 0.0594059405940594
i: 0.04455445544554455
u: 0.04455445544554455
t: 0.04455445544554455
m: 0.034653465346534656
d: 0.0297029702970297
p: 0.0297029702970297
c: 0.024752475247524754
v: 0.019801980198019802
y: 0.01485148514851485
h: 0.01485148514851485
b: 0.009900990099009901
j: 0.009900990099009901
f: 0.009900990099009901
g: 0.009900990099009901


It's not exactly trivial to guess which language this is from the frequencies alone, but you might have a bit of intuition. For example, `t` is extremely common in English (the second most frequent letter, actually), so we might suspect this isn't English given it's nearly halway down the frequency list. In other words, the weight for $x_{20}$ (the frequency of the letter `t`) should be higher for English than for Spanish.

If we assume that the probability of the text being a certain language is linearly related to these frequencies, we could frame this problem like a logistic regression problem:

$$
w_{1e}x_1 + w_{2e}x_2 + {...} + w_{26e}x_{26} + b_e = logit_e
$$

Where $x_1$ is the frequency of letter `a` in our text, $w_{1e}$ is its respective weight (for English), and ditto for the rest of the input varluables. $b_e$ is the bias.

> What's up with the logit?

You might remember from Workshop 2 that the output of a linear combination in linear regression is called a **logit**; it's related to the probability of the input data belonging to a certain class, but it's quite a probability yet (which should be apparent because it can be any real value, whereas probabilities should be between `0` and `1`). In logistic regression, we used the `sigmoid` to get from a logit to a probability, but we'll need to do something different for multiclass classification– we'll get to that in a second!

Notice that we'll have something very similar for Spanish and German, but with different weights and biases.

$$
w_{1s}x_1 + w_{2s}x_2 + {...} + w_{26s}x_{26} + b_s = logit_s \\
w_{1g}x_1 + w_{2g}x_2 + {...} + w_{26g}x_{26} + b_g = logit_g
$$

Note how every feature ($x_1$ through $x_{26}$) gets 'sent' to all three outputs. Conversely, each output receives all the inputs. We can sketch our neural net like this:

<img src="images/nn_1.png" width="500px">

Notice the dimensions. Our input layer has shape `(26,)` (vector of size 26) and the output layer has shape `(3,)` (vector of size 3).

In [4]:
# Illustrate the point by creating the input vector from the example above
import numpy as np
x = np.zeros(shape=(26,), dtype=np.float32)                # Create empty vector
for i, letter in enumerate('abcdefghijklmnopqrstuvwxyz'):  # Iterate through 26 letters
    if letter in counts:                                   # If the ith letter appeared in the text
        x[i] = counts[letter]           # Set x_i to that letter's count
x = np.divide(x, sum(x))                # Turn counts into frequencies
print("X_vector:\n", x)
print("Shape:", x_vector.shape)

X_vector:
 [0.19306931 0.00990099 0.02475248 0.02970297 0.0990099  0.00990099
 0.00990099 0.01485149 0.04455446 0.00990099 0.         0.06435644
 0.03465347 0.07920792 0.07920792 0.02970297 0.         0.05940594
 0.08415841 0.04455446 0.04455446 0.01980198 0.         0.
 0.01485149 0.        ]
Shape: (26,)


Notice how we turned an input (text, in this case) into a vector. This vector would then go into the neural net during training (which is how the model gets $x_1$ through $x_{26}$).

> What about the $y$ data? Why is it a vector of size 3?

For binary classification, our $y$ data can simply be a `1` or a `0`; either an input belongs to the class, or it does not. For multiclass classification, we think of having a separate probability for the input belonging to each class. Since our task has three languages we're trying to classify between, our output vector would have size 3. For example, the text we saw above was Spanish, so the correct $y$ would be `[0, 1, 0]`, assuming each value is the probability of the text being English, Spanish, and German, respectively (we could have picked any order; we just need to stick with one).

> Okay, what about that whole logit business?

Remember that, with linear regression with one output variable, we used the sigmoid function to turn a logit into a probability. For **multiclass classification**, we need to use a *generalized* version of the sigmoid function, which is called **softmax**. Softmax takes in *multiple* logits and transforms them such that each one lies in the range `[0,1]` and sum to `1`. That is, it turns your outputs into probabilities of the input belonging to each class!'

When a function is applied to the inputs of a layer, it's called an **activation function**. So, we apply an activation function of `softmax` on our output layer to turn our logits into probabilities. There are other activation functions we might apply to layers in our neural net. For example, you can actually think of "doing nothing" as the linear activation function ($y=x$). A very common one is `ReLU`, which just turns any negative number into `0`. There are plenty of others, such as `sigmoid` (which we've seen), `tanh`, and `leaky ReLU`. [Here's](https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/) a helpful guide to activation functions and what they're used for.

>So to review what we have in this architecture:

This is a **neural network** with an **input layer**, an **output layer**, and **no hidden layers**. The input layer has **26 neurons** and the output layer has **3 neurons**. There are **weights** (multiply with the input) that connect each node in first layer to the second layer, and **biases** (add with the input) for each neuron in the output lyaer. We apply the **softmax activation function** to the last layer of the neural net to output a probability vector.

<img src="images/nn_2.png" width="500px">

>So... how does it learn?

The model makes a certain number of predictions and calculates a **loss** (see Workshop 3 for a more detailed explanation of this). The model then uses **gradient descent** and a process called **backpropagation** (shortened to backprop or just BP) to iteratively adjust the weights and biases to improve the model's performance (minimize loss). Then, the model makes more predictions, and the whole process repeats.

Backpropagation is a pretty cool process; it's not magic, it's vector calculus! I won't go into the details right now, but [here](https://medium.com/coinmonks/backpropagation-concept-explained-in-5-levels-of-difficulty-8b220a939db5) is a cool article that explains backprop in 5 degrees of in-depth-ness.

Note that the weights and biases are called the **trainable parameters**, which should make sense, because they're parameters that are learned via training.

### Hidden layers

This is all well and good, but we may suspect that the probem is more complex than pure linear combinations. This is what leads us to add **hidden layers** between our input and output layers, which allow for the model to learn more complex relationships between variables. That would look something like this:

>Again, note the architecture:

Everything is the same as the previous example, except we now have **two hidden layers**. In our code, we would say how many neurons each of these hidden layers has. For example, the first could have 128 and the second could have 256 (powers of 2 are common). All of these layers are **densely connected**, meaning every neuron is connected to every neuron in the next layer.

>How would the code for this look?

Something like this:

In [4]:
import tensorflow
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

# Build model 
model = Sequential()
model.add(Dense(128, input_dim=26, activation='relu')) # Input layer and first hidden layer
model.add(Dense(256, activation='relu'))               # Second hidden layer
model.add(Dense(3, activation='softmax'))              # Output layer
# Choose optimizer and loss function
opt = tensorflow.keras.optimizers.Adam(lr=0.001)
loss = 'categorical_crossentropy'
# Compile 
model.compile(optimizer=opt, 
    loss=loss,
    metrics=['accuracy'])

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_6 (Dense)              (None, 128)               3456      
_________________________________________________________________
dense_7 (Dense)              (None, 256)               33024     
_________________________________________________________________
dense_8 (Dense)              (None, 3)                 771       
Total params: 37,251
Trainable params: 37,251
Non-trainable params: 0
_________________________________________________________________


Other important keywords:

- The **batch size** is a number of data points the model sees before updating its internal parameters. If the batch size is 1, then it backpropagates every time it sees a data point, and if it is 50, it only backpropagates after every 50 data points it sees.
- An **epoch** is a complete run-through of the training data. If you train a model for 10 epochs, it sees all the data 10 times.
- A **hyperparameter** is a parameter that does *not* get updated by backpropagation and are therefore not learned by the model. Rather, it is part of the architecture itself. For example, the number of hidden layers, the size of the hidden layers, and the batch size are all hyperparamters.

### Other neural nets!

There are many kinds of neural networks. Here are some examples:

- **Multilayer Perceptron (MLP)**: The 'vanilla' neural network, which is what we were just looking at!
- **Convolutional Neural Network (CNN)**: A type of neural network heavily used for 2D data, including images! [CNNs explained](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/)
- **Long Short-Term Memory (LSTM)**: A type of RNN (recurrent neural network) with specially designed cells that work well with time-based data, like weather or stock prices! [LSTMs explained](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)

<img src="https://upload.wikimedia.org/wikipedia/commons/3/3b/The_LSTM_cell.png" height="600px" width="900px">

# Part 3: Key takeaways

<img src="https://images.squarespace-cdn.com/content/v1/56f19dfb4d088e32bdf80799/1587922501998-14ODTFZJ2BDTQO7YWVKE/ke17ZwdGBToddI8pDm48kF7XTateWj6md6x9JeEx8zB7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z5QPOohDIaIeljMHgDF5CVlOqpeNLcJ80NK65_fV7S1UcQ3LBX3nFqIgEwW1SNWcHb6uj2GNAWCrthT2xRQtbfxv31nTNa4bR0Fwgc9EJtadg/pasta%2B-%2B1%2B%25281%2529.jpg" height="400px">

*fig (1). Spaghetti thrown at a wall*

So what are the most important lessons from all four workshops? I've tried to summarize them all here:

- **Machine learning** comes in all kinds and flavors for all kinds of tasks. Some of the most common types of tasks are:
  - **Classification**: Assign a data point to a particular class
  - **Regression**: Assign a data point some continuous value
  - **Reinforcement learning**: Train an agent to interact with some environment

For classification and regression:

- **Understand** your problem before writing code!
  - Perform **research** to understand the problem better
  - **Explore** your data with tables and plots
  - Perform **feature engineering** to make your features as predictive as possible
- **Data is king**!
- **Select your model** with a reason, but don't be afraid to try a few different ones!
- Split your data into **training**, **validation**, and (if you have enough data) **test** sets to diagnose overfitting / test how generalizable your model is. For a slightly different approach, check out [k-fold cross validation](https://machinelearningmastery.com/k-fold-cross-validation/).
- While your model can learn trainable parameters, adjust your **hyperparameters** to see how they affect your model. There are also ways to automate testing different hyperparamters, like [grid search](https://scikit-learn.org/stable/modules/grid_search.html).

And most importantly...

- **Read!** Whenever you come across something you're not familiar with (which will probably be very often, unless you are [Ian Goodfellow](https://en.wikipedia.org/wiki/Ian_Goodfellow)), look it up! It's a great way to fill in your gaps in knowledge.
- **Code!** A hands-on approach is the best way for a human to learn how machines learn. [Kaggle](http://kaggle.com/) is a great place to get started.

## Happy learning!

<img src="https://wompampsupport.azureedge.net/fetchimage?siteId=7575&v=2&jpgQuality=100&width=700&url=https%3A%2F%2Fi.kym-cdn.com%2Fentries%2Ficons%2Ffacebook%2F000%2F028%2F926%2Fcove3.jpg">