
# AI Camp 2021 Kickoff
***

# Agenda
***
1. What to expect
2. Timeline
3. How we code
    * Jupyter notebooks
4. **Lecture 1**: Intro Deep Learning
5. Challenge 1: Minimal MNIST

# What to expect
***
The AI Camp is for Artificial Intelligence enthusiasts
* You have an interest in Data Science and math?

4 Mini-lectures on selected topics
* 4 Data science challenges, first one starting today!
    * Can you beat us?

Big joint competition in the last month
* It's us against the world

A platform for you own AI projects
* Write us!

# Timeline
***
- **KW 16**: (20.04) Kickoff, Intro lecture, Minimal MNIST Competition
    - KW 17: Q&A, Gradient Descent lecture, optional
- **KW 18**: Reinforcement Learning
    - KW 19: Q&A, optional
- **KW 20**: Guest lecture: Proc. Kacprowski (Data Science for Biomedicine), **TBD COMP**
    - KW 21: Q&A, optional
- **KW 22**: Comp Recap, NLP Lecture, Sentiment Analysis Competition
    - KW 23: Q&A, optional
- **KW 24: Presentation & selection of our grand competition**
    - KW 25 to 28: Working on our competition
- **KW 29: AI Camp wrap up, presenting the results**

# Regular date
***
We need day + time for our meetings.

Please check our doodle:

https://doodle.com/poll/wq8gvb3d762wr6aa


# Organizational
***
 - Lectures & Examples in Python
     - Frameworks: Tensorflow, Pytorch, sklearn, etc.
 - Slides & Code in <a href="https://jupyter.org/">Jupyter Notebooks</a>
 - Everything is located in Niklas' <a href="https://github.com/nikrruun/jupyter-notebooks/tree/aicamp2020">GitHub Repo</a>
     - Branch "aicamp2020"
 - Join our Discord:
     - https://discord.gg/77uHPAMt
     - All notifications happen here

# Why Jupyter Notebooks?
***
We want to present math and code for AI models
> We have PowerPoint for this!

But wouldn't it be nice, if we also could
- Actually run the code during presentation
- Interact with it and observe changes
- Easily share it and ship to remote hardware
   


# You're in a simulation
***
This presentation is just code + markdown: 

In [None]:
import time
print("HELLO")
for i in range(5):
    print(f"2^{i} =", 2**i)
    #time.sleep(0.5)

This is a running python interpreter!

# Features 
***
* Inline Latex support: $e^yx_k\sum_{i=1}{2^{-i}}$
* Text narratives via markdown (essentially a readme.md)
* Usually, a notebook is a list of sequential code or text cells
    - But, with some effort it automatically translates into a slideshow
* Run a copy on <a href="https://colab.research.google.com">Google Colab</a> in < 1 minute

# How to run this notebook
***
Option 1: Google Colab
* Run everything in the cloud, everything comes pre-installed
    * The content is the same, but no fancy slides!

Option 2: Run it on your machine
* Try cloning the GitHub and run the setup (Very WIP)
    * Currently only available on Windows
        * Docker support coming soon

# Google Colab walkthrough
***
1. Open the link <a href="https://colab.research.google.com">Google Colab</a>
2. Sign up & in, and open a new notebook
3. Go to tab "GitHub" and search for user "nikrruun". Choose as shown below:
<img src="img/slides/colab_github_menu.png">

# Google Colab (2)
***
4. Click on "<i>kickoff_presentation.ipynb</i>"
    - Generally, Google Colab filters out all available notebooks from a given repository
5. Use the "Content" Navigation (on the left) or scroll down until you see this cell
6. Check if you can run this code cell:

In [None]:
data = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
print("".join([chr(x) for x in data]))

# So, why *python*?
***
Python is the current standard for ML and Data Science frameworks:
* Forced indentation
* Interpreted, not compiled
* Extremely slow

It is possible to expose C/C++/CUDA code to python
* Tons of highly optimized frameworks and libraries exist
    * numpy, tensorflow, pytorch, pycuda ...
* Performance still not perfect, but close
* Interpreted code allows for interactive and explorative programming!

# Lecture 1: Intro Deep Learning
***

# What is an artificial neural network?
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

An artificial neural network is a set of interconnected neurons
* In most modern frameworks, neurons are grouped into layers
    * Often, we much more talk about these layers, than single neurons
        * "The first hidden layer contains 100 neurons!"

In general, there are three kinds of layers of neurons:
* `input layer`: The input data on which our net operates on, e.g. images
    * A single node represents a single value, e.g. a pixel
* `hidden layer`: Intermediate neurons between `input` and `output`
    * They receive the preceeding layer's output as input
* `output layer`: Think a `hidden layer` without any successor
    * These are the "results" the net computed on the "input"

# Artificial neurons
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

They are the smallest unit in a neural network, loosely inspired by biological neurons
* Each has a set of input connections to some preceeding neurons
    * Each connection is assigned a **weight** $w$, representing the \
    importance of the incoming signal for a neuron
* It *reacts* to incoming signals by *firing* an output signal:
    * It sums up all weighed inputs
    * Computes and outputs an **activation** based on the summed inputs
* The *way it reacts* to a certain input is determined by its **activation function**
    * The activation is inhibited by a **bias**

# Weights? Bias? Activation?
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

Consider the top node in the hidden layer (blue circle):
* There are three incoming signals from three input nodes
    * Signals are just float-values!
* We want to assign different values of importance to each of the three signals
    * Let's call the input values $x_0,x_1,x_2\in\mathbb{R}$
        * With respective **weights** $w_0, w_1,w_2\in\mathbb{R}$
* By multiplying a weight $w_i$ with its signal $x_i$
    * We can express how $w_i$ influences the neurons activation:
        * $w_i>>1$: $x_i$ will have significant positive impact
        * $w_i\approx0$: $x_i$ will have small to no impact
        * $w_i<<-1$: $x_i$ will have significant negative impact

# Summing up
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

Next, we need to sum up the weighed signals:

$\displaystyle w_0*x_0+w_1*x_1+w_2*x_2=\sum_{i}w_ix_i$
* This is the weighed input for our neuron!

The **bias** of a neuron represents some sort of threshold to overcome
* The larger the bias, the harder it is the get a large activation
* In math: Just add it: $\sum_{i}w_ix_i + b$\
(Intuitively, you would probably rather subtract, but it\
has become the standard to instead add a negative value)

# Getting excited
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

Lastly, we compute the output $y$ of a neuron by applying the activation function $f$

$\displaystyle y=f\left(\sum_{i}w_ix_i + b\right)$

Why do we need an activation function?
* Without it, our neuron will always be a linear function
    * This limits the kind of functions we can express
* Instead, we can freely control how our neuron should react to certain values
    * E.g. only fire if there is a positive input
        * $f_{relu}(x)=max(0,x)$

Activation functions allow us to introduce non-linearities into our models!

# Time for an example
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

Let's say we have our three inputs\
$x_0=2,x_1=-3,x_2=0.2$

And our weights and the bias are set to\
$w_0=-10,w_1=0,w_2=50$, and $b=-30$

Further, we choose $f(x)=x^2$

The weighed input is then:\
$\sum_{i}w_ix_i + b=2*(-10)+(-3)*0+0.2*50 + (-30)=-40$

And plugging that into $f=x^2$ we get our output $y$:\
$y=f\left(-40\right)=1600$

# The forward pass
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

Computing a neurons output based on its inputs is called a **forward pass**

We can repeat this procedure for each neuron from left to right:
* We can compute the state of the whole network
    * And get the output of the net!

What about the green nodes?
* They depend on the output of the blue neurons
* Run them last!

# Fancy functions
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

Think of an artificial neural network as a fancy way to express a function
* It has multiple input arguments, namely $x_0,x_1,\dots,x_n$
* To evaluate it, we have to:
    * Assign the input layer with the according values
    * Apply layer-wise forward passes until the output layer is set
    * Return the output values of the network

# Fancy indeed
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

$\displaystyle y_{h_i}=f_{h_i} \left(\sum_{j}w_{ji}x_j + b_{h_i}\right)$

and\
$\displaystyle y_{o_k}=f_{o_k} \left(\sum_{i}w_{ik}y_{h_i} + b_{o_k}\right)$

$\displaystyle y_{o_k}=f_{o_k} \left(\left[\sum_{i}w_{ik}f_{h_i} \left(\sum_{j}w_{ji}x_j + b_{h_i}\right)\right] + b_{o_k}\right)$

$y_{o_k}$ is the output of the $k$-th neuron in an arbitrary two-layer network

# What about those functions?
***
Neural networks can be *trained*
* By supervision: Learning from examples
    * E.g. regression & classification
* Unsupervised: Through its own experiences
    * E.g. reinforcement learning, clustering

It can be shown, that ANNs can approximate any functions to arbitrary degree
* In theory, they could learn anything
    * Funnily, *how* to get such a perfect network is a much harder problem!
    
Today: Supervised learning!

# Why children are smarter than you think
***
The idea of supervised training is analoguos to how you would teach a child:
* First, you would show some examples, e.g. in a book
    * "See, here is a cat!"
* Then, whenever you run across a cat, you'd ask the child:
    * "Hey, what animal is that?"
* If the child were wrong, you would correct it
    * The child might then think: "Oh, *that is a cat*!"
       * And hopefully learn from that

# Supervised training for ANNs
***
Supervised training roughly goes like this:
1. Initialize the network weights (*parameters*) randomly
2. Make a forward pass on some examples and obtain the network's outputs
    * E.g. pictures of handwritten digits $\rightarrow$ integers
3. Compare the net's predictions to the correct *label* of the example
    * Measure some sort of *error*, indicating correctness of prediction
4. Adapt the weights, such that the error gets smaller over time



# Prerequisites for training
***
To run any training, we need:
* A set of examples, together with their label
    * E.g. a set of images, of which we know what kind of object they show
    * Called a **dataset**
* A notion of error which we would like to minimize
    * Usually called the **loss**
* A method that finds weights that lead to a small error (loss)
    * We are looking for an optimization method!
    * The current standard for ANNs is Gradient Descent
        * More on that later

# The loss function
***
We need a notion of error, that tells us, how far off a network's prediction is:
* "On the last 10 pictures I was slightly off on the first, but completely screwed up on the third"
    * Only in a digital version!
* What an error is, and how *bad* it is, usually depends on the task at hand

There are myriads of different loss functions for all sorts of tasks
* E.g. mean squared error (MSE) for regression/classification
    * Or crossentropy, hinge, kullback-leibler divergence, absolute error ...


# POV: You are a post office
***
We want to recognize handwritten digits to speed up our letter sorting machine!
* We want to *classify* each image into its integer category

Let's load the MNIST dataset of handwritten digits
* Keras has some built-in functions for that:

In [None]:
from tensorflow import keras

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)

* `x_train`: 60k 28x28 images used for training
* `y_train`: 60k integer labels
* `x_test`: 10k 28x28 images used for testing
* `y_test`: 10k integer labels

What do these images look like? matplotlib shows us how:

In [None]:
import matplotlib.pyplot as plt
for i in range(5):
    img = x_train[i]
    plt.imshow(img)
    plt.title(f"x_train[{i}], label {y_train[i]}")
    plt.show()

# Classes are not continuous
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

How can we adapt our network to output a class?
* Idea: For each class we want to predict:
    * Dedicate a neuron in the output layer just for that class
        * Let the output be in the range $[0,1]$
            * By using the right activation function!
* Then, interpret $y_{o_k}$ as a probability of observing class $k$
    * E.g. $y_{o_1}=0.4$ would express, that the model is 40% sure\
    that the class with index $k=1$ is the correct class

# But what about our labels?
***
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/851px-Colored_neural_network.svg.png" style="max-width:25%;float:right;">

* Instead of showing the network which integer would have been correct
* We tell the network which output neuron should compute to probability $1$

Then, the labels are just binary vectors:
* 4 becomes $[0,0,0,0,1,0,0,0,0,0]$
    * This is called a one-hot encoding for class 4

The model might return a less clean vector:
* E.g. $y_o=[0.2,0.8,0,\dots,0]$
    * It says, "I think it is a $1$ with $80\%$ probability"
        * "Or a $0$ with $20\%$"

# Setting up a model in keras
***
We need to:
* Create an input layer, representing a $28\times 28$ image
    * Serialize to $784$
* Create a hidden layer
    * How many neurons?
    * Which activation function?
* Create an output layer
    * Exactly 10 neurons
    * Which activation function??

# Flattening the images
***
We need to serialize our images to 1D arrays
* Simple row-by-row is sufficient

In [None]:
import numpy as np
x_train = x_train.reshape(60000, 28*28)
x_test = x_test.reshape(10000, 28*28)
print(x_train.shape, x_test.shape)


# Converting the labels
***
Our labels `y_train` and `y_test` are still in integer format
* keras has built-in functionality to get binary vectors, too!

In [None]:
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)
print(y_train.shape, y_test.shape)
print("Label example:", y_train[0])

# Creating the layers
***
Creating layers is straightforward with keras:

In [None]:
from tensorflow.keras.layers import Input, Dense

input_layer = Input(shape=(28*28,))
hidden_layer = Dense(units=16, activation="relu")(input_layer)
output_layer = Dense(10, "softmax")(hidden_layer) 

* `input_layer` holding $28\times28=784$ values
* `hidden_layer` with 16 neurons and a `relu` activation
    * $f_{relu}(x)=max(x,0)$
* `output_layer` with 10 neurons and the `softmax` activation
    * The `softmax` *squashes* the outputs into range $[0,1]$

# Creating the model
***
keras organizes several layers into a *model*:

In [None]:
model = keras.models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer="SGD", loss="MSE", metrics=["accuracy"])
print("Our model has", model.count_params(), "weights")

* `SGD` is short for *stochastic gradient descent*, the vanilla optimizer in keras
    * There are many others, give them a try!
* `MSE` is the *mean squared error* loss
* We add the `accuracy` metric, which measure how many digits we correctly classified
    * A `metric` has no influence on the training, as opposed to the `loss`

# Training the model
***
Training is started via the `fit` method:

In [None]:
_ = model.fit(x_train, y_train, epochs=1, batch_size=256, verbose=1)

* We train on the serialized images `x_train` and the binary labels `y_train`
* `epochs` is the number of times we run over all samples in `x_train`
* `batch_size` is the number of samples for each *mini-batch*
    * Every `batch_size` steps the weights of the model are adapted
        * Weight update is based on the last `batch_size` samples!

# Testing the model
***
Have you wondered about the use of `x_test` and `y_test`?
* Training may collaps to perfect memorization (*overfitting*)
    * Every sample in `x_train` will be remembered perfectly
        * But unseen samples won't!
    * Think of the "peek-a-boo" game with babies!
        * By covering parts of your face, the still-developing brain can't *recognize* you

To check whether a model *generalizes* from the training data to unseen samples
* An evaluation on held-out data is crucial!
    * That's what `x_test` and `y_test` are used for
    * **Never** use those in training!

In [None]:
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)

# Putting it all together
***

In [None]:
# Layers
input_layer = Input(shape=(28*28,))
hidden_layer = Dense(units=16, activation="relu")(input_layer)
output_layer = Dense(10, "softmax")(hidden_layer)

# Model
model = keras.models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer="SGD", loss="MSE", metrics=["accuracy"])
print("Our model has", model.count_params(), "weights")

# Training
print("Training:")
model.fit(x_train, y_train, epochs=1, batch_size=256, verbose=1)

# Evaluation
print("Evaluation:")
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=1)

# Visualizing the predictions
***
Let's take a look at the kind of mistakes our model makes:

In [None]:
y_pred = model.predict(x_test)
classes = y_pred.argmax(axis=1)
wrong_indices = np.where(classes != y_test.argmax(axis=1))[0]
print("Model failed on samples:", wrong_indices)
print("Mis-classified", len(wrong_indices),"out of",len(y_test),"test samples")

In [None]:
fig, axes = plt.subplots(2,3, squeeze=True)
axes = axes.flatten()
for i in range(6):
    index = wrong_indices[i]
    pred = classes[index]
    axes[i].imshow(x_test[index].reshape(28,28))
    axes[i].set_title(f"Model says it is a {pred}")
plt.tight_layout()
plt.show()

# Challenge time!
***

1. Open the link <a href="https://colab.research.google.com">Google Colab</a>
2. Sign up & in, and open a new notebook
3. Go to tab "GitHub" and search for user "nikrruun". Choose as shown below:
<img src="img/slides/colab_github_menu.png">