# This Talk

- Problem definition -- AI
- The main framework of learning
- Knowledge roadmap
- More Examples


# Artificial Intelligence

## What is AI

- What makes AI 
    - Or may we start with: what looks like AI but does NOT stand scrutinise.
    - __Q__: what is the more challenging problem: auto-pilot from Sydney to Singpore, or auto-driving from School to Shopping?
    - Decision making machines: check this [very short introduction on robots (chinese)][AIClip].
[AIClip]:https://www.bilibili.com/video/av28946374/    

- Modelling the world (and hopefully do something about it)
    - Models: idelly, generally applicable patterns
        - Can we build simple rules like what Archimedes did for geometry (or Newton for classical physics)?
            - Yes! why not trying? Perheps that what the idea on computers by the first programmer! 
            <img src="https://upload.wikimedia.org/wikipedia/commons/a/a4/Ada_Lovelace_portrait.jpg" alt="Ada" height="200" width="100">

> [The Analytical Engine] might act upon other things besides number, were objects found whose mutual fundamental relations could be expressed by those of the abstract science of operations, and which should be also susceptible of adaptations to the action of the operating notation and mechanism of the engine...Supposing, for instance, that the fundamental relations of pitched sounds in the science of harmony and of musical composition were susceptible of such expression and adaptations, the engine might compose elaborate and scientific pieces of music of any degree of complexity or extent ...


- Unfortunatelly, things do not like go this way -- check [here][Minsky] for a powerful argument from Minsky.
    - Intelligent brains is the result of hundreds of millions years of construction by evolution - so you'd expect more structures
    - EXAMPLE-INTRO
[Minsky]:https://youtu.be/RZ3ahBm3dCk?t=1m28s

# EXAMPLE -INTRO

Consider the data of three numbers and an associated answer.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
def triarea(a, b, gamma):
    """

    :param a: edge of one edge
    :param b: edge of another
    :param gamma: angle
    :return:
    """
    return a * b * torch.sin(gamma/180.0*3.1416)

from matplotlib.patches import Polygon
#from matplotlib.collections import PatchCollection

def draw_sample(sample, ax):
    """
    """   
    a, b, gamma = sample.numpy()
    y = b*np.sin(gamma/180.0*3.1416)
    x = b*np.cos(gamma/180.0*3.1416)
    tri = Polygon(np.array([[0, 0], [a, 0], [x, y]]), True)
    ax.add_patch(tri)

In [3]:
X = torch.rand(1000, 3)
X[:,0] *= 10
X[:,1] *= 10
X[:,2] *= 170
y = triarea(X[:,0], X[:,1], X[:,2])
TRAIN_NUM = 800
train_x, test_x = X[:TRAIN_NUM], X[TRAIN_NUM:]
train_y, test_y = y[:TRAIN_NUM], y[TRAIN_NUM:]

In [None]:
for i in range(2):
    for j in range(3):
        ax = plt.subplot(2,3,i*3+j+1)
        draw_sample(X[i*3+j], ax)
        ax.axis('off')
        ax.set_xlim([0,10])
        ax.set_ylim([-5,5])
        ax.set_title("Area={:.3f}".format(y[i*3+j].item()))

In [None]:
class ArchimedesNet(nn.Module):
    def __init__(self):
        super(ArchimedesNet, self).__init__()
        self.linear1 = nn.Linear(3, 128)
        self.linear2 = nn.Linear(128, 64)
        self.linear3 = nn.Linear(64, 16)
        self.linear4 = nn.Linear(16, 1)
    
    def forward(self, x):
        h = F.tanh(self.linear1(x))
        h = F.tanh(self.linear2(h))
        h = F.tanh(self.linear3(h))
        y = nn.functional.relu(self.linear4(h))
        return y   

In [None]:
ar_net = ArchimedesNet()
optim = Adam(ar_net.parameters(), lr=0.01)

In [None]:
for i in range(3000):
    train_pred = ar_net(train_x).squeeze()
    loss = F.mse_loss(train_pred, train_y)
    optim.zero_grad()
    loss.backward()
    optim.step()
    if (i % 50 == 0 and i<200) or i%200 == 0:
        test_pred = ar_net(test_x).squeeze()
        test_loss = F.mse_loss(test_pred, test_y)
        print("{}: loss {:.2f}, test-loss {:.6f}".format(i, loss.item(), test_loss.item()))


In [None]:
for y0, y1 in zip(test_y, test_pred):
    print("{:.3f}-{:.3f}".format(y0.item(), y1.item()))

There are a few aspects to think about of this example -- this can almost be viewed as a _counter-example_ of statistical learning, where trends are inferred statistically while they are not needed to be so. 
- but consider what if the semantics of the attributes are not told?
- now you are facing a collection of numbers, with respective desirable answers, the above scheme IS a way of extracting trends out of observations

- and HOPE that the trend generalises - a powerful time-tested line of thought!
        
    <img src="ref/Justus_Sustermans_-_Portrait_of_Galileo_Galilei_1636.jpg" alt="Galileo" height="400" width="300">
        
>  Philosophy is written in this grand book, the universe ... It is written in the language of mathematics, and its characters are triangles, circles, and other geometric figures;....
> -- <cite>Galileo, _The Assayer_</cite>
        

## Learn to Generalise

With the hope, we the greatest challenge / promise of machine learning - to generalise.

Let us check our empirical mathematician. First, let's check its response to the variables

In [None]:
DOES_TEST_GENERALISE = True
X1 = torch.ones(1000, 3)
X1[:, 1] = 5.0
X1[:, 2] = 90.0 # right angled triangles
for i in range(1000):
    X1[i, 0] = 0.1 + 0.01*float(i)
    if i>900 and DOES_TEST_GENERALISE:
        X1[i, 0] += i*0.03
y1 = triarea(X1[:,0], X1[:,1], X1[:,2])
pred1 = ar_net(X1)

X2 = torch.ones(1000, 3)
X2[:, 0] = 5.0
X2[:, 2] = 90.0
for i in range(1000):
    X2[i, 1] = 0.1 + 0.01*float(i)
    if i>900 and DOES_TEST_GENERALISE:
        X2[i, 1] += i*0.01
y2 = triarea(X2[:,0], X2[:,1], X2[:,2])
pred2 = ar_net(X2)

X3 = torch.ones(1000, 3)
X3[:, 0] = 3.0
X3[:, 1] = 4.0
for i in range(1000):
    X3[i, 2] = 0.01 + float(i)*175.0/1000
y3 = triarea(X3[:,0], X3[:,1], X3[:,2])
pred3 = ar_net(X3)


In [None]:
plt.plot(X1[:, 0].detach().numpy(), pred1.detach().numpy(), 'b')
plt.plot(X1[:, 0].detach().numpy(), y1.detach().numpy(), 'r')
plt.legend(['Pred', 'Ground-truth'])
plt.xlabel("Length: a")
plt.ylabel("Area")

In [None]:
plt.plot(X2[:, 1].detach().numpy(), pred2.detach().numpy(), 'b')
plt.plot(X2[:, 1].detach().numpy(), y2.detach().numpy(), 'r')
plt.legend(['Pred', 'Ground-truth'])
plt.xlabel("Length: b")
plt.ylabel("Area")

In [None]:
plt.plot(X3[:, 2].detach().numpy(), pred3.detach().numpy(), 'b')
plt.plot(X3[:, 2].detach().numpy(), y3.detach().numpy(), 'r')
plt.legend(['Pred', 'Ground-truth'])
plt.xlabel("Length: b")
plt.ylabel("Area")

__Q__: The true world model is, of course, $$
A = \frac{1}{2} \sin(\gamma) * a * b
$$
However, can you consider any generalisability issue for the model?

# Learning from Data Framework


<img src="ref/learning.png" alt="LearningFramework" height="400" width="500">

The major players:
- data
- models (hypotheses)
- algorithm
- selection criterion

__Let's understand the data first!__

<span style="color:blue">__DATA__:$\mathcal{D}$</span>

If we rip off all "semantics/domain-specific interpretations/conceptual icing-on-the-cake-of-theory", each object of interest boils down to a bunch of numbers in any analytics on a digital computer ([Cool works](https://www.youtube.com/watch?v=Ecvv-EvOj8M) have revealed evidence supportting the brain works the similar way, too!). 

We call such a collection of numbers, a _sample_. Each of the number is an _attribute_. Say, each sample contains $p$ attributes.

Some alternative names of the concepts implies interesting views of the data. 
- Alternatively, samples are called data points or sample points. It is not hard to imagine that "point" alludes to the geometric understanding of the data as living in a $p$-dimensional space, each sample corresponding to a point in the space. And the point is determined by its coordinates which are just the numbers making that sample. 
- Note an attribute is more generic than a number in a particular sample -- it refers to that number across all possible samples. We can take an attribute-centric view of the data, where each sample is an instance of the $p$ _random variables_.

The two views (geometric and stochastic) of the data are non-exclusive. For example, you can think of the data to be analysed as some cloudy stuff spreading (distribution) across the $p$-dimensional space. While the distribution is both unknown and difficult to describe (even if we think for a while it were known), all the analyser can get her hands on is a number of points scattered within the "distribution cloud", and the points are our sample points. Here our adventure begins.

__Q__: what is the difference between training and test data in the above example (or if you are familier with the area, in general)?

__Q__: can you talk about what makes a data model?

In practical analytics, when we are talking about _models_, we actually operates with another concept -- _model family_. Each model is a very particular theory about the relationship between the attributes of samples and the corresponding targets. Such a theory is called a _hypothesis_. However, any particular hypothesis may of little use in practice: say you have a linear model with $w_1=0.1573, w_2=-1.2856$ and $b = -0.0021$, for some data of two attributes, it is hard to imagine the model is optimal for any practical 2D data analytic task. So in the business of data analytics, we adopt the _learning_ approach -- where instead of focusing on an individual data model, we define a _model family_, which consists of many, often infinitely many, hypothesis. 

<span style="color:blue">__HYPOTHESIS SPACE__:$\mathcal{H}$</span>

__Targets__

Recall our question above. The target of analytics should be given for sample examples. This piece of information, or the lack of it, defines many different types of machine learning. If you think about the situation carefully, there is no essential distinction between "targets" or "attributes" from the point of view of an object. The distinction is made on the model learner's side. Targets are a special set of attributes, which are accessible during the stage we adjust our model. Once the model has been fixed and deployed, the "target" variables are no longer available to the model, for its namesake, they becomes the "target" of the model prediction. 

From the various ways the targets are given (or missing) arise supervised (we will see shortly), unsupervised, reinforcement, transfer, ..., learning schemes.

Given the hypotheses in some $\mathcal{H}$, and in the light of data samples drawn from $\mathcal{D}$, we perform some process to pick up the one that mostly fits. The process is called _learning_.

<span style="color:blue">__LEARNING ALGORITHM__:$\mathcal{A}$</span>

The purpose of learning is of course to make the model's prediction on the targets, given the observable attributes, better aligned with that of the true data. Technically, we need a single criterion, you can consider it as the "KPI" of the model. We optimise over the model parameters (i.e. picking up a specific model/hypothesis from $\mathcal{H}$) to have the best KPI-measurement on the _training set_ of data. By convention, the criterion is often formulated as the discrepancy between the desired target and the model prediction, which is to be _minimised_.

<span style="color:blue">__LOSS__:$\mathcal{L}$</span>.

By now we have encountered all elements in learning-based machine intelligence $(\mathcal{D}, \mathcal{H}, \mathcal{L}, \mathcal{A})$ -- if you accept the view of intelligence as "capable of summarising from past experience to shape behaviour for future reward". 

__Q__
Can you identify a key assemption in the entire formulation?

# Knowledge Mindmap
I provide a knowledge map [here](https://sketchboard.me/xA4SQKJWSZcd#/).

# EXAMPLE-HAND-WRITTEN DIGITS

In [None]:
%matplotlib inline
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import transforms
from torch.optim import Adam
import os
import matplotlib.pyplot as plt

DDIR = os.path.expanduser("~/data/common")
if not os.path.exists(DDIR):
    os.makedirs(DDIR)
mnist_transform = transforms.Compose([transforms.ToTensor()])
mnist_trainset = torchvision.datasets.MNIST(
    root=DDIR, train=True, download=True,
    transform=mnist_transform)
mnist_trainloader = torch.utils.data.DataLoader(
    mnist_trainset, batch_size=32,
    shuffle=True, num_workers=2
)
mnist_testset = torchvision.datasets.MNIST(
    root=DDIR, train=False, download=True,
    transform=mnist_transform)
mnist_testloader = torch.utils.data.DataLoader(
    mnist_testset, batch_size=32
)

def show(img):
    npimg = img.detach().numpy()
    npimg -= npimg.min()
    npimg /= npimg.max()
    if npimg.shape[0] in [3,4]:
        plt.imshow(np.transpose(npimg, (1,2,0)), interpolation='nearest')
    else:
        plt.imshow(npimg.squeeze(), interpolation='nearest', cmap='gray')

In [None]:
for x, y in mnist_trainloader:
    break
show(x[0])

Let us construct a _convolutional_ neural network for the image data. Here we have a [nice animation](https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md) for intuitive interpretation.

In [None]:
conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, bias=False)

In [None]:
p_ = list(conv.parameters())
print("There are {} parameter objects.".format(len(p_)))
print(p_[0])

In [None]:
p_[0][0, 0, 0, 0] = 1.0
p_[0][0, 0, 0, 1] = 1.0
p_[0][0, 0, 0, 2] = 1.0
p_[0][0, 0, 1, 0] = 0.0
p_[0][0, 0, 1, 1] = 0.0
p_[0][0, 0, 1, 2] = 0.0
p_[0][0, 0, 2, 0] = -1.0
p_[0][0, 0, 2, 1] = -1.0
p_[0][0, 0, 2, 2] = -1.0
print(p_[0])

Let us check the effect of the operation of `conv` on our image.

In [None]:
h = conv(x)
show(x[0])
plt.show()
show(h[0])

However, the point of _learning_ is to automatically discover meaningful processings (recall our old friend empiricist "mathematician"), and to a much more complext level.

In [None]:
class HandwrittenDigitNet(nn.Module):
    def __init__(self):
        super(HandwrittenDigitNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, 
                               out_channels=64, 
                               kernel_size=3)
        self.conv2 = nn.Conv2d(in_channels=64, 
                               out_channels=128, 
                               kernel_size=3)
        self.linear = nn.Linear(3200, 10)
        
    def forward(self, x):
        """
        :param x: a batch of images
        """
        h = self.conv1(x)
        h = F.leaky_relu(h, 0.2, inplace=True)
        h = F.max_pool2d(h, 2)
        h = self.conv2(h)
        h = F.leaky_relu(h, 0.2, inplace=True)
        h = F.max_pool2d(h, 2)
        h = self.linear(h.view(h.shape[0], 3200))
        h = F.log_softmax(h, dim=1)
        return h

In [None]:
mnist_net = HandwrittenDigitNet()
mnist_optim = Adam(mnist_net.parameters(), lr=1e-4)

In [None]:
epoch = 0
last_test_accu = 0.1
small_improve_evals = 0
max_epoch = 100
max_small_imporve_evals = 20
evaluate_every_n_steps = 10
stop_cond = False
while not stop_cond:
    it = 0
    for X, y in mnist_trainloader:
        mnist_optim.zero_grad()
        h = mnist_net(X)
        loss = F.nll_loss(h, y)
        loss.backward()
        mnist_optim.step()
        it += 1
        if it % evaluate_every_n_steps==0:
            correct_num = 0
            for Xt, yt in mnist_testloader:
                test_h = mnist_net(Xt)
                test_c = torch.argmax(test_h, dim=1)
                correct_num += torch.sum(test_c == yt).item()
            test_accu = float(correct_num) / len(mnist_testset)
            imporve_rate = (1.0-last_test_accu) / (1.0-test_accu) - 1.0
            if imporve_rate < 0.05:
                small_improve_evals += 1
            else:
                small_improve_evals = 0
            last_test_accu = max(test_accu, last_test_accu)
            print("Epoch {}, iteration {} "
                  "train loss {:.3f} test accuracy {:.3f} "
                  "imp {:.3f} no-improve-steps {}".format(
                      epoch, it, loss, test_accu, 
                      imporve_rate,
                      small_improve_evals))
            if small_improve_evals >= max_small_imporve_evals:
                stop_cond = True
                break
                
    epoch += 1
    stop_cond = stop_cond or epoch >= max_epoch
    # you can validate the model on test data here, try

# EXAMPLE CYCLE-GAN

We can expand the above scheme to a much larger scale.
- many more layers
- output many more answers for each data sample -- e.g. 
> we can output a "fake" image, entirely produced by the data model, while each pixel of the fake image is an answer!

<img src="ref/cgan.png" alt="CGAN" height="300" width="800">

In [None]:
import os
import urllib.request
import utils.cganimstyler as cim
import matplotlib.pyplot as plt
%matplotlib inline

AVAILABLE_TARGET_STYLES = [
    "apple2orange", "orange2apple", 
    "summer2winter_yosemite", "winter2summer_yosemite", 
    "horse2zebra", "zebra2horse", "monet2photo", 
    "style_monet", "style_cezanne", "style_ukiyoe", 
    "style_vangogh", "sat2map", "map2sat", 
    "cityscapes_photo2label", "cityscapes_label2photo", 
    "facades_photo2label", "facades_label2photo", "iphone2dslr_flower"
]

TARGET_STYLE = AVAILABLE_TARGET_STYLES[10]
print("TARGET_STYLE: ", TARGET_STYLE)
# download trained style-conversion models
model_path = "checkpoints/saved_style_models/" + TARGET_STYLE + ".pth"
if not os.path.exists(model_path):
    urllib.request.urlretrieve(
        "http://efrosgans.eecs.berkeley.edu/cyclegan/pretrained_models/" + \
        TARGET_STYLE + ".pth",
        model_path)

# build the style model
netG = cim.load_generator_from(model_path)

In [None]:
im = cim.load_image('data/Jun.jpeg') # Put your own image here!
res = netG(im)
npim = cim.tensor2im(im)
res_npim = cim.tensor2im(res)

In [None]:
plt.subplot(1,2,1)
plt.imshow(npim)
plt.axis('off')
plt.subplot(1,2,2)
plt.imshow(res_npim)
plt.axis('off')
plt.show()

# EXAMPLE A3C

This represents a family of algorithms simultaneously collect experience while training a reinforcement learning agent.

## Problem Definition
We build a program that can play video games:
```
Input: Game-Environment (env)
Output: Game-Policy (policy)
```

__env__:
```
Input: action (0~k, say, 2)
Output: screen-image, reward, game-is-over
```

__policy__:
```
Input: screen-image
Output: action
```

It is not difficult to imagine how a policy “plays” an env. The goal is to design the policy self-inspection and adjustment scheme (call this meta-policy if you like), so the total reward it receives in a game maximises. Note this setting can be more generic than you might think of:
- the game can give a reward of any constant positive value at each step to simply encourage the player to stay playing as long as possible, which makes sense in some balancing or jumpping games. In some games that can even mean disencourage winning, such as in ball games!
- the game can give a reward of any constant negative value at each step to encourage quick playing, such as in a maze game without suicidal option, this is equal to saying "hurry up!"

## Building Blocks

1. Neural Networks
    1. Network building using torch
    2. Network training using gradient descent
        1. Compute gradients with back propagation
        2. Commit parameter update along gradient direction
        
2. Reinforcement Learning Algorithms
    1. Long-term evaluation of actions -- building a slot-machine playing agent
    2. Handling machines with internal states
    3. Handling machines with MANY internal states
        1. Using neural networks to estimate action values
        2. Adjust action probability by reviewing consequences

## Technical Terms

A small number of technical terms are used in our discussion. One way to treat a strange jargon is just ignore it when encountering, and let its meaning emerge by itself during your study. If you find your short-term memory is going to explode because the need to keep track many strange notions -- it might be helpful to look up in a glossary such as [here](https://www.analyticsvidhya.com/glossary-of-common-statistics-and-machine-learning-terms/). Or simply Google the new concept. However, I am afraid google/wiki-def of the notion can only partially help your study -- the key is the fact that many new concepts need to be learned, so only careful review can help achieve a deep understanding.

## Math Glossary
Some math representation of useful concepts are:

- $s_t$: the state observed at time $t$. This is usually but NOT always the stuff returned to the agent at the time taking actions. The most prominent exception is the screen-image-based states in our video-game playing examples. One applies some simple preprocessing to the states.

- $a_t$: the action taken at time $t$. Generally, it is an integer $\{0, 1, ..., K-1\}$ if there are $K$ different actions. Keep in mind that the actual action could be represented differently, such as "press A-button". For a decision making agent, given all possible action choices, choosing actions is eqivalent to choosing the indexes.

- $r_t$: the immediate reward received at time $t$. Note some authors used to let $r_t$ refer to the reward received __after__ taking action $a_t$ in state $s_t$, while others take $r_t$ as the reward received __at the beginning__ at time $t$, after taking action $a_{t-1}$ in state $s_{t-1}$. In whatever way, the procedure: in $s_t$ taking action $a_t$ according to some policy $\pi$ arriving the next state $s_{t+1}$ and receiving a reward $r_{t}$ (or $r_{t+1}$ subject to your choice of denotation) is called a __transition step__.

- $\pi(\cdot|s)$, given the state $s$, a policy $\pi(a|s)$ (not to be confused with the $\pi\approx3.1416$) assigns a non-negative real number to each action -- it dictates the possibility of choosing the action given $s$. If a policy is deterministic, rather than stochastic, it can attribute all probabilities to one particular action, so that the corresponding $\pi(a|s)=1$ and $\pi(a'|s)=0$ for all other actions $a'$.

- $Q^\pi(s, a)$, evaluation of __long term__ return for taking action $a$ in state $s$. Since it considers future effects, it relies on the on-going policy, $\pi$. Note taking $a$ at the current state $s$, the very first step of this evaluation is not necessarily with respect to $\pi$. Consider this $Q$-evaluation as answering a hypothetical question: what the long term reward would have been if she took action $a$ at $s$ and followed $\pi$ henceafter. Of course if this evaluation is known, it is wise to take the action maximising this $Q$ at each $s$.

- Note in neural network implementation, $Q$- and $\pi$-nets share the same structure: map states to $K$ numbers, where $K$ is the number of actions.

In [None]:
from utils.a3c import a3c
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import gym
env = gym.make('Pong-v0')
s = env.reset()
plt.imshow(s)

In [None]:
def a3c_test(env, model):
    state = env.reset()
    state = torch.from_numpy(state)
    reward_sum = 0
    done = True

    # a quick hack to prevent the agent from stucking
    episode_length = 0
    cx = hx = None
    while True:
        episode_length += 1
        # Sync with the shared model
        if done:
            with torch.no_grad():
                cx = torch.zeros(1, 256)
                hx = torch.zeros(1, 256)
        else:
            with torch.no_grad():
                cx = cx.detach()
                hx = hx.detach()

        with torch.no_grad():
            s_ = state.unsqueeze(0)
        value, logit, (hx, cx) = model((s_, (hx, cx)))
        prob = F.softmax(logit, dim=1)
        # print logit.data.numpy()
        action = prob.max(1, keepdim=True)[1].data.cpu().numpy()

        state, reward, done, _ = env.step(action[0, 0])

        env.render()
        time.sleep(0.3)
        done = done or episode_length >= 10000
        reward_sum += reward

        # a quick hack to prevent the agent from stucking
        # actions.append(action[0, 0])
        # if actions.count(actions[0]) == actions.maxlen:
        #     done = True

        if done:
            print("Reward {}, episode length {}".
                  format(reward_sum, episode_length))
            env.close()
            break
        state = torch.from_numpy(state)

In [None]:
A3C_CHECKPOINTS = [40, 100, 200]
A3C_CP = 2

model = a3c.ActorCritic(env.observation_space.shape[0], env.action_space)
if A3C_CP >= 0:
    checkpoint = torch.load('checkpoints/a3c_models/PongDeterministic-v4_worker2_{}'.format(A3C_CHECKPOINTS[A3C_CP]))
    model.load_state_dict(checkpoint['state_dict'])
model.eval()

In [None]:
env = a3c.create_atari_env('PongDeterministic-v4')
env.reset()
a3c_test(env, model)

In [None]:
env.close()

# EXAMPLE - YOLO

This section is under construction

The example follows the [Yolo Tutorial from paperspace][1].

[1]:https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/

#### Understanding YOLO-Network output and detection task

Each pixel in the final convolutional layer has $B\times(5+C)$ entries. $B$ is the number of bounding boxes each cell can predict. Each of the box is for detecting a certain kind of object. Each bounding box has $5+C$ attributes -- centre coordinates (x, y, offset?), dimensions (width, height), objectness score and $C$ classes confidence. __Q__: if each of the $B$  bounding box is corresponding to one certain kind of object, then why each bounding box can have $C$ class likelihood?

Each cell can predict an object in one of it bounding boxes _if the centre of the object falls in the receptive field of the cell_.

For training (or computing the training loss), the input image is divided according to the final feature map -- if the final feature map represents a 32x shrink of the image, the image will be divided by $32 \times 32$ grids. See the picture below <img src="ref/yolo-5.png" alt="Smiley face" height="400">

Each cell defines $B$ anchors (default boxes). 


Network outputs $t_x, t_y, t_w, t_h$. For $x$-coordinate, $b_x = \sigma(t_x)+c_x$: sigmoid to predict offset in $x$ direction with respect to grid-cell-centre $c_x$. 

> For example, consider the case of our dog image. If the prediction for center is (0.4, 0.7), then this means that the center lies at (6.4, 6.7) on the 13 x 13 feature map. (Since the top-left co-ordinates of the red cell are (6,6)).

$b_w = e^{t_w}$ -- the width of the box. $1.0$ is the edge length of a cell.

Anyway, each cell of the image is thought as of $1.0 \times 1.0$.

> The resultant predictions, bw and bh, are normalised by the height and width of the image. (Training labels are chosen this way). So, if the predictions bx and by for the box containing the dog are (0.3, 0.8), then the actual width and height on 13 x 13 feature map is (13 x 0.3, 13 x 0.8).

#### Understanding YOLO -- Processing Steps

Processing and downsampling, until to the $32\times 32$ cell, then upsample and use skip-layer connections. 32, 16, 8 are used.

[IDEA] can we use high-resolution image, rather than upsampling for fine-grain detection?

# Epilogue


<img src="ref/Frans_Hals-Descartes.jpg" width="200" height="400">

> Cogito, ergo sum

    -- René Descartes