#### Outlook for UK AI and Machine Learning

#### 2018-05-11

#### Neil D. Lawrence

#### Amazon Cambridge and **University of Sheffield**

`@lawrennd` [inverseprobability.com](http://inverseprobability.com)

The aim of this presentation is give a sense of the current situation in
machine learning and artificial intelligence as well as some perspective
on the immediate outlook for the field.

### What is Machine Learning?

First of all, we'll consider the question, what is machine learning? By
my definition Machine Learning is a combination of

$$ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

where *data* is our observations. They can be actively or passively
acquired (meta-data). The *model* contains our assumptions, based on
previous experience. THat experience can be other data, it can come from
transfer learning, or it can merely be our beliefs about the
regularities of the universe. In humans our models include our inductive
biases. The *prediction* is an action to be taken or a categorization or
a quality score. The reason that machine learning has become a mainstay
of artificial intelligence is the importance of predictions in
artificial intelligence.

In practice we normally perform machine learning using two functions. To
combine data with a model we typically make use of:

**a prediction function** a function which is used to make the
predictions. It includes our beliefs about the regularities of the
universe, our assumptions about how the world works, e.g. smoothness,
spatial similarities, temporal similarities.

**an objective function** a function which defines the cost of
misprediction. Typically it includes knowledge about the world's
generating processes (probabilistic objectives) or the costs we pay for
mispredictions (empiricial risk minimization).

The combination of data and model through the prediction function and
the objectie function leads to a *learning algorithm*. The class of
prediction functions and objective functions we can make use of is
restricted by the algorithms they lead to. If the prediction function or
the objective function are too complex, then it can be difficult to find
an appropriate learning algorithm.

A useful reference for state of the art in machine learning is the UK
Royal Society Report, [Machine Learning: Power and Promise of Computers
that Learn by
Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).

You can also check my blog post on ["What is Machine
Learning?"](http://inverseprobability.com/2017/07/17/what-is-machine-learning)

Machine learning technologies have been the driver of two related, but
distinct disciplines. The first is *data science*. Data science is an
emerging field that arises from the fact that we now collect so much
data by happenstance, rather than by *experimental design*. Classical
statistics is the science of drawing conclusions from data, and to do so
statistical experiments are carefully designed. In the modern era we
collect so much data that there's a desire to draw inferences directly
from the data.

As well as machine learning, the field of data science draws from
statistics, cloud computing, data storage (e.g. streaming data),
visualization and data mining.

In contrast, artificial intelligence technologies typically focus on
emulating some form of human behaviour, such as understanding an image,
or some speech, or translating text from one form to another. The recent
advances in artifcial intelligence have come from machine learning
providing the automation. But in contrast to data science, in artifcial
intelligence the data is normally collected with the specific task in
mind. In this sense it has relations to classical statistics.

Classically artificial intelligence worried more about *logic* and
*planning* and focussed less on data driven decision making. Modern
machine learning owes more to the field of *Cybernetics* than artificial
intelligence. Related fields include *robotics*, *speech recognition*,
*language understanding* and *computer vision*.

There are strong overlaps between the fields, the wide availability of
data by happenstance makes it easier to collect data for designing AI
systems. These relations are coming through wide availability of sensing
technologies that are interconnected by celluar networks, WiFi and the
internet. This phenomenon is sometimes known as the *Internet of
Things*, but this feels like a dangerous misnomer. We must never forget
that we are interconnecting people, not things.

### What does Machine Learning do?

-   ML Automates through Data
    -   *Strongly* related to statistics.
    -   Field underpins revolution in *data science* and *AI*
-   With AI:
    -   *logic*, *robotics*, *computer vision*, *speech*
-   With Data Science:
    -   *databases*, *data mining*, *statistics*, *visualization*

### "Embodiment Factors"

<table>
<tr>
<td>
</td>
<td align="center">
<img src="../slides/diagrams/IBM_Blue_Gene_P_supercomputer.jpg" width="60%" style="background:none; border:none; box-shadow:none;" align="center">
</td>
<td align="center">
<img src="../slides/diagrams/ClaudeShannon_MFO3807.jpg" width="100%" style="background:none; border:none; box-shadow:none;" align="center">
</td>
</tr>
<tr>
<td>
compute
</td>
<td align="center">
\~10 gigaflops
</td>
<td align="center">
\~ 1000 teraflops?
</td>
</tr>
<tr>
<td>
communicate
</td>
<td align="center">
\~1 gigbit/s
</td>
<td align="center">
\~ 100 bit/s
</tr>
<td>
(compute/communicate)
</td>
<td align="center">
10
</td>
<td align="center">
\~ 10<sup>13</sup>
</tr>
</table>
See ["Living Together: Mind and Machine
Intelligence"](https://arxiv.org/abs/1705.07996)

In [None]:
import pods
pods.notebook.display_plots('information-flow{sample:0>3}.svg', 
                            '../slides/diagrams/data-science', sample=(1,3))

### What does Machine Learning do?

-   Automation scales by codifying processes and automating them.
-   Need:
    -   Interconnected components
    -   Compatible components
-   Early examples:
    -   cf Colt 45, Ford Model T

### Codify Through Mathematical Functions

-   How does machine learning work?
-   Jumper (jersey/sweater) purchase with logistic regression
    $$ \text{odds} = \frac{\text{bought}}{\text{not bought}} $$
    $$ \log \text{odds}  = \beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}$$

### Codify Through Mathematical Functions

-   How does machine learning work?
-   Jumper (jersey/sweater) purchase with logistic regression
    $$ p(\text{bought}) =  {f}\left(\beta_0 + \beta_1 \text{age} + \beta_2 \text{latitude}\right)$$

### Codify Through Mathematical Functions

-   How does machine learning work?
-   Jumper (jersey/sweater) purchase with logistic regression
    $$ p(\text{bought}) =  {f}\left(\boldsymbol{\beta}^\top {{\bf {x}}}\right)$$

. . .

We call ${f}(\cdot)$ the *prediction function*

### Fit to Data

-   Use an objective function
    $${E}(\boldsymbol{\beta}, {\mathbf{Y}}, {{\bf X}})$$

. . .

-   E.g. least squares
    $${E}(\boldsymbol{\beta}) = \sum_{i=1}^{n}\left({y}_i - {f}({{\bf {x}}}_i)\right)^2$$

### Two Components

-   Prediction function, ${f}(\cdot)$
-   Objective function, ${E}(\cdot)$

In [None]:
import pods
pods.notebook.display_plots('anne-bob-conversation{sample:0>3}.svg', 
                            '../slides/diagrams', sample=(0,7))

### 

<img class="" src="./diagrams/Classic_baby_shoes.jpg" width="60%" align="" style="background:none; border:none; box-shadow:none;">

<center>
*For sale: baby shoes, never worn.*
</center>
### Heider and Simmel (1944)

\`\`\`{.python} from IPython.lib.display import YouTubeVideo
YouTubeVideo('8FIEZXMUM2I') '''

### Machine Learning and Narratives

<img class="" src="./diagrams/Classic_baby_shoes.jpg" width="60%" align="" style="background:none; border:none; box-shadow:none;">

<center>
*For sale: baby shoes, never worn.*
</center>
### Heider and Simmel (1944)

\`\`\`{.python} from IPython.lib.display import YouTubeVideo
YouTubeVideo('8FIEZXMUM2I') '''

### Deep Learning

-   These are interpretable models: vital for disease etc.

-   Modern machine learning methods are less interpretable

-   Example: face recognition

### 

<small>Outline of the DeepFace architecture. A front-end of a single
convolution-pooling-convolution filtering on the rectified input,
followed by three locally-connected layers and two fully-connected
layers. Color illustrates feature maps produced at each layer. The net
includes more than 120 million parameters, where more than 95% come from
the local and fully connected.</small>

<img class="" src="../slides/diagrams/deepface_neg.png" width="100%" align="" style="background:none; border:none; box-shadow:none;">

<p align="right">
<small>Source: DeepFace</small>
</p>
### 

<img class="" src="../slides/diagrams/576px-Early_Pinball.jpg" width="50%" align="" style="background:none; border:none; box-shadow:none;">

We can think of what these models are doing as being similar to early
pin ball machines. In a neural network, we input a number (or numbers),
whereas in pinball, we input a ball. The location of the ball on the
left-right axis can be thought of as the number. As the ball falls
through the machine, each layer of pins can be thought of as a different
layer of neurons. Each layer acts to move the ball from left to right.

In a pinball machine, when the ball gets to the bottom it might fall
into a hole defining a score, in a neural network, that is equivalent to
the decision: a classification of the input object.

An image has more than one number associated with it, so it's like
playing pinball in a *hyper-space*.

In [None]:
import pods
pods.notebook.display_plots('pinball{sample:0>3}.svg', 
                            '../slides/diagrams', sample=(1,2))

At initialization, the pins aren't in the right place to bring the ball
to the correct decision.

Learning involves moving all the pins to be in the right position, so
that the ball falls in the right place. But moving all these pins in
hyperspace can be difficult. In a hyper space you have to put a lot of
data through the machine for to explore the positions of all the pins.
Adversarial learning reflects the fact that a ball can be moved a small
distance and lead to a very different result.

Probabilistic methods explore more of the space by considering a range
of possible paths for the ball through the machine.

### Uncertainty and Learning

### Comparison with Human Learning & Embodiment

### Data Science

-   Industrial Revolution 4.0?

-   *Industrial Revolution* (1760-1840) term coined by Arnold Toynbee,
    late 19th century.

-   Maybe: But this one is dominated by *data* not *capital*

-   That presents *challenges* and *opportunities*

cf [digital
oligarchy](https://www.theguardian.com/media-network/2015/mar/05/digital-oligarchy-algorithms-personal-data)
vs [how Africa can benefit from the data
revolution](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information)

-   Apple vs Nokia: How you handle disruption.

### A Time for Professionalisation?

-   New technologies historically led to new professions:
    -   Brunel (born 1806): Civil, mechanical, naval
    -   Tesla (born 1856): Electrical and power
    -   William Shockley (born 1910): Electronic
    -   Watts S. Humphrey (born 1927): Software

### Why?

-   Codification of best practice.
-   Developing trust

### Where are we?

-   Perhaps around the 1980s of programming.
    -   We understand `if`, `for`, and procedures
    -   But we don't share best practice.
-   Let's *avoid* the over formalisation of software engineering.

### The Software Crisis

> The major cause of the software crisis is that the machines have
> become several orders of magnitude more powerful! To put it quite
> bluntly: as long as there were no machines, programming was no problem
> at all; when we had a few weak computers, programming became a mild
> problem, and now we have gigantic computers, programming has become an
> equally gigantic problem.
>
> Edsger Dijkstra, The Humble Programmer

### The Data Crisis

> The major cause of the data crisis is that machines have become more
> interconnected than ever before. Data access is therefore cheap, but
> data quality is often poor. What we need is cheap high quality data.
> That implies that we develop processes for improving and verifying
> data quality that are efficient.
>
> There would seem to be two ways for improving efficiency. Firstly, we
> should not duplicate work. Secondly, where possible we should automate
> work.
>
> Me

### 

<img class="" src="../slides/diagrams/Medievalplowingwoodcut.jpg" width="" align="" style="background:none; border:none; box-shadow:none;">

Feudal era data ecosystem.

### Rest of this Talk: Two Areas of Focus

-   Data Infrastructure

-   Deployment of Machine Learning Systems

In [None]:
import pods
pods.notebook.display_plots('uk_tin_coal_railways{sample:0>3}.svg', 
                            '../slides/diagrams/data-science', sample=(1,5))

### Data Readiness Levels

[<img class="" src="../slides/diagrams/data-science/data-readiness-levels.png" width="" align="" style="background:none; border:none; box-shadow:none;">](https://arxiv.org/pdf/1705.02245.pdf)

[Data Readiness
Levels](http://inverseprobability.com/2017/01/12/data-readiness-levels)

### Three Grades of Data Readiness:

-   Grade C - accessibility

-   Grade B - validity

-   Grade A - usability

### Accessibility: Grade C

-   *Hearsay* data.
-   Availability, is it actually being recorded?
-   privacy or legal constraints on the accessibility of the recorded
    data, have ethical constraints been alleviated?
-   Format: log books, PDF ...
-   limitations on access due to topology (e.g. it's distributed across
    a number of devices)
-   At the end of Grade C data is ready to be loaded into analysis
    software (R, SPSS, Matlab, Python, Mathematica)

### Validity: Grade B

-   faithfulness and representation
-   visualisations.
-   exploratory data analysis
-   noise characterisation.
-   Missing values.
-   Schema alignment, record linkage, data fusion?
-   Example, was a column or columns accidentally perturbed (e.g.
    through a sort operation that missed one or more columns)? Or was a
    [gene name accidentally converted to a
    date](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80)?
-   At the end of Grade B, ready to define a candidate question, the
    context, load into OpenML

### Usability: Grade A

-   The usability of data
-   Grade A is about data in context.
-   Consider appropriateness of a given data set to answer a particular
    question or to be subject to a particular analysis.
-   Data integration?
-   At the end of Grade A it's ready for data platforms such as RAMP,
    Kaggle, define a *task* in OpenML.

### 

§[<img class="" src="../slides/diagrams/data-science/data-trusts.png" width="100%" align="" style="background:none; border:none; box-shadow:none;">](https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy)

### 

[<img class="" src="../slides/diagrams/data-science/data-trusts-review.png" width="" align="" style="background:none; border:none; box-shadow:none;">](https://www.out-law.com/en/articles/2017/october/review-calls-for-data-trusts-to-help-grow-artificial-intelligence-in-the-uk/)

### 

<img src="../slides/diagrams/user-centric-data.svg" align="">

### Fragility of AI Systems

### Pigeonholing

<img class="" src="../slides/diagrams/TooManyPigeons.jpg" width="60%" align="" style="background:none; border:none; box-shadow:none;">

The way we are deploying artificial intelligence systems in practice is
to build up systems of machine learning components. To build a machine
learning system, we decompose the task into parts which we can emulate
with ML methods. Each of these parts can be, typically, independently
constructed and verified. For example, in a driverless car we can
decompose the tasks into components such as "pedestrian detection" and
"road line detection". Each of these components can be constructed with,
for example, an independent classifier. We can then superimpose a logic
on top. For example, "Follow the road line unless you detect a
pedestrian in the road".

This allows for verification of car performance, as long as we can
verify the individual components. However, it also implies that the AI
systems we deploy are *fragile*.

### Rapid Reimplementation

### Early AI

<img class="rotateimg90" src="../slides/diagrams/2017-10-12 16.47.34.jpg" width="40%" align="" style="background:none; border:none; box-shadow:none;">

### Machine Learning Systems Design

<img class="" src="../slides/diagrams/SteamEngine_Boulton&Watt_1784_neg.png" width="50%" align="" style="background:none; border:none; box-shadow:none;">

### Adversaries

-   Stuxnet
-   Mischevious-Adversarial

### Turnaround And Update

-   There is a massive need for turn around and update
-   A redeploy of the entire system.
    -   This involves changing the way we design and deploy.
-   Interface between security engineering and machine learning.

### Peppercorns

-   A new name for system failures which aren't bugs.
-   Difference between finding a fly in your soup vs a peppercorn in
    your soup.

<!--
### {.slide: data-transition="none"}

<center><video height="600" type="video/mp4"><source src="../slides/diagrams/paolo-peppercorn.mp4" height="80%"></video></center>

### {.slide: data-transition="none"}

<center><video type="video/mp4"><source src="../slides/diagrams/paolo-save.mp4"></video></center>
-->
### Thanks!

-   twitter: @lawrennd
-   blog:
    [http://inverseprobability.com](http://inverseprobability.com/blog.html)
-   [Mike Jordan's Medium
    Post](https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7)