# How Engineers Solve Big and Difficult Problems Part 1: The

Challenges/Opportunities Presented to Engineers by AI/ML \#\#\# [Neil D.
Lawrence](http://inverseprobability.com), University of Cambridge

### 2022-11-14

**Abstract**: Machine learning solutions, in particular those based on
deep learning methods, form an underpinning of the current revolution in
“artificial intelligence” that has dominated popular press headlines and
is having a significant influence on the wider tech agenda. In this talk
I will give an overview of where we are now with machine learning
solutions, and what challenges we face both in the near and far future.
These include practical application of existing algorithms in the face
of the need to explain decision making, mechanisms for improving the
quality and availability of data, dealing with large unstructured
datasets.

$$
$$

::: {.cell .markdown}

<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!---->
<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->
<!--

-->

## Setup

In [None]:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 22})

<!--setupplotcode{import seaborn as sns
sns.set_style('darkgrid')
sns.set_context('paper')
sns.set_palette('colorblind')}-->

## notutils

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/notutils-software.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/notutils-software.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

This small package is a helper package for various notebook utilities
used

The software can be installed using

In [None]:
%pip install notutils

from the command prompt where you can access your python installation.

The code is also available on GitHub:
<https://github.com/lawrennd/notutils>

Once `notutils` is installed, it can be imported in the usual manner.

In [None]:
import notutils

## pods

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/pods-software.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/pods-software.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In Sheffield we created a suite of software tools for ‘Open Data
Science.’ Open data science is an approach to sharing code, models and
data that should make it easier for companies, health professionals and
scientists to gain access to data science techniques.

You can also check this blog post on [Open Data
Science](http://inverseprobability.com/2014/07/01/open-data-science).

The software can be installed using

In [None]:
%pip install pods

from the command prompt where you can access your python installation.

The code is also available on GitHub: <https://github.com/lawrennd/ods>

Once `pods` is installed, it can be imported in the usual manner.

In [None]:
import pods

## mlai

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/mlai-software.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/mlai-software.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The `mlai` software is a suite of helper functions for teaching and
demonstrating machine learning algorithms. It was first used in the
Machine Learning and Adaptive Intelligence course in Sheffield in 2013.

The software can be installed using

In [None]:
%pip install mlai

from the command prompt where you can access your python installation.

The code is also available on GitHub: <https://github.com/lawrennd/mlai>

Once `mlai` is installed, it can be imported in the usual manner.

In [None]:
import mlai

## Complexity in Action

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_psychology/includes/selective-attention-bias.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_psychology/includes/selective-attention-bias.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

As an exercise in understanding complexity, watch the following video.
You will see the basketball being bounced around, and the players
moving. Your job is to count the passes of those dressed in white and
ignore those of the individuals dressed in black.

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('vJG698U2Mvo')

Figure: <i>Daniel Simon’s famous illusion “monkey business.” Focus on
the movement of the ball distracts the viewer from seeing other aspects
of the image.</i>

In a classic study Simons and Chabris (1999) ask subjects to count the
number of passes of the basketball between players on the team wearing
white shirts. Fifty percent of the time, these subjects don’t notice the
gorilla moving across the scene.

The phenomenon of inattentional blindness is well known, e.g in their
paper Simons and Charbris quote the Hungarian neurologist, Rezsö Bálint,

> It is a well-known phenomenon that we do not notice anything happening
> in our surroundings while being absorbed in the inspection of
> something; focusing our attention on a certain object may happen to
> such an extent that we cannot perceive other objects placed in the
> peripheral parts of our visual field, although the light rays they
> emit arrive completely at the visual sphere of the cerebral cortex.
>
> Rezsö Bálint 1907 (translated in Husain and Stein 1988, page 91)

When we combine the complexity of the world with our relatively low
bandwidth for information, problems can arise. Our focus on what we
perceive to be the most important problem can cause us to miss other
(potentially vital) contextual information.

This phenomenon is known as selective attention or ‘inattentional
blindness.’

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('_oGAzq5wM_Q')

Figure: <i>For a longer talk on inattentional bias from Daniel Simons
see this video.</i>

## Data Selective Attention Bias

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-selection-attention-bias.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-selection-attention-bias.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

We are going to see how inattention biases can play out in data analysis
by going through a simple example. The analysis involves body mass index
and activity information.

## BMI Steps Data

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_datasets/includes/bmi-steps-data.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_datasets/includes/bmi-steps-data.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The BMI Steps example is taken from Yanai and Lercher (2020). We are
given a data set of body-mass index measurements against step counts.
For convenience we have packaged the data so that it can be easily
downloaded.

In [None]:
import pods

In [None]:
data = pods.datasets.bmi_steps()
X = data['X'] 
y = data['Y']

It is good practice to give our variables interpretable names so that
the analysis may be clearly understood by others. Here the `steps` count
is the first dimension of the covariate, the `bmi` is the second
dimension and the `gender` is stored in `y` with `1` for female and `0`
for male.

In [None]:
steps = X[:, 0]
bmi = X[:, 1]
gender = y[:, 0]

We can check the mean steps and the mean of the BMI.

In [None]:
print('Steps mean is {mean}.'.format(mean=steps.mean()))

In [None]:
print('BMI mean is {mean}.'.format(mean=bmi.mean()))

## BMI Steps Data Analysis

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/bmi-steps-analysis.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/bmi-steps-analysis.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

We can also separate out the means from the male and female populations.
In python this can be done by setting male and female indices as
follows.

In [None]:
male_ind = (gender==0)
female_ind = (gender==1)

And now we can extract the variables for the two populations.

In [None]:
male_steps = steps[male_ind]
male_bmi = bmi[male_ind]

And as before we compute the mean.

In [None]:
print('Male steps mean is {mean}.'.format(mean=male_steps.mean()))

In [None]:
print('Male BMI mean is {mean}.'.format(mean=male_bmi.mean()))

Similarly, we can get the same result for the female portion of the
populaton.

In [None]:
female_steps = steps[female_ind]
female_bmi = bmi[female_ind]

In [None]:
print('Female steps mean is {mean}.'.format(mean=female_steps.mean()))

In [None]:
print('Female BMI mean is {mean}.'.format(mean=female_bmi.mean()))

Interesting, the female BMI average is slightly higher than the male BMI
average. The number of steps in the male group is higher than that in
the female group. Perhaps the steps and the BMI are anti-correlated. The
more steps, the lower the BMI.

Python provides a statistics package. We’ll import this in `python` so
that we can try and understand the correlation between the `steps` and
the `BMI`.

In [None]:
from scipy.stats import pearsonr

In [None]:
corr, _ = pearsonr(steps, bmi)
print("Pearson's overall correlation: {corr}".format(corr=corr))

In [None]:

male_corr, _ = pearsonr(male_steps, male_bmi)
print("Pearson's correlation for males: {corr}".format(corr=male_corr))

In [None]:
female_corr, _ = pearsonr(female_steps, female_bmi)
print("Pearson's correlation for females: {corr}".format(corr=female_corr))

In [None]:
import mlai.plot as plot
import mlai
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(X[male_ind, 0], X[male_ind, 1], 'g.',markersize=10)
_ = ax.plot(X[female_ind, 0], X[female_ind, 1], 'r.',markersize=10)
_ = ax.set_xlabel('steps', fontsize=20)
_ = ax.set_ylabel('BMI', fontsize=20)
xlim = (0, 15000)
ylim = (15, 32.5)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(filename='bmi-steps.svg',
                directory='./datasets',
                transparent=True)

## A Hypothesis as a Liability

This analysis is from an article titled “A Hypothesis as a Liability”
(Yanai and Lercher, 2020), they start their article with the following
quite from Herman Hesse.

> " ‘When someone seeks,’ said Siddhartha, ‘then it easily happens that
> his eyes see only the thing that he seeks, and he is able to find
> nothing, to take in nothing. \[…\] Seeking means: having a goal. But
> finding means: being free, being open, having no goal.’ "
>
> Hermann Hesse

Their idea is that having a hypothesis can constrain our thinking.
However, in answer to their paper Felin et al. (2021) argue that some
form of hypothesis is always necessary, suggesting that a hypothesis
*can* be a liability

My view is captured in the introductory chapter to an edited volume on
computational systems biology that I worked on with Mark Girolami,
Magnus Rattray and Guido Sanguinetti.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/licsb-popper-quote.png" style="width:80%">

Figure: <i>Quote from Lawrence (2010) highlighting the importance of
interaction between data and hypothesis.</i>

Popper nicely captures the interaction between hypothesis and data by
relating it to the chicken and the egg. The important thing is that
these two co-evolve.

## Number Theatre

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/number-data-theatre.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/number-data-theatre.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Unfortunately, we don’t always have time to wait for this process to
converge to an answer we can all rely on before a decision is required.

Not only can we be misled by data before a decision is made, but
sometimes we can be misled by data to justify the making of a decision.
David Spiegelhalter refers to the phenomenon of “Number Theatre” in a
conversation with Andrew Marr from May 2020 on the presentation of data.

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('9388XmWIHXg')

Figure: <i>Professor Sir David Spiegelhalter on Andrew Marr on 10th May
2020 speaking about some of the challengers around data, data
presentation, and decision making in a pandemic. David mentions number
theatre at 9 minutes 10 seconds.</i>

<!--includebbcvideo{p08csg28}-->

## Data Theatre

Data Theatre exploits data inattention bias to present a particular view
on events that may misrepresents through selective presentation.
Statisticians are one of the few groups that are trained with a
sufficient degree of data skepticism. But it can also be combatted
through ensuring there are domain experts present, and that they can
speak freely.

<img src="https://inverseprobability.com/talks/./slides/diagrams//business/data-theatre001.svg" class="" width="60%" style="vertical-align:middle;">

Figure: <i>The phenomenon of number theatre or *data theatre* was
described by David Spiegelhalter and is nicely summarized by Martin
Robbins in this sub-stack article
<https://martinrobbins.substack.com/p/data-theatre-why-the-digital-dashboards>.</i>

The best book I have found for teaching the skeptical sense of data that
underlies the statisticians craft is David Spiegelhalter’s *Art of
Statistics*.

# The Art of Statistics

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_books/includes/the-art-of-statistics.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_books/includes/the-art-of-statistics.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<center>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip0">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

David Spiegelhalter

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/david-spiegelhalter.png" clip-path="url(#clip0)"/>

</svg>
</center>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//books/the-art-of-statistics.jpg" style="width:40%">

Figure: <i>[The Art of Statistics by David
Spiegelhalter](https://www.amazon.co.uk/Art-Statistics-Learning-Pelican-Books-ebook/dp/B07HQDJD99)
is an excellent read on the pitfalls of data interpretation.</i>

David’s book (Spiegelhalter, 2019) brings important examples from
statistics to life in an intelligent and entertaining way. It is highly
readable and gives an opportunity to fast-track towards the important
skill of data-skepticism that is the mark of a professional
statistician.

# What is Machine Learning?

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/what-is-ml.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/what-is-ml.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

What is machine learning? At its most basic level machine learning is a
combination of

$$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$$

where *data* is our observations. They can be actively or passively
acquired (meta-data). The *model* contains our assumptions, based on
previous experience. That experience can be other data, it can come from
transfer learning, or it can merely be our beliefs about the
regularities of the universe. In humans our models include our inductive
biases. The *prediction* is an action to be taken or a categorization or
a quality score. The reason that machine learning has become a mainstay
of artificial intelligence is the importance of predictions in
artificial intelligence. The data and the model are combined through
computation.

In practice we normally perform machine learning using two functions. To
combine data with a model we typically make use of:

**a prediction function** it is used to make the predictions. It
includes our beliefs about the regularities of the universe, our
assumptions about how the world works, e.g., smoothness, spatial
similarities, temporal similarities.

**an objective function** it defines the ‘cost’ of misprediction.
Typically, it includes knowledge about the world’s generating processes
(probabilistic objectives) or the costs we pay for mispredictions
(empirical risk minimization).

The combination of data and model through the prediction function and
the objective function leads to a *learning algorithm*. The class of
prediction functions and objective functions we can make use of is
restricted by the algorithms they lead to. If the prediction function or
the objective function are too complex, then it can be difficult to find
an appropriate learning algorithm. Much of the academic field of machine
learning is the quest for new learning algorithms that allow us to bring
different types of models and data together.

A useful reference for state of the art in machine learning is the UK
Royal Society Report, [Machine Learning: Power and Promise of Computers
that Learn by
Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).

You can also check my post blog post on [What is Machine
Learning?](http://inverseprobability.com/2017/07/17/what-is-machine-learning).

## Artificial Intelligence and Data Science

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/data-science-vs-ai.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/data-science-vs-ai.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Machine learning technologies have been the driver of two related, but
distinct disciplines. The first is *data science*. Data science is an
emerging field that arises from the fact that we now collect so much
data by happenstance, rather than by *experimental design*. Classical
statistics is the science of drawing conclusions from data, and to do so
statistical experiments are carefully designed. In the modern era we
collect so much data that there’s a desire to draw inferences directly
from the data.

As well as machine learning, the field of data science draws from
statistics, cloud computing, data storage (e.g. streaming data),
visualization and data mining.

In contrast, artificial intelligence technologies typically focus on
emulating some form of human behaviour, such as understanding an image,
or some speech, or translating text from one form to another. The recent
advances in artificial intelligence have come from machine learning
providing the automation. But in contrast to data science, in artificial
intelligence the data is normally collected with the specific task in
mind. In this sense it has strong relations to classical statistics.

Classically artificial intelligence worried more about *logic* and
*planning* and focused less on data driven decision making. Modern
machine learning owes more to the field of *Cybernetics* (Wiener, 1948)
than artificial intelligence. Related fields include *robotics*, *speech
recognition*, *language understanding* and *computer vision*.

There are strong overlaps between the fields, the wide availability of
data by happenstance makes it easier to collect data for designing AI
systems. These relations are coming through wide availability of sensing
technologies that are interconnected by cellular networks, WiFi and the
internet. This phenomenon is sometimes known as the *Internet of
Things*, but this feels like a dangerous misnomer. We must never forget
that we are interconnecting people, not things.

<center>

Convention for the Protection of *Individuals* with regard to Automatic
Processing of *Personal Data* (1981/1/28)

</center>

## Societal Effects

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/societal-effects.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/societal-effects.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

We have already seen the effects of this changed dynamic in biology and
computational biology. Improved sensorics have led to the new domains of
transcriptomics, epigenomics, and ‘rich phenomics’ as well as
considerably augmenting our capabilities in genomics.

Biologists have had to become data-savvy, they require a rich
understanding of the available data resources and need to assimilate
existing data sets in their hypothesis generation as well as their
experimental design. Modern biology has become a far more quantitative
science, but the quantitativeness has required new methods developed in
the domains of *computational biology* and *bioinformatics*.

There is also great promise for personalized health, but in health the
wide data-sharing that has underpinned success in the computational
biology community is much harder to carry out.

We can expect to see these phenomena reflected in wider society.
Particularly as we make use of more automated decision making based only
on data. This is leading to a requirement to better understand our own
subjective biases to ensure that the human to computer interface allows
domain experts to assimilate data driven conclusions in a well
calibrated manner. This is particularly important where medical
treatments are being prescribed. It also offers potential for different
kinds of medical intervention. More subtle interventions are possible
when the digital environment is able to respond to users in an bespoke
manner. This has particular implications for treatment of mental health
conditions.

The main phenomenon we see across the board is the shift in dynamic from
the direct pathway between human and data, as traditionally mediated by
classical statistics, to a new flow of information via the computer.
This change of dynamics gives us the modern and emerging domain of *data
science*, where the interactions between human and data are mediated by
the machine.

## Challenges

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/three-data-science-challenges.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/three-data-science-challenges.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The field of data science is rapidly evolving. Different practitioners
from different domains have their own perspectives. We identify three
broad challenges that are emerging. Challenges which have not been
addressed in the traditional sub-domains of data science. The challenges
have social implications but require technological advance for their
solutions.

1.  Paradoxes of the Data Society
2.  Quantifying the Value of Data
3.  Privacy, loss of control, marginalization

You can also check this blog post on [Three Data Science
Challenges](http://inverseprobability.com/2016/07/01/data-science-challenges)..

## The Big Data Paradox

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/big-data-paradox.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/big-data-paradox.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The big data paradox is the modern phenomenon of “as we collect more
data, we understand less.” It is emerging in several domains, political
polling, characterization of patients for trials data, monitoring
twitter for political sentiment.

I like to think of the phenomenon as relating to the notion of “can’t
see the wood for the trees.” Classical statistics, with randomized
controlled trials, improved society’s understanding of data. It improved
our ability to monitor the forest, to consider population health, voting
patterns etc. It is critically dependent on active approaches to data
collection that deal with confounders. This data collection can be very
expensive.

In business today, it is still the gold standard, A/B tests are used to
understand the effect of an intervention on revenue or customer capture
or supply chain costs.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//Grib_skov.jpg" style="width:50%">

Figure: <i>New beech leaves growing in the Gribskov Forest in the
northern part of Sealand, Denmark. Photo from wikimedia commons by
Malene Thyssen, <http://commons.wikimedia.org/wiki/User:Malene>.</i>

The new phenomenon is *happenstance data*. Data that is not actively
collected with a question in mind. As a result, it can mislead us. For
example, if we assume the politics of active users of twitter is
reflective of the wider population’s politics, then we may be misled.

However, this happenstance data often allows us to characterise a
particular individual to a high degree of accuracy. Classical statistics
was all about the forest, but big data can often become about the
individual tree. As a result we are misled about the situation.

The phenomenon is more dangerous, because our perception is that we are
characterizing the wider scenario with ever increasing accuracy. Whereas
we are just becoming distracted by detail that may or may not be
pertinent to the wider situation.

This is related to our limited bandwidth as humans, and the ease with
which we are distracted by detail. The data-inattention-cognitive-bias.

## Breadth or Depth Paradox

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/breadth-or-depth.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/breadth-or-depth.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The first challenge we’d like to highlight is the unusual paradoxes of
the data society. It is too early to determine whether these paradoxes
are fundmental or transient. Evidence for them is still somewhat
anecdotal, but they seem worthy of further attention.

### The Paradox of Measurement

We are now able to quantify to a greater and greater degree the actions
of individuals in society, and this might lead us to believe that social
science, politics, economics are becoming quantifiable. We are able to
get a far richer characterization of the world around us. Paradoxically
it seems that as we measure more, we understand less.

How could this be possible? It may be that the greater preponderance of
data is making society itself more complex. Therefore traditional
approaches to measurement (e.g. polling by random sub sampling) are
becoming harder, for example due to more complex batch effects, a
greater stratification of society where it is more difficult to weigh
the various sub-populations correctly.

The end result is that we have a Curate’s egg of a society: it is only
‘measured in parts.’ Whether by examination of social media or through
polling we no longer obtain the overall picture that can be necessary to
obtain the depth of understanding we require.

[One example of this
phenomenon](http://www.theguardian.com/politics/2015/nov/13/new-research-general-election-polls-inaccurate)
is the 2015 UK election which polls had as a tie and yet in practice was
won by the Conservative party with a seven point advantage. A
post-election poll which was truly randomized suggested that this lead
was measurable, but pre-election polls are conducted on line and via
phone. These approaches can under represent certain sectors. The
challenge is that the truly randomized poll is expensive and time
consuming. In practice on line and phone polls are usually weighted to
reflect the fact that they are not truly randomized, but in a rapidly
evolving society the correct weights may move faster than they can be
tracked.

Another example is clinical trials. Once again they are the preserve of
randomized studies to verify the efficacy of the drug. But now, rather
than population becoming more stratified, it is the more personalized
nature of the drugs we wish to test. A targeted drug which has efficacy
in a sub-population may be harder to test due to difficulty in
recruiting the sub-population, the benefit of the drug is also for a
smaller sub-group, so expense of drug trials increases.

There are other less clear cut manifestations of this phenomenon. We
seem to rely increasingly on social media as a news source, or as a
indicator of opinion on a particular subject. But it is beholden to the
whims of a vocal minority.

Similar to the way we required more paper when we first developed the
computer, the solution is more *classical* statistics. We need to do
more work to verify the tentative conclusions we produce so that we know
that our new methodologies are effective.

As we increase the amount of data we acquire, we seem to be able to get
better at characterizing the actions of individuals, predicting how they
will behave. But we seem, somehow, to be becoming less capable at
understanding society. Somehow it seems that as we measure more, we
understand less.

That seems counter-intuitive. But perhaps the preponderance of data is
making society itself, or the way we measure society, somehow more
complex. And in turn, this means that traditional approaches to
measurement are failing. So when we realize we are getting better at
characterising individuals, perhaps we are only measuring society in
parts.

## Breadth vs Depth

Classical approaches to data analysis made use of many subjects to
achieve statistical power. Traditionally, we measure a few things about
many people. For example cardiac disease risks can be based on a limited
number of factors inmany patients (such as whether the patient smokes,
blood pressure, cholesterol levels etc). Because, traditionally, data
matrices are stored with individuals in rows and features in columns[1],
we refer to this as *depth* of measurement. In statistics this is
sometimes known as the *large $p$, small $n$* domain because
traditionally $p$ is used to denote the number of features we know about
an individual and $n$ is used to denote the number of individuals.

The data-revolution is giving us access to far more detail about each
individual, this is leading to a *breadth* of coverage. This
characteristic first came to prominence in computational biology and
genomics where we became able to record information about mutations and
transcription in millions of genes. So $p$ became very large, but due to
expense of measurement, the number of patients recorded, $n$, was
relatively small. But we now see this increasingly for other domains.
With an increasing number of sensors on our wrists or in our mobile
phones, we are characterizing indivdiuals in unprecedented detail. This
domain can also be effectively dealt with by modifying the models that
are used for the data.

<!-- https://upload.wikimedia.org/wikipedia/commons/5/5b/Grib_skov.jpg-->

So we can know an individual extremely well, or we can know a population
well. The saying “Can’t see the wood for the trees,” means we are
distracted by the individual trees in a forest, and can’t see the wider
context. This seems appropriate for what may be going on here. We are
becoming distracted by the information on the individual and we can’t
see the wider context of the data.

We know that a rigorous, randomized, study would characterize that
forest well, but it seems we are unwilling to invest the money required
to do that and the proxies we are using are no longer effective, perhaps
because of shifting patterns of behaviour driven by the rapidly evolving
digital world.

Further, it’s likely that we are interested in *strata* within our data
set. Equivalent to the structure within the forest: a clearing, a
transition between types of tree, a shift in the nature of the
undergrowth.

[1] In statistics this is known as a *design matrix*, representing the
design of a study. But in databases, one might think of each patient
being in a row, or record of the database.

## Examples

Examples exhibiting this phenomenon include recent elections, which have
proven difficult to predict. Including, the UK 2015 elections, the EU
referendum, the US 2016 elections and the UK 2017 elections. In each
case individuals may have taken actions on the back of polls that showed
one thing or another but turned out to be inaccurate. Indeed, the only
accurate pre-election poll for the UK 2017 election, [the YouGov
poll](https://yougov.co.uk/news/2017/05/31/how-yougov-model-2017-general-election-works/),
was not a traditional poll, it contains a new type of statistical model
called [Multilevel Regression and Poststratification
(MRP)](http://andrewgelman.com/2013/10/09/mister-p-whats-its-secret-sauce/)
(Gelman and Hill, 2006).

Another example is stratified medicine. If a therapy is effective only
in a sub-type of a disease, then statistical power can be lost across
the whole population, particularly when that sub-type is a minority. But
characterization of that sub-type is difficult. For example, new cancer
immunotherapy treatments can have a dramatic effect, leading to almost
total elimination of the cancer in some patients, but characterizing
this sub-population is hard. This also makes it hard to develop clinical
trials that prove the efficacy of the drugs.

A final example is our measurement of our economy, which increasingly
may not capture where value is being generated. This is characterized by
the changing nature of work, and the way individuals contribute towards
society. For example, the open source community has driven the backbone
of the majority of operating system software we use today, as well as
cloud compute. But this value is difficult to measure as it was
contributed by volunteers, not by a traditional corporate structure.
Data itself may be driving this change, because the value of data
accumulates in a similar way to the value of capital. The movement of
data in the economy, and the value it generates is also hard to measure,
and it seems there may be a large class of “have nots,” in terms of
those industries whose productivity has suffered relative to the top
performers. The so-called productivity gap may not just be due to skills
and infrastructure, but also due to data-skills and data-infrastructure.

## Challenges

The nature of the digital society has a closed loop feedback on itself.
This is characterized by social media memes, which focus attention on
particular issues very quickly. A good example being the photograph of
Aylan Kurdi, the young Syrian boy found drowned on a Turkish beach. This
photograph had a dramatic effect on attitudes towards immigration, more
than the statistics that were showing that thousands were dieing in the
Mediterranean each month (see [this report by the University of
Sheffield’s Social Media
Lab](https://www.dropbox.com/s/hnydewwtido6nhv/VISSOCMEDLAB_AYLAN%20KURDI%20REPORT.pdf?dl=0)).
Similarly, the changed dynamics of our social circles. Filter bubbles,
where our searches and/or newsfeed has been personalized to things that
algorithms already know we like. Echo chambers, where we interact mainly
with people we agree with and our opinions aren’t challenged. Each of
these is changing the dynamic of society, and yet there is a strong
temptation to use digital media for surveying information.

## Solutions

The solutions to these challenges come in three flavours. Firstly, there
is a need for more data. In particular data that is actively acquired to
cover the gaps in our knowledge. We also need more use of classical
statistical techniques, and a wider understanding of what they involve.
This situation reminds me somewhat of the idea of the ‘paperless
office.’ The innovative research at Xerox PARC that brought us the
Graphical User Interface, so prevalent today, was driven by the
realization, in the 1970s that eventually offices would stop using
paper. Xerox focussed research on what that office would look like as it
was a perceived threat to their business. The paperless office may still
come, but in practice computers brought about a significant increase in
the need for paper due to the additional amounts of information that
they caused to be summarized or generated. In a similar way, the world
of *big data* is driving a need for more experimental design and more
classical statistics. Any perception of the automated computer algorithm
that drives all before it is at least as far away as the paperless
office was in the 1970s.

We also need a better social, cognitive and biological understanding of
humans and how we and our social structures respond to these
interventions. Over time some of the measurables will likely stabilize,
but it is not yet clear which ones.

## Big Model Paradox

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/big-model-paradox.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/big-model-paradox.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The big data paradox has a sister: the big model paradox. As we build
more and more complex models, we start believing that we have a
high-fidelity representation of reality. But the complexity of reality
is way beyond our feeble imaginings. So we end up with a highly complex
model, but one that falls well short in terms of reflecting reality. The
complexity of the model means that it moves beyond our understanding.

# Quantifying the Value of Data

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/value-of-data.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/value-of-data.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The situation is reminiscent of a thirsty castaway, set adrift. There is
a sea of data, but it is not fit to drink. We need some form of data
desalination before it can be consumed. But like real desalination, this
is a non-trivial process, particularly if we want to achieve it at
scale.

There’s a sea of data, but most of it is undrinkable.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//sea-water-ocean-waves.jpg" style="width:50%">

Figure: <i>The abundance of uncurated data is reminiscent of the
abundance of undrinkable water for those cast adrift at sea.</i>

We require data-desalination before it can be consumed!

I spoke about the challenges in data science at the NIPS 2016 Workshop
on Machine Learning for Health. NIPS mainly focuses on machine learning
methodologies, and many of the speakers were doing so. But before my
talk, I listened to some of the other speakers talk about the challenges
they had with data preparation.

-   90% of our time is spent on validation and integration (Leo Anthony
    Celi)
-   “The Dirty Work We Don’t Want to Think About” (Eric Xing)
-   “Voodoo to get it decompressed” (Francisco Giminez)

A further challenge in healthcare is that the data is collected by
clinicians, often at great inconvenience to both themselves and the
patient, but the control of the data is sometimes used to steer the
direction of research.

The fact that we put so much effort into processing the data, but so
little into allocating credit for this work is a major challenge for
realizing the benefit in the data we have.

This type of work is somewhat thankless, with the exception of the
clinicians’ control of the data, which probably takes things too far,
those that collate and correct data sets gain little credit. In the
domain of *reinforcement learning* the aim is to take a series of
actions to achieve a stated goal and gain a reward. The *credit
assignment problem* is the challenge in the learning algorithm of
distributing credit to each of the actions which brought about the
reward. We also experience this problem in society, we use proxies such
as monetary reward to incentivise intermediate steps in our economy.
Modern society functions because we agree to make basic expenditure on
infrastructure, such as roads, which we all make use of. Our
data-society is not sufficiently mature to be correctly crediting and
rewarding those that undertake this work.

This situation is no better in industry than in academia. Many companies
have been persuaded to accumulate all their data centrally in a
so-called “data lake.” This attractive idea is problematic, because data
is added to the “lake” without thought to its quality. As a result, a
better name for these resources would be data swamps. Because the
quality of data in them is often dubious. Data scientists when working
with these sources often need to develop their own processes for
checking the quality of the data before it is used. Unfortunately, the
quality improvements they make are rarely fed back into the ecosystem,
meaning the same purification work needs to be done repeatedly.

We need to properly incentivize the sharing and production of clean data
sets, we need to correctly quantify the value in the contribution of
each actor, otherwise there won’t be enough clean data to satiate the
thirst of our decision-making processes.

In [None]:
import notutils as nu

In [None]:
import notutils as nu

In [None]:
nu.display_plots('pomdp{samp:0>3}.svg', 
                            directory='./', samp=(1, 4))

<img src="https://inverseprobability.com/talks/./slides/diagrams//pomdp004.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>Partially observable Markov decision process observing reward
as actions are taken in different states</i>

The value of shared data infrastructures in computational biology was
recognized by the 2010 joint statement from the Wellcome Trust and other
funders of research at the “Foggy Bottom” meeting. They recognised three
key benefits to sharing of health data:

-   faster progress in improving health
-   better value for money
-   higher quality science

But incentivising sharing requires incentivising collection and
collation of data, and the associated credit allocation models.

## Data Readiness Levels

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-readiness-levels.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-readiness-levels.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

### Data Readiness Levels

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-readiness-levels-short.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-readiness-levels-short.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

[Data Readiness
Levels](http://inverseprobability.com/2017/01/12/data-readiness-levels)
(Lawrence, 2017) are an attempt to develop a language around data
quality that can bridge the gap between technical solutions and decision
makers such as managers and project planners. They are inspired by
Technology Readiness Levels which attempt to quantify the readiness of
technologies for deployment.

See this blog post on [Data Readiness
Levels](http://inverseprobability.com/2017/01/12/data-readiness-levels).

### Three Grades of Data Readiness

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/three-grades-of-data-readiness.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/three-grades-of-data-readiness.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Data-readiness describes, at its coarsest level, three separate stages
of data graduation.

-   Grade C - accessibility
    -   Transition: data becomes electronically available
-   Grade B - validity
    -   Transition: pose a question to the data.
-   Grade A - usability

The important definitions are at the transition. The move from Grade C
data to Grade B data is delimited by the *electronic availability* of
the data. The move from Grade B to Grade A data is delimited by posing a
question or task to the data (Lawrence, 2017).

## Accessibility: Grade C

The first grade refers to the accessibility of data. Most data science
practitioners will be used to working with data-providers who, perhaps
having had little experience of data-science before, state that they
“have the data.” More often than not, they have not verified this. A
convenient term for this is “Hearsay Data,” someone has *heard* that
they have the data so they *say* they have it. This is the lowest grade
of data readiness.

Progressing through Grade C involves ensuring that this data is
accessible. Not just in terms of digital accessiblity, but also for
regulatory, ethical and intellectual property reasons.

## Validity: Grade B

Data transits from Grade C to Grade B once we can begin digital analysis
on the computer. Once the challenges of access to the data have been
resolved, we can make the data available either via API, or for direct
loading into analysis software (such as Python, R, Matlab, Mathematica
or SPSS). Once this has occured the data is at B4 level. Grade B
involves the *validity* of the data. Does the data really represent what
it purports to? There are challenges such as missing values, outliers,
record duplication. Each of these needs to be investigated.

Grade B and C are important as if the work done in these grades is
documented well, it can be reused in other projects. Reuse of this
labour is key to reducing the costs of data-driven automated decision
making. There is a strong overlap between the work required in this
grade and the statistical field of [*exploratory data
analysis*](https://en.wikipedia.org/wiki/Exploratory_data_analysis)
(Tukey, 1977).

The need for Grade B emerges due to the fundamental change in the
availability of data. Classically, the scientific question came first,
and the data came later. This is still the approach in a randomized
control trial, e.g. in A/B testing or clinical trials for drugs. Today
data is being laid down by happenstance, and the question we wish to ask
about the data often comes after the data has been created. The Grade B
of data readiness ensures thought can be put into data quality *before*
the question is defined. It is this work that is reusable across
multiple teams. It is these processes that the team which is *standing
up* the data must deliver.

## Usability: Grade A

Once the validity of the data is determined, the data set can be
considered for use in a particular task. This stage of data readiness is
more akin to what machine learning scientists are used to doing in
universities. Bringing an algorithm to bear on a well understood data
set.

In Grade A we are concerned about the utility of the data given a
particular task. Grade A may involve additional data collection
(experimental design in statistics) to ensure that the task is
fulfilled.

This is the stage where the data and the model are brought together, so
expertise in learning algorithms and their application is key. Further
ethical considerations, such as the fairness of the resulting
predictions are required at this stage. At the end of this stage a
prototype model is ready for deployment.

Deployment and maintenance of machine learning models in production is
another important issue which Data Readiness Levels are only a part of
the solution for.

## Recursive Effects

To find out more, or to contribute ideas go to
<http://data-readiness.org>

Throughout the data preparation pipeline, it is important to have close
interaction between data scientists and application domain experts.
Decisions on data preparation taken outside the context of application
have dangerous downstream consequences. This provides an additional
burden on the data scientist as they are required for each project, but
it should also be seen as a learning and familiarization exercise for
the domain expert. Long term, just as biologists have found it necessary
to assimilate the skills of the bioinformatician to be effective in
their science, most domains will also require a familiarity with the
nature of data driven decision making and its application. Working
closely with data-scientists on data preparation is one way to begin
this sharing of best practice.

The processes involved in Grade C and B are often badly taught in
courses on data science. Perhaps not due to a lack of interest in the
areas, but maybe more due to a lack of access to real world examples
where data quality is poor.

These stages of data science are also ridden with ambiguity. In the long
term they could do with more formalization, and automation, but best
practice needs to be understood by a wider community before that can
happen.

# Assessing the Organizations Readiness

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-joel-tests.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-joel-tests.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Assessing the readiness of data for analysis is one action that can be
taken, but assessing teams that need to assimilate the information in
the data is the other side of the coin. With this in mind both [Damon
Civin](https://medium.com/@damoncivin/the-joel-test-for-data-readiness-4882aae64753)
and [Nick
Elprin](https://blog.dominodatalab.com/joel-test-data-science/) have
independently proposed the idea of a “Data Joel Test.” A “[Joel
Test](https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/)”
is a short questionnaire to establish the ability of a team to handle
software engineering tasks. It is designed as a rough and ready
capability assessment. A “Data Joel Test” is similar, but for assessing
the capability of a team in performing data science.

## Privacy, Loss of Control and Marginalization

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/privacy-intro.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/privacy-intro.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Society is becoming harder to monitor, but the individual is becoming
easier to monitor. Social media monitoring for ‘hate speech’ can easily
be turned to monitoring of political dissent. Marketing becomes more
sinister when the target of the marketing is so well understood and the
digital environment of the target is so well controlled.

## Marketing and Free Will

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/marketing-and-free-will.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/marketing-and-free-will.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

What does it mean for our free will if a computer can predict our
individual behavior better than we ourselves can?

There is potential for both explicit and implicit discrimination on the
basis of race, religion, sexuality or health status. All of these are
prohibited under European law but can pass unawares or be implicit.

The GDPR is the General Data Protection Regulation, but a better name
for it would simply be Good Data Practice Rules. It covers how to deal
with discrimination which has a consequential effect on the individual.
For example, entrance to university, access to loans or insurance. But
the new phenomenon is dealing with a series of inconsequential decisions
that taken together have a consequential effect.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//woman-tends-house-in-village-of-uganda-africa.jpg" style="width:60%">

Figure: <i>A woman tends her house in a village in Uganda.</i>

Statistics as a community is also focused on the single consequential
effect of an analysis (efficacy of drugs, or distribution of Mosquito
nets). Associated with happenstance data is *happenstance decision
making*.

These algorithms behind these decisions are developed in a particular
context. The so-called Silicon Valley bubble. But they are deployed
across the world. To address this, a key challenge is capacity building
in contexts which are remote from the Western norm.

## Amelioration

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/privacy-amelioration.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/privacy-amelioration.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Addressing challenges in privacy, loss of control and marginalization
includes ensuring that the individual retains control of their own data.
We accept privacy in our real loves, we need to accept it in our digital
persona. This is vital for our control of persona and our ability to
project ourselves.

Fairness goes hand in hand with privacy to protect the individual.
Regulations like the GDPR date from a time where the main worry was
*consequential* decision making. Today we also face problems from
accumulation of inconsequential decisions leading to a resulting
consequential effect.

Capacity building in different contexts, empowering domain experts to
solve their own problems, is one aspect to the solution. A further
proposal is the use of data trusts to reintroduce control of personal
data for the individual.

## Personal Data Trusts

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/data-trusts.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/data-trusts.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The machine learning solutions we are dependent on to drive automated
decision making are dependent on data. But with regard to personal data
there are important issues of privacy. Data sharing brings benefits, but
also exposes our digital selves. From the use of social media data for
targeted advertising to influence us, to the use of genetic data to
identify criminals, or natural family members. Control of our virtual
selves maps on to control of our actual selves.

The feudal system that is implied by current data protection legislation
has significant power asymmetries at its heart, in that the data
controller has a duty of care over the data subject, but the data
subject may only discover failings in that duty of care when it’s too
late. Data controllers also may have conflicting motivations, and often
their primary motivation is *not* towards the data-subject, but that is
a consideration in their wider agenda.

[Personal Data
Trusts](https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy)
(Delacroix and Lawrence, 2018; Edwards, 2004; Lawrence, 2016) are a
potential solution to this problem. Inspired by *land societies* that
formed in the 19th century to bring democratic representation to the
growing middle classes. A land society was a mutual organization where
resources were pooled for the common good.

A Personal Data Trust would be a legal entity where the trustees’
responsibility was entirely to the members of the trust. So the
motivation of the data-controllers is aligned only with the
data-subjects. How data is handled would be subject to the terms under
which the trust was convened. The success of an individual trust would
be contingent on it satisfying its members with appropriate balancing of
individual privacy with the benefits of data sharing.

Formation of Data Trusts became the number one recommendation of the
Hall-Presenti report on AI, but unfortunately, the term was confounded
with more general approaches to data sharing that don’t necessarily
involve fiduciary responsibilities or personal data rights. It seems
clear that we need to better characterize the data sharing landscape as
well as propose mechanisms for tackling specific issues in data sharing.

It feels important to have a diversity of approaches, and yet it feels
important that any individual trust would be large enough to be taken
seriously in representing the views of its members in wider
negotiations.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/data-trusts.png" style="width:100%">

Figure: <i>For thoughts on data trusts see Guardian article on [Data
Trusts](https://www.theguardian.com/media-network/https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy).</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/data-trusts-review.png" style="width:50%">

Figure: <i>Data Trusts were the first recommendation of the
<a href="https://www.out-law.com/en/articles/2017/october/review-calls-for-data-trusts-to-help-grow-artificial-intelligence-in-the-uk/" target="_blank">Hall-Presenti
Report</a>. Unfortunately, since then the role of data trusts vs other
data sharing mechanisms in the UK has been somewhat confused.</i>

See Guardian articles on Guardian article on [Digital
Oligarchies](https://www.theguardian.com/media-network/https://www.theguardian.com/media-network/2015/mar/05/digital-oligarchy-algorithms-personal-data)
and Guardian article on [Information
Feudalism](https://www.theguardian.com/media-network/https://www.theguardian.com/media-network/2015/nov/16/information-barons-threaten-autonomy-privacy-online).

## Data Trusts Initiative

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/data-trusts-initiative.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/data-trusts-initiative.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The [Data Trusts Initiative](https://datatrusts.uk/), funded by the
Patrick J. McGovern Foundation is supporting three pilot projects that
consider how bottom-up empowerment can redress the imbalance associated
with the digital oligarchy.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//governance/data-trusts-initiative-project-page.png" style="width:60%">

Figure: <i>The Data Trusts Initiative
(<a href="https://datatrusts.uk/" target="_blank">http://datatrusts.uk</a>)
hosts blog posts helping build understanding of data trusts and supports
research and pilot projects.</i>

## Progress So Far

In its first 18 months of operation, the Initiative has:

-   Convened over 200 leading data ethics researchers and practitioners;

-   Funded 7 new research projects tackling knowledge gaps in data trust
    theory and practice;

-   Supported 3 real-world data trust pilot projects establishing new
    data stewardship mechanisms.

## Data Science Africa

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-science-africa.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-science-africa.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science-africa-logo.png" style="width:30%">

Figure: <i>Data Science Africa <http://datascienceafrica.org> is a
ground up initiative for capacity building around data science, machine
learning and artificial intelligence on the African continent.</i>

<img src="https://inverseprobability.com/talks/./slides/diagrams//dsa/dsa-events-october-2021.svg" class="" width="60%" style="vertical-align:middle;">

Figure: <i>Data Science Africa meetings held up to October 2021.</i>
Data Science Africa is a bottom up initiative for capacity building in
data science, machine learning and artificial intelligence on the
African continent.

As of October 2021 there have been five workshops and five schools,
located in Nyeri, Kenya (twice); Kampala, Uganda; Arusha, Tanzania;
Abuja, Nigeria; Addis Ababa, Ethiopia; Accra, Ghana; Kampala, Uganda and
Kimberley, South Africa.

The main notion is *end-to-end* data science. For example, going from
data collection in the farmer’s field to decision making in the Ministry
of Agriculture. Or going from malaria disease counts in health centers
to medicine distribution.

The philosophy is laid out in (Lawrence, 2015). The key idea is that the
modern *information infrastructure* presents new solutions to old
problems. Modes of development change because less capital investment is
required to take advantage of this infrastructure. The philosophy is
that local capacity building is the right way to leverage these
challenges in addressing data science problems in the African context.

Data Science Africa is now a non-govermental organization registered in
Kenya. The organising board of the meeting is entirely made up of
scientists and academics based on the African continent.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/africa-benefit-data-revolution.png" style="width:70%">

Figure: <i>The lack of existing physical infrastructure on the African
continent makes it a particularly interesting environment for deploying
solutions based on the *information infrastructure*. The idea is
explored more in this Guardian op-ed on Guardian article on [How African
can benefit from the data
revolution](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information).</i>

Guardian article on [Data Science
Africa](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information)

## Example: Prediction of Malaria Incidence in Uganda

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_health/includes/malaria-gp.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_health/includes/malaria-gp.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip1">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Martin Mubangizi

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/martin-mubangizi.png" clip-path="url(#clip1)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip2">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Ricardo Andrade Pacecho

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/ricardo-andrade-pacheco.png" clip-path="url(#clip2)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip3">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

John Quinn

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/john-quinn.jpg" clip-path="url(#clip3)"/>

</svg>

As an example of using Gaussian process models within the full pipeline
from data to decsion, we’ll consider the prediction of Malaria incidence
in Uganda. For the purposes of this study malaria reports come in two
forms, HMIS reports from health centres and Sentinel data, which is
curated by the WHO. There are limited sentinel sites and many HMIS
sites.

The work is from Ricardo Andrade Pacheco’s PhD thesis, completed in
collaboration with John Quinn and Martin Mubangizi (Andrade-Pacheco et
al., 2014; Mubangizi et al., 2014). John and Martin were initally from
the AI-DEV group from the University of Makerere in Kampala and more
latterly they were based at UN Global Pulse in Kampala. You can see the
work summarized on the UN Global Pulse [disease outbreaks project site
here](https://diseaseoutbreaks.unglobalpulse.net/uganda/).

-   See [UN Global Pulse Disease Outbreaks
    Site](https://diseaseoutbreaks.unglobalpulse.net/uganda/)

Malaria data is spatial data. Uganda is split into districts, and health
reports can be found for each district. This suggests that models such
as conditional random fields could be used for spatial modelling, but
there are two complexities with this. First of all, occasionally
districts split into two. Secondly, sentinel sites are a specific
location within a district, such as Nagongera which is a sentinel site
based in the Tororo district.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//health/uganda-districts-2006.png" style="width:50%">

Figure: <i>Ugandan districts. Data SRTM/NASA from
<https://dds.cr.usgs.gov/srtm/version2_1>.</i>

(Andrade-Pacheco et al., 2014; Mubangizi et al., 2014)

The common standard for collecting health data on the African continent
is from the Health management information systems (HMIS). However, this
data suffers from missing values (Gething et al., 2006) and diagnosis of
diseases like typhoid and malaria may be confounded.

<img src="https://inverseprobability.com/talks/./slides/diagrams//health/Tororo_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Tororo district, where the sentinel site, Nagongera, is
located.</i>

[World Health Organization Sentinel Surveillance
systems](https://www.who.int/immunization/monitoring_surveillance/burden/vpd/surveillance_type/sentinel/en/)
are set up “when high-quality data are needed about a particular disease
that cannot be obtained through a passive system.” Several sentinel
sites give accurate assessment of malaria disease levels in Uganda,
including a site in Nagongera.

<img class="negate" src="https://inverseprobability.com/talks/./slides/diagrams//health/sentinel_nagongera.png" style="width:100%">

Figure: <i>Sentinel and HMIS data along with rainfall and temperature
for the Nagongera sentinel station in the Tororo district.</i>

In collaboration with the AI Research Group at Makerere we chose to
investigate whether Gaussian process models could be used to assimilate
information from these two different sources of disease informaton.
Further, we were interested in whether local information on rainfall and
temperature could be used to improve malaria estimates.

The aim of the project was to use WHO Sentinel sites, alongside rainfall
and temperature, to improve predictions from HMIS data of levels of
malaria.

<img src="https://inverseprobability.com/talks/./slides/diagrams//health/Mubende_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Mubende District.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//health/mubende.png" style="width:80%">

Figure: <i>Prediction of malaria incidence in Mubende.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//gpss/1157497_513423392066576_1845599035_n.jpg" style="width:80%">

Figure: <i>The project arose out of the Gaussian process summer school
held at Makerere in Kampala in 2013. The school led, in turn, to the
Data Science Africa initiative.</i>

## Early Warning Systems

<img src="https://inverseprobability.com/talks/./slides/diagrams//health/Kabarole_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Kabarole district in Uganda.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//health/kabarole.gif" style="width:100%">

Figure: <i>Estimate of the current disease situation in the Kabarole
district over time. Estimate is constructed with a Gaussian process with
an additive covariance funciton.</i>

Health monitoring system for the Kabarole district. Here we have fitted
the reports with a Gaussian process with an additive covariance
function. It has two components, one is a long time scale component (in
red above) the other is a short time scale component (in blue).

Monitoring proceeds by considering two aspects of the curve. Is the blue
line (the short term report signal) above the red (which represents the
long term trend? If so we have higher than expected reports. If this is
the case *and* the gradient is still positive (i.e. reports are going
up) we encode this with a *red* color. If it is the case and the
gradient of the blue line is negative (i.e. reports are going down) we
encode this with an *amber* color. Conversely, if the blue line is below
the red *and* decreasing, we color *green*. On the other hand if it is
below red but increasing, we color *yellow*.

This gives us an early warning system for disease. Red is a bad
situation getting worse, amber is bad, but improving. Green is good and
getting better and yellow good but degrading.

Finally, there is a gray region which represents when the scale of the
effect is small.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//health/monitor.gif" style="width:50%">

Figure: <i>The map of Ugandan districts with an overview of the Malaria
situation in each district.</i>

These colors can now be observed directly on a spatial map of the
districts to give an immediate impression of the current status of the
disease across the country.

## Supply Chain

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_supply-chain/includes/supply-chain.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_supply-chain/includes/supply-chain.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//supply-chain/packhorse-bridge-burbage-brook.jpg" style="width:80%">

Figure: <i>Packhorse Bridge under Burbage Edge. This packhorse route
climbs steeply out of Hathersage and heads towards Sheffield. Packhorses
were the main route for transporting goods across the Peak District. The
high cost of transport is one driver of the ‘smith’ model, where there
is a local skilled person responsible for assembling or creating goods
(e.g. a blacksmith). </i>

On Sunday mornings in Sheffield, I often used to run across Packhorse
Bridge in Burbage valley. The bridge is part of an ancient network of
trails crossing the Pennines that, before Turnpike roads arrived in the
18th century, was the main way in which goods were moved. Given that the
moors around Sheffield were home to sand quarries, tin mines, lead mines
and the villages in the Derwent valley were known for nail and pin
manufacture, this wasn’t simply movement of agricultural goods, but it
was the infrastructure for industrial transport.

The profession of leading the horses was known as a Jagger and leading
out of the village of Hathersage is Jagger’s Lane, a trail that headed
underneath Stanage Edge and into Sheffield.

The movement of goods from regions of supply to areas of demand is
fundamental to our society. The physical infrastructure of supply chain
has evolved a great deal over the last 300 years.

## Cromford

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_supply-chain/includes/cromford.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_supply-chain/includes/cromford.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//supply-chain/cromford-mill.jpg" style="width:80%">

Figure: <i>Richard Arkwright is regarded of the founder of the modern
factory system. Factories exploit distribution networks to centralize
production of goods. Arkwright located his factory in Cromford due to
proximity to Nottingham Weavers (his market) and availability of water
power from the tributaries of the Derwent river. When he first arrived
there was almost no transportation network. Over the following 200 years
The Cromford Canal (1790s), a Turnpike (now the A6, 1816-18) and the
High Peak Railway (now closed, 1820s) were all constructed to improve
transportation access as the factory blossomed.</i>

Richard Arkwright is known as the father of the modern factory system.
In 1771 he set up a [Mill](https://en.wikipedia.org/wiki/Cromford_Mill)
for spinning cotton yarn in the village of Cromford, in the Derwent
Valley. The Derwent valley is relatively inaccessible. Raw cotton
arrived in Liverpool from the US and India. It needed to be transported
on packhorse across the bridleways of the Pennines. But Cromford was a
good location due to proximity to Nottingham, where weavers where
consuming the finished thread, and the availability of water power from
small tributaries of the Derwent river for Arkwright’s [water
frames](https://en.wikipedia.org/wiki/Spinning_jenny) which automated
the production of yarn from raw cotton.

By 1794 the [Cromford
Canal](https://en.wikipedia.org/wiki/Cromford_Canal) was opened to bring
coal in to Cromford and give better transport to Nottingham. The
construction of the canals was driven by the need to improve the
transport infrastructure, facilitating the movement of goods across the
UK. Canals, roads and railways were initially constructed by the
economic need for moving goods. To improve supply chain.

The A6 now does pass through Cromford, but at the time he moved there
there was merely a track. The High Peak Railway was opened in 1832, it
is now converted to the High Peak Trail, but it remains the highest
railway built in Britain.

Cooper (1991)

## Containerization

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_supply-chain/includes/containerisation.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_supply-chain/includes/containerisation.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//supply-chain/container-2539942_1920.jpg" style="width:80%">

Figure: <i>The container is one of the major drivers of globalization,
and arguably the largest agent of social change in the last 100 years.
It reduces the cost of transportation, significantly changing the
appropriate topology of distribution networks. The container makes it
possible to ship goods halfway around the world for cheaper than it
costs to process those goods, leading to an extended distribution
topology.</i>

Containerization has had a dramatic effect on global economics, placing
many people in the developing world at the end of the supply chain.

<table>
<tr>
<td width="45%">

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//supply-chain/wild-alaskan-cod.jpg" style="width:90%">

</td>
<td width="45%">

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//supply-chain/wild-alaskan-cod-made-in-china.jpg" style="width:90%">

</td>
</tr>
</table>

Figure: <i>Wild Alaskan Cod, being solid in the Pacific Northwest, that
is a product of China. It is cheaper to ship the deep frozen fish
thousands of kilometers for processing than to process locally.</i>

For example, you can buy Wild Alaskan Cod fished from Alaska, processed
in China, sold in North America. This is driven by the low cost of
transport for frozen cod vs the higher relative cost of cod processing
in the US versus China. Similarly,
<a href="https://www.telegraph.co.uk/news/uknews/1534286/12000-mile-trip-to-have-seafood-shelled.html" target="_blank">Scottish
prawns are also processed in China for sale in the UK.</a>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//supply-chain/environmental-impact-of-food-by-life-cycle.png" style="width:70%">

Figure: <i>The transport cost of most foods is a very small portion of
the total cost. The exception is if foods are air freighted. Source:
<https://ourworldindata.org/food-choice-vs-eating-local> by Hannah
Ritche CC-BY</i>

This effect on cost of transport vs cost of processing is the main
driver of the topology of the modern supply chain and the associated
effect of globalization. If transport is much cheaper than processing,
then processing will tend to agglomerate in places where processing
costs can be minimized.

Large scale global economic change has principally been driven by
changes in the technology that drives supply chain.

Supply chain is a large-scale automated decision making network. Our aim
is to make decisions not only based on our models of customer behavior
(as observed through data), but also by accounting for the structure of
our fulfilment center, and delivery network.

Many of the most important questions in supply chain take the form of
counterfactuals. E.g. “What would happen if we opened a manufacturing
facility in Cambridge?” A counter factual is a question that implies a
mechanistic understanding of a system. It goes beyond simple smoothness
assumptions or translation invariants. It requires a physical, or
*mechanistic* understanding of the supply chain network. For this
reason, the type of models we deploy in supply chain often involve
simulations or more mechanistic understanding of the network.

In supply chain Machine Learning alone is not enough, we need to bridge
between models that contain real mechanisms and models that are entirely
data driven.

This is challenging, because as we introduce more mechanism to the
models we use, it becomes harder to develop efficient algorithms to
match those models to data.

## SafeBoda

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_ai/includes/safe-boda.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ai/includes/safe-boda.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The complexity of building safe, maintainable systems that are based on
interacting components which include machine learning models means that
smaller companies can be excluded from access to these technologies due
the technical and intellectual debt incurred when maintaining such
systems in a real-world environment.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//ai/safe-boda.png" style="width:60%">

Figure: <i>SafeBoda is a ride allocation system for Boda Boda drivers.
Let’s imagine the capabilities we need for such an AI system.</i>

[SafeBoda](https://safeboda.com/ug/index.php#whysafeboda) is a Kampala
based rider allocation system for Boda Boda drivers. Boda boda are
motorcycle taxis which give employment to, often young men, across
Kampala. Safe Boda is driven by the knowledge that road accidents are
set to match HIV/AIDS as the highest cause of death in low/middle income
families by 2030.

> With road accidents set to match HIV/AIDS as the highest cause of
> death in low/middle income countries by 2030, SafeBoda’s aim is to
> modernise informal transportation and ensure safe access to mobility.

A key aim of the AutoAI agenda is to reduce these technical challenges,
so that such software can be maintained safely and reliably by a small
team of software engineers. Without this capability it is hard to
imagine how low resource environments can fully benefit from the ‘data
revolution’ without heavy reliance on technical provision from
high-resource environments. Such dependence would inevitably mean a skew
towards the challenges that high-resource economies face, rather than
the more urgent and important problems that are faced in low-resource
environments.

## Prime Air

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_ai/includes/prime-air-system.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ai/includes/prime-air-system.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

One project where the components of machine learning and the physical
world come together is Amazon’s Prime Air drone delivery system.

Automating the process of moving physical goods through autonomous
vehicles completes the loop between the ‘bits’ and the ‘atoms.’ In other
words, the information and the ‘stuff.’ The idea of the drone is to
complete a component of package delivery, the notion of last mile
movement of goods, but in a fully autonomous way.

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip4">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Gur Kimchi

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/gur-kimchi.png" clip-path="url(#clip4)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip5">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Paul Viola

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/paul-viola.png" clip-path="url(#clip5)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip6">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

David Moro

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/david-moro.png" clip-path="url(#clip6)"/>

</svg>

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('3HJtmx5f1Fc')

Figure: <i>An actual ‘Santa’s sleigh.’ Amazon’s prototype delivery
drone. Machine learning algorithms are used across various systems
including sensing (computer vision for detection of wires, people, dogs
etc) and piloting. The technology is necessarily a combination of old
and new ideas. The transition from vertical to horizontal flight is
vital for efficiency and uses sophisticated machine learning to
achieve.</i>

As Jeff Wilke (who was CEO of Amazon Retail at the time) [announced in
June
2019](https://blog.aboutamazon.com/transportation/a-drone-program-taking-flight)
the technology is ready, but still needs operationalization including
e.g. regulatory approval.

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('wa8DU-Sui8Q')

Figure: <i>Jeff Wilke (CEO Amazon Consumer) announcing the new drone at
the Amazon 2019 re:MARS event alongside the scale of the Amazon supply
chain.</i>

> When we announced earlier this year that we were evolving our Prime
> two-day shipping offer in the U.S. to a one-day program, the response
> was terrific. But we know customers are always looking for something
> better, more convenient, and there may be times when one-day delivery
> may not be the right choice. Can we deliver packages to customers even
> faster? We think the answer is yes, and one way we’re pursuing that
> goal is by pioneering autonomous drone technology.

> Today at Amazon’s re:MARS Conference (Machine Learning, Automation,
> Robotics and Space) in Las Vegas, we unveiled our latest Prime Air
> drone design. We’ve been hard at work building fully electric drones
> that can fly up to 15 miles and deliver packages under five pounds to
> customers in less than 30 minutes. And, with the help of our
> world-class fulfillment and delivery network, we expect to scale Prime
> Air both quickly and efficiently, delivering packages via drone to
> customers within months.

The 15 miles in less than 30 minutes implies air speed velocities of
around 50 kilometers per hour.

> Our newest drone design includes advances in efficiency, stability
> and, most importantly, in safety. It is also unique, and it advances
> the state of the art. How so? First, it’s a hybrid design. It can do
> vertical takeoffs and landings – like a helicopter. And it’s efficient
> and aerodynamic—like an airplane. It also easily transitions between
> these two modes—from vertical-mode to airplane mode, and back to
> vertical mode.

> It’s fully shrouded for safety. The shrouds are also the wings, which
> makes it efficient in flight.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//ai/amazon-prime-air-remars-june-2019.jpg" style="width:80%">

Figure: <i>Picture of the drone from Amazon Re-MARS event in 2019.</i>

> Our drones need to be able to identify static and moving objects
> coming from any direction. We employ diverse sensors and advanced
> algorithms, such as multi-view stereo vision, to detect static objects
> like a chimney. To detect moving objects, like a paraglider or
> helicopter, we use proprietary computer-vision and machine learning
> algorithms.

> A customer’s yard may have clotheslines, telephone wires, or
> electrical wires. Wire detection is one of the hardest challenges for
> low-altitude flights. Through the use of computer-vision techniques
> we’ve invented, our drones can recognize and avoid wires as they
> descend into, and ascend out of, a customer’s yard.

We separated the challenges we face into three groups: (1) paradoxes of
the odern data society, (2) quantifying the value of data and (3)
privacy loss of control and marginalization. We’ve noted the origins of
the paradoxes, speculating that it is based in a form of data (or
modelling) inattention bias demonstrated through the Gorilla. We’ve
drawn parallels between challenges of rewarding the addition of value
and the credit assignment problem in reinforecement learning and we’ve
looked at approaches to introduce the voice of marginalized societies
and people into the conversation.

## Thanks!

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/how-engineers-solve-big-and-difficult-problems-part-1-the-challenge-opportunities-presented-to-engineers-by-ai-ml.gpp.markdown" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/how-engineers-solve-big-and-difficult-problems-part-1-the-challenge-opportunities-presented-to-engineers-by-ai-ml.gpp.markdown', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

For more information on these subjects and more you might want to check
the following resources.

-   twitter: [@lawrennd](https://twitter.com/lawrennd)
-   podcast: [The Talking Machines](http://thetalkingmachines.com)
-   newspaper: [Guardian Profile
    Page](http://www.theguardian.com/profile/neil-lawrence)
-   blog:
    [http://inverseprobability.com](http://inverseprobability.com/blog.html)

## References

Andrade-Pacheco, R., Mubangizi, M., Quinn, J., Lawrence, N.D., 2014.
Consistent mapping of government malaria records across a changing
territory delimitation. Malaria Journal 13.
<https://doi.org/10.1186/1475-2875-13-S1-P5>

Cooper, B., 1991. Transformation of a valley: Derbyshire derwent.
Scarthin Books.

Delacroix, S., Lawrence, N.D., 2018. Disturbing the ‘one size fits all’
approach to data governance: Bottom-up data trusts. SSRN.
<https://doi.org/10.1093/idpl/ipz01410.2139/ssrn.3265315>

Edwards, L., 2004. The problem with privacy. International Review of Law
Computers & Technology 18, 263–294.

Felin, T., Koenderink, J., Krueger, J.I., Noble, D., Ellis, G.F.R.,
2021. The data-hypothesis relationship. Genome Biology 22.
<https://doi.org/10.1186/s13059-021-02276-4>

Gelman, A., Hill, J., 2006. Data analysis using regression and
multilevel/hierarchical models, Analytical methods for social research.
Cambridge University Press, Cambridge, UK.
<https://doi.org/10.1017/CBO9780511790942>

Gething, P.W., Noor, A.M., Gikandi, P.W., Ogara, E.A.A., Hay, S.I.,
Nixon, M.S., Snow, R.W., Atkinson, P.M., 2006. Improving imperfect data
from health management information systems in Africa using space–time
geostatistics. PLoS Medicine 3.
<https://doi.org/10.1371/journal.pmed.0030271>

Lawrence, N.D., 2017. Data readiness levels. ArXiv.

Lawrence, N.D., 2016. Data trusts could allay our privacy fears.

Lawrence, N.D., 2015. How Africa can benefit from the data revolution.

Lawrence, N.D., 2010. Introduction to learning and inference in
computational systems biology.

Mubangizi, M., Andrade-Pacheco, R., Smith, M.T., Quinn, J., Lawrence,
N.D., 2014. Malaria surveillance with multiple data sources using
Gaussian process models, in: 1st International Conference on the Use of
Mobile ICT in Africa.

Simons, D.J., Chabris, C.F., 1999. Gorillas in our midst: Sustained
inattentional blindness for dynamic events. Perception 28, 1059–1074.
<https://doi.org/10.1068/p281059>

Spiegelhalter, D.J., 2019. The art of statistics. Pelican.

Tukey, J.W., 1977. Exploratory data analysis. Addison-Wesley.

Wiener, N., 1948. Cybernetics: Control and communication in the animal
and the machine. MIT Press, Cambridge, MA.

Yanai, I., Lercher, M., 2020. A hypothesis is a liability. Genome
Biology 21.