# How Engineers Solve Big and Difficult Problems Part 1: The

Challenges/Opportunities Presented to Engineers by AI/ML \#\#\# [Neil D.
Lawrence](http://inverseprobability.com), University of Cambridge

### 2022-11-14

**Abstract**: Machine learning solutions, in particular those based on
deep learning methods, form an underpinning of the current revolution in
“artificial intelligence” that has dominated popular press headlines and
is having a significant influence on the wider tech agenda. In this talk
I will give an overview of where we are now with machine learning
solutions, and what challenges we face both in the near and far future.
These include practical application of existing algorithms in the face
of the need to explain decision making, mechanisms for improving the
quality and availability of data, dealing with large unstructured
datasets.

$$
$$

::: {.cell .markdown}

<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!---->
<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->
<!--

-->

## Setup

In [None]:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 22})

<!--setupplotcode{import seaborn as sns
sns.set_style('darkgrid')
sns.set_context('paper')
sns.set_palette('colorblind')}-->

## notutils

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/notutils-software.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/notutils-software.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

This small package is a helper package for various notebook utilities
used

The software can be installed using

In [None]:
%pip install notutils

from the command prompt where you can access your python installation.

The code is also available on GitHub:
<https://github.com/lawrennd/notutils>

Once `notutils` is installed, it can be imported in the usual manner.

In [None]:
import notutils

## pods

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/pods-software.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/pods-software.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In Sheffield we created a suite of software tools for ‘Open Data
Science.’ Open data science is an approach to sharing code, models and
data that should make it easier for companies, health professionals and
scientists to gain access to data science techniques.

You can also check this blog post on [Open Data
Science](http://inverseprobability.com/2014/07/01/open-data-science).

The software can be installed using

In [None]:
%pip install pods

from the command prompt where you can access your python installation.

The code is also available on GitHub: <https://github.com/lawrennd/ods>

Once `pods` is installed, it can be imported in the usual manner.

In [None]:
import pods

## mlai

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/mlai-software.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_software/includes/mlai-software.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The `mlai` software is a suite of helper functions for teaching and
demonstrating machine learning algorithms. It was first used in the
Machine Learning and Adaptive Intelligence course in Sheffield in 2013.

The software can be installed using

In [None]:
%pip install mlai

from the command prompt where you can access your python installation.

The code is also available on GitHub: <https://github.com/lawrennd/mlai>

Once `mlai` is installed, it can be imported in the usual manner.

In [None]:
import mlai

## Complexity in Action

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_psychology/includes/selective-attention-bias.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_psychology/includes/selective-attention-bias.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

As an exercise in understanding complexity, watch the following video.
You will see the basketball being bounced around, and the players
moving. Your job is to count the passes of those dressed in white and
ignore those of the individuals dressed in black.

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('vJG698U2Mvo')

Figure: <i>Daniel Simon’s famous illusion “monkey business.” Focus on
the movement of the ball distracts the viewer from seeing other aspects
of the image.</i>

In a classic study Simons and Chabris (1999) ask subjects to count the
number of passes of the basketball between players on the team wearing
white shirts. Fifty percent of the time, these subjects don’t notice the
gorilla moving across the scene.

The phenomenon of inattentional blindness is well known, e.g in their
paper Simons and Charbris quote the Hungarian neurologist, Rezsö Bálint,

> It is a well-known phenomenon that we do not notice anything happening
> in our surroundings while being absorbed in the inspection of
> something; focusing our attention on a certain object may happen to
> such an extent that we cannot perceive other objects placed in the
> peripheral parts of our visual field, although the light rays they
> emit arrive completely at the visual sphere of the cerebral cortex.
>
> Rezsö Bálint 1907 (translated in Husain and Stein 1988, page 91)

When we combine the complexity of the world with our relatively low
bandwidth for information, problems can arise. Our focus on what we
perceive to be the most important problem can cause us to miss other
(potentially vital) contextual information.

This phenomenon is known as selective attention or ‘inattentional
blindness.’

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('_oGAzq5wM_Q')

Figure: <i>For a longer talk on inattentional bias from Daniel Simons
see this video.</i>

## Data Selective Attention Bias

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-selection-attention-bias.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-selection-attention-bias.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

We are going to see how inattention biases can play out in data analysis
by going through a simple example. The analysis involves body mass index
and activity information.

## BMI Steps Data

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_datasets/includes/bmi-steps-data.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_datasets/includes/bmi-steps-data.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The BMI Steps example is taken from Yanai and Lercher (2020). We are
given a data set of body-mass index measurements against step counts.
For convenience we have packaged the data so that it can be easily
downloaded.

In [None]:
import pods

In [None]:
data = pods.datasets.bmi_steps()
X = data['X'] 
y = data['Y']

It is good practice to give our variables interpretable names so that
the analysis may be clearly understood by others. Here the `steps` count
is the first dimension of the covariate, the `bmi` is the second
dimension and the `gender` is stored in `y` with `1` for female and `0`
for male.

In [None]:
steps = X[:, 0]
bmi = X[:, 1]
gender = y[:, 0]

We can check the mean steps and the mean of the BMI.

In [None]:
print('Steps mean is {mean}.'.format(mean=steps.mean()))

In [None]:
print('BMI mean is {mean}.'.format(mean=bmi.mean()))

## BMI Steps Data Analysis

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/bmi-steps-analysis.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/bmi-steps-analysis.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

We can also separate out the means from the male and female populations.
In python this can be done by setting male and female indices as
follows.

In [None]:
male_ind = (gender==0)
female_ind = (gender==1)

And now we can extract the variables for the two populations.

In [None]:
male_steps = steps[male_ind]
male_bmi = bmi[male_ind]

And as before we compute the mean.

In [None]:
print('Male steps mean is {mean}.'.format(mean=male_steps.mean()))

In [None]:
print('Male BMI mean is {mean}.'.format(mean=male_bmi.mean()))

Similarly, we can get the same result for the female portion of the
populaton.

In [None]:
female_steps = steps[female_ind]
female_bmi = bmi[female_ind]

In [None]:
print('Female steps mean is {mean}.'.format(mean=female_steps.mean()))

In [None]:
print('Female BMI mean is {mean}.'.format(mean=female_bmi.mean()))

Interesting, the female BMI average is slightly higher than the male BMI
average. The number of steps in the male group is higher than that in
the female group. Perhaps the steps and the BMI are anti-correlated. The
more steps, the lower the BMI.

Python provides a statistics package. We’ll import this in `python` so
that we can try and understand the correlation between the `steps` and
the `BMI`.

In [None]:
from scipy.stats import pearsonr

In [None]:
corr, _ = pearsonr(steps, bmi)
print("Pearson's overall correlation: {corr}".format(corr=corr))

In [None]:

male_corr, _ = pearsonr(male_steps, male_bmi)
print("Pearson's correlation for males: {corr}".format(corr=male_corr))

In [None]:
female_corr, _ = pearsonr(female_steps, female_bmi)
print("Pearson's correlation for females: {corr}".format(corr=female_corr))

In [None]:
import mlai.plot as plot
import mlai
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(X[male_ind, 0], X[male_ind, 1], 'g.',markersize=10)
_ = ax.plot(X[female_ind, 0], X[female_ind, 1], 'r.',markersize=10)
_ = ax.set_xlabel('steps', fontsize=20)
_ = ax.set_ylabel('BMI', fontsize=20)
xlim = (0, 15000)
ylim = (15, 32.5)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(filename='bmi-steps.svg',
                directory='./datasets',
                transparent=True)

## A Hypothesis as a Liability

This analysis is from an article titled “A Hypothesis as a Liability”
(Yanai and Lercher, 2020), they start their article with the following
quite from Herman Hesse.

> " ‘When someone seeks,’ said Siddhartha, ‘then it easily happens that
> his eyes see only the thing that he seeks, and he is able to find
> nothing, to take in nothing. \[…\] Seeking means: having a goal. But
> finding means: being free, being open, having no goal.’ "
>
> Hermann Hesse

Their idea is that having a hypothesis can constrain our thinking.
However, in answer to their paper Felin et al. (2021) argue that some
form of hypothesis is always necessary, suggesting that a hypothesis
*can* be a liability

My view is captured in the introductory chapter to an edited volume on
computational systems biology that I worked on with Mark Girolami,
Magnus Rattray and Guido Sanguinetti.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/licsb-popper-quote.png" style="width:80%">

Figure: <i>Quote from Lawrence (2010) highlighting the importance of
interaction between data and hypothesis.</i>

Popper nicely captures the interaction between hypothesis and data by
relating it to the chicken and the egg. The important thing is that
these two co-evolve.

# What is Machine Learning?

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/what-is-ml.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/what-is-ml.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

What is machine learning? At its most basic level machine learning is a
combination of

$$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$$

where *data* is our observations. They can be actively or passively
acquired (meta-data). The *model* contains our assumptions, based on
previous experience. That experience can be other data, it can come from
transfer learning, or it can merely be our beliefs about the
regularities of the universe. In humans our models include our inductive
biases. The *prediction* is an action to be taken or a categorization or
a quality score. The reason that machine learning has become a mainstay
of artificial intelligence is the importance of predictions in
artificial intelligence. The data and the model are combined through
computation.

In practice we normally perform machine learning using two functions. To
combine data with a model we typically make use of:

**a prediction function** it is used to make the predictions. It
includes our beliefs about the regularities of the universe, our
assumptions about how the world works, e.g., smoothness, spatial
similarities, temporal similarities.

**an objective function** it defines the ‘cost’ of misprediction.
Typically, it includes knowledge about the world’s generating processes
(probabilistic objectives) or the costs we pay for mispredictions
(empirical risk minimization).

The combination of data and model through the prediction function and
the objective function leads to a *learning algorithm*. The class of
prediction functions and objective functions we can make use of is
restricted by the algorithms they lead to. If the prediction function or
the objective function are too complex, then it can be difficult to find
an appropriate learning algorithm. Much of the academic field of machine
learning is the quest for new learning algorithms that allow us to bring
different types of models and data together.

A useful reference for state of the art in machine learning is the UK
Royal Society Report, [Machine Learning: Power and Promise of Computers
that Learn by
Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).

You can also check my post blog post on [What is Machine
Learning?](http://inverseprobability.com/2017/07/17/what-is-machine-learning).

## Artificial Intelligence and Data Science

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/data-science-vs-ai.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/data-science-vs-ai.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Machine learning technologies have been the driver of two related, but
distinct disciplines. The first is *data science*. Data science is an
emerging field that arises from the fact that we now collect so much
data by happenstance, rather than by *experimental design*. Classical
statistics is the science of drawing conclusions from data, and to do so
statistical experiments are carefully designed. In the modern era we
collect so much data that there’s a desire to draw inferences directly
from the data.

As well as machine learning, the field of data science draws from
statistics, cloud computing, data storage (e.g. streaming data),
visualization and data mining.

In contrast, artificial intelligence technologies typically focus on
emulating some form of human behaviour, such as understanding an image,
or some speech, or translating text from one form to another. The recent
advances in artificial intelligence have come from machine learning
providing the automation. But in contrast to data science, in artificial
intelligence the data is normally collected with the specific task in
mind. In this sense it has strong relations to classical statistics.

Classically artificial intelligence worried more about *logic* and
*planning* and focused less on data driven decision making. Modern
machine learning owes more to the field of *Cybernetics* (Wiener, 1948)
than artificial intelligence. Related fields include *robotics*, *speech
recognition*, *language understanding* and *computer vision*.

There are strong overlaps between the fields, the wide availability of
data by happenstance makes it easier to collect data for designing AI
systems. These relations are coming through wide availability of sensing
technologies that are interconnected by cellular networks, WiFi and the
internet. This phenomenon is sometimes known as the *Internet of
Things*, but this feels like a dangerous misnomer. We must never forget
that we are interconnecting people, not things.

<center>

Convention for the Protection of *Individuals* with regard to Automatic
Processing of *Personal Data* (1981/1/28)

</center>

# Evolved Relationship with Information

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/evolved-relationship.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/evolved-relationship.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The high bandwidth of computers has resulted in a close relationship
between the computer and data. Large amounts of information can flow
between the two. The degree to which the computer is mediating our
relationship with data means that we should consider it an intermediary.

Originally our low bandwidth relationship with data was affected by two
characteristics. Firstly, our tendency to over-interpret driven by our
need to extract as much knowledge from our low bandwidth information
channel as possible. Secondly, by our improved understanding of the
domain of *mathematical* statistics and how our cognitive biases can
mislead us.

With this new set up there is a potential for assimilating far more
information via the computer, but the computer can present this to us in
various ways. If its motives are not aligned with ours then it can
misrepresent the information. This needn’t be nefarious it can be simply
because of the computer pursuing a different objective from us. For
example, if the computer is aiming to maximize our interaction time that
may be a different objective from ours which may be to summarize
information in a representative manner in the *shortest* possible length
of time.

For example, for me, it was a common experience to pick up my telephone
with the intention of checking when my next appointment was, but to soon
find myself distracted by another application on the phone and end up
reading something on the internet. By the time I’d finished reading, I
would often have forgotten the reason I picked up my phone in the first
place.

There are great benefits to be had from the huge amount of information
we can unlock from this evolved relationship between us and data. In
biology, large scale data sharing has been driven by a revolution in
genomic, transcriptomic and epigenomic measurement. The improved
inferences that can be drawn through summarizing data by computer have
fundamentally changed the nature of biological science, now this
phenomenon is also influencing us in our daily lives as data measured by
*happenstance* is increasingly used to characterize us.

Better mediation of this flow requires a better understanding of
human-computer interaction. This in turn involves understanding our own
intelligence better, what its cognitive biases are and how these might
mislead us.

For further thoughts see Guardian article on [marketing in the internet
era](https://www.theguardian.com/media-network/2015/jul/23/data-driven-economy-marketing)
from 2015.

You can also check my blog post on [System
Zero](http://inverseprobability.com/2015/12/04/what-kind-of-ai). This
was also written in 2015.

## New Flow of Information

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/new-flow-of-information.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/new-flow-of-information.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Classically the field of statistics focused on mediating the
relationship between the machine and the human. Our limited bandwidth of
communication means we tend to over-interpret the limited information
that we are given, in the extreme we assign motives and desires to
inanimate objects (a process known as anthropomorphizing). Much of
mathematical statistics was developed to help temper this tendency and
understand when we are valid in drawing conclusions from data.

<img src="https://inverseprobability.com/talks/./slides/diagrams//data-science/new-flow-of-information003.svg" class="" width="70%" style="vertical-align:middle;">

Figure: <i>The trinity of human, data, and computer, and highlights the
modern phenomenon. The communication channel between computer and data
now has an extremely high bandwidth. The channel between human and
computer and the channel between data and human is narrow. New direction
of information flow, information is reaching us mediated by the
computer. The focus on classical statistics reflected the importance of
the direct communication between human and data. The modern challenges
of data science emerge when that relationship is being mediated by the
machine.</i>

Data science brings new challenges. In particular, there is a very large
bandwidth connection between the machine and data. This means that our
relationship with data is now commonly being mediated by the machine.
Whether this is in the acquisition of new data, which now happens by
happenstance rather than with purpose, or the interpretation of that
data where we are increasingly relying on machines to summarize what the
data contains. This is leading to the emerging field of data science,
which must not only deal with the same challenges that mathematical
statistics faced in tempering our tendency to over interpret data but
must also deal with the possibility that the machine has either
inadvertently or maliciously misrepresented the underlying data.

## Data Science Africa

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-science-africa.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/data-science-africa.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science-africa-logo.png" style="width:30%">

Figure: <i>Data Science Africa <http://datascienceafrica.org> is a
ground up initiative for capacity building around data science, machine
learning and artificial intelligence on the African continent.</i>

<img src="https://inverseprobability.com/talks/./slides/diagrams//dsa/dsa-events-october-2021.svg" class="" width="60%" style="vertical-align:middle;">

Figure: <i>Data Science Africa meetings held up to October 2021.</i>
Data Science Africa is a bottom up initiative for capacity building in
data science, machine learning and artificial intelligence on the
African continent.

As of October 2021 there have been five workshops and five schools,
located in Nyeri, Kenya (twice); Kampala, Uganda; Arusha, Tanzania;
Abuja, Nigeria; Addis Ababa, Ethiopia; Accra, Ghana; Kampala, Uganda and
Kimberley, South Africa.

The main notion is *end-to-end* data science. For example, going from
data collection in the farmer’s field to decision making in the Ministry
of Agriculture. Or going from malaria disease counts in health centers
to medicine distribution.

The philosophy is laid out in (Lawrence, 2015). The key idea is that the
modern *information infrastructure* presents new solutions to old
problems. Modes of development change because less capital investment is
required to take advantage of this infrastructure. The philosophy is
that local capacity building is the right way to leverage these
challenges in addressing data science problems in the African context.

Data Science Africa is now a non-govermental organization registered in
Kenya. The organising board of the meeting is entirely made up of
scientists and academics based on the African continent.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/africa-benefit-data-revolution.png" style="width:70%">

Figure: <i>The lack of existing physical infrastructure on the African
continent makes it a particularly interesting environment for deploying
solutions based on the *information infrastructure*. The idea is
explored more in this Guardian op-ed on Guardian article on [How African
can benefit from the data
revolution](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information).</i>

Guardian article on [Data Science
Africa](https://www.theguardian.com/media-network/2015/aug/25/africa-benefit-data-science-information)

## Example: Prediction of Malaria Incidence in Uganda

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_health/includes/malaria-gp.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_health/includes/malaria-gp.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip0">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Martin Mubangizi

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/martin-mubangizi.png" clip-path="url(#clip0)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip1">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Ricardo Andrade Pacecho

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/ricardo-andrade-pacheco.png" clip-path="url(#clip1)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip2">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

John Quinn

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/john-quinn.jpg" clip-path="url(#clip2)"/>

</svg>

As an example of using Gaussian process models within the full pipeline
from data to decsion, we’ll consider the prediction of Malaria incidence
in Uganda. For the purposes of this study malaria reports come in two
forms, HMIS reports from health centres and Sentinel data, which is
curated by the WHO. There are limited sentinel sites and many HMIS
sites.

The work is from Ricardo Andrade Pacheco’s PhD thesis, completed in
collaboration with John Quinn and Martin Mubangizi (Andrade-Pacheco et
al., 2014; Mubangizi et al., 2014). John and Martin were initally from
the AI-DEV group from the University of Makerere in Kampala and more
latterly they were based at UN Global Pulse in Kampala. You can see the
work summarized on the UN Global Pulse [disease outbreaks project site
here](https://diseaseoutbreaks.unglobalpulse.net/uganda/).

-   See [UN Global Pulse Disease Outbreaks
    Site](https://diseaseoutbreaks.unglobalpulse.net/uganda/)

Malaria data is spatial data. Uganda is split into districts, and health
reports can be found for each district. This suggests that models such
as conditional random fields could be used for spatial modelling, but
there are two complexities with this. First of all, occasionally
districts split into two. Secondly, sentinel sites are a specific
location within a district, such as Nagongera which is a sentinel site
based in the Tororo district.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//health/uganda-districts-2006.png" style="width:50%">

Figure: <i>Ugandan districts. Data SRTM/NASA from
<https://dds.cr.usgs.gov/srtm/version2_1>.</i>

(Andrade-Pacheco et al., 2014; Mubangizi et al., 2014)

The common standard for collecting health data on the African continent
is from the Health management information systems (HMIS). However, this
data suffers from missing values (Gething et al., 2006) and diagnosis of
diseases like typhoid and malaria may be confounded.

<img src="https://inverseprobability.com/talks/./slides/diagrams//health/Tororo_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Tororo district, where the sentinel site, Nagongera, is
located.</i>

[World Health Organization Sentinel Surveillance
systems](https://www.who.int/immunization/monitoring_surveillance/burden/vpd/surveillance_type/sentinel/en/)
are set up “when high-quality data are needed about a particular disease
that cannot be obtained through a passive system.” Several sentinel
sites give accurate assessment of malaria disease levels in Uganda,
including a site in Nagongera.

<img class="negate" src="https://inverseprobability.com/talks/./slides/diagrams//health/sentinel_nagongera.png" style="width:100%">

Figure: <i>Sentinel and HMIS data along with rainfall and temperature
for the Nagongera sentinel station in the Tororo district.</i>

In collaboration with the AI Research Group at Makerere we chose to
investigate whether Gaussian process models could be used to assimilate
information from these two different sources of disease informaton.
Further, we were interested in whether local information on rainfall and
temperature could be used to improve malaria estimates.

The aim of the project was to use WHO Sentinel sites, alongside rainfall
and temperature, to improve predictions from HMIS data of levels of
malaria.

<img src="https://inverseprobability.com/talks/./slides/diagrams//health/Mubende_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Mubende District.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//health/mubende.png" style="width:80%">

Figure: <i>Prediction of malaria incidence in Mubende.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//gpss/1157497_513423392066576_1845599035_n.jpg" style="width:80%">

Figure: <i>The project arose out of the Gaussian process summer school
held at Makerere in Kampala in 2013. The school led, in turn, to the
Data Science Africa initiative.</i>

## Early Warning Systems

<img src="https://inverseprobability.com/talks/./slides/diagrams//health/Kabarole_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Kabarole district in Uganda.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//health/kabarole.gif" style="width:100%">

Figure: <i>Estimate of the current disease situation in the Kabarole
district over time. Estimate is constructed with a Gaussian process with
an additive covariance funciton.</i>

Health monitoring system for the Kabarole district. Here we have fitted
the reports with a Gaussian process with an additive covariance
function. It has two components, one is a long time scale component (in
red above) the other is a short time scale component (in blue).

Monitoring proceeds by considering two aspects of the curve. Is the blue
line (the short term report signal) above the red (which represents the
long term trend? If so we have higher than expected reports. If this is
the case *and* the gradient is still positive (i.e. reports are going
up) we encode this with a *red* color. If it is the case and the
gradient of the blue line is negative (i.e. reports are going down) we
encode this with an *amber* color. Conversely, if the blue line is below
the red *and* decreasing, we color *green*. On the other hand if it is
below red but increasing, we color *yellow*.

This gives us an early warning system for disease. Red is a bad
situation getting worse, amber is bad, but improving. Green is good and
getting better and yellow good but degrading.

Finally, there is a gray region which represents when the scale of the
effect is small.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//health/monitor.gif" style="width:50%">

Figure: <i>The map of Ugandan districts with an overview of the Malaria
situation in each district.</i>

These colors can now be observed directly on a spatial map of the
districts to give an immediate impression of the current status of the
disease across the country.

## Challenges

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/three-data-science-challenges.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/three-data-science-challenges.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The field of data science is rapidly evolving. Different practitioners
from different domains have their own perspectives. We identify three
broad challenges that are emerging. Challenges which have not been
addressed in the traditional sub-domains of data science. The challenges
have social implications but require technological advance for their
solutions.

1.  Paradoxes of the Data Society
2.  Quantifying the Value of Data
3.  Privacy, loss of control, marginalization

You can also check this blog post on [Three Data Science
Challenges](http://inverseprobability.com/2016/07/01/data-science-challenges)..

## The Big Data Paradox

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/big-data-paradox.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/big-data-paradox.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The big data paradox is the modern phenomenon of “as we collect more
data, we understand less.” It is emerging in several domains, political
polling, characterization of patients for trials data, monitoring
twitter for political sentiment.

I like to think of the phenomenon as relating to the notion of “can’t
see the wood for the trees.” Classical statistics, with randomized
controlled trials, improved society’s understanding of data. It improved
our ability to monitor the forest, to consider population health, voting
patterns etc. It is critically dependent on active approaches to data
collection that deal with confounders. This data collection can be very
expensive.

In business today, it is still the gold standard, A/B tests are used to
understand the effect of an intervention on revenue or customer capture
or supply chain costs.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//Grib_skov.jpg" style="width:50%">

Figure: <i>New beech leaves growing in the Gribskov Forest in the
northern part of Sealand, Denmark. Photo from wikimedia commons by
Malene Thyssen, <http://commons.wikimedia.org/wiki/User:Malene>.</i>

The new phenomenon is *happenstance data*. Data that is not actively
collected with a question in mind. As a result, it can mislead us. For
example, if we assume the politics of active users of twitter is
reflective of the wider population’s politics, then we may be misled.

However, this happenstance data often allows us to characterise a
particular individual to a high degree of accuracy. Classical statistics
was all about the forest, but big data can often become about the
individual tree. As a result we are misled about the situation.

The phenomenon is more dangerous, because our perception is that we are
characterizing the wider scenario with ever increasing accuracy. Whereas
we are just becoming distracted by detail that may or may not be
pertinent to the wider situation.

This is related to our limited bandwidth as humans, and the ease with
which we are distracted by detail. The data-inattention-cognitive-bias.

## Number Theatre

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/number-data-theatre.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/number-data-theatre.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Unfortunately, we don’t always have time to wait for this process to
converge to an answer we can all rely on before a decision is required.

Not only can we be misled by data before a decision is made, but
sometimes we can be misled by data to justify the making of a decision.
David Spiegelhalter refers to the phenomenon of “Number Theatre” in a
conversation with Andrew Marr from May 2020 on the presentation of data.

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('9388XmWIHXg')

Figure: <i>Professor Sir David Spiegelhalter on Andrew Marr on 10th May
2020 speaking about some of the challengers around data, data
presentation, and decision making in a pandemic. David mentions number
theatre at 9 minutes 10 seconds.</i>

<!--includebbcvideo{p08csg28}-->

## Data Theatre

Data Theatre exploits data inattention bias to present a particular view
on events that may misrepresents through selective presentation.
Statisticians are one of the few groups that are trained with a
sufficient degree of data skepticism. But it can also be combatted
through ensuring there are domain experts present, and that they can
speak freely.

<img src="https://inverseprobability.com/talks/./slides/diagrams//business/data-theatre001.svg" class="" width="60%" style="vertical-align:middle;">

Figure: <i>The phenomenon of number theatre or *data theatre* was
described by David Spiegelhalter and is nicely summarized by Martin
Robbins in this sub-stack article
<https://martinrobbins.substack.com/p/data-theatre-why-the-digital-dashboards>.</i>

## Big Model Paradox

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/big-model-paradox.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/big-model-paradox.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The big data paradox has a sister: the big model paradox. As we build
more and more complex models, we start believing that we have a
high-fidelity representation of reality. But the complexity of reality
is way beyond our feeble imaginings. So we end up with a highly complex
model, but one that falls well short in terms of reflecting reality. The
complexity of the model means that it moves beyond our understanding.

# Quantifying the Value of Data

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/value-of-data-intro.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/value-of-data-intro.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The situation is reminiscent of a thirsty castaway, set adrift. There is
a sea of data, but it is not fit to drink. We need some form of data
desalination before it can be consumed. But like real desalination, this
is a non-trivial process, particularly if we want to achieve it at
scale.

There’s a sea of data, but most of it is undrinkable.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//sea-water-ocean-waves.jpg" style="width:50%">

Figure: <i>The abundance of uncurated data is reminiscent of the
abundance of undrinkable water for those cast adrift at sea.</i>

We require data-desalination before it can be consumed!

I spoke about the challenges in data science at the NIPS 2016 Workshop
on Machine Learning for Health. NIPS mainly focuses on machine learning
methodologies, and many of the speakers were doing so. But before my
talk, I listened to some of the other speakers talk about the challenges
they had with data preparation.

## African Data Sharing Covid-19

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/increasing-data-sharing-from-africas-covid-19-response.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/increasing-data-sharing-from-africas-covid-19-response.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip3">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Morine Amutorine

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/morine-amutorine.png" clip-path="url(#clip3)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:10%">

<defs> <clipPath id="clip4">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Jessica Montgomery

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/jessica-montgomery.jpg" clip-path="url(#clip4)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:10%">

<defs> <clipPath id="clip5">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Victor Ohuruogu

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/victor-ohuruogu.jpg" clip-path="url(#clip5)"/>

</svg>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//governance/increasing-data-sharing-from-africas-covid-19-response.png" style="width:70%">

Figure: <i>Blog post, <https://www.datascienceafrica.org/dsablog/>, by
Morine Amutorine and Jessica Montgomery summarising some of the issues
around data sharing in the Covid-19 response</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//governance/morine-slides-areas-where-data-has-been-impactful-covid19.png" style="width:90%">

Figure: <i>Areas where data has been impactful summarised from Morine
Amutorine’s survey work on data sharing in Africa during the Covid19
response.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//governance/morine-slides-four-areas-of-opportunity.png" style="width:90%">

Figure: <i>Four areas of opportunity to learn from identified by Morine
Amutorine in her survey work on Africa’s data driven Covid-19
response.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//governance/morine-slides-areas-for-action.png" style="width:90%">

Figure: <i>Areas for action arising from Morine Amutorine’s survey work
on Africa’s data driven Covid-19 response.</i>

## Morine’s Areas for Action

Building capacity of organisations in the public and private sector to
reuse and act on data through investments in training, education, and
reskilling of relevant authorities;

Establishing data stewards in organisations who can coordinate and
collaborate with counterparts on using data in the public’s interest and
acting on it.

Technical skills and expertise-researchers (eg data scientists) to
develop and deploy useful, privacy-preserving technologies.

Developing but also clarifying governance framework to enable the
trusted, transparent, and accountable reuse of privately held data in
the public interest under a clear regulatory framework

Data Collaboratives are a new form of collaboration, beyond the
public-private partnership model, in which participants from different
sectors — in particular companies -  exchange their data to create
public value.

## Privacy, Loss of Control and Marginalization

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/privacy-intro.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/privacy-intro.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Society is becoming harder to monitor, but the individual is becoming
easier to monitor. Social media monitoring for ‘hate speech’ can easily
be turned to monitoring of political dissent. Marketing becomes more
sinister when the target of the marketing is so well understood and the
digital environment of the target is so well controlled.

## Case Study: Text Mining for Misinformation

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/joyce-nabende-text-mining-case-study.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_data-science/includes/joyce-nabende-text-mining-case-study.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip6">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Joyce Nakatumba-Nabende

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/joyce-nabende.jpg" clip-path="url(#clip6)"/>

</svg>

We consider a case study from Joyce Nabende, Head of the [Makerere AI
Lab](https://air.ug/). This case study is based on a presentation given
by Joyce to the DSA Research Grants, “Project Progress” session on 20th
August 2021.

The aim of the case study is to map some of the approaches used by Joyce
onto the Access, Assess, Address paradigm.

The aim of the project is to develop tools for automated misinformation
detection. Web, mobile based social media platforms. Social media posts
are invalid, inaccurate, potentially harmful. This is set within the
context of the Covid-19 pandemic within Uganda.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/joyce-nabende-uganda-social-media-killing.png" style="width:60%">

Figure: <i>Misinformation through media has been a challenge for as long
as we’ve been communicating. Social media misinformation is a particular
challenge due to the number of possible sources, the scale and speed
with which it can propagate. Slide from Joyce Nabende’s
presentation.</i>

In common with many applications of data science, and in line with
traditional statistics, the question here comes first, at the beginning
of the data collection. But the access of the data is made easier by the
fact that the data exists in the digital space already. There are APIs
for collecting data from Facebook and Twitter.

The focus here will be trying to understand which parts of this data
collection process might be reusable for others. The aim is to separate
those reusable parts from aspects that are specific to the question.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/napoleoncat-social-media-statistics-facebook-users-in-uganda_2021_06.png" style="width:70%">

Figure: <i>Social media is widespread in Uganda, perhaps largely due to
widespread availability of mobile phone access.</i>

As with any data science problem, it’s vital that domain knowledge is
included in the analysis of the problem. To set context, we see in
Figure how widespread use of social media is in Uganda for different age
groups. The total population of Uganda is around 47 million.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/joyce-nabende-data-science-objective.png" style="width:90%">

Figure: <i>The objective of the project is to track misinformation and
understandperceptions of Ugandan Government’s COVID-19 transmission
mitigation strategies.</i>

One particular challenge for this project is dealing with a data set
with multiple languages. In Uganda, people don’t just communicate in
English, but they will
[code-switch](https://en.wikipedia.org/wiki/Code-switching) or
communicate purely in, e.g. Luganda. Tools and resources for dealing
with code-switching or the Lugandan language in NLP are much less common
than tools for dealing with high resource languages (e.g. German,
English, French, Spanish, Mandarin). See Magueresse et al. (2020) for a
review of NLP in low resource languages, multilingual data sets bring
their own problems Aman Ullah et al. (2020).

The Luganda language is the most widely spoken indigenous language in
Uganda with more than seven million speakers. By definition, a low
resourced language has less capabilities for data annotation and
augmentation, e.g. part of speech taggers.

## Data Access

The social media data was collected from a set of pages (media
institutions, ministry of health, media personalities, top
twitter/facebook users from Uganda. All data was then filtered using
keywords, ‘ssenyiga,’ ‘kolona,’ ‘corona’ ,‘virus’ ,‘obulwadde,’
‘corona,’ ‘covid,’ ‘abalwadde,’ ‘ekirwadde,’ ‘akawuka,’ ‘staysafeug,’
‘stayhome,’ ‘tonsemberera,’ ‘tokwatakudereva,’ ‘vaccine’ to select with
Covid-19 related tweets. Very short Facebook posts were also removed.
Data was collected in two phases, from March 2020 - March 2021 and then
from June 2021 - August 2021. Raw data points 15,354 posts from twitter
and 430,075 from Facebook.

Note that in this case, knowledge of the question has been used in
accessing the data. The context of the data is Uganda and the focus is
Covid-19. That focus is driven by the pandemic. However, as we see when
we get to data assessment, there is still an amount of reusable work
that could/should be automated.

## Data Assessment

After collecting data, the initial assessment was formed to understand
the data, uncover patterns and gain insights. Here various
visualisations can be used to find any unexpected factors in the data.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/joyce-nabende-word-cloud-twitter.png" style="width:60%">

Figure: <i>Word cloud from the Twitter data collected through the
filtering.</i>

In the case of the Uganda data set, Joyce found that mixed in with the
Covid-19 data were topics focussed on popular Ugandan TV shows and the
Ugandan election.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/joyce-nabende-word-cloud-facebook.png" style="width:60%">

Figure: <i>Word cloud from the Facebook data collected through the
filtering.</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/joyce-nabende-lda-topics.png" style="width:90%">

Figure: <i>LDA topics and topic distance maps. Interspersed with the
Covid-19 topics are topics associated with television dance shows,
elections, and the president showing the importance of having domain
knowledge.</i>

Topic modeling highlights the different subjects present in the data,
and how they interrelate.

-   Annotation attributes:
    1.  Data source \[Facebook, Twitter\]
    2.  Language \[English, Luganda, and codemixed\]
    3.  Aspect \[truck drivers, hospitals, vaccine, cases, SOPs, NPIs,
        Testing, Border, Covid19_Impact, Presidential address, death,
        elections and Covid19\]
    4.  Sentiment \[positive, negative and neutral\]
    5.  Misinformation \[Not Fake, Fake, Partially Fake, and Others\]
-   As part of quality assurance, the data was reviewed by an
    independent team to ensure that the annotation guidelines were
    followed.}

Annotation carried out by seven annotators who could understand both
English and Luganda. The data was labeled with the
[Doccano](https://github.com/doccano/doccano) text annotation tool.
Annotations included the data source, the language, the label, the
sentiment and the misinformation status.

{Quality assurance performed by reviewing data with an independent team
for ensuring annotation guidelines were followed.

Table: Portion of data that was annotated.

 \| Twitter Data \| Facebook Data \|  
Initial dataset \| 15,354 \| 430,075 \|  
Dataset after Annotation \| 3,527 \| 4,479 \|

<img class="img-button" src="{{ '/assets/images/Magnify_Large.svg' | relative_url }}" style="width:1.5ex">

[Cohen’s kappa](https://en.wikipedia.org/wiki/Cohen%27s_kappa)
inter-annotation used to measure annotator agreement.

Table: Cohen’s kappa agreement scores for the data.

Language \| 0.89 \|  
Aspect \| 0.69 \|  
Sentiment \| 0.73 \|  
Misinformation \| 0.74 \|

<img class="img-button" src="{{ '/assets/images/Magnify_Large.svg' | relative_url }}" style="width:1.5ex">

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/joyce-nabende-data-annotation-example.png" style="width:70%">

Figure: <i>Example of data annotation for sentiment and misinformation
from the data set.</i>

The idea of the analysis is to bring this information together for
sentiment and misinformation analysis in a [dashboard for Covid-19 in
Uganda](https://dsa-uganda.herokuapp.com/dashboard/).

## Personal Data Trusts

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/data-trusts.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/data-trusts.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The machine learning solutions we are dependent on to drive automated
decision making are dependent on data. But with regard to personal data
there are important issues of privacy. Data sharing brings benefits, but
also exposes our digital selves. From the use of social media data for
targeted advertising to influence us, to the use of genetic data to
identify criminals, or natural family members. Control of our virtual
selves maps on to control of our actual selves.

The feudal system that is implied by current data protection legislation
has significant power asymmetries at its heart, in that the data
controller has a duty of care over the data subject, but the data
subject may only discover failings in that duty of care when it’s too
late. Data controllers also may have conflicting motivations, and often
their primary motivation is *not* towards the data-subject, but that is
a consideration in their wider agenda.

[Personal Data
Trusts](https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy)
(Delacroix and Lawrence, 2018; Edwards, 2004; Lawrence, 2016) are a
potential solution to this problem. Inspired by *land societies* that
formed in the 19th century to bring democratic representation to the
growing middle classes. A land society was a mutual organization where
resources were pooled for the common good.

A Personal Data Trust would be a legal entity where the trustees’
responsibility was entirely to the members of the trust. So the
motivation of the data-controllers is aligned only with the
data-subjects. How data is handled would be subject to the terms under
which the trust was convened. The success of an individual trust would
be contingent on it satisfying its members with appropriate balancing of
individual privacy with the benefits of data sharing.

Formation of Data Trusts became the number one recommendation of the
Hall-Presenti report on AI, but unfortunately, the term was confounded
with more general approaches to data sharing that don’t necessarily
involve fiduciary responsibilities or personal data rights. It seems
clear that we need to better characterize the data sharing landscape as
well as propose mechanisms for tackling specific issues in data sharing.

It feels important to have a diversity of approaches, and yet it feels
important that any individual trust would be large enough to be taken
seriously in representing the views of its members in wider
negotiations.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/data-trusts.png" style="width:100%">

Figure: <i>For thoughts on data trusts see Guardian article on [Data
Trusts](https://www.theguardian.com/media-network/https://www.theguardian.com/media-network/2016/jun/03/data-trusts-privacy-fears-feudalism-democracy).</i>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//data-science/data-trusts-review.png" style="width:50%">

Figure: <i>Data Trusts were the first recommendation of the
<a href="https://www.out-law.com/en/articles/2017/october/review-calls-for-data-trusts-to-help-grow-artificial-intelligence-in-the-uk/" target="_blank">Hall-Presenti
Report</a>. Unfortunately, since then the role of data trusts vs other
data sharing mechanisms in the UK has been somewhat confused.</i>

See Guardian articles on Guardian article on [Digital
Oligarchies](https://www.theguardian.com/media-network/https://www.theguardian.com/media-network/2015/mar/05/digital-oligarchy-algorithms-personal-data)
and Guardian article on [Information
Feudalism](https://www.theguardian.com/media-network/https://www.theguardian.com/media-network/2015/nov/16/information-barons-threaten-autonomy-privacy-online).

## Data Trusts Initiative

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/data-trusts-initiative.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_governance/includes/data-trusts-initiative.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The [Data Trusts Initiative](https://datatrusts.uk/), funded by the
Patrick J. McGovern Foundation is supporting three pilot projects that
consider how bottom-up empowerment can redress the imbalance associated
with the digital oligarchy.

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//governance/data-trusts-initiative-project-page.png" style="width:60%">

Figure: <i>The Data Trusts Initiative
(<a href="https://datatrusts.uk/" target="_blank">http://datatrusts.uk</a>)
hosts blog posts helping build understanding of data trusts and supports
research and pilot projects.</i>

## Progress So Far

In its first 18 months of operation, the Initiative has:

-   Convened over 200 leading data ethics researchers and practitioners;

-   Funded 7 new research projects tackling knowledge gaps in data trust
    theory and practice;

-   Supported 3 real-world data trust pilot projects establishing new
    data stewardship mechanisms.

# The Art of Statistics

<span class="editsection-bracket" style="">\[</span><span
class="editsection"
style=""><a href="https://github.com/lawrennd/talks/edit/gh-pages/_books/includes/the-art-of-statistics.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/talks/edit/gh-pages/_books/includes/the-art-of-statistics.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The best book I have found for teaching the skeptical sense of data that
underlies the statistician’s craft is David Spiegelhalter’s *Art of
Statistics*.

<center>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip7">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

David Spiegelhalter

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://inverseprobability.com/talks/./slides/diagrams//people/david-spiegelhalter.png" clip-path="url(#clip7)"/>

</svg>
</center>

<img class="" src="https://inverseprobability.com/talks/./slides/diagrams//books/the-art-of-statistics.jpg" style="width:40%">

Figure: <i>[The Art of Statistics by David
Spiegelhalter](https://www.amazon.co.uk/Art-Statistics-Learning-Pelican-Books-ebook/dp/B07HQDJD99)
is an excellent read on the pitfalls of data interpretation.</i>

David’s book (Spiegelhalter, 2019) brings important examples from
statistics to life in an intelligent and entertaining way. It is highly
readable and gives an opportunity to fast-track towards the important
skill of data-skepticism that is the mark of a professional
statistician.

## Thanks!

For more information on these subjects and more you might want to check
the following resources.

-   twitter: [@lawrennd](https://twitter.com/lawrennd)
-   podcast: [The Talking Machines](http://thetalkingmachines.com)
-   newspaper: [Guardian Profile
    Page](http://www.theguardian.com/profile/neil-lawrence)
-   blog:
    [http://inverseprobability.com](http://inverseprobability.com/blog.html)

## References

Aman Ullah, M., Azman, N., Mohd Zaki, Z., Monirul Islam, Md., 2020.
Dataset creation from multilingual data of social media: Challenges and
consequences, in: 2020 IEEE International Women in Engineering (WIE)
Conference on Electrical and Computer Engineering (WIECON-ECE). pp.
288–291. <https://doi.org/10.1109/WIECON-ECE52138.2020.9398002>

Andrade-Pacheco, R., Mubangizi, M., Quinn, J., Lawrence, N.D., 2014.
Consistent mapping of government malaria records across a changing
territory delimitation. Malaria Journal 13.
<https://doi.org/10.1186/1475-2875-13-S1-P5>

Delacroix, S., Lawrence, N.D., 2018. Disturbing the ‘one size fits all’
approach to data governance: Bottom-up data trusts. SSRN.
<https://doi.org/10.1093/idpl/ipz01410.2139/ssrn.3265315>

Edwards, L., 2004. The problem with privacy. International Review of Law
Computers & Technology 18, 263–294.

Felin, T., Koenderink, J., Krueger, J.I., Noble, D., Ellis, G.F.R.,
2021. The data-hypothesis relationship. Genome Biology 22.
<https://doi.org/10.1186/s13059-021-02276-4>

Gething, P.W., Noor, A.M., Gikandi, P.W., Ogara, E.A.A., Hay, S.I.,
Nixon, M.S., Snow, R.W., Atkinson, P.M., 2006. Improving imperfect data
from health management information systems in Africa using space–time
geostatistics. PLoS Medicine 3.
<https://doi.org/10.1371/journal.pmed.0030271>

Lawrence, N.D., 2016. Data trusts could allay our privacy fears.

Lawrence, N.D., 2015. How Africa can benefit from the data revolution.

Lawrence, N.D., 2010. Introduction to learning and inference in
computational systems biology.

Magueresse, A., Carles, V., Heetderks, E., 2020. Low-resource languages:
A review of past work and future challenges. CoRR.

Mubangizi, M., Andrade-Pacheco, R., Smith, M.T., Quinn, J., Lawrence,
N.D., 2014. Malaria surveillance with multiple data sources using
Gaussian process models, in: 1st International Conference on the Use of
Mobile ICT in Africa.

Simons, D.J., Chabris, C.F., 1999. Gorillas in our midst: Sustained
inattentional blindness for dynamic events. Perception 28, 1059–1074.
<https://doi.org/10.1068/p281059>

Spiegelhalter, D.J., 2019. The art of statistics. Pelican.

Wiener, N., 1948. Cybernetics: Control and communication in the animal
and the machine. MIT Press, Cambridge, MA.

Yanai, I., Lercher, M., 2020. A hypothesis is a liability. Genome
Biology 21.