Skip to content

Latest commit

 

History

History
2998 lines (2352 loc) · 111 KB

plan.org

File metadata and controls

2998 lines (2352 loc) · 111 KB

Digital Skills

Introduction

This is a hands-on course that integrates introductory programming, statistics and data science. Through out this course, we will formulate scientific hypotheses, design experiments, and collect and analyse data visually and through formal models. You are expected to supplement this course with homework, self-study and other courses in descriptive statistics and python programming, but no prior knowledge is assumed. Formal concepts will be introduced through class activities and examples.

The course will focus neither on statistical theory nor on programming. We will only lightly build those skills within the course. Our main aim will be to build scientific and statistical intuition through practical work. However, to achieve this you need to be able to independently pick up theoretical and programming skills. For that reason, it is necessary for you to do outside reading either (ideally) by taking statistics and programming courses in parallel, or through self-study.

Learning goals:

Graphical comprehension

  1. Recognise structural elements in a statistical graph (e.g. axis, symbols, labels) and evaluate the effectiveness (for perception and judgment) and appropriateness (for the type of data) of structural element.
  2. Translate relationships reflected in a graph to the data represented.
  3. Recognise when one graph is more useful than another and organise/reorganise data to make an alternative representation.
  4. Use context to make sense of what is presented in a graph and avoid reading too much into any relationships observed.
  5. Express creative thinking via the production of an innovative graphical presentation.

Scientific process

  1. Understanding the randomness, variability and uncertainty inherent in a problem.
  2. Developing clear statements of the problem/scientific research question; understanding the purpose of the answer.
  3. Be able to perform a basic experiment design.
  4. Identify sources of bias in data collection and analysis.
  5. Ensuring acquisition of high-quality data and not just a lot of numbers.
  6. Understanding the process that produced the data, to provide proper context for analysis.
  7. Allowing domain knowledge to guide both data collection and analysis.
  8. Quantify uncertainty—and knowledge—visually.
  9. Realise that all visualisations are model summaries.
  10. Be able to write simple python programs for data science workflows.

Administration

  1. Make sure you are registered on IS-Academia
  2. Also register on Moodle: this is where the assignments will be
  3. Clone this git repository

Assessment

The assessment is purely through in-class exercises, quizzes and homework assignments. There will be assignments spread over the semester, as well as a group project. The project will be performed in pairs.

For all assignment and the project, the following rubrik is used. Some of the assignments may not involve all parts.

Experiment design. The first stage any project, no matter how small, is the experiment design and analysis. This includes a plan for how to collect data, methodologies for analysing the data, and the development of a pipeline, preferrably in the form of a program, for collecting data and analysing it. In additional, the experiment design must be reproducible: This can be ensured by running the data collection and analysis pipeline on simulated data, and seeing if the results are as expected.

Computation. Here you must instantiate the experiment design and analysis with concrete computations. For reproducibility, the computations you perform should be independent of the data you actually have. Correctness of the computations is the most important aspect, here. However, you should also take care to document why and how how you are doing the computations.

Graphics. This addresses the creation of visualisations of your analysis. It is recommended to do this fully automatically, so that you can simply run your pipeline and get all the results you need. Be sure to quantify uncertainty.

Text. Here you should explain in text what the graphics mean. Point out any interesting things you can see in the visualisation and try to explain it. Do not be overconfident, but quantify uncertainty properly.

Synthesis. Here you should summarise the most important findings from your analysis. Be careful to not over-interpret your results. A lot of results can be imaginary and can be attributed to insufficient data, biased sampling, improper modelling or $p$-value hacking. Again, be sure to quantify uncertainty.

SkillNeeds ImprovementBasic LevelAdvanced Level
<25><25><25><25>
Experiment design: Data collection and analysis pipelineInappropriate sampling, non-reproducible analysisData collection biased or analysis not reproducibile.Unbiased sampling and reproducibile experiment desgin and analysis.
Computation: Perform computationsComputations contain errors and extraneous codeComputations are correct but contain extraneous/unnecessary code.Computations are correct and properly justified and explained.
Graphics: Communicate findings graphically clearly, precisely and conciselyInappropriate choice of plots; poorly labelled plots; plots missing.Plots convey information correctly but lack context for interpretation.Plots convey information correctly with adequate and appropriate information
Text: Communicate findings clearly, precisely and conciselyExplanation is illogical, incorrect or incoherent.Explanation is partially correct but incomplete or unconvincingExplanation is correct, complete and convincing.
Synthesis: Identify key features of the analysiand interpret resultsConclusions are missing, incorrect, or not made based on analysisConclsions reasonable, but partially correct or incomplete.Relevant conclusions explicitly connected to analysis and context.

Pass: All parts must be addressed, the ‘default’ grade is 75%. 5% is added for every ‘advanced’ skill and removed for every ‘needs improvement skill’. Thus the passing grades are 50-100%.

Fail: If not all parts are explicitly addresed, the assignment is failed.

Data sources

This course will consider the following data sources in order of importance.

Synthetic data

This data is obtained through simulation, and it is useful in order to test whether a particular pipeline is working as intended. In particular, it is a great way to test the performance of a method as you vary the data generation process so that different assumptions are satisfied. This allows you to verify robustness.

UCI machine learning repository

The UCI repository has a large collection of datasets in an easy to access format. These have already been used in many academic papers, and are a good starting point for you to look at real data. All the data is formatted in an easy-to-use some format, but some pre-processing may still be necessary.

Wikipedia and newspaper articles

Wikipedia has many interesting articles, from which you can extra tabular data, as well as more contextual information. It is possible to also discuss newspaper articles. Wikipedia and newspaper articles can be used in the context of some assignments.

Economics data

  • FRED: Federal Reserve Economic Data
  • OECD: Organisation for Economic Co-operation and Development

PET Statistics Hackathon

https://petlab.officialstatistics.org/

NASA Mars Challenge

https://www.drivendata.org/competitions/97/nasa-mars-gcms/

Module 1: Visualisation as models and data summary

What is visualisation? It is a way to summarise data. It is also a way to view relationships between variables. Visualisation helps us to find patterns and understand the underlying laws behind how the data was generated. This is, in fact, the essence of modelling.

A model is also a way of summarising the essential features of the data. A visualisation differs from a model only in one sense: It easy to interpret visually.

Every data visualisation implicitly assumes a model of the data generating process. This is true for even the simplest visualisations, like histograms. There is no escape from the fact that any visualisation makes a lot of assumptions. We must emphasize what those assumptions are. What happens if they are not true?

Every data visusalisation, then, proceeds in three steps:

  1. Data transformation
  2. Model creation
  3. Model visualisation

Parameters. Every model is defined by a number of parameters. This is what is displayed when we visualise data. You can think of the model as the underlying theory, and the visualisation as a way to explain the theory visually.

Histograms: model a distribution

Histograms are a simple tool for modelling distributions. In their simplest application, they are used to simply count the number of items in distinct bins of a dataset. While typically employed to represent the empirical distribution of one-dimensional variables, they can be generalised to multiple dimensions .

Bar graph activity

  1. All students who are male raise their hand
  2. All students who are female raise their hand.
  3. We count, and draw a bar graph: the number of male, female and other students.
  4. We count how many are in the BSc of DS of those
  5. We also count how many are taking a programming course
  6. We also count how many are taking a maths/stats course
  7. What does the graph tell us about:
    1. Computer Science students
    2. Students at Neuchatel
    3. Residents of Neuchatel
    4. Other subsets of the population
  8. Can we make a similar graph when measuring a continuous variable?

Definition of a bar graph

Consider a set of $k$ categories $C = \{1, \ldots, k\}$. Every individual belongs in one category. This can be defined through a function $f$ assigning categories to individuals.

In particular, imagine a dataset of individuals $D$. There exists a membership function \[ f : D → C \] so that $f(x)$ is the category to which individual $x ∈ D$ belongs.

A bar graph is a visual representation of the following count vector, \[ n_c(D) = ∑i ∈ C \ind{f(i) = c}, \] where $\mathbb{I}$ is an indicator function: \[ \ind{A} = \begin{cases} 1, & \textrm{$A$ is true}
0, & \textrm{$A$ is false}, \end{cases} \] so that the height of the $c$-th bar is proportional to $n_c$.

Introduction to histograms

Assume data is in $\mathbb{R}$. Then split the real line into intervals $[a_i, b_i)$. For a given dataset $D$, for each interval $i$, count the amount of data $n_i(D)$ in the interval. We can also normalise to obtain $p_i(D) = n_i(D) / ∑_j n_i(D)$

More generlaly, a (counting) histogram is defined as a collection of disjoint sets called bins

$\{ A_i | i=1, \ldots, k\}$

with associated counts $n_i$, so that, given some data $D$,

$n_i(D) = ∑x ∈ D \mathbb{I}[x ∈ A_i]$,

where $n_i$ is the number of datapoints in $A_i$. Typically $A_i ⊂ R$.

We can use the histogram as the model of a distribution. For that, we use the relative frequency of points in each bin: $p_i(D) = n_i(D) / ∑j n_j(D)$. The selection of bins influences the model.

See also: https://en.wikipedia.org/wiki/Histogram

Histogram activity

  1. Introduce the concept of a histogram on the board.
  2. Split the students in two groups.
  3. Have each group collect the height of every student.
  4. How can we summarise the data of each group?
  5. Now the students will individually draw a histogram from the data of their group.
  6. Show two different histograms from two people in the same group. Why are they different? Discuss in pairs and then in class.
  7. Now show a histogram from a person in another group. Why are the histograms in the two groups different? Discuss.
  8. Collect the data of all students in the online excel file.
  9. Now we shall plot a histogram of the students using the sheet. How does that differ?

[If there are not enough students, the exercise can be performed by adding random numbers using dice]

Measuring a discrete distribution

  1. Toss a coin 10 times and record each one of the results, e.g. {0,1,1,0,0,0,0,1,1,1}.
  2. Count the number of times it comes heads or tails.
  3. We then summarise the result.

Let us denote the number of times you have heads by $N(x = k)$. It should be approximately true that $N(x = k) ≈ 5$, however, this may not be true for everybody.

We can visualise this by plotting bars or lines, whose height is proportional to $N(x=k)$.

We typically assume that individual coin tosses are generated from The Bernoulli distribution. This means that the probability of heads or tails is fixed, and does not depend on the result of the previous tosses. Why might not that be the case?

If individual tosses are Bernoulli, then the distribution of the number of heads (or tails) is a binomial distribution.

We will now show how to achieve the same results programmatically.

General python help

To help yourself understand python, you can always take a look at the documentation in English or French. To start with, check out the Tutorial. Then use the library reference for advanced usage.

Sometimes it is quicker to just use the help command in the python console or the ? command in the jupyter notebook.

Python can be used as a simple calculator

$ python3
Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 1 + 1
2
>>> 2 * 3 + 1
7
>>> 2 * (3 + 1)
8
>>> exit()

This interactive console is the most usual way of playing with python in the beginning, but it is not useful in general.

A simple program

Python programs are executed one statement at a time. Statements are separated by newlines. Anything appearing after a # symbol is not executed.

print("Hello world") # first statement, with a comment
print("Goodbye, world.") # second statement!
# print("This is not printed") - it is a comment, you see

Python programs can be generate text, write and read from files, access the internet, generate and display or save plots to disc, play and record music, record images from a camera…..

Before we actually run the program, one of you can play the role of the python interpreter. The interpreter goes through each line of the program, interprets it and executes it.

When we execute a program in the console, we are assuming the role of the intpreter that steps through the program.

Run the above statements in the python interpreter

$ python3
Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print ("Hello world")
Hello world
>>> print ("Goodbye, world")
Goodbye, world
>>> exit()
$

The statements print() and exit() are called functions. Function names must always be used with parenthesis. The contents of the parentheses are called arguments. Changing the arguments to a function has a different effect.

We can now try and save the above commands in a file called “pythontest.py” and execute it via

python3 pythontest.py

Consoles, Scripts and Notebooks

Console input is used when you want to have a purely interactive session to test something. In console mode, the interpreter executes each line as you enter it.

Script files are used to save your work and re-run it. They also allow you to build complex programs from multiple files, where each file has a different functionality. In script mode, python acts as though you were entering each line one-by-one. It reads each line and executes in turn.

Notebooks are something in between. They are script files with interactive output, and are very useful for rapid development and testing. They also save their state and output in between runs, so they help to document your code. We will use them a lot in class. There are two methods to use notebooks:

Most of the time, you want to be saving the code you write in a script file and executing it, instead of using the console. However, sometimes the interactivity of the console is helpful. This is when notebooks are used. After you are done developing something with the notebook, you can then extract what you need in a simple python script.

Python variables

The python interpreter has a state. This includes the contents of a memory where variables are stored, and the current location of the code pointer, that is which line will be executed next.

Variables are alphanumeric references to simple or complex objects. Possible variable names:

  • X
  • NumberOfApples
  • salary
  • scratch_variable
  • y2

A variable can be assigned a value with the = operator.

x = 2 # this gives the numeric value of '2' to the variable

Variables must be defined before they can be used for the first time

>>> x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
>>> x = 2
>>> x # typing the name of a variable in the console gives you its value
2

Numerical Python variables are very simple entities. Let us go through this is easy program for a warm-up.

  • x=value; assigns a value to a variable named x
  • print(); displays something in the terminal
x = 1 # a variable
y = 2 # another variable
print(x+y) # print the value of this variable sum
x = y # assignment operation: now x has the same value as y
print(x) #what would this value be?
y = 3
print(x) #is x changed?

Possible confusion point: assignment operator and math equations

The asignment operator = is not a mathematical equation. For example, in mathematics I may write \begin{align} x &= y + 1
x &= 5 \end{align} This is a system of equations, which can be solved to obtain $5 = y + 1$ and so $y = 4$. This is not what the assignment operator means.

Consequently, the following program will fail with an error, as y is not defined

x = y + 1
x = 5

In the following program, the value of y will remain -1 after the program ends

y = -1 # y = -1, x is not defined
x = y + 1 # now y = -1, x = 0
x = 5 # now y = -1, x = 5

In fact, writing the above as a system of equations makes no sense, as $x = y+1$ cannot be true if $x = 5, y = -1$. Replacing, we obtain $5 = 0$, which is false.

So, while math-like notation is used in programming, its meaning is not really the same as in mathematics, most of the time.

Python lists

A slightly more compex object are python lists. A list can contain anything, and is so very flexible. It can contain numbers, strings, or arbitrary ‘objects’.

Now check out the first part of the Histogram example. For that we need one line of setup so we can plot stuff.

import matplotlib.pyplot as plt # this is used for plotting
X=[0,1,0,0,0,0,1];  # list of coin tosses
plt.hist(X) # plot a histogram - this automatically splits everything into bins

In reality, the histogram function creates a so-called bar plot

import matplotlib.pyplot as plt # this is used for plotting
X=[0,1,0,0,0,0,1];  # list of coin tosses
plt.bar(["heads","tails"], [sum(X), len(X) - sum(X)]) # do a bar plot!

The following source creates a list of four numbers and returns one element. Things to unpack here:

  • x[i] returns the (i+1)-th element: we start counting from 0
  • the return statement sends a value back to the whatever started the python program: in this case this .org file.
x = [1, 2, 3, 4]
return x[3] # returns the last element of the list

The following program assigns arrays-values to variables. Now x, y are both lists.

x = [1, 2, 3, 4]
y = [-1, -2]
x = y # assignment operation: now x is just a different name for y
y[0] = 1 # modify the 0th element of y
return x # what would the value of x be?

Lists are different in one respect: when we assign one list name to another, this does not copy any data. Both names refer to the same data. Consequently, if we change the data, it changes for both variable names. The way to avoid that is to use the copy() function.

x = [1, 2, 3, 4]
y = [-1, -2]
x = y.copy() # copy operation: now x has a copy of y's data
y[0] = 1 # modify the 0th element of y
return x # what would the value of x be?

Python lists and tests

A python list is similar, but not identical to, a mathematical set.

A finite set $S$

\[ S = \{x_1, x_2, \ldots, x_n\}. \]

Numpy arrays

Because lists are very flexible, they are a bit slow. A special type of object, an array, is used to handle lists of numbers. This is not defined in basic python, but only in one module called numpy. Even though basic Python has only a few commands, it has many modules that extend the language to perform complex tasks without having to code everything from scratch.

import numpy as np
x = np.array([1, 2, 3, 4])
y = np.array([-1, -2])
x = y # assignment operation: 
y[0] = 1
return x

Python control structures

Sometimes we want to repeat some code. For example, we have two matrices of X, Y values and we wish to plot them:

import matpolotlib.pyplot as plt
import numpy as np
X = np.random.uniform([10,128])
Y = X + np.random.uniform([10,128])
plt.plot(X[0], Y[0])
plt.plot(X[1], Y[1])

#.... etc - to avoid repetition we can use this:
for t in range(10): # this defines the variable t and it cycles it through the values 0, 1, ..., 9.
    # start of repeated block
    plt.plot(X[t], Y[t])
# end of repeated block - blocks are identified by identation

# we can also loop through a specific list of values
for t in [1, 2, -1]: 
    print(t)
# this should output 1, 2, -1

Python functions

Sometimes we want to repeat a complex bit of code in different places. So a loop won’t do. The way to do that is to use a function:

def function_name(first_argument, second_argument): # there can be zero or more arguments to a function
    return first_argument + second_argument # this function just returns the sum of its arguments

Function scope

Whenever code is executed inside a function, the variables created there are only valid within the function. The function arguments are also effectively new variables. To see this, consider the following example.

 def example_function(argument):
	# The following does not necessarily modify the original variable
	# passed.  It depends on the effect of 'argument =
	# original_variable'.  If it copies the value, then the original
	# variable remains the same.  If it merely acts as a refernce (as
	# is the case with lists and arrays) then a modification happens.
	argument += 1
	# this variable should not be visible outside the function
	hidden_variable = 2
	# variables defined outside the function are still readable!
	print(outside_variable)
	# but, we cannot affect the variables outside the
	# function. Otherwise there would be a mess.
	another_variable = 0
	# for that reason, functions should only use the arguments passed
	return argument
 test = 100
 outside_variable = "I am defined outside the function"
 another_variable = -1 # so am I
 foo = example_function(test) # line to unpack
 print("test:", test)
 print("foo:", foo)
 print("outside_variable:", outside_variable)
 #print("hidden_variable:", hidden_variable) # it complains of 'hidden_variable' not being defined

Let us unpack what happens. When we write foo = example_function(test), what happens is as follows.

argument = test # create a new variable: all other variables are now hidden from scope
argument += 1 # execute the function's code block
foo = argument # apply the return operator to 'foo = example_function()'

Pandas and Histograms

For this, we work on the Histogram example.

Pandas is a module for simple and efficient data I/O processing and visualisation. The following code snippet demonstrates a couple of features.

import pandas as pd # we need to load a library first
# loading data into pandas creates a data frame df
df['column-name'] # selects a column
df.hist() # creates a plot with many histograms

Coin example

Plotting is also possible through the matplotlib. This is the module that pandas uses to plot stuff. It just has a simpler interface for doing so. But if you want to create custom plots, matplotlib is what you need to use.

X = [1, 0, 1, 0, 1, 1, 0, 1, 0] # a sequence of coin tosses.
import matplotlib.pyplot as plt # python has no default plot function, we must IMPORT it
plt.hist(X) # this function plots the histogram

Each one of you should predict the result of a number of coin tosses. Let us do a histogram of the predictions. This is a binomial distribution.

  1. The students record their data in the shared spreadsheet
  2. Firstly, plot the histogram of the data with default settings.
  3. What is the eff

Let us look at the student data: see src/histograms/heights.ipynb

Heights example

import pandas as pd
X = pd.read_csv("class-data.csv") # read the data into a DataFrame
X['Height (cm)'].hist() #directly plot the histogram

Histograms vs Pie Charts

While histograms are good visualisations of distributions on the real line, distributions over a discrete set of possible values are best-represented by a pie-chart. This especially if there is no relation between the different values. As an example, if the values are distinct categories, there is no particular reason to order them on an axis.

  • What are the advantages and disadvantages of pie charts and histograms?
    HistogramPie Chart
    To show proportions
    For more categories
    To compare relative size
    For real-valued data
  • Why is a 3D pie chart never a good idea?
    plt.pie(counts) # plot counts
        

Randomness

Random algorithms using coins.

y = 0 # y is a variable, with the value zero currently
import numpy as np # this library has many useful functions
x = np.random.choice(100) # x takes values 'randomly'. It is a 'random variable'.
return x # let's see what value it takes

Uncertainty vs randomness: coin-flipping experiment

  1. Everybody flips a coin 10 times.
  2. Record each throw with 0, 1 in this spreadsheet: https://docs.google.com/spreadsheets/d/1E4bs05HnKXf1GZe4g3v6RLnHsj-YcaWg3Qe_RQyfhHU/edit?usp=sharing
  3. Then record how you threw the coin and what coin it was.
  4. Discuss if the coin is really random.
  5. What is the distribution of coin throws for the first throw?
  6. What is the distribution of recorded coin biases? Why do some coins appear more biased than others?
  7. Does it make sense to aggregate all the results? What does that assume?

In the context of experiment design and data analysis, it is very common to have conditions like those in this example. Even though we wish there was such a thing as the ‘repeated experiment’, in practice it never is repeated. There is always some varying factor.

Pseudo-random numbers

Let us now repeat the experiment with data generated via a computer.

# here is a default way to generate 'random' numbers
import random
X = random.choices([0, 1], k=10) # uniformly choose 10 times between 0 and 1.
plt.hist(X) # everytime we run these commands, we get a different proportion

This python code is completely deterministic. A complicated calculation is used to generate the next ‘random’ number from the previous one. Consider this example:

import random
seed(5) #this sets the 'state' of the random number generating machine
print(random.uniform(0,1)) # the random number is a function of the state
print(random.uniform(0,1)) # the state changes after we generate a new number
print(random.uniform(0,1))
seed(5) # when we reset the state, we get the same sequence of numbers
print(random.uniform(0,1)) #
print(random.uniform(0,1))
print(random.uniform(0,1))

For cryptographically strong random numbers you need to use the secrets module:

import secrets
secrets.choice(range(100))

Physical sources of randomness

Let’s go back to throwing coins now. Coins are completely deterministic. Whenever we have a specific coin to throw in the air, there are two things we do not know. The first is which side the coin will land on. Why is that? The second is uncertainty about the coin bias: is the probability of landing heads exactly 50%? How can we quantify this? What does it depend on? Discuss in class.

What physical source of randomness can we use instead of coins?

Uncertainty

Probability is not only used to model random events. In fact, almost nothing can be said to be really random, unless we go into quantum physics. Even a die thrown in the air follows precise mechanical laws. Given enough information, it is possible to accurately predict the outcome of a throw.

For that reason, probability is best thought of as a way to model any residual uncertainty we have about an event. Then the probability of an event is simply a subjective measure of the likelihood.

While probability offers a nice mathematical formulation of uncertainty, when this uncertainty is subjective, the question arises: how can we elicit precise probabilities about uncertain events from individuals? Here is an example.

The number of immigrants

Consider the following question: how many immigrants live in Switzerland?

  1. In-class discussion: what do we mean by that?
  2. Now everybody can make a guess and record it on this form: https://moodle.unine.ch/mod/evoting/view.php?id=295622

What does this distribution mean? Can we use it as an estimate of uncertainty?

  1. Now let us create some confidence intervals. The procedure is as

follows. Let us take a first guess at an inteval, (say 5-10%) and ask: (a) Are you willing to take an even bet that the true number is between [5-10%]?

Time-Series: model the evolution of a system

A time series $x_1, \ldots, x_t$ is simply a sequence of variables. We typically assume that this is random. How can we capture this dependency between variables? Does the value of $x_t$ depend only on the value of $xt-1$? On all the previous values? Only on the time index $t$?

Frequently, sequential observations of a variable $x_t$ are in fact noisy measurements of the true variable of interest, $y_t$, which we never observe. As an example, consider covid infections. There is a true, underlying, number of infections, but we only ever measure the number of positive cases detected in a day.

Generally, there are three tasks associate with time series modelling, always given data up to this point, i.e. $x_1, \ldots, x_t$.

  1. Smoothing: What has happened in the past? Here we estimate $yt-k$ for $k &gt; 0$.
  2. Filtering: What is the current situation? To solve this problem we must estimate $y_t$.
  3. Prediction: What will happen in the future? This involves predicting $yt+k$ for some $k &gt; 0$.

These problems are all related and can be formalised in a statistical manner, and there are multiple algorithms that can be used to solve each problem. When $x_t = y_t$, then smoothing and filtering are trivial, but prediction is still an important problem. We focus here on a simple linear transformation such as the moving average as a basic solution method.

Smoothing For smoothing, a moving average filter is typically sufficient whenever $\mathbb{E}(x_t) = y_t$, i.e. $x$ is just a zero-mean noisy measurement of $y$. Then we can construct the estimator \[ \hat{y}_t = \frac{1}{2n+1} ∑k = t-nt+n x_k. \]

Filtering When we wish to filter, at best we can take the moving average from the past $n$ observations. If $n$ is very large, then there is a a corresponding delay between our filtering and the final prediction. \[ \hat{y}_t = \frac{1}{n+1} ∑k = t-nt x_k. \] The only way to remove the lag is to perform a more complex transformation of the original data. To see this, consider the problem of prediction.

Prediction Prediction means estimating something in the future. This task is never trivial, even with perfect observations, i.e. when $x_t = y_t$. In this setting moving averages do not make sense. A simple idea is to assume a linear trend, e.g. that $yt+1 - y_t = y_t - yt-1$. By re-arranging terms, we have that $yt+1 = 2y_t - yt-1$. This gives us the estimator: \[ \hat{y}t+1 = 2x_t - xt-1 \]

Plotting lines

Here is a simple example of line plotting.

import numpy as np
X = [1, 2, 3, 4, 5, 4, 3, 2, 1] # define a small number of points
import matplotlib.pyplot as plt # import the plotting library
plt.plot(x) # perform a standard, simple plot
plt.savefig(f)
return f

What are such plots useful for?

Race times

https://en.wikipedia.org/wiki/1500_metres_world_record_progression

Wikipedia has a table that shows the progression of 1500m world records.

  1. Let us first show the records up to 1950 .
  2. Try and predict the progrssion of world records on the board.
  3. Let us now look at the actual graph. Is it what you expected?
  4. How do you expect the progression to continue after 2020?
  5. How do you explain this progression? Can you find data to validate or refute your explanation?

Scraping tables example :example:data-collection:

import pandas
tables=pandas.read_html("URL") # read a table
# convert date-string:
dt = datetime.datetime.strptime(string, '%Y-%m-%d').year
# string manipulation
string.replace("+", "0") # replaces a + with a 0
string.split(":") # splits a string into multiple strings
# data formats
float("12.2"); # converts a number into a float

Example: The inclination of Mars

  1. Plot Mars data
  2. Show orbits
  3. 3-body system, chaos and randomness

Example: Covid

  1. Plot covid data.
  2. Smooth the data: moving average plots
  3. Try and estimate past, current and future infections with simple tools.
  4. Discuss: Are those simple tools sufficient? Is our visualisation consistent? Do we need something further?

Example: Stock market prices

See: Trading Economics

Scatterplots: model a relationship

Let us start with an example where we just have three variables. We can plot the relationship between any two of them.

X=[1, 2, 3, 4, 10, 6]
Y=[5, 2, 5, 3, 1, 2]
Z=[0, 1, 0, 1, 0, 1]
import matplotlib.pyplot as plt
plt.scatter(X,Y)

Variables are frequently in some array instead.

import numpy as np
n_data = 10
n_features = 3
data = np.random.uniform(size=[n_data, n_features]) # create some random data
plt.scatter(data[:,0], data[:,1]) #plot the first against the second column
# We can always take a 'slice' of the data:
data[:,[1,2]] # get columns 1 and 2 and all the rows
data[1:10, [0,2]] # get columns 0 and 2 and all rows 1-10
## : means everyhing
## a:b means everything from a to b
## [a,b,c] means a, b and c.

In dataframes, it we can deal with multiple variables by name

import pandas as pd
df = pd.DataFrame(data, columns =  ["Alcohol", "Caffeine", "Sugar"])
plt.scatter(df["Alcohol"], df["Caffeine"]) #plot the first against the second column
# getting slices is also possible in pandas dataframes, just slightly different:
df.loc[:,'Alcohol'] # get column Alcohol
df.loc[:,['Alcohol', 'Caffeine']] # get column Alcohol and Caffeine

Relationships as functions

A lot of relationships between two variables $x ∈ X$, $y ∈ Y$, can be described through some deterministic function $f : X → Y$, i.e. \[ y = f(x).\]

If the relationship is one-to-one, then there exists an inverse function $f-1 : Y → X$, so that \[ x = f-1(y), \] with \[ x = f-1[f(x)]. \]

Sometimes, however, the relationship betwen the two variables is not deterministic, that is the value of $x$ does not uniquely determine the value of $y$, or the converse may occur… or both.

Physical relations

Many equations in physics relate two quantities. For example, there is the equation relating current $I$, voltage $V$ and resistance $R$: \[ V = IR. \] This relation can be inverted to obtain \[ R = V/I, \qquad I = V / R. \] Let us say that the resistance $R$ is fixed. By altering the voltage (e.g. by adding more batteries to a circuit) we can see that the current increases.

Relationships as joint distributions

The simplest way to model a stochastic relationship between two variables is to model the joint distribution $P(X,Y)$. Consider the example of heights versus weights: We can expect that the taller a person is, the heavier they will be. However, their weight will depend on the mass of their muscle and adipose tissue. These in turn depend on their age, sex, genetics and lifetime calorie expenditure and intake.

Weight and height distribution

Here we can plot the number of people having a certain height and weight combination. This can be done with a colour-map. This is not much different than a normal histogram - and is called hist2d in pyplot:

X = 100 / (1 + np.exp(-np.random.normal(size=100))) + 125
Y = X *  (1 + 0.1*abs(np.random.normal(size=100))) - 100

Relationships as conditional distribution.

Here we model the distribution of one variable given a fixed value for the other, e.g. $P(X|Y)$. The simplest thing is to only try to model the expected value $E[X|Y]$. How can we do this?

Method 1: polynomial fitting!

# returns the best fitting line to the data
a, b = np.polyfit(data_x, data_y, 1)
# Why is this the 'best' line? Because it minimises the total squared
# error between the predicted value and the actual ones.
# We can plot the line by this simple linear equation
ax = np.linspace(0,1)
plt.plot(ax, a * ax + b)

Example: Unemployment, GDP

Get some data financial data from FRED. This is time-series data. Can we actually make sense out of it in terms of correlations? Explore.

First, the unemployment rate: https://fred.stlouisfed.org/series/UNRATE Then, the GDP: https://fred.stlouisfed.org/series/GDPC1 This has two different data frames.

# read the files
import pandas as pd
ur=pd.read_csv("UNRATE.csv")
gdp=pd.read_csv("GDPC1.csv")
# the date ranges are different, so we must try to merge them (inner join!)
merged = ur.merge(gdp)

Module 2: Experiment design

Data collection and cleaning

Random sampling

In this module, we will perform the following activities:

  1. Uniformly random sampling. How can we perform it?
  2. Biased sampling, correcting for effects.
  3. Importance sampling.

Survey of political opinions

You each support one of the following political groups:

  • R: Red.
  • B: Blue.

In this exercise, we will try to measure the support for different political parties.

Fixed affiliations

  1. Deal cards so that 40% of the students are red, and 60% are blue.
  2. Now sample the population in the following manner:

    Everybody throws a die. Those with a value 5 or greater are part of our sample, The sample $ω$ comes to the board.

    The set of all possible samples is called the universe $Ω$. It is possible that we sample everybody, or nobody, or only the boys, or only the girls, or any combination.

    Mathematically, we can say that $Ω = \{0,1\}^n$, where $n$ is the number of students, and $ω_i = 1$ if a student has been selected.

  3. Write a tick mark in the box saying “Red” or “Blue”.
  4. We then measure the proportion of red and blue votes. This proportion is the random variable!

We can repeat the same procedure with a different sampling method:

  1. Assign a number to each student.
  2. Cast a die and see which student it corresponds to.
  3. The student tells me their vote.
  4. I repeat.

Simulated sampling from the larger population

Here we assign a random affiliation to each one of you. Throw a die for your political affiliation:

DieParty
0Red
1
2
3
4
5
6Blue
7
8
9

We now count the number of people having different affiliations. This are your true voting affiliations. If you were to vote, then you would vote for these specific parties.

Here, the underlying random space is the combined dice throws of all the class, and the random variable of interest is the number of votes for each candidate.

Given the number of people in the course, what is the expected number of votes for each party?

From random dice to random votes

From a probability perspective, we can think of the die as having random outcomes in $Ω = [9]$. The random variable is the party vote $v : Ω → \{R, G, B\}$ with \[ v(ω) = \begin{cases} R, & ω ∈ \{0,1,2,3\}
B, & ω ∈ \{4,5,6,7,8,9\} \end{cases} \]

Let us assume that the die has a uniform distribution $P$ so that $P(ω) = 1/10$ for all $ω ∈ Ω$ and $P(S) = |S|/10$ for all subsets $S ⊂ Ω$. What is then the probability that somebody supports the Red party?

The probability that $v = R$, which we write informally as $Pr(v = R)$, is simply the probability of all $ω$ that lead to $R$, that is: \[ Pr(v = R) = P(\{ω : v(ω) = R\}) = P(\{0,1\}) = 2. \]

Expectations

If we know the probability that a randomly chosen voter will vote for the i-th party, then what is the probability of different numbers of votes? What is the expected number of votes for each party? To solv this problm, we need to define another random variable:

  • $n_i$: the number of votes cast for each party $i$.

This total number of votes depends on the party affiliation of each voter. Let

  • $ω_t$ be the random die of person $t$.
  • $v_t = v(ω_t)$ is then the party affiliation of person $t$.

We collect the random dice of all individuals into one big vector \[ ω = (ω_1, \ldots, ω_t, \ldots, ω_T), \qquad ω ∈ Ω^T, ω_t ∈ Ω \] Then the total number of votes for party $i$ is simply \[ n_i(ω) = ∑t=1^T \mathbb{I} \{f_t = i\} = ∑t=1^T \mathbb{I} \{v(ω_t) = i\} \]

Clearly, if there is only one voter, the expected number of votes for each party $i$ is simply: $Pr(v = i)$. If we had $T$ voters then this should be $n_i = T Pr(v = i)$. We can verify this by writing out the expectation.

\[ \E_P[n_i] = ∑_ω P^T(ω) n_i(ω) \]

Here $P^T$ is the joint distribution of all persons dice. The main assumption we must make is that each person’s die is independent of everybody else’s. Then this joint distribution can be written as the following product: \[ P^T(ω) = ∏t=1^T P(ω_t). \] Consequently, the expectation becomes \begin{align*} \E_P[n_i] &= ∑ω ∈ Ω^T P^T(ω) n_i(ω) \tag{by definition}
&= ∑ω ∈ Ω^T P^T(ω) ∑j=1^T \mathbb{I} \{v_j = i\} \tag{by the definitionm of $n_i$}\ &= ∑ω ∈ Ω^Tj=1^T P^T(ω) \mathbb{I} \{v_j = i\} \tag{moving $P^T$ inside}\ &= ∑j=1^T ∑ω ∈ Ω^T [P^T(ω) \mathbb{I} \{v_j = i\}] \tag{changing the order of summation}\ &= ∑j=1^T P^T(\{ω : v(ω_j) = i\}) \tag{rewriting the event}\ &= ∑j=1^T P(\{ω_j : v(ω_j) = i\}) \tag{by independence}\ &= ∑j=1^T P_v(i) \tag{by definition}\ &= T P_v(i). \end{align*} The conceptually tricky part is that \[ ∑ω ∈ Ω^T [P^T(ω) \mathbb{I} \{v_j = i\}] = P^T(\{ω : v(ω_j) = i\}). \] This does not follow from a complex probabilistic reasoning. The right hand side is simply the set of all $ω ∈ Ω^T$ with $v(ω_j) = i$. The left hand side is summing over all possible $ω$, but multiplying with zero all those for which their corresponding value is not $i$. Thus, the two terms are equal. The independence assumption is used to show that $P^T(\{ω : v(ω_j) = i\} = P(\{ω_j : v(ω_j) = i\})$ The other steps do not require any assumptions.

The probability measure induced by a random variable f

We already define a probability measure $P$ on the outcomes $Ω$. Since $f$ is a function on $Ω$, we can also define an appropriate probability measure $P_f$ for the outputs of $f$. In particular, for any subset $S ⊂ \{R, G, B\}$, we define \[ P_f(B) = P(\{ω : f(ω) ∈ B\}). \] We can this identify the informal probability $Pr(f = R)$ with the measure $P_f(R)$. Here, even though $R$ is an element, and not a set, we abuse notation. Normally we would write $P_f(\{R\})$ to denote the set consisting only of R$.

Random sampling

Each one of you throws a second die and records the outcome. We now have

DiePartyResponse
0RedNot Reachable
1-2Refuse
3-9Green
0-4BlueNot Reachable
5Refuse
6-9Blue
  • Make a histogram / bar-char pie plot on the number of votes

Uniform sampling

In uniform sampling, the probability of all outcomes is the same.

Sampling with replacement In sampling with replacement, each outcome can appear more than once.

import numpy as np
population_size = 10 # say we have 10 people we want to sample from
n_samples = 5 # say we want to take 5 samples from the population the
# following will give us a sample drawn with replacement: It doesn't
# matter who we selected before, the next one will be randomly
# selected independently of the previous selection
sample = np.random.choice(population_size, size = n_samples)
return sample

The following code has the same effect

import numpy as np
population_size = 10 # say we have 10 people we want to sample from
n_samples = 5 # say we want to take 5 samples from the populatoin
sample = np.zeros(n_samples)
for i in range(n_samples):
    sample[i] = np.random.choice(population_size)
return sample

Sampling without replacement In sampling without replacement, each outcome can appear at most once.

import numpy as np
population_size = 10 # say we have 10 people we want to sample from
n_samples = 5 # say we want to take 5 samples from the population the
# following will give us a sample drawn with replacement: It doesn't
# matter who we selected before, the next one will be randomly
# selected independently of the previous selection
sample = np.random.choice(population_size, size = n_samples, replace='False')
return sample

Expectation

Recall that a random variable $f$ is a function $f : Ω → \mathbb{R}$. The expectation of a random variable with underlying distribution $P(ω)$ is simply \[ \mathbb{E}_P[f] \defn ∑ω ∈ Ω f(ω) P(ω). \] There is nothing random about the variable itself, it is only the random input that makes its value random.

In the following example, $ω ∈ \{0, 1, 2, 3\}$, with $P(ω)$, specified by the vector P[]

 import numpy as np
 # Let us define the space Omega
 Omega  = np.array([0, 1, 2, 3])

 # Let us define a vector P so that P[omega] is the probability of omega
 # e.g the probability that omega = 0 is 0.1, and that omega = 3 is 0.5.
 P = np.array([0.1, 0.2, 0.3, 0.4])

 # Here is our random variable f(). It is just a function of omega. There
 # is nothing random about it, only omega is random!
 def f(omega):
     return omega * omega

 # Let us generate a random omega:
 random_outcome = np.random.choice(Omega, p = P)
 random_variable_value = f(random_outcome)
 print("omega:", random_outcome, 
	 "f:", random_variable_value)

 # We can also easily calculate the expectation of the random variable
 # through the dot product.
 print ("Expected value:", np.dot(P, f(Omega)))

What is the expectation of this variable?

In our case, $f(ω) = ω^2$. Let us first consider a discrete $Ω$:

Let $ω ∈ \{0,1,2\}$ and $P(1) = 0.3$ and $P(2) = 0.2$. THen \begin{align*} \E_P(f) &= ∑_ω P(ω) f(ω)
&= P(0) f(0) + P(1) f(1) + P(2) f(2) \ &= 0.5 × 0^2+ 0.3 × 1^2 + 0.2 × 2^2 \ &= 0.3 + 0.8 = 1.1. \end{align*}

Let us now consider a continuous $Ω$ with probability measure $P(A)$ for all (measurable) subsets $A$ of $Ω$. This can be defined through the probability density $p(ω)$,

\[P(A) = ∫_A p(ω) dω.\]

The corresponding expectation of $f$ is then given by \[ \E_P(f) = ∫_Ω f(ω) p(ω) dω \] For our specific example, let us choose $p$ to be the uniform distribution on $[0,2]$. Then $p(ω) = 1/2$ and \[ \E_P(f) = ∫_0^2 ω^2 /2 dω = [x^3/6]_0^2 = 2^3/6 = 8/6=4/3. \]

Centime exercise

A jar with coins is passed around the class.

  1. The students are asked to guess how many coins it contains.
  2. The students agree on a 50% confidence interval.
  3. The students fit a normal distribution on this interval $[μ - \frac{2}{3}σ, μ + \frac{2}{3}σ]$.
  4. Is this normal distribution a good choice? Are you 90\% sure the number of coins is less than $x$?
  5. Is a normal distribution generally appropriate?
  6. Puzzle: Guess how many coins there are. If correct, then the class will share the money. If not, they will get nothing. What is the correct guess?

(If students have trouble with this, try with small numbers of coins and finite number of possibilities - demonstrate by playing the guessing game repeatedly)

Survey of heights

  1. First, randomly select a student. How? Everybody gets a different number. Then I throw a die until I get a single student matching this number. The student is my sample $ω$.
  2. The student comes to the board and measures his height. This height is a random variable $h(ω)$. Each student has their own fixed height, but which student I select is random. Thus, the height that I measure is random.
  3. Repeat the experiment once more.
  4. Now we randomly select multiple students. How? Each student throws a die. If the die is >4, then they are selected. All the students come to the board. They are our sample $ω$.
  5. We now write the student heights, and average. The average height of the sample is our random variable.

The data science pipeline

The experimental pipipeline has a number of different components.

  1. Formulating the problem.
  2. Deciding what type of data is needed.
  3. Choosing the model and visualisation needed.
  4. Designing the experimental protocol.
  5. Generating data confirming to our assumptions.
  6. Testing the protocol on synthetic data. Is it working as expected?
  7. Putting the protocl through on real data.

Salaries of men and women

You want to measure if men and women have different salaries, as well as the reasons why. In particular, you want to check if men are paid more than women on average, and if this can be explained purely by age.

Formulate the problem

Let us define three variables, gender $g$, age $a$ and salary $s$. Our possible hypotheses can be expressed mathematically as follows:

For background, check out Hypotheses about the expected salary relate to conditional expectation

  • $E(s | g = m) = E(s | g = f)$. Men and women have the same salary in expectation
  • $E(s | g = m) &gt; E(s | g = f)$. Women have a lower salary in expectation. But how much lower?

(Note that a stronger hypothesis would involve the conditional probabilities rather than the expectation)

If we find a difference, then we may want to explain it. If the age is a sufficient explanation for the salary, then the salary is conditionally independent of the gender given the age:

  • $E(s | a, g ) = E(s | a)$. The salary is independent of the gender, given the age.

What type of data do we want?

We neet salaries, age and gender.

Model and visualisation

Since we are looking at means, then it is maybe enough to look at averages. Averages are single numbers, so maybe a bar plot is enough.

Experimental protocol

  1. Collect data.
  2. Plot average for men and women. Use the simulation to find the right type of plot.
  3. If average salary is very different (how different?) then say women are paid less than men. Use the simulation to get an idea.

Generate data according to our assumptions

We can have three different tests:

  1. Generate everything independently
import numpy as np
# Generate data
def identical_populations(n_samples):
    gender = np.random.choice(samples = n_samples)
    age = 18 + 60*np.random.beta(3,4, size = n_samples)
    salary = 100* np.random.exponential(100, size = n_samples)
    return age, gender, salary
  1. Make the salary depend on the gender
def different_populations(n_samples):
    gender = np.random.choice(samples = n_samples)
    age = 18 + 60*np.random.beta(3,4, size = n_samples)
    salary = (100 + gender*10)* np.random.exponential(100, size = n_samples)
    return age, gender, salary
  1. Make the salary depend on the age, but have fewer women working after some age.
def age_effect_populations(n_samples):
    age = 18 + 60*np.random.beta(3,4, size = n_samples)
    q = (age - 18) / 60
    p = q * 1 + (1 - q)*0.5
    gender = np.random.choice(2, p = [p, 1-p], samples = n_samples)
    salary = (100 + age)* np.random.exponential(100, size = n_samples)
    return age, gender, salary

See Salary example notebook.

If we estimate the means, we see there is a lot of variability. How can we fix that? One idea is to perform a lot of simulations and note how much variability we do have. Another is to use a bootstrap sample.

A question of voting

An election is coming in 100 days. A political party supporting one of the options [assume it’s just a yes/no vote] in the election gives you 10,000 CHF to spend over these 100 days so as to measure the mood of the population. They can use that information to increase their chances of success. How should you do the study?

Formulate the problem

Who is going to use our visualisation? How are they going to use it? Let us brainstorm a little bit about this.

What type of data do we want?

What do we want to know about the people we collect the data from?

Choose the model and visualisation needed

What can we assume about people’s opinions? How about their responses? Are they truthful? What is the most useful visualisation?

After we get the data, we need to analyse it. A simple model would be a moving average of polls over time. Would that work?

Design the experimental protocol

Assume we need to pay 1 CHF for every time we ask a poll question, and we have a budget of 10,000 CHF. Then, how should we ask questions, assuming the election is in 100 days from now?

(a) Ask 10,000 people now. (b) Ask 10,000 people one day before the election. (c) Ask 10,000 people 50 days from now. (d) Ask 100 people every day? (e) Ask 1,000 people every 10 days.

Generate data according to our assumptions

We can assume a simple model of the electorate here… what should it be? Maybe their opinion changes over time. Maybe some people are not responding, or not reachable. Which one corresponds to our assumptions?

Test the protocol

After testing the whole pipeline, we can see if it actually works as intended.

Module 3: Inference

Logic and bar graphs

How many people are taking CS? How many are in other courses? Let us consider the following statements. Each statement is either true or false. If a statement is true, it has the value 1. It consequently has the value 0 if it is false.

For each student $i$, we define:

  • $m_i$: the student is male
  • $f_i$: the student is female
  • $s_i$: the student is in the faculty of science
  • $l_i$: the student is in the faculty of law
  • $e_i$: the student is in the faculty of economics

Clearly, if $s_i$ is true then $l_i, e_i$ are false.

Logic and sets

We can associate events with sets in a universe. The following laws apply:

  1. There is a universe $Ω$ of possible outcomes
  2. Consider an individual outcome $ω ∈ Ω$:

(a) If $ω ∈ A$ ,then we say that $A$ is true. (b) If $ω ∉ A$ ,then we say that $A$ is false.

  1. Let $¬ A$, read ‘not A’, be the complement $Ω \setminus A$: If $A$ is true, then $¬ A$ is false and vice-versa.
  2. $B$ is a subset of A (i.e. $B ⊂ A$) iff $B$ implies $A$.
  3. If $A$ and $B$ are disjoint, i.e. $A ∩ B = ∅$ (they have an empty inersection) then $A, B$ are mutually exclusive. That means that it is impossible for $A, B$ to both be true.

Conditional Probability and the Theorem of Bayes

Bayes theorem

  • Recall the definition of Conditional probability: \[ P(A | B) = P(A ∩ B) / P(B) \] i.e. the probability of A happening if B happens.
  • It is also true that: \[ P(B | A) = P(A ∩ B) / P(A) \]
  • Combining the two equations, reverse the conditioning: \[ P(A | B) = P(B | A) P (A) / P(B) \]
  • So we can reverse the order of conditioning, i.e. relate to the probability of A given B to that of B given A.

The cards problem

  1. Print out a number of cards, with either [A|A], [A|B] or [B|B] on their sides.
  2. If you have an A, what is the probability of an A on the other side?
  3. Have the students perform the experiment with:
    1. Draw a random card.
    2. Count the number of people with A.
    3. What is the probability that somebody with an A on one side will have an A on the other?
    4. Half of the people should have an A?

The prior and posterior probabilities

AA2/6A observed2/3
AB1/6A observed1/3
BA1/6
BB2/6

Bayesian analysis example

The murder problem

  • Somebody saw somebody matching their description and he was found

in the neighbourghood. There is no other evidence.

  • There are two possibilities:
  • $H_0$: They are innocent.
  • $H_1$: They are guilty.

What is your belief that they have committed the crime?

Prior elicitation

  • All those that think the accused is guilty, raise their hand.
  • Divide by the number of people in class
  • Let us call this $P(H_1)$.
  • This is a purely subjective measure!

DNA test

  • Let us now do a DNA test on the suspect

DNA test properties

  • $D$: Test is positive
  • $P(D | H_0) = 1\%$: False positive rate
  • $P(D | H_1) = 100\%$: True positive rate

Run the test

  • The result is either positive or negative ($¬ D)$.
  • What is your belief now that the suspect is guilty?

Everybody is a suspect

  • Run a DNA test on everybody.
  • What is different from before?
  • Who has a positive test?
  • What is your belief that the people with the positive test are guilty?

Explanation

  • Prior: $P(H_i)$.
  • Likelihood $P(D | H_i)$.
  • Posterior: $P(H_i | D) = P(D ∩ H_i) / P(D) = P(D | H_i) P(H_i) / P(D)$
  • Marginal probability: $P(D) = P(D | H_0) P(H_0) + P(D | H_1) P(H_1)$
  • Posterior: $P(H_0 | D) = \frac{P(D | H_0) P(H_0)}{P(D | H_0) P(H_0) + P(D | H_1) P(H_1)}$
  • Assuming $P(D | H_1) = 1$, and setting $P(H_0) = q$, this gives

\[ P(H_0 | D) = \frac{0.1 q}{0.1 q + 1 - q} = \frac{q}{10 - 9q} \]

  • The posterior can always be updated with more data!

Python example

# the input to the function is the prior, the likelihood function, and posteriors
# Input:
# - prior for hypothesis 0 (scalar)
# - data (single data point)
# - likelihood[data][hypothesis] array unction
# Returns:
# - posterior for the data point (if multiple points are given, the calculation is repeated)
def get_posterior(prior, data, likelihood):
    marginal = prior * likelihood[data][0] + (1 - prior) * likelihood[data][1]
    posterior = prior * likelihood[data][0] / marginal
    return posterior

import numpy as np
prior = 0.9
likelihood = np.zeros([2, 2])
# pr of negative test if not a match
likelihood[0][0] = 0.9
# pr of positive test if not a match
likelihood[1][0] = 0.1
# pr of negative test if a match
likelihood[0][1] = 0
# pr of positive test if a match
likelihood[1][1] = 1
data = 1
return get_posterior(prior, data, likelihood)

More general problems

The $k$-meteorologists problem

We have $k$ meteorological stations.

Predictions and outcomes

StationMTWT
MeteoSuisse25%20%10%5%
Wunderground30%50%20%10%
AccuWeather90%70%10%0%
RainYNNY

$H_i$: The $i$-th station’s model is correct

  • $P(H_i)$: prior
  • $P(D | H_i)$: likelihood according to station $i$
  • $P(H_i | D)$: posterior

Types of hypothesis testing problems

Simple Hypothesis Test

Example: DNA evidence, Covid tests

  • Two hypothesese $H_0, H_1$
  • $P(D | H_i)$ is defined for all $i$

Multiple Hypotheses Test

Example: Model selection

  • $H_i$: One of many mutually exclusive models
  • $P(D | H_i)$ is defined for all $i$

Null Hypothesis Test

Example: Are men’s and women’s heights the same?

  • $H_0$: The ‘null’ hypothesis
  • $P(D | H_0)$ is defined
  • The alternative is undefined

Pitfalls

Problem definition

  • Defining the models $P(D | H_i)$ incorrectly.
  • Using an “unreasonable” prior $P(H_i)$

The garden of many paths

  • Having a huge hypothesis space
  • Selecting the relevant hypothesis after seeing the data

The covid test problem

10% of the class has covid, i.e. P(covid) = 0.1. Each one of you performs a covid test. If you have covid, the test is correct 80% of the time, i.e. P(positive | covid) = 0.8. Conversely, if you do not have covid, there is still a 10% chance of a positive test, with P(positive | not-covid) = 0.1

How likely is it that you have covid if your test is positive or negative, i.e. P(covid | positive), vs. P(covid | negative)?

First of all, each one of you should independently generate a uniform random number between 1 and 10. For that, you can each throw a die, and record the outcome.

Then you throw a second die, and record that as well.

I will now pass over the tables and tell each one of you if they have a positive test.

Now, everybody with a positive test raises their hand. I expect it to be slightly more than 10% (but it depends).

Hypothesis testing

Hypothesis testing, or model selection, is a general problem in machine learning and statistics.

Estimation intervals

Intervals are a generalisation of hypothesis tests. We want to know if a given unknown parameter is within some range. Let us give an example.

Estimating an expectation

We might have data $x_1, \ldots, x_T$, with $x_t ∼ Ber(θ)$ from a Bernoulli distribution with parameter $θ$. While we can calculate the mean from the samples, what can we actually say about the unknown expectation $\E[x_i] = θ$?

Bayesian credible interval

The Bayesian idea is intuitive. Through the posterior, we can calculate the probability that the true parameter is in any interval $S ⊂ [0,1]$: \[ P(θ ∈ S | x_1, \ldots, x_T) \] This idea works for any kind of estimation.

Confidence interval

Confidence intervals work the other way round. First we construct an algorithm for obtaining an estimate of the unknown parameter, e.g. the empirical mean estimate $μ(x_1, … x_T)$. The confidence interval tells us the probability that our estimate is further away than $ε$ from the expectation, as an analytic function of $ε$ and the amount of data $n$ \[ P(|μ - θ| > ε \mid θ) \leq δ(ε, n) \]

Difference between confidence intervals and credible intervals.

In confidence intervals, we condition on the unknown parameter $θ$, but the function $δ$ is constructed so that it is independent of $θ$, since that is unknown.

Simulation study

For a simple visualisation problem, vary parameter values and simulate thousands of times under each set of conditions. Summarise your findings graphically.

Introduction: Estimation, Prediction and Decision Making

Correlation example

./figures/pirates-global-warming.jpg

The three main problems in statistics

Estimation

  • How many covid patients are there now?
  • Which is the best meteorological station?
  • What is the right model for planetary motion?

Prediction

  • How many covid patients will there be next week?
  • Will it rain tomorrow?
  • When is the next lunar eclipse?

Decision making

  • What covid measures should I take?
  • Should I take the umbrella?
  • What’s the best way to each the next eclipse?

Types of variables

Observed variables

  • Telescope images
  • Survey data from a health authority
  • Radar measurements in an AV

Latent variables

  • The mass, position and velocity of Mars
  • Actual number of covid patients
  • The location and speed of nearby vehicles

Decision variables

  • Can be observed or latent
  • Selected arbitrarily by humans or machines

From correlation to causation.

Statistical relation.

  • Is smoking related to lung cancer?

Statistical causation.

  • Does smoking cause lung cancer?
  • Will I get lung cancer if I smoke?

Actual causes.

  • Did I get lung cancer because I smoked?

Correlation versus Causality

Correlation and dependence

  • Does knowing the value of one variable give us information about another variables?

Causality

  • Does changing the value of one variable affect other variables?

Joint and Conditional Probabilities

Here we go into some more detail about joint and conditoinal probabilities.

Correlation: Setting

Let us start with the general setting.

Outline

  • Random variables $x, y$ taking values in $\mathcal{X}, {Y}$.
  • We write $Pr(x,y)$ to informally mean their joint distribution.

For a formal definition, we must define an appropriate probability measure on $Ω$, which will then result in a new measure on the space defined by the two variables of interest, as follows:

Formal definition

  • Underlying probability space $P, Ω, Σ$ with
  • Outcome space $Ω$
  • Event space $Σ$, so that $A ∈ Σ$ are subsets of $Ω$.
  • Probability measure $P : Σ → Ω$.
  • RVs $x : Ω → \mathcal{X}$, $y : Ω → \mathcal{Y}$
  • Joint measure $Px,y(S_x, S_y) \defn P(\{ω : x(ω) ∈ S_x, y(ω) ∈ S_y\})$

However, this is not, strictly speaking, necessary for an intuitive understanding.

Conditional probabilities

Definition

The conditional distribution of $x$ given $y$ is: \[ Pr(x | y) = Pr(x, y) / Pr(y) \] Thus, for every value of $y$ we get a different distribution for $x$, called $Pr(x | y)$.

Recall definition for events

This has the same form. \[ P(A | B) = P(A ∩ B) / P(B). \]

Python example: independent RVs

Let us begin with a simple example of two variables which are independent Bernoulli.

Bernoulli-distributed $x, y ∈ \{0,1\}$

  • $Pr(x = 1) = θ$
  • $Pr(y = 1) = v$

The independence can also be seen in the structure of the program: x and y are generated by two separate calls to random.choice()

One draw of $x,y$

import numpy as np
theta = 0.6
v = 0.8
x = np.random.choice(2, p = [1 - theta, theta])
y = np.random.choice(2, p = [1 - v, v])

Dependent, Discrete $x, y$

Now let us look at dependent variables. Here $x$ is generated first. Then $y$ has a value which depends on what value of $x$ has been generated.

Bernoulli-distributed $x, y ∈ \{0,1\}$

  • $Pr(x = 1) = θ$
  • $Pr(y = 1 | x = 0) = v_0$
  • $Pr(y = 1 | x = 1) = v_1$

We can also calculate $Pr(y = 1)$ by using the marginalisation rule: \[ Pr(y = 1) = Pr(y = 1 | x=0) Pr(x = 0) + Pr(y = 1 | x=1) Pr(x = 1) = v_0 (1 - θ) + v_1 θ. \] Now let us how to do this programmatically. Here, x is an explicit input to the function that generates y:

One draw of x, y

import numpy as  np
theta = 0.6
v = np.zeros(2)
v[0] = 0.4
v[1]= 0.8
x = np.random.choice(2, p = [1 - theta, theta])
y = np.random.choice(2, p = [1 - v[x], v[x]])
return x,y

Python example: multiple draws

Doing this for multiple draws isa bit more complicated, but it can be done. We just need to use the set-builder notation to generate one $y$ for each $x$ generated.

import numpy as np
n = 10000
theta = 0.6
v = np.zeros(2)
v[0] = 0.4
v[1] = 0.8
x = np.random.choice(2, p = [1 - theta, theta], size = n)
y = np.array([np.random.choice(2, p = [1 - v[x_t], v[x_t]]) for x_t in x])
import matplotlib.pyplot as plt
A = np.zeros([2,2])

for i in range(2):
	 for j in range(2):
	   A[i,j] = sum((x==i) & (y==j))

plt.imshow(A)
plt.savefig("correlated-binary.png")
plt.show()
return A

Empirical joint probability of correlated x, y

./figures/correlated-binary.png

Empirical joint probability of independent x, y

./figures/independent-binary.png

Continuous $x, y$

This is the typical structure of regression problems. Let us start with a simple normal distribution for both x, y:

Normal-distributed $x, y$

  • $x ∼ Normal(0, 1)$.
  • $y | x ∼ Normal(x, 1)$.

The above notation means that:

  • $x$ has a normal distribution with mean 0 and variance 1.
  • Given the value of $x$, $y$ has a normal distribution with mean $x$ and variance 1.

One draw from x, y

This dependence is made explicit in the following code.

import numpy as  np
theta = 0.8
x = np.random.normal(0, 1)
y = np.random.normal(x, 1)
return x,y

Continuous $x$, Discretre $y$

This is the typical structure of classification problems.

Normal-distributed $x$, Bernoulli-distributed $y$

  • $y ∼ Bernoulli(0.6)$
  • $x | y ∼ 160 + Normal(10*y, 1)$.

The above notation means that:

  • $y$ has a Bernoulli distribution with mean 0.6.
  • Given the value of $y$, $x$ has a normal distribution with mean $160 + 10x$ and variance 1.

One draw from x, y

import numpy as  np
y = np.random.choice(2, p = [0.4, 0.6])
x = np.random.normal(x, 1)
return x,y

Covariance matrix

  • Consider a collection of RVs $x_1, \ldots, x_n$.
  • The joint distribution is a complicated object.
  • Visualised with scatterplots $(x_i, x_j)$, e.g. sns.pairplot()

Covariance matrix $C$

Instead, we can calculate the covariance matrix \[ Cij = \frac{\E\{[x_i - \E(x_i)][x_j - \E(x_j)]\}} {\sqrt{\E\{[x_i - \E(x_i)]^2\}\E\{[x_i - \E(x_i)]^2\}}} \]

Approximating the covariance matrix

  • Assuming data $x(t)$ with components $x_i(t)$:
  • $Cij ≈ \frac{1}{T} ∑t=1^T [x_i(t) - μ_i] [x_j(t) - μ_j] / σ_i σ_j$.
  • $μ_i$: (empirical) mean of $x_i$
  • $σ_i$: (empirical) standard deviation of $x_i$

Correlation versus dependence

Dependent random variables

  • $x, y$ are independent if $Pr(x,y) = Pr(x)Pr(y)$
  • equivalently, if $Pr(x | y) = Pr(x)$
  • $x, y$ are dependent if they are not independent.

Correlated random variables

  • $x, y$ are uncorrelated if $\E(x,y) = \E(x)\E(y)$
  • Equivalently, if $\E(x | y) = \E(x)$
  • $x, y$ are correlated if $\E(x,y) ≠ \E(x)\E(y)$

Theorem

  • If $x, y$ are correlated then they are dependent.
  • If $x, y$ are independent the they are uncorrelated.

Models of Causation

Causal inference vs the actual cause

Causal inference

  • Can aspirine cure headaches?
  • Does smoking cause lung cancer?
  • Or do cancer patients become smokers?
  • Or is there a third factor causing both?

The actual cause

  • Did aspirin cure my headache?
  • Did smoking cause my cancer?

Applications

  • Causal inference useful in a scientific setting.
  • Reliable methods for causal inference exist.
  • Actual causes useful in a legal setting.
  • No reliable method or definition exists for determining actual causes.

Confounding variables

Arrival at work

  • Tom and Fatima both work in Lausanne.
  • Whenever Tom is late to work, so is Fatima.
  • When this happens, there is also a traffic jam.

Kidney stone treatment

  • Treatment A is effective 90% of the time
  • Treatment B is effective 50% of the time.
  • Why is that? (see src/causality/confounders)x
TreatmentSmall StonesLarge Stones
A80%30%
B90%60%

We see here that treatment B is always better, but Large Stones are harder to treat. Let us now consider the treatment policy:

TreatmentSmall StonesLarge Stones
A80%20%
B20%80%

The treatment policy mostly assigns treatment A when patients have small stones, and B for large stones. Hence, as treatment B is applied to harder cases, it might look worse on average. In particular, if we calculate the average treatment effect, by marginalising over stones, we have

TreatmentAverage Effect
A0.7
B0.66

So, even though the second treatment better both for large and small stones, if we ignore the stone size, it looks worse.

Ignoring the stone size makes it a confounder, because it was used in the policy, but we ignore it. It wouldn’t matter if stone size was not used to select the treatment, which is why we typically use randomised trials to study the effect of treatments.

Instrumental variables

Setting

  • $a$: Treatment
  • $x$: Patient characteristics
  • $y$: Outcome

Confounding

  • $x$ is ignored
  • $a$ depends on $x$.
  • $y$ depends on $a, x$.

Using $x$ as an instrument

  • $x$ is taken into account
  • We model $Pr(y | x, a)$, $Pr(a | x)$

Avoiding confounding

Randomised studies

  • Treatment is random
  • Hence dependence to $x$ is cut.

Selecting an instrument

  • Must be an informed choice
  • A complex instrument can lead to spurious correlations

Explicitly model policies $π$

  • Treatment policy $π$ with $π(a | x, π)$.
  • Outcomes depend on policy $π$

Module 4: Advanced visualisation

Geographical data

https://scikit-learn.org/stable/auto_examples/neighbors/plot_species_kde.html

Colour maps

  1. Colour as a continuous variable.
  2. Colour as a discrete variable.
  3. Colour perception and interpretation.

Contour maps

  1. Geographical contour
  2. Density plots

Text data

Assignments

The course contains assignments and a project. The instructions for each assignment are given below. The assignments are largely done in class, but completed at home.

Table To Picture

TLDR: Find a table in wikipedia on a topic of interest, and convert the table into a graph.

The purpose of this assignment is for you to create a graphic that demonstrates something interesting about a data table found on the web. Please provide as precise and concise answers as possible. This assignment is graded. You are encouraged to discuss the assignment with other students in a group. However each student must prepare their own individual report. Please submit your answers on moodle

Instructions

In this exercise, you must create a plot from an existing dataset and write a short report. Use the following steps as a guideline.

  1. Find a data table on the internet.
  2. Write a short description of the data on the table.
  3. Create one or two plots of your choice, summarising the data on the table.
  4. Explain what the graph shows about the data.
  5. Try and draw some conclusions or generalisations from the graph. Does it make logical sense?

Example

This is an example of this exercise for a dataset we already saw in class.

  1. Use the world records data for 100m/400m/1500m or some other distance.
  2. Explain what the records show.
  3. Show how the world record changes over time for men and women, with different colours. Be sure to plot records with the x axis showing time.
  4. Draw regression lines over the world record graph: the records reduce over time.
  5. It is not logically possible to expect the times to reduce linearly with time! Is there a fundamental limit? How can the data be best explained? Is human performance constant over time, and records are falling due to random chance? Is human performance slowly increasing over time?

Plot deconstruction

Instructions

Find an inteesting plot from a web page on e.g. wikipedia. Try to identify some problem with the plot. To help you, ask yourself the following questions:

  • Is the plot type appropriate?
  • Is the data correct?
  • Does the plot convey an appropriate message?
  • Is thee more data somewhere that you could combine with the original to obtain a better picture?

After you have identified problems with the plot, data sources, or missing data, create a new plot, along with an explanation of how you addressed the original plot’s deficiencies.

Example

Newspaper article analysis

In this assignment you will read a newspaper article with some statistics and visualisations, and try to interpret what it says. You must study the article criticially. Are the conclusions supported by the data? Does the methodology make sens? Find primary sources that confirm or challenge the article to obtain a more rounded picture.

Here is a list of possible articles you can use. Feel free to suggest your own article and add it to the list.

https://docs.google.com/spreadsheets/d/1QKj_L9f0UIH80qgs2kcjc8AU1eKZzsTOcYFSU06HJBY/edit#gid=0

Open project

Project proposal

Propose a problem to solve, including:

  • Hypotheses to test
  • How to collect data
  • How to analyse the collected data

Project Highlight

After you have started your project, each one of the project members presents a preliminary plot and explains it (5 minutes).

Project presentation

Project report

The completed project should include a report written by both students in the team. This should should address the points in the Assessment description.

Suggested structure for the report:

  1. Hypotheses and scientific question.

What is the main scientific question you want to answer? After a general introduction to the question, be more precise and list all alternative hypotheses that you would like to examine.

Example: We wanted to see if electric, diesel or gasoline vehicles are more environmentally friendly. More specifically, we will compare the lifetime carbon emissions of vehicles of all types, including material sourcing, construction, driving and fuel sources, maintainance and disposal.

  1. Data collection and experiment design

What data did you use to answer your question? How has the data been collected? What are the main chracteristics of the data?

Example: We will collect data from a few representative manufacturers which produce both EV and internal combustion vehicles. This will be combined with measurements from national environmental agencies and NGOs. [Then detail the list of sources and what data you obtain from each]

  1. Methodology

How will you use the data to answer the question? Explain in as precise terms as possible.

Example: The easiest part to measure is the CO2 emissions during car use. We can obtain measurements from government agencies, magazines and the manufacturers about power consumption of each vehicle. These can be synthesized into a range of consumption numbers [Detail how exactly you do the synthesis]. For the construction of a vehicle it is hard to get exact numbers but we used estimates from these articles [cite the articles]. Because there might be a manufacturer bias (some might generally use more intensive processes than others), we will focus on comparing equivalent models within a manufacturer with different engine and drive-trains.

  1. Results

Here give a number of plots to answer your question. Section 3 should say how the plots were arrived at. In this section, you should explain the meaning of these plots and what they show.

Example: Stacked bar plots of CO2 emissions for different types of vehicles, grouped by manufacturer. Differences in CO2 emissoins between catagories for each manufacturer. The plots might show that e.g. Renault overall has lowre emissions than Ford, but that in either case the EV vehicles have lower lifetime emissions than their Diesel counterparts.

  1. Conclusion

Are you able to arrive at a firm conclusion or not? If so, what is it? If not, would more data help you to obtain a clearer conclusion? Are you happy with your methodology, or could it be improved?

Example: You have firmly concluded that in-use emissions are lower for EVs than internal combustion vehicles. However, you were unable to obtain data regarding disposal and so have left that out of the analysis. More-over, the uncertainty about construction emissions are too great to be able to see a significant difference.

Theoretical Background

Notation

For convenience, I include necessary mathematical notation

Sets

  • $\mathbb{R}$: Real numbers
  • $\mathbb{R}^d$: d-dimensional Euclidean space
  • $∅$: The empty set
  • $A ⊂ B$: A is a subset of B.
  • $A ∩ B$: The intersection of A and B
  • $A ∪ B$: The union of A and B
  • $A \setminus B$: Removing B from A
  • $Ω$: The “universe”
  • $A^c = Ω \setminus A$: The complement of a set.
  • $\{x | f(x) = 0\}$: The set of x so that $f(x) = 0$.

Analysis

  • $\mathbb{I}\{x ∈ A\}$: indicator function (takes the value $1$ if $x ∈ A$, $0$ oterwise)
  • $∑x ∈ X f(x) = f(x_1) + \cdots + f(x_n)$, with $X = \{x_1, \ldots, x_n\}$
  • $d/dx f(x)$: derivative of $f$
  • $∂/∂ x f(x,y)$: partial derivative of $f$
  • $∇_x = (∂/∂ x_1, \ldots, ∂/∂ x_n)$, vector of partial derivatives.

Probability

  • $Pr$: Probability (informally generally)
  • $\mathbb{E}$: Probability
  • $P$: A probability measure
  • $p$: A probability density
  • $P(A | B) = P(A ∪ B) / P(B)$. Conditional probability, $A, B ⊂ Ω$.
  • $\param$: Parameter
  • $\Param$: Parameter set
  • $\{P_\param | \param ∈ \Param\}$: A family of parametrised models
  • $Pr(x | y)$ conditional probability for random variables x, y (generally)

Probability background

The theory of probability is used to mathematically define processes with uncertain outcomes. The set of all possible outcomes depends on the process. For example, if we throw a die, this is the set of all possible ways, locations etc that the die can fall and land. However, we may only be interested in two events: whether the die lands showing a ‘6’ or not. Formally, an event $A$ is a subset of all the possible outcomes $Ω$. In our example, $A$ can be the set of all ways in which the die can land so that its top shows “6”. A probability measure $P$ simply assigns a number between 0 and 1 to every subset $A$ we might be interested in. This can be thought of as the area of $A$. Different probability measures $P$ assign different areas to different sets.

For some more technical details, see Probability space.

See also:

Probability space

In probability theory, we typically define the set of all possible events that we care about as the algebra $Σ$, so that any possible event $A ∈ Σ$ and so that $A ⊂ Ω$. The algebra has the property that it is closed under union and complement, that is:

  1. If $A, B ∈ Σ$ then $A ∪ B ∈ Σ$
  2. If $A ∈ Σ$ then $¬ A ∈ Σ$.

Here, $¬ A \defn Ω \setminus A$, i.e. the subset of $Ω$ not containing $A$.

Together with a probability measure $P$, the tuple $(Ω, Σ, P)$ defines a probability space. Simply put, $P(A)$ is the probability that event $A$ happens.

Probability measures

A probability measure $P$ is a function from sets to the interval $[0,1]$. Measuring the probability of a set is technically the same as measuring the area of a region, or the number of items in a given region. Formally, for a probability measure is defined on:

  • A universe $Ω$ of outcomes
  • The algebra $Σ$ of subsets of $Ω$ (which we can think of as all the ‘events’ of interest) so that:

(a) If $A ∈ Σ$, then $A ⊂ Ω$ (b) If $A, B ∈ Σ$ then $A ∪ B ∈ Σ$. (b) If $A ∈ Σ$ then $Ω \setminus A ∈ Σ$.

The axioms of probability A probability measure $P: Σ → [0,1]$ on $Ω$ satisfies the following axioms:

  1. $P(Ω) = 1$.
  2. If $A ∩ B = ∅$ then $P(A ∪ B) = P(A) + P(B)$.

From these, it also follows that $P(∅) = 0$.

See also: https://en.wikipedia.org/wiki/Probability_measure

Marginalisation

If $B_1, \ldots, B_n$ is a partition of $Ω$, i.e. a collection of sets so that $B_i ∩ B_j = ∅$ are disjoint and $\bigcupi B_i = Ω$ cover all of $Ω$, then we have that \[ P(A ∩ B) = ∑_i P(A ∩ B_i), \] because $(A ∩ B_i)$ are disjoint and $\bigcup_i (A ∩ B_i) = A$.

Mutually exclusive events

Two events $A, B$ are mutually exclussive if $A ∩ B = ∅$.

This means that there is no random outcome $ω$ that is in both of them, so they can never be true at the same time.

Independence

Two events $A, B$ are said to be independent iff $P(A ∩ B) = P(A) P(B)$.

Intuitively, this means that knowing if $A$ happened tells us nothing about whether $B$ happened.

Conditional probability

For any events $A, B ⊂ Ω$ and any probability on $Ω$, we define the conditional probabiity that $A$ is true if $B$ is true as follows: \[ P(A | B) = \frac{P(A ∩ B) }{P(B)} \] In other words, this is the probability that proportion of times that A is true when B is true. It is akin to restrict the universe of outcomes to B. We then count: how many times is A true?

We can also express independence in terms of conditional probability. A and B are independent if: \[ P(A | B) = P(A), \] or, if $P(A) = 0$: \[ P(B | A) = P(B). \]

Bayes theorem

We can use the conditional probability definition to relate $P(A|B)$ to $P(B|A)$. \[ P(A | B) = \frac{P(A ∩ B) }{P(B)} = \frac{P(B | A) P(A) }{P(B)}, \] since $P(B | A) = P(A ∩ B) / P(A)$.

This is most useful for statistical inference, where we think of $P(A)$ the prior, $P(A |B)$ as the posterior, and $P(B | A)$ as the evidence likelihood.

It is also useful to write this rule when $A_1, \ldots, A_n$ is a partition of $Ω$. Then $P(B) = ∑_i P(B ∩ A_i) = ∑_i P(B | A_i) P(A_i)$ and so: \[ P(A | B) = \frac{P(B | A) P(A) }{∑_i P(B | A_i) P(A_i)}, \]

Conditional independence

Two events $A, B$ are said to be independent given another event $C$ iff \[ P(A ∩ B | C) = P(A | C) P(B | C) \]

This can also be written as \[ P(A | C, B) = P(A | C) \]

Example Distributions

We focus on distributions where there is a finite number of possible outcomes, and hence a finite number of possible events that we might care about. All such distributions are characterised by one or more parameters. The simplest such distribution is a distribution on only two outcomes, the family of Bernoulli distributions.

The Bernoulli distribution.

Let us start with a simple example, the Bernoulli distribution with parameter $θ ∈ [0,1]$. This is the distribution over two outcomes $\{0,1\}$, so that if $x$ is a Bernoulli random variable, then: \[ Pr(x = 1) = θ, \qquad Pr(x = 0) = 1 - θ. \] It is typical to think the distribution of heads and tails of a coin as being Bernoulli, with parameter $θ = 1/2$.

Probability space Formally, if the underlying probability space is $(Ω, Σ, P)$, with random outcomes $ω ∈ Ω$ and random variable $x : Ω → \{0,1\}$ then \[ Pr(x = 1) = P(\{ω : x(ω) = 1\}). \]

See also: https://en.wikipedia.org/wiki/Bernoulli_distribution

Binomial distribution

If we repeat a Bernouli trial, we can also count the number of times the coin comes heads. The distribution of the counts is called the Binomial distribution. If $y$ is a binomial random variable for $n$ throws with parameter $θ$, then we can write it as the sum of $n$ Bernoulli random variables $x_1, \ldots, x_n$, i.e.: \[ y = ∑t=1^n x_t. \] The probability of $k$ heads after $n$ throws is given by the formula: \[ Pr(y = k) = \binom{n}{k} θk (1 - θ)n - k \], where $\binom{n}{k}$ is the bimomial coefficient.

See also: https://en.wikipedia.org/wiki/Binomial_distribution

The Categorical/Multinomial distribution

A multinomial distribution is an extension of the Bernoulli and binomial distributions to $m \geq 2$ outcomes.

Categorial distributions Let us start with one trial, e.g. a single throw of a die. We can model this dice throw as the distribution where the probability that the die lands with its $k$-th face on top is $θ_k$, \[ Pr(x = k) = θ_k.\] Thus, this distribution is parametrised by the vector $θ = (θ_1, \ldots, θ_m)$. A random variable $x : Ω → \{1,\ldots, m\}$ obeying this distribution is called multinomial.

If the underlying probability space is $(Ω, Σ, P)$, then \[ Pr(x = k) = P(\{ω : x(ω) = k\}) \]

See also: https://en.wikipedia.org/wiki/Multinomial_distribution

Uniform distributions

A special case of binomial and multinomial distributions is the uniform distribution. This is defined as follows.

Let $|A|$ be the size of a set $A$. Then a distribution $P$ is uniform if it obeys \[ P(A) = \frac{|A|}{|Ω|}. \]

This definition applies to continuous distributions as well. A standard example is the uniform distribution on the interval $[0,1)$. Then the probability that we obtain an outcome in the set $[0,p)$ is always equal to $p$, i.e. \[ P([0,p)) = Pr(ω ∈ [0,p)) = p. \]

Random variables

A real-valued random variable $f : Ω → \mathbb{R}$ is simply a function from the outcomes to the real numbers. Even though it is a fixed function, its values are random, because the actual value $ω ∈ Ω$ that will be used to calculate its value $f(ω)$ is random.

Random variables can be easily generalised to other domains than the real numbers.

See also: https://en.wikipedia.org/wiki/Random_variable En francais: https://fr.wikipedia.org/wiki/Variable_al%C3%A9atoire

Example random variable

Take a 10-sided die. The outcomes of the die represent the space $Ω$. We can now create a Bernoulli variable from the die. For example, set $x(ω) = 1$ when $ω ∈ \{1, 5\}$ and $x(ω) = 0$ when $ω ∈ \{6, 10\}$.

What is the distribution of $x$? Have you seen it before?

Generating a Bernoulli random variable.

In python, there are procedures for generating data from many types of random variables. However, all such methods for generating random numbers are not truly random. They rely on something called a pseudorandom number generator. The values output by this generator are then transformed so as to become similar to the random variable we want.

  1. Let us start by throwing a 10-sided die. How do you generate a Bernoulli random variable with $θ = 0.6$ from the outcomes of the dice throw?
  2. Now let us consider the following python code, which generates values from a Bernoulli random variable.
import numpy as np
return np.random.choice(2,p=[0.6, 0.4])

Let us replicate it through a distribution of uniform randomv variables in $[0,1]$.

import numpy as np
x = np.random.uniform() # returns a uniformly generated number in [0,1). 
if x < 0.6:
  return 1
else:
  return 0

Expectation

The expectation of a random variable $f : Ω → \mathbb{R}$ wiht respect to a probability $P$ on $Ω$ is defined as follows, for finite $Ω$. \[ \E_P(f) = ∑ω ∈ Ω f(ω) P(ω). \] When $P$ is clear from the context, we can just write $\E(f)$.

Expectation of general random variables

For the general case, we define it in terms of integrals: \[ \E_P(f) = ∫Ω f(ω) d P(ω). \]

Averages and expected values

If we have obtained a sequence of samples $ω_1, \ldots, ω_n$ from $P(ω)$, and calculated the average \[ M_n(ω_1, \ldots, ω_n) = \frac{1}{n} ∑i=1^n f(ω_i). \] then the average is approximately equal to the expectation: \[ \E_P(f) ≈ \frac{1}{n} ∑i=1^n f(ω_i). \]

To see this, first note that $\E_P(f(ω_i)) = \E_P(f)$, so the expected value of the average is equal to the expected value of $f$. \[ \E[\frac{1}{n} ∑i=1^n f(ω_i)] = \frac{1}{n} \E [∑i=1^n f(ω_i)] = \frac{1}{n}∑i=1^n \E[f(ω_i)] = \frac{1}{n}∑i=1^n \E(f) = \E(f). \]

In fact, the average $M_n$ is another random variable, defined on $Ω^n$. We say that $M_n$ concentrates around the expected value, with the propertity that \[ M_n = \E(f) \mp O(1/\sqrt{n}). \] Informally, we should expect our error to reduce proportionally to $1/sqrt{n}$ the number of samples we have.

Variance

The variance of a random variable $f$ is defined as \[ σ_f^2 \defn \E\{[f - \E(f)]^2\}. \] We can rewrite the variance as follows: \[ σ_f^2 = \E\{[f - \E(f)]^2\} = \E[f^2 - 2f \E(f) + \E(f)^2] = \E[f^2] - 2\E(f) \E(f) + \E(f)^2 = \E[f^2] -\E(f)^2. \] The variance is another example of a moment of a random variable. More complicated moments, such as skew and kurtosis exist.

Conditional expectation

We can define the conditional expectation of a random variable through conditional probabilities.

In particular, let some variable $f: Ω → \Reals$ and some event $B ⊂ Ω$. Then we define \[ \E(f | B) = ∑ω ∈ Ω f(ω) P(ω | B). \]

We can even re-write this in a more intuitive way. Note that if $ω ∉ B$ then $P(ω | B) = 0$. So, we have \[ \E(f | B) = ∑ω ∈ B f(ω) P(ω | B) = ∑ω ∈ B f(ω) P(ω) / P(B). \]

Group comparison

Let us say we wish to compare the distribution among multiple groups. Perhaps we wish to compare the number of students who achieve a certain test like the table below.

SuccessSuccess
SchoolMaleFemale
A62%82%
B63%68%
C37%34%
D33%35%
E28%24%
F6%7%
Average45%38%
  1. Let us plot the success rate for females and males over different schools.
  2. Does this show a bias? What information is missing?
  3. Let us combine these two plots into one plot now for this we need to use the following code:
w=0.5
X = np.linspace(1,6*w,6)
M = ...
F = ...
plt.bar(X, M, align=edge, width=w/2)
plt.bar(X + w, F, align=edge, width=w/2)

Graphics types

  1. Histogram and 3D extensions
  2. Density curve
  3. Scatterplot
  4. Smooth scatterplot
  5. Violin plot
  6. Line Plot
  7. Confidence Intervals
  8. Geographical/topological maps
  9. Network graphs
  10. Word cloud

See also: Catalogue of data visualisation

Schedule and links to other courses

The schedule of this and the other courses is in flux, but I do not expect it to change very much. In any case, the course will operate independently of the other courses. You should expect to cover the same topic more than once. However, this course will not focus on either statistical theory or programming.

WeekStatisticsProgrammingIn-CourseHomework
1Course introPython introHistograms
23 SepRandomness
Math score
2R IntroData typesForm groups
30 SepData manipulationUncertainty
HistogramsDiscrete Variables
ScatterplotsContinuous Variables
Boxplots
Variable types
Mosaic plots
Functions
3Quantifying VariabilityControlTime-Series
7 OctDistributionLinear functionsForm groups
Density functionStock market prices
HistogramsCrime statistics
SkewnessS&P index
QuantilesWorld Records
4Qualitative vars in RStructures
14 OctDiscrete vars in RProj. Proposal
Scatterplots
Unemployment
5Continous RVFunctions
21 OctData CleaningTable2Picture
6Continuous RVComplementsExperiment design
28 OctRandom SamplingDeconstruction
Undercounting
Represnetative samples
7Continuous RVClassesExpectations
4 NovNewsPaper
8Dependencies.ObjectsBayesian Inference
11 NovJoint distribution.Project Highlight
Conditional distribution.
9MomentsErrorsPipelinesProject
18 NovSimulation studies
10CovarianceIterators
25 NovCorrelationHypothesis testingProject
ScatterplotsThe garden of many paths
11Prices, returnsFPExamples, project work
2 DecProject presentation
12Conditional expectationsProject report
9 DecExamples, project work
13
16 DecProject presentationsProject report

Glossary

  • Area: Air
  • Die (Dice): Dé(s)
  • Expectation: Espérance
  • Experiment: événement
  • Histogram: Histogramme?
  • Pie chart: Diagramme circulaire
  • Bar chart: Diagramme à barres
  • Scatterplot: Nuage de points
  • Randomness: Hasard
  • Uncertainty: Incertitude
  • Probability: Probabilité
  • Stochastic: stochastique
  • Random: Aléatoire
  • Random Variable: Variable aléatoire
  • Sample Space: see Universe
  • Sample (v): Échantilloner
  • Sample (n): Échantillon
  • Sampling: Échantillonnage
  • Set: Ensemble
  • Subset: Sous-ensemble
  • Superset: Sur-ensemble
  • Survey: Sondage
  • Universe: Univers
  • $σ$-algebra: tribu, $σ$-algèbre

References

Python help: Use python’s !help! function whenever you can.

help(print)

Books