This is a hands-on course that integrates introductory programming, statistics and data science. Through out this course, we will formulate scientific hypotheses, design experiments, and collect and analyse data visually and through formal models. You are expected to supplement this course with homework, self-study and other courses in descriptive statistics and python programming, but no prior knowledge is assumed. Formal concepts will be introduced through class activities and examples.
The course will focus neither on statistical theory nor on programming. We will only lightly build those skills within the course. Our main aim will be to build scientific and statistical intuition through practical work. However, to achieve this you need to be able to independently pick up theoretical and programming skills. For that reason, it is necessary for you to do outside reading either (ideally) by taking statistics and programming courses in parallel, or through self-study.
Graphical comprehension
- Recognise structural elements in a statistical graph (e.g. axis, symbols, labels) and evaluate the effectiveness (for perception and judgment) and appropriateness (for the type of data) of structural element.
- Translate relationships reflected in a graph to the data represented.
- Recognise when one graph is more useful than another and organise/reorganise data to make an alternative representation.
- Use context to make sense of what is presented in a graph and avoid reading too much into any relationships observed.
- Express creative thinking via the production of an innovative graphical presentation.
Scientific process
- Understanding the randomness, variability and uncertainty inherent in a problem.
- Developing clear statements of the problem/scientific research question; understanding the purpose of the answer.
- Be able to perform a basic experiment design.
- Identify sources of bias in data collection and analysis.
- Ensuring acquisition of high-quality data and not just a lot of numbers.
- Understanding the process that produced the data, to provide proper context for analysis.
- Allowing domain knowledge to guide both data collection and analysis.
- Quantify uncertainty—and knowledge—visually.
- Realise that all visualisations are model summaries.
- Be able to write simple python programs for data science workflows.
- Make sure you are registered on IS-Academia
- Also register on Moodle: this is where the assignments will be
- Clone this git repository
The assessment is purely through in-class exercises, quizzes and homework assignments. There will be assignments spread over the semester, as well as a group project. The project will be performed in pairs.
For all assignment and the project, the following rubrik is used. Some of the assignments may not involve all parts.
Experiment design. The first stage any project, no matter how small, is the experiment design and analysis. This includes a plan for how to collect data, methodologies for analysing the data, and the development of a pipeline, preferrably in the form of a program, for collecting data and analysing it. In additional, the experiment design must be reproducible: This can be ensured by running the data collection and analysis pipeline on simulated data, and seeing if the results are as expected.
Computation. Here you must instantiate the experiment design and analysis with concrete computations. For reproducibility, the computations you perform should be independent of the data you actually have. Correctness of the computations is the most important aspect, here. However, you should also take care to document why and how how you are doing the computations.
Graphics. This addresses the creation of visualisations of your analysis. It is recommended to do this fully automatically, so that you can simply run your pipeline and get all the results you need. Be sure to quantify uncertainty.
Text. Here you should explain in text what the graphics mean. Point out any interesting things you can see in the visualisation and try to explain it. Do not be overconfident, but quantify uncertainty properly.
Synthesis. Here you should summarise the most important findings
from your analysis. Be careful to not over-interpret your results. A
lot of results can be imaginary and can be attributed to insufficient
data, biased sampling, improper modelling or
Skill | Needs Improvement | Basic Level | Advanced Level |
---|---|---|---|
<25> | <25> | <25> | <25> |
Experiment design: Data collection and analysis pipeline | Inappropriate sampling, non-reproducible analysis | Data collection biased or analysis not reproducibile. | Unbiased sampling and reproducibile experiment desgin and analysis. |
Computation: Perform computations | Computations contain errors and extraneous code | Computations are correct but contain extraneous/unnecessary code. | Computations are correct and properly justified and explained. |
Graphics: Communicate findings graphically clearly, precisely and concisely | Inappropriate choice of plots; poorly labelled plots; plots missing. | Plots convey information correctly but lack context for interpretation. | Plots convey information correctly with adequate and appropriate information |
Text: Communicate findings clearly, precisely and concisely | Explanation is illogical, incorrect or incoherent. | Explanation is partially correct but incomplete or unconvincing | Explanation is correct, complete and convincing. |
Synthesis: Identify key features of the analysiand interpret results | Conclusions are missing, incorrect, or not made based on analysis | Conclsions reasonable, but partially correct or incomplete. | Relevant conclusions explicitly connected to analysis and context. |
Pass: All parts must be addressed, the ‘default’ grade is 75%. 5% is added for every ‘advanced’ skill and removed for every ‘needs improvement skill’. Thus the passing grades are 50-100%.
Fail: If not all parts are explicitly addresed, the assignment is failed.
This course will consider the following data sources in order of importance.
This data is obtained through simulation, and it is useful in order to test whether a particular pipeline is working as intended. In particular, it is a great way to test the performance of a method as you vary the data generation process so that different assumptions are satisfied. This allows you to verify robustness.
The UCI repository has a large collection of datasets in an easy to access format. These have already been used in many academic papers, and are a good starting point for you to look at real data. All the data is formatted in an easy-to-use some format, but some pre-processing may still be necessary.
Wikipedia has many interesting articles, from which you can extra tabular data, as well as more contextual information. It is possible to also discuss newspaper articles. Wikipedia and newspaper articles can be used in the context of some assignments.
https://petlab.officialstatistics.org/
https://www.drivendata.org/competitions/97/nasa-mars-gcms/
What is visualisation? It is a way to summarise data. It is also a way to view relationships between variables. Visualisation helps us to find patterns and understand the underlying laws behind how the data was generated. This is, in fact, the essence of modelling.
A model is also a way of summarising the essential features of the data. A visualisation differs from a model only in one sense: It easy to interpret visually.
Every data visualisation implicitly assumes a model of the data generating process. This is true for even the simplest visualisations, like histograms. There is no escape from the fact that any visualisation makes a lot of assumptions. We must emphasize what those assumptions are. What happens if they are not true?
Every data visusalisation, then, proceeds in three steps:
- Data transformation
- Model creation
- Model visualisation
Parameters. Every model is defined by a number of parameters. This is what is displayed when we visualise data. You can think of the model as the underlying theory, and the visualisation as a way to explain the theory visually.
Histograms are a simple tool for modelling distributions. In their simplest application, they are used to simply count the number of items in distinct bins of a dataset. While typically employed to represent the empirical distribution of one-dimensional variables, they can be generalised to multiple dimensions .
- All students who are male raise their hand
- All students who are female raise their hand.
- We count, and draw a bar graph: the number of male, female and other students.
- We count how many are in the BSc of DS of those
- We also count how many are taking a programming course
- We also count how many are taking a maths/stats course
- What does the graph tell us about:
- Computer Science students
- Students at Neuchatel
- Residents of Neuchatel
- Other subsets of the population
- Can we make a similar graph when measuring a continuous variable?
Consider a set of
In particular, imagine a dataset of individuals
A bar graph is a visual representation of the following count vector,
\[
n_c(D) = ∑i ∈ C \ind{f(i) = c},
\]
where
0, & \textrm{$A$ is false},
\end{cases}
\]
so that the height of the
Assume data is in
More generlaly, a (counting) histogram is defined as a collection of disjoint sets called bins
with associated counts
$n_i(D) = ∑x ∈ D \mathbb{I}[x ∈ A_i]$,
where
We can use the histogram as the model of a distribution. For that, we use the relative frequency of points in each bin: $p_i(D) = n_i(D) / ∑j n_j(D)$. The selection of bins influences the model.
See also: https://en.wikipedia.org/wiki/Histogram
- Introduce the concept of a histogram on the board.
- Split the students in two groups.
- Have each group collect the height of every student.
- How can we summarise the data of each group?
- Now the students will individually draw a histogram from the data of their group.
- Show two different histograms from two people in the same group. Why are they different? Discuss in pairs and then in class.
- Now show a histogram from a person in another group. Why are the histograms in the two groups different? Discuss.
- Collect the data of all students in the online excel file.
- Now we shall plot a histogram of the students using the sheet. How does that differ?
[If there are not enough students, the exercise can be performed by adding random numbers using dice]
- Toss a coin 10 times and record each one of the results, e.g. {0,1,1,0,0,0,0,1,1,1}.
- Count the number of times it comes heads or tails.
- We then summarise the result.
Let us denote the number of times you have heads by
We can visualise this by plotting bars or lines, whose height is
proportional to
We typically assume that individual coin tosses are generated from The Bernoulli distribution. This means that the probability of heads or tails is fixed, and does not depend on the result of the previous tosses. Why might not that be the case?
If individual tosses are Bernoulli, then the distribution of the number of heads (or tails) is a binomial distribution.
We will now show how to achieve the same results programmatically.
To help yourself understand python, you can always take a look at the documentation in English or French. To start with, check out the Tutorial. Then use the library reference for advanced usage.
Sometimes it is quicker to just use the help command in the python console or the ? command in the jupyter notebook.
Python can be used as a simple calculator
$ python3 Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> 1 + 1 2 >>> 2 * 3 + 1 7 >>> 2 * (3 + 1) 8 >>> exit()
This interactive console is the most usual way of playing with python in the beginning, but it is not useful in general.
Python programs are executed one statement at a time. Statements are separated by newlines. Anything appearing after a # symbol is not executed.
print("Hello world") # first statement, with a comment
print("Goodbye, world.") # second statement!
# print("This is not printed") - it is a comment, you see
Python programs can be generate text, write and read from files, access the internet, generate and display or save plots to disc, play and record music, record images from a camera…..
Before we actually run the program, one of you can play the role of the python interpreter. The interpreter goes through each line of the program, interprets it and executes it.
When we execute a program in the console, we are assuming the role of the intpreter that steps through the program.
Run the above statements in the python interpreter
$ python3 Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print ("Hello world") Hello world >>> print ("Goodbye, world") Goodbye, world >>> exit() $
The statements print() and exit() are called functions. Function names must always be used with parenthesis. The contents of the parentheses are called arguments. Changing the arguments to a function has a different effect.
We can now try and save the above commands in a file called “pythontest.py” and execute it via
python3 pythontest.py
Console input is used when you want to have a purely interactive session to test something. In console mode, the interpreter executes each line as you enter it.
Script files are used to save your work and re-run it. They also allow you to build complex programs from multiple files, where each file has a different functionality. In script mode, python acts as though you were entering each line one-by-one. It reads each line and executes in turn.
Notebooks are something in between. They are script files with interactive output, and are very useful for rapid development and testing. They also save their state and output in between runs, so they help to document your code. We will use them a lot in class. There are two methods to use notebooks:
- Locally through e.g. jupyter-lab or jupyter-notebook
- Online through https://colab.research.google.com/ https://replit.com or https://noto.epfl.ch
Most of the time, you want to be saving the code you write in a script file and executing it, instead of using the console. However, sometimes the interactivity of the console is helpful. This is when notebooks are used. After you are done developing something with the notebook, you can then extract what you need in a simple python script.
The python interpreter has a state. This includes the contents of a memory where variables are stored, and the current location of the code pointer, that is which line will be executed next.
Variables are alphanumeric references to simple or complex objects. Possible variable names:
- X
- NumberOfApples
- salary
- scratch_variable
- y2
A variable can be assigned a value with the = operator.
x = 2 # this gives the numeric value of '2' to the variable
Variables must be defined before they can be used for the first time
>>> x Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'x' is not defined >>> x = 2 >>> x # typing the name of a variable in the console gives you its value 2
Numerical Python variables are very simple entities. Let us go through this is easy program for a warm-up.
- x=value; assigns a value to a variable named x
- print(); displays something in the terminal
x = 1 # a variable
y = 2 # another variable
print(x+y) # print the value of this variable sum
x = y # assignment operation: now x has the same value as y
print(x) #what would this value be?
y = 3
print(x) #is x changed?
The asignment operator =
is not a mathematical equation.
For example, in mathematics I may write
\begin{align}
x &= y + 1
x &= 5
\end{align}
This is a system of equations, which can be solved to obtain
Consequently, the following program will fail with an error, as y
is not defined
x = y + 1
x = 5
In the following program, the value of y
will remain -1
after the program ends
y = -1 # y = -1, x is not defined
x = y + 1 # now y = -1, x = 0
x = 5 # now y = -1, x = 5
In fact, writing the above as a system of equations makes no sense, as
So, while math-like notation is used in programming, its meaning is not really the same as in mathematics, most of the time.
A slightly more compex object are python lists. A list can contain anything, and is so very flexible. It can contain numbers, strings, or arbitrary ‘objects’.
Now check out the first part of the Histogram example. For that we need one line of setup so we can plot stuff.
import matplotlib.pyplot as plt # this is used for plotting
X=[0,1,0,0,0,0,1]; # list of coin tosses
plt.hist(X) # plot a histogram - this automatically splits everything into bins
In reality, the histogram function creates a so-called bar plot
import matplotlib.pyplot as plt # this is used for plotting
X=[0,1,0,0,0,0,1]; # list of coin tosses
plt.bar(["heads","tails"], [sum(X), len(X) - sum(X)]) # do a bar plot!
The following source creates a list of four numbers and returns one element. Things to unpack here:
- x[i] returns the (i+1)-th element: we start counting from 0
- the return statement sends a value back to the whatever started the python program: in this case this .org file.
x = [1, 2, 3, 4]
return x[3] # returns the last element of the list
The following program assigns arrays-values to variables. Now x, y are both lists.
x = [1, 2, 3, 4]
y = [-1, -2]
x = y # assignment operation: now x is just a different name for y
y[0] = 1 # modify the 0th element of y
return x # what would the value of x be?
Lists are different in one respect: when we assign one list name to another, this does not copy any data. Both names refer to the same data. Consequently, if we change the data, it changes for both variable names. The way to avoid that is to use the copy() function.
x = [1, 2, 3, 4]
y = [-1, -2]
x = y.copy() # copy operation: now x has a copy of y's data
y[0] = 1 # modify the 0th element of y
return x # what would the value of x be?
A python list is similar, but not identical to, a mathematical set.
\[ S = \{x_1, x_2, \ldots, x_n\}. \]
Because lists are very flexible, they are a bit slow. A special type of object, an array, is used to handle lists of numbers. This is not defined in basic python, but only in one module called numpy. Even though basic Python has only a few commands, it has many modules that extend the language to perform complex tasks without having to code everything from scratch.
import numpy as np
x = np.array([1, 2, 3, 4])
y = np.array([-1, -2])
x = y # assignment operation:
y[0] = 1
return x
Sometimes we want to repeat some code. For example, we have two matrices of X, Y values and we wish to plot them:
import matpolotlib.pyplot as plt
import numpy as np
X = np.random.uniform([10,128])
Y = X + np.random.uniform([10,128])
plt.plot(X[0], Y[0])
plt.plot(X[1], Y[1])
#.... etc - to avoid repetition we can use this:
for t in range(10): # this defines the variable t and it cycles it through the values 0, 1, ..., 9.
# start of repeated block
plt.plot(X[t], Y[t])
# end of repeated block - blocks are identified by identation
# we can also loop through a specific list of values
for t in [1, 2, -1]:
print(t)
# this should output 1, 2, -1
Sometimes we want to repeat a complex bit of code in different places. So a loop won’t do. The way to do that is to use a function:
def function_name(first_argument, second_argument): # there can be zero or more arguments to a function
return first_argument + second_argument # this function just returns the sum of its arguments
Whenever code is executed inside a function, the variables created there are only valid within the function. The function arguments are also effectively new variables. To see this, consider the following example.
def example_function(argument):
# The following does not necessarily modify the original variable
# passed. It depends on the effect of 'argument =
# original_variable'. If it copies the value, then the original
# variable remains the same. If it merely acts as a refernce (as
# is the case with lists and arrays) then a modification happens.
argument += 1
# this variable should not be visible outside the function
hidden_variable = 2
# variables defined outside the function are still readable!
print(outside_variable)
# but, we cannot affect the variables outside the
# function. Otherwise there would be a mess.
another_variable = 0
# for that reason, functions should only use the arguments passed
return argument
test = 100
outside_variable = "I am defined outside the function"
another_variable = -1 # so am I
foo = example_function(test) # line to unpack
print("test:", test)
print("foo:", foo)
print("outside_variable:", outside_variable)
#print("hidden_variable:", hidden_variable) # it complains of 'hidden_variable' not being defined
Let us unpack what happens. When we write foo = example_function(test), what happens is as follows.
argument = test # create a new variable: all other variables are now hidden from scope argument += 1 # execute the function's code block foo = argument # apply the return operator to 'foo = example_function()'
For this, we work on the Histogram example.
Pandas is a module for simple and efficient data I/O processing and visualisation. The following code snippet demonstrates a couple of features.
import pandas as pd # we need to load a library first
# loading data into pandas creates a data frame df
df['column-name'] # selects a column
df.hist() # creates a plot with many histograms
Plotting is also possible through the matplotlib. This is the module that pandas uses to plot stuff. It just has a simpler interface for doing so. But if you want to create custom plots, matplotlib is what you need to use.
X = [1, 0, 1, 0, 1, 1, 0, 1, 0] # a sequence of coin tosses.
import matplotlib.pyplot as plt # python has no default plot function, we must IMPORT it
plt.hist(X) # this function plots the histogram
Each one of you should predict the result of a number of coin tosses. Let us do a histogram of the predictions. This is a binomial distribution.
- The students record their data in the shared spreadsheet
- Firstly, plot the histogram of the data with default settings.
- What is the eff
Let us look at the student data: see src/histograms/heights.ipynb
import pandas as pd
X = pd.read_csv("class-data.csv") # read the data into a DataFrame
X['Height (cm)'].hist() #directly plot the histogram
While histograms are good visualisations of distributions on the real line, distributions over a discrete set of possible values are best-represented by a pie-chart. This especially if there is no relation between the different values. As an example, if the values are distinct categories, there is no particular reason to order them on an axis.
- What are the advantages and disadvantages of pie charts and histograms?
Histogram Pie Chart To show proportions For more categories To compare relative size For real-valued data - Why is a 3D pie chart never a good idea?
plt.pie(counts) # plot counts
Random algorithms using coins.
y = 0 # y is a variable, with the value zero currently
import numpy as np # this library has many useful functions
x = np.random.choice(100) # x takes values 'randomly'. It is a 'random variable'.
return x # let's see what value it takes
Uncertainty vs randomness: coin-flipping experiment
- Everybody flips a coin 10 times.
- Record each throw with 0, 1 in this spreadsheet: https://docs.google.com/spreadsheets/d/1E4bs05HnKXf1GZe4g3v6RLnHsj-YcaWg3Qe_RQyfhHU/edit?usp=sharing
- Then record how you threw the coin and what coin it was.
- Discuss if the coin is really random.
- What is the distribution of coin throws for the first throw?
- What is the distribution of recorded coin biases? Why do some coins appear more biased than others?
- Does it make sense to aggregate all the results? What does that assume?
In the context of experiment design and data analysis, it is very common to have conditions like those in this example. Even though we wish there was such a thing as the ‘repeated experiment’, in practice it never is repeated. There is always some varying factor.
Pseudo-random numbers
Let us now repeat the experiment with data generated via a computer.
# here is a default way to generate 'random' numbers
import random
X = random.choices([0, 1], k=10) # uniformly choose 10 times between 0 and 1.
plt.hist(X) # everytime we run these commands, we get a different proportion
This python code is completely deterministic. A complicated calculation is used to generate the next ‘random’ number from the previous one. Consider this example:
import random
seed(5) #this sets the 'state' of the random number generating machine
print(random.uniform(0,1)) # the random number is a function of the state
print(random.uniform(0,1)) # the state changes after we generate a new number
print(random.uniform(0,1))
seed(5) # when we reset the state, we get the same sequence of numbers
print(random.uniform(0,1)) #
print(random.uniform(0,1))
print(random.uniform(0,1))
For cryptographically strong random numbers you need to use the secrets module:
import secrets
secrets.choice(range(100))
Physical sources of randomness
Let’s go back to throwing coins now. Coins are completely deterministic. Whenever we have a specific coin to throw in the air, there are two things we do not know. The first is which side the coin will land on. Why is that? The second is uncertainty about the coin bias: is the probability of landing heads exactly 50%? How can we quantify this? What does it depend on? Discuss in class.
What physical source of randomness can we use instead of coins?
Probability is not only used to model random events. In fact, almost nothing can be said to be really random, unless we go into quantum physics. Even a die thrown in the air follows precise mechanical laws. Given enough information, it is possible to accurately predict the outcome of a throw.
For that reason, probability is best thought of as a way to model any residual uncertainty we have about an event. Then the probability of an event is simply a subjective measure of the likelihood.
While probability offers a nice mathematical formulation of uncertainty, when this uncertainty is subjective, the question arises: how can we elicit precise probabilities about uncertain events from individuals? Here is an example.
Consider the following question: how many immigrants live in Switzerland?
- In-class discussion: what do we mean by that?
- Now everybody can make a guess and record it on this form: https://moodle.unine.ch/mod/evoting/view.php?id=295622
What does this distribution mean? Can we use it as an estimate of uncertainty?
- Now let us create some confidence intervals. The procedure is as
follows. Let us take a first guess at an inteval, (say 5-10%) and ask: (a) Are you willing to take an even bet that the true number is between [5-10%]?
A time series
Frequently, sequential observations of a variable
Generally, there are three tasks associate with time series
modelling, always given data up to this point, i.e.
- Smoothing: What has happened in the past? Here we estimate
$yt-k$ for
$k > 0$ . - Filtering: What is the current situation? To solve this problem
we must estimate
$y_t$ . - Prediction: What will happen in the future? This involves
predicting $yt+k$ for some
$k > 0$ .
These problems are all related and can be formalised in a statistical
manner, and there are multiple algorithms that can be used to solve
each problem. When
Smoothing For smoothing, a moving average filter is typically
sufficient whenever
Filtering When we wish to filter, at best we can take the moving
average from the past
Prediction Prediction means estimating something in the future. This
task is never trivial, even with perfect observations, i.e. when
Here is a simple example of line plotting.
import numpy as np
X = [1, 2, 3, 4, 5, 4, 3, 2, 1] # define a small number of points
import matplotlib.pyplot as plt # import the plotting library
plt.plot(x) # perform a standard, simple plot
plt.savefig(f)
return f
What are such plots useful for?
https://en.wikipedia.org/wiki/1500_metres_world_record_progression
Wikipedia has a table that shows the progression of 1500m world records.
- Let us first show the records up to 1950 .
- Try and predict the progrssion of world records on the board.
- Let us now look at the actual graph. Is it what you expected?
- How do you expect the progression to continue after 2020?
- How do you explain this progression? Can you find data to validate or refute your explanation?
import pandas
tables=pandas.read_html("URL") # read a table
# convert date-string:
dt = datetime.datetime.strptime(string, '%Y-%m-%d').year
# string manipulation
string.replace("+", "0") # replaces a + with a 0
string.split(":") # splits a string into multiple strings
# data formats
float("12.2"); # converts a number into a float
- Plot Mars data
- Show orbits
- 3-body system, chaos and randomness
- Plot covid data.
- Smooth the data: moving average plots
- Try and estimate past, current and future infections with simple tools.
- Discuss: Are those simple tools sufficient? Is our visualisation consistent? Do we need something further?
See: Trading Economics
Let us start with an example where we just have three variables. We can plot the relationship between any two of them.
X=[1, 2, 3, 4, 10, 6]
Y=[5, 2, 5, 3, 1, 2]
Z=[0, 1, 0, 1, 0, 1]
import matplotlib.pyplot as plt
plt.scatter(X,Y)
Variables are frequently in some array instead.
import numpy as np
n_data = 10
n_features = 3
data = np.random.uniform(size=[n_data, n_features]) # create some random data
plt.scatter(data[:,0], data[:,1]) #plot the first against the second column
# We can always take a 'slice' of the data:
data[:,[1,2]] # get columns 1 and 2 and all the rows
data[1:10, [0,2]] # get columns 0 and 2 and all rows 1-10
## : means everyhing
## a:b means everything from a to b
## [a,b,c] means a, b and c.
In dataframes, it we can deal with multiple variables by name
import pandas as pd
df = pd.DataFrame(data, columns = ["Alcohol", "Caffeine", "Sugar"])
plt.scatter(df["Alcohol"], df["Caffeine"]) #plot the first against the second column
# getting slices is also possible in pandas dataframes, just slightly different:
df.loc[:,'Alcohol'] # get column Alcohol
df.loc[:,['Alcohol', 'Caffeine']] # get column Alcohol and Caffeine
A lot of relationships between two variables
If the relationship is one-to-one, then there exists an inverse function $f-1 : Y → X$, so that \[ x = f-1(y), \] with \[ x = f-1[f(x)]. \]
Sometimes, however, the relationship betwen the two variables is not
deterministic, that is the value of
Many equations in physics relate two quantities. For example, there
is the equation relating current
The simplest way to model a stochastic relationship between two
variables is to model the joint distribution
Here we can plot the number of people having a certain height and weight combination. This can be done with a colour-map. This is not much different than a normal histogram - and is called hist2d in pyplot:
X = 100 / (1 + np.exp(-np.random.normal(size=100))) + 125 Y = X * (1 + 0.1*abs(np.random.normal(size=100))) - 100
Here we model the distribution of one variable given a fixed value for
the other, e.g.
Method 1: polynomial fitting!
# returns the best fitting line to the data
a, b = np.polyfit(data_x, data_y, 1)
# Why is this the 'best' line? Because it minimises the total squared
# error between the predicted value and the actual ones.
# We can plot the line by this simple linear equation
ax = np.linspace(0,1)
plt.plot(ax, a * ax + b)
Get some data financial data from FRED. This is time-series data. Can we actually make sense out of it in terms of correlations? Explore.
First, the unemployment rate: https://fred.stlouisfed.org/series/UNRATE Then, the GDP: https://fred.stlouisfed.org/series/GDPC1 This has two different data frames.
# read the files
import pandas as pd
ur=pd.read_csv("UNRATE.csv")
gdp=pd.read_csv("GDPC1.csv")
# the date ranges are different, so we must try to merge them (inner join!)
merged = ur.merge(gdp)
In this module, we will perform the following activities:
- Uniformly random sampling. How can we perform it?
- Biased sampling, correcting for effects.
- Importance sampling.
You each support one of the following political groups:
- R: Red.
- B: Blue.
In this exercise, we will try to measure the support for different political parties.
- Deal cards so that 40% of the students are red, and 60% are blue.
- Now sample the population in the following manner:
Everybody throws a die. Those with a value 5 or greater are part of our sample, The sample
$ω$ comes to the board.The set of all possible samples is called the universe
$Ω$ . It is possible that we sample everybody, or nobody, or only the boys, or only the girls, or any combination.Mathematically, we can say that
$Ω = \{0,1\}^n$ , where$n$ is the number of students, and$ω_i = 1$ if a student has been selected. - Write a tick mark in the box saying “Red” or “Blue”.
- We then measure the proportion of red and blue votes. This proportion is the random variable!
We can repeat the same procedure with a different sampling method:
- Assign a number to each student.
- Cast a die and see which student it corresponds to.
- The student tells me their vote.
- I repeat.
Here we assign a random affiliation to each one of you. Throw a die for your political affiliation:
Die | Party |
---|---|
0 | Red |
1 | |
2 | |
3 | |
4 | |
5 | |
6 | Blue |
7 | |
8 | |
9 |
We now count the number of people having different affiliations. This are your true voting affiliations. If you were to vote, then you would vote for these specific parties.
Here, the underlying random space is the combined dice throws of all the class, and the random variable of interest is the number of votes for each candidate.
Given the number of people in the course, what is the expected number of votes for each party?
From a probability perspective, we can think of the die as having
random outcomes in
B, & ω ∈ \{4,5,6,7,8,9\}
\end{cases}
\]
Let us assume that the die has a uniform distribution
The probability that
If we know the probability that a randomly chosen voter will vote for the i-th party, then what is the probability of different numbers of votes? What is the expected number of votes for each party? To solv this problm, we need to define another random variable:
-
$n_i$ : the number of votes cast for each party$i$ .
This total number of votes depends on the party affiliation of each voter. Let
-
$ω_t$ be the random die of person$t$ . -
$v_t = v(ω_t)$ is then the party affiliation of person$t$ .
We collect the random dice of all individuals into one big vector
\[
ω = (ω_1, \ldots, ω_t, \ldots, ω_T), \qquad ω ∈ Ω^T, ω_t ∈ Ω
\]
Then the total number of votes for party
Clearly, if there is only one voter, the expected number of votes for
each party
\[ \E_P[n_i] = ∑_ω P^T(ω) n_i(ω) \]
Here
&=
∑ω ∈ Ω^T P^T(ω) ∑j=1^T \mathbb{I} \{v_j = i\} \tag{by the definitionm of
We already define a probability measure
Each one of you throws a second die and records the outcome. We now have
Die | Party | Response |
---|---|---|
0 | Red | Not Reachable |
1-2 | Refuse | |
3-9 | Green | |
0-4 | Blue | Not Reachable |
5 | Refuse | |
6-9 | Blue |
- Make a histogram / bar-char pie plot on the number of votes
In uniform sampling, the probability of all outcomes is the same.
Sampling with replacement In sampling with replacement, each outcome can appear more than once.
import numpy as np
population_size = 10 # say we have 10 people we want to sample from
n_samples = 5 # say we want to take 5 samples from the population the
# following will give us a sample drawn with replacement: It doesn't
# matter who we selected before, the next one will be randomly
# selected independently of the previous selection
sample = np.random.choice(population_size, size = n_samples)
return sample
The following code has the same effect
import numpy as np
population_size = 10 # say we have 10 people we want to sample from
n_samples = 5 # say we want to take 5 samples from the populatoin
sample = np.zeros(n_samples)
for i in range(n_samples):
sample[i] = np.random.choice(population_size)
return sample
Sampling without replacement In sampling without replacement, each outcome can appear at most once.
import numpy as np
population_size = 10 # say we have 10 people we want to sample from
n_samples = 5 # say we want to take 5 samples from the population the
# following will give us a sample drawn with replacement: It doesn't
# matter who we selected before, the next one will be randomly
# selected independently of the previous selection
sample = np.random.choice(population_size, size = n_samples, replace='False')
return sample
Recall that a random variable
In the following example, P[]
import numpy as np
# Let us define the space Omega
Omega = np.array([0, 1, 2, 3])
# Let us define a vector P so that P[omega] is the probability of omega
# e.g the probability that omega = 0 is 0.1, and that omega = 3 is 0.5.
P = np.array([0.1, 0.2, 0.3, 0.4])
# Here is our random variable f(). It is just a function of omega. There
# is nothing random about it, only omega is random!
def f(omega):
return omega * omega
# Let us generate a random omega:
random_outcome = np.random.choice(Omega, p = P)
random_variable_value = f(random_outcome)
print("omega:", random_outcome,
"f:", random_variable_value)
# We can also easily calculate the expectation of the random variable
# through the dot product.
print ("Expected value:", np.dot(P, f(Omega)))
In our case,
Let
&= P(0) f(0) + P(1) f(1) + P(2) f(2) \
&= 0.5 × 0^2+ 0.3 × 1^2 + 0.2 × 2^2 \
&= 0.3 + 0.8
= 1.1.
\end{align*}
Let us now consider a continuous
\[P(A) = ∫_A p(ω) dω.\]
The corresponding expectation of
A jar with coins is passed around the class.
- The students are asked to guess how many coins it contains.
- The students agree on a 50% confidence interval.
- The students fit a normal distribution on this interval
$[μ - \frac{2}{3}σ, μ + \frac{2}{3}σ]$ . - Is this normal distribution a good choice? Are you 90\% sure the number of coins is less than
$x$ ? - Is a normal distribution generally appropriate?
- Puzzle: Guess how many coins there are. If correct, then the class will share the money. If not, they will get nothing. What is the correct guess?
(If students have trouble with this, try with small numbers of coins and finite number of possibilities - demonstrate by playing the guessing game repeatedly)
- First, randomly select a student. How? Everybody gets a different
number. Then I throw a die until I get a single student matching
this number. The student is my sample
$ω$ . - The student comes to the board and measures his height. This height
is a random variable
$h(ω)$ . Each student has their own fixed height, but which student I select is random. Thus, the height that I measure is random. - Repeat the experiment once more.
- Now we randomly select multiple students. How? Each student throws
a die. If the die is >4, then they are selected. All the students
come to the board. They are our sample
$ω$ . - We now write the student heights, and average. The average height of the sample is our random variable.
The experimental pipipeline has a number of different components.
- Formulating the problem.
- Deciding what type of data is needed.
- Choosing the model and visualisation needed.
- Designing the experimental protocol.
- Generating data confirming to our assumptions.
- Testing the protocol on synthetic data. Is it working as expected?
- Putting the protocl through on real data.
You want to measure if men and women have different salaries, as well as the reasons why. In particular, you want to check if men are paid more than women on average, and if this can be explained purely by age.
Let us define three variables, gender
For background, check out Hypotheses about the expected salary relate to conditional expectation
-
$E(s | g = m) = E(s | g = f)$ . Men and women have the same salary in expectation -
$E(s | g = m) > E(s | g = f)$ . Women have a lower salary in expectation. But how much lower?
(Note that a stronger hypothesis would involve the conditional probabilities rather than the expectation)
If we find a difference, then we may want to explain it. If the age is a sufficient explanation for the salary, then the salary is conditionally independent of the gender given the age:
-
$E(s | a, g ) = E(s | a)$ . The salary is independent of the gender, given the age.
We neet salaries, age and gender.
Since we are looking at means, then it is maybe enough to look at averages. Averages are single numbers, so maybe a bar plot is enough.
- Collect data.
- Plot average for men and women. Use the simulation to find the right type of plot.
- If average salary is very different (how different?) then say women are paid less than men. Use the simulation to get an idea.
We can have three different tests:
- Generate everything independently
import numpy as np
# Generate data
def identical_populations(n_samples):
gender = np.random.choice(samples = n_samples)
age = 18 + 60*np.random.beta(3,4, size = n_samples)
salary = 100* np.random.exponential(100, size = n_samples)
return age, gender, salary
- Make the salary depend on the gender
def different_populations(n_samples):
gender = np.random.choice(samples = n_samples)
age = 18 + 60*np.random.beta(3,4, size = n_samples)
salary = (100 + gender*10)* np.random.exponential(100, size = n_samples)
return age, gender, salary
- Make the salary depend on the age, but have fewer women working after some age.
def age_effect_populations(n_samples):
age = 18 + 60*np.random.beta(3,4, size = n_samples)
q = (age - 18) / 60
p = q * 1 + (1 - q)*0.5
gender = np.random.choice(2, p = [p, 1-p], samples = n_samples)
salary = (100 + age)* np.random.exponential(100, size = n_samples)
return age, gender, salary
See Salary example notebook.
If we estimate the means, we see there is a lot of variability. How can we fix that? One idea is to perform a lot of simulations and note how much variability we do have. Another is to use a bootstrap sample.
An election is coming in 100 days. A political party supporting one of the options [assume it’s just a yes/no vote] in the election gives you 10,000 CHF to spend over these 100 days so as to measure the mood of the population. They can use that information to increase their chances of success. How should you do the study?
Who is going to use our visualisation? How are they going to use it? Let us brainstorm a little bit about this.
What do we want to know about the people we collect the data from?
What can we assume about people’s opinions? How about their responses? Are they truthful? What is the most useful visualisation?
After we get the data, we need to analyse it. A simple model would be a moving average of polls over time. Would that work?
Assume we need to pay 1 CHF for every time we ask a poll question, and we have a budget of 10,000 CHF. Then, how should we ask questions, assuming the election is in 100 days from now?
(a) Ask 10,000 people now. (b) Ask 10,000 people one day before the election. (c) Ask 10,000 people 50 days from now. (d) Ask 100 people every day? (e) Ask 1,000 people every 10 days.
We can assume a simple model of the electorate here… what should it be? Maybe their opinion changes over time. Maybe some people are not responding, or not reachable. Which one corresponds to our assumptions?
After testing the whole pipeline, we can see if it actually works as intended.
How many people are taking CS? How many are in other courses? Let us consider the following statements. Each statement is either true or false. If a statement is true, it has the value 1. It consequently has the value 0 if it is false.
For each student
-
$m_i$ : the student is male -
$f_i$ : the student is female -
$s_i$ : the student is in the faculty of science -
$l_i$ : the student is in the faculty of law -
$e_i$ : the student is in the faculty of economics
Clearly, if
We can associate events with sets in a universe. The following laws apply:
- There is a universe
$Ω$ of possible outcomes - Consider an individual outcome
$ω ∈ Ω$ :
(a) If
- Let
$¬ A$ , read ‘not A’, be the complement$Ω \setminus A$ : If$A$ is true, then$¬ A$ is false and vice-versa. -
$B$ is a subset of A (i.e.$B ⊂ A$ ) iff$B$ implies$A$ . - If
$A$ and$B$ are disjoint, i.e.$A ∩ B = ∅$ (they have an empty inersection) then$A, B$ are mutually exclusive. That means that it is impossible for$A, B$ to both be true.
- Recall the definition of Conditional probability: \[ P(A | B) = P(A ∩ B) / P(B) \] i.e. the probability of A happening if B happens.
- It is also true that: \[ P(B | A) = P(A ∩ B) / P(A) \]
- Combining the two equations, reverse the conditioning: \[ P(A | B) = P(B | A) P (A) / P(B) \]
- So we can reverse the order of conditioning, i.e. relate to the probability of A given B to that of B given A.
- Print out a number of cards, with either [A|A], [A|B] or [B|B] on their sides.
- If you have an A, what is the probability of an A on the other side?
- Have the students perform the experiment with:
- Draw a random card.
- Count the number of people with A.
- What is the probability that somebody with an A on one side will have an A on the other?
- Half of the people should have an A?
A | A | 2/6 | A observed | 2/3 |
A | B | 1/6 | A observed | 1/3 |
B | A | 1/6 | ||
B | B | 2/6 |
- Somebody saw somebody matching their description and he was found
in the neighbourghood. There is no other evidence.
- There are two possibilities:
-
$H_0$ : They are innocent. -
$H_1$ : They are guilty.
What is your belief that they have committed the crime?
- All those that think the accused is guilty, raise their hand.
- Divide by the number of people in class
- Let us call this
$P(H_1)$ . - This is a purely subjective measure!
- Let us now do a DNA test on the suspect
-
$D$ : Test is positive -
$P(D | H_0) = 1\%$ : False positive rate -
$P(D | H_1) = 100\%$ : True positive rate
- The result is either positive or negative (
$¬ D)$ . - What is your belief now that the suspect is guilty?
- Run a DNA test on everybody.
- What is different from before?
- Who has a positive test?
- What is your belief that the people with the positive test are guilty?
- Prior:
$P(H_i)$ . - Likelihood
$P(D | H_i)$ . - Posterior:
$P(H_i | D) = P(D ∩ H_i) / P(D) = P(D | H_i) P(H_i) / P(D)$ - Marginal probability:
$P(D) = P(D | H_0) P(H_0) + P(D | H_1) P(H_1)$ - Posterior:
$P(H_0 | D) = \frac{P(D | H_0) P(H_0)}{P(D | H_0) P(H_0) + P(D | H_1) P(H_1)}$ - Assuming
$P(D | H_1) = 1$ , and setting$P(H_0) = q$ , this gives
\[ P(H_0 | D) = \frac{0.1 q}{0.1 q + 1 - q} = \frac{q}{10 - 9q} \]
- The posterior can always be updated with more data!
# the input to the function is the prior, the likelihood function, and posteriors
# Input:
# - prior for hypothesis 0 (scalar)
# - data (single data point)
# - likelihood[data][hypothesis] array unction
# Returns:
# - posterior for the data point (if multiple points are given, the calculation is repeated)
def get_posterior(prior, data, likelihood):
marginal = prior * likelihood[data][0] + (1 - prior) * likelihood[data][1]
posterior = prior * likelihood[data][0] / marginal
return posterior
import numpy as np
prior = 0.9
likelihood = np.zeros([2, 2])
# pr of negative test if not a match
likelihood[0][0] = 0.9
# pr of positive test if not a match
likelihood[1][0] = 0.1
# pr of negative test if a match
likelihood[0][1] = 0
# pr of positive test if a match
likelihood[1][1] = 1
data = 1
return get_posterior(prior, data, likelihood)
We have
Station | M | T | W | T |
---|---|---|---|---|
MeteoSuisse | 25% | 20% | 10% | 5% |
Wunderground | 30% | 50% | 20% | 10% |
AccuWeather | 90% | 70% | 10% | 0% |
Rain | Y | N | N | Y |
-
$P(H_i)$ : prior -
$P(D | H_i)$ : likelihood according to station$i$ -
$P(H_i | D)$ : posterior
Example: DNA evidence, Covid tests
- Two hypothesese
$H_0, H_1$ -
$P(D | H_i)$ is defined for all$i$
Example: Model selection
-
$H_i$ : One of many mutually exclusive models -
$P(D | H_i)$ is defined for all$i$
Example: Are men’s and women’s heights the same?
-
$H_0$ : The ‘null’ hypothesis -
$P(D | H_0)$ is defined - The alternative is undefined
- Defining the models
$P(D | H_i)$ incorrectly. - Using an “unreasonable” prior
$P(H_i)$
- Having a huge hypothesis space
- Selecting the relevant hypothesis after seeing the data
10% of the class has covid, i.e. P(covid) = 0.1. Each one of you performs a covid test. If you have covid, the test is correct 80% of the time, i.e. P(positive | covid) = 0.8. Conversely, if you do not have covid, there is still a 10% chance of a positive test, with P(positive | not-covid) = 0.1
How likely is it that you have covid if your test is positive or negative, i.e. P(covid | positive), vs. P(covid | negative)?
First of all, each one of you should independently generate a uniform random number between 1 and 10. For that, you can each throw a die, and record the outcome.
Then you throw a second die, and record that as well.
I will now pass over the tables and tell each one of you if they have a positive test.
Now, everybody with a positive test raises their hand. I expect it to be slightly more than 10% (but it depends).
Hypothesis testing, or model selection, is a general problem in machine learning and statistics.
Intervals are a generalisation of hypothesis tests. We want to know if a given unknown parameter is within some range. Let us give an example.
We might have data
The Bayesian idea is intuitive. Through the posterior, we can
calculate the probability that the true parameter is in any interval
Confidence intervals work the other way round. First we construct an
algorithm for obtaining an estimate of the unknown parameter, e.g. the
empirical mean estimate
In confidence intervals, we condition on the unknown parameter
For a simple visualisation problem, vary parameter values and simulate thousands of times under each set of conditions. Summarise your findings graphically.
- How many covid patients are there now?
- Which is the best meteorological station?
- What is the right model for planetary motion?
- How many covid patients will there be next week?
- Will it rain tomorrow?
- When is the next lunar eclipse?
- What covid measures should I take?
- Should I take the umbrella?
- What’s the best way to each the next eclipse?
- Telescope images
- Survey data from a health authority
- Radar measurements in an AV
- The mass, position and velocity of Mars
- Actual number of covid patients
- The location and speed of nearby vehicles
- Can be observed or latent
- Selected arbitrarily by humans or machines
- Is smoking related to lung cancer?
- Does smoking cause lung cancer?
- Will I get lung cancer if I smoke?
- Did I get lung cancer because I smoked?
- Does knowing the value of one variable give us information about another variables?
- Does changing the value of one variable affect other variables?
Here we go into some more detail about joint and conditoinal probabilities.
Let us start with the general setting.
- Random variables
$x, y$ taking values in$\mathcal{X}, {Y}$ . - We write
$Pr(x,y)$ to informally mean their joint distribution.
For a formal definition, we must define an appropriate probability
measure on
- Underlying probability space
$P, Ω, Σ$ with - Outcome space
$Ω$ - Event space
$Σ$ , so that$A ∈ Σ$ are subsets of$Ω$ . - Probability measure
$P : Σ → Ω$ . - RVs
$x : Ω → \mathcal{X}$ ,$y : Ω → \mathcal{Y}$ - Joint measure $Px,y(S_x, S_y) \defn P(\{ω : x(ω) ∈ S_x, y(ω) ∈ S_y\})$
However, this is not, strictly speaking, necessary for an intuitive understanding.
The conditional distribution of
This has the same form. \[ P(A | B) = P(A ∩ B) / P(B). \]
Let us begin with a simple example of two variables which are independent Bernoulli.
$Pr(x = 1) = θ$ $Pr(y = 1) = v$
The independence can also be seen in the structure of the program: x and y are generated by two separate calls to random.choice()
import numpy as np
theta = 0.6
v = 0.8
x = np.random.choice(2, p = [1 - theta, theta])
y = np.random.choice(2, p = [1 - v, v])
Now let us look at dependent variables. Here
$Pr(x = 1) = θ$ $Pr(y = 1 | x = 0) = v_0$ $Pr(y = 1 | x = 1) = v_1$
We can also calculate
import numpy as np
theta = 0.6
v = np.zeros(2)
v[0] = 0.4
v[1]= 0.8
x = np.random.choice(2, p = [1 - theta, theta])
y = np.random.choice(2, p = [1 - v[x], v[x]])
return x,y
Doing this for multiple draws isa bit more complicated, but it can be done. We just need to use the set-builder notation to generate one
import numpy as np
n = 10000
theta = 0.6
v = np.zeros(2)
v[0] = 0.4
v[1] = 0.8
x = np.random.choice(2, p = [1 - theta, theta], size = n)
y = np.array([np.random.choice(2, p = [1 - v[x_t], v[x_t]]) for x_t in x])
import matplotlib.pyplot as plt
A = np.zeros([2,2])
for i in range(2):
for j in range(2):
A[i,j] = sum((x==i) & (y==j))
plt.imshow(A)
plt.savefig("correlated-binary.png")
plt.show()
return A
This is the typical structure of regression problems. Let us start with a simple normal distribution for both x, y:
-
$x ∼ Normal(0, 1)$ . -
$y | x ∼ Normal(x, 1)$ .
The above notation means that:
-
$x$ has a normal distribution with mean 0 and variance 1. - Given the value of
$x$ ,$y$ has a normal distribution with mean$x$ and variance 1.
This dependence is made explicit in the following code.
import numpy as np
theta = 0.8
x = np.random.normal(0, 1)
y = np.random.normal(x, 1)
return x,y
This is the typical structure of classification problems.
$y ∼ Bernoulli(0.6)$ -
$x | y ∼ 160 + Normal(10*y, 1)$ .
The above notation means that:
-
$y$ has a Bernoulli distribution with mean 0.6. - Given the value of
$y$ ,$x$ has a normal distribution with mean$160 + 10x$ and variance 1.
import numpy as np
y = np.random.choice(2, p = [0.4, 0.6])
x = np.random.normal(x, 1)
return x,y
- Consider a collection of RVs
$x_1, \ldots, x_n$ . - The joint distribution is a complicated object.
- Visualised with scatterplots
$(x_i, x_j)$ , e.g.sns.pairplot()
Instead, we can calculate the covariance matrix \[ Cij = \frac{\E\{[x_i - \E(x_i)][x_j - \E(x_j)]\}} {\sqrt{\E\{[x_i - \E(x_i)]^2\}\E\{[x_i - \E(x_i)]^2\}}} \]
- Assuming data
$x(t)$ with components$x_i(t)$ : - $Cij ≈ \frac{1}{T} ∑t=1^T [x_i(t) - μ_i] [x_j(t) - μ_j] / σ_i σ_j$.
-
$μ_i$ : (empirical) mean of$x_i$ -
$σ_i$ : (empirical) standard deviation of$x_i$
-
$x, y$ are independent if$Pr(x,y) = Pr(x)Pr(y)$ - equivalently, if
$Pr(x | y) = Pr(x)$ -
$x, y$ are dependent if they are not independent.
-
$x, y$ are uncorrelated if$\E(x,y) = \E(x)\E(y)$ - Equivalently, if
$\E(x | y) = \E(x)$ -
$x, y$ are correlated if$\E(x,y) ≠ \E(x)\E(y)$
- If
$x, y$ are correlated then they are dependent. - If
$x, y$ are independent the they are uncorrelated.
- Can aspirine cure headaches?
- Does smoking cause lung cancer?
- Or do cancer patients become smokers?
- Or is there a third factor causing both?
- Did aspirin cure my headache?
- Did smoking cause my cancer?
- Causal inference useful in a scientific setting.
- Reliable methods for causal inference exist.
- Actual causes useful in a legal setting.
- No reliable method or definition exists for determining actual causes.
- Tom and Fatima both work in Lausanne.
- Whenever Tom is late to work, so is Fatima.
- When this happens, there is also a traffic jam.
- Treatment A is effective 90% of the time
- Treatment B is effective 50% of the time.
- Why is that? (see src/causality/confounders)x
Treatment | Small Stones | Large Stones |
---|---|---|
A | 80% | 30% |
B | 90% | 60% |
We see here that treatment B is always better, but Large Stones are harder to treat. Let us now consider the treatment policy:
Treatment | Small Stones | Large Stones |
---|---|---|
A | 80% | 20% |
B | 20% | 80% |
The treatment policy mostly assigns treatment A when patients have small stones, and B for large stones. Hence, as treatment B is applied to harder cases, it might look worse on average. In particular, if we calculate the average treatment effect, by marginalising over stones, we have
Treatment | Average Effect |
---|---|
A | 0.7 |
B | 0.66 |
So, even though the second treatment better both for large and small stones, if we ignore the stone size, it looks worse.
Ignoring the stone size makes it a confounder, because it was used in the policy, but we ignore it. It wouldn’t matter if stone size was not used to select the treatment, which is why we typically use randomised trials to study the effect of treatments.
-
$a$ : Treatment -
$x$ : Patient characteristics -
$y$ : Outcome
-
$x$ is ignored -
$a$ depends on$x$ . -
$y$ depends on$a, x$ .
-
$x$ is taken into account - We model
$Pr(y | x, a)$ ,$Pr(a | x)$
- Treatment is random
- Hence dependence to
$x$ is cut.
- Must be an informed choice
- A complex instrument can lead to spurious correlations
- Treatment policy
$π$ with$π(a | x, π)$ . - Outcomes depend on policy
$π$
https://scikit-learn.org/stable/auto_examples/neighbors/plot_species_kde.html
- Colour as a continuous variable.
- Colour as a discrete variable.
- Colour perception and interpretation.
- Geographical contour
- Density plots
The course contains assignments and a project. The instructions for each assignment are given below. The assignments are largely done in class, but completed at home.
TLDR: Find a table in wikipedia on a topic of interest, and convert the table into a graph.
The purpose of this assignment is for you to create a graphic that demonstrates something interesting about a data table found on the web. Please provide as precise and concise answers as possible. This assignment is graded. You are encouraged to discuss the assignment with other students in a group. However each student must prepare their own individual report. Please submit your answers on moodle
In this exercise, you must create a plot from an existing dataset and write a short report. Use the following steps as a guideline.
- Find a data table on the internet.
- Write a short description of the data on the table.
- Create one or two plots of your choice, summarising the data on the table.
- Explain what the graph shows about the data.
- Try and draw some conclusions or generalisations from the graph. Does it make logical sense?
This is an example of this exercise for a dataset we already saw in class.
- Use the world records data for 100m/400m/1500m or some other distance.
- Explain what the records show.
- Show how the world record changes over time for men and women, with different colours. Be sure to plot records with the x axis showing time.
- Draw regression lines over the world record graph: the records reduce over time.
- It is not logically possible to expect the times to reduce linearly with time! Is there a fundamental limit? How can the data be best explained? Is human performance constant over time, and records are falling due to random chance? Is human performance slowly increasing over time?
Find an inteesting plot from a web page on e.g. wikipedia. Try to identify some problem with the plot. To help you, ask yourself the following questions:
- Is the plot type appropriate?
- Is the data correct?
- Does the plot convey an appropriate message?
- Is thee more data somewhere that you could combine with the original to obtain a better picture?
After you have identified problems with the plot, data sources, or missing data, create a new plot, along with an explanation of how you addressed the original plot’s deficiencies.
In this assignment you will read a newspaper article with some statistics and visualisations, and try to interpret what it says. You must study the article criticially. Are the conclusions supported by the data? Does the methodology make sens? Find primary sources that confirm or challenge the article to obtain a more rounded picture.
Here is a list of possible articles you can use. Feel free to suggest your own article and add it to the list.
https://docs.google.com/spreadsheets/d/1QKj_L9f0UIH80qgs2kcjc8AU1eKZzsTOcYFSU06HJBY/edit#gid=0
Propose a problem to solve, including:
- Hypotheses to test
- How to collect data
- How to analyse the collected data
After you have started your project, each one of the project members presents a preliminary plot and explains it (5 minutes).
The completed project should include a report written by both students in the team. This should should address the points in the Assessment description.
Suggested structure for the report:
- Hypotheses and scientific question.
What is the main scientific question you want to answer? After a general introduction to the question, be more precise and list all alternative hypotheses that you would like to examine.
Example: We wanted to see if electric, diesel or gasoline vehicles are more environmentally friendly. More specifically, we will compare the lifetime carbon emissions of vehicles of all types, including material sourcing, construction, driving and fuel sources, maintainance and disposal.
- Data collection and experiment design
What data did you use to answer your question? How has the data been collected? What are the main chracteristics of the data?
Example: We will collect data from a few representative manufacturers which produce both EV and internal combustion vehicles. This will be combined with measurements from national environmental agencies and NGOs. [Then detail the list of sources and what data you obtain from each]
- Methodology
How will you use the data to answer the question? Explain in as precise terms as possible.
Example: The easiest part to measure is the CO2 emissions during car use. We can obtain measurements from government agencies, magazines and the manufacturers about power consumption of each vehicle. These can be synthesized into a range of consumption numbers [Detail how exactly you do the synthesis]. For the construction of a vehicle it is hard to get exact numbers but we used estimates from these articles [cite the articles]. Because there might be a manufacturer bias (some might generally use more intensive processes than others), we will focus on comparing equivalent models within a manufacturer with different engine and drive-trains.
- Results
Here give a number of plots to answer your question. Section 3 should say how the plots were arrived at. In this section, you should explain the meaning of these plots and what they show.
Example: Stacked bar plots of CO2 emissions for different types of vehicles, grouped by manufacturer. Differences in CO2 emissoins between catagories for each manufacturer. The plots might show that e.g. Renault overall has lowre emissions than Ford, but that in either case the EV vehicles have lower lifetime emissions than their Diesel counterparts.
- Conclusion
Are you able to arrive at a firm conclusion or not? If so, what is it? If not, would more data help you to obtain a clearer conclusion? Are you happy with your methodology, or could it be improved?
Example: You have firmly concluded that in-use emissions are lower for EVs than internal combustion vehicles. However, you were unable to obtain data regarding disposal and so have left that out of the analysis. More-over, the uncertainty about construction emissions are too great to be able to see a significant difference.
For convenience, I include necessary mathematical notation
-
$\mathbb{R}$ : Real numbers -
$\mathbb{R}^d$ : d-dimensional Euclidean space -
$∅$ : The empty set -
$A ⊂ B$ : A is a subset of B. -
$A ∩ B$ : The intersection of A and B -
$A ∪ B$ : The union of A and B -
$A \setminus B$ : Removing B from A -
$Ω$ : The “universe” -
$A^c = Ω \setminus A$ : The complement of a set. -
$\{x | f(x) = 0\}$ : The set of x so that$f(x) = 0$ .
-
$\mathbb{I}\{x ∈ A\}$ : indicator function (takes the value$1$ if$x ∈ A$ ,$0$ oterwise) - $∑x ∈ X f(x) = f(x_1) + \cdots + f(x_n)$, with
$X = \{x_1, \ldots, x_n\}$ -
$d/dx f(x)$ : derivative of$f$ -
$∂/∂ x f(x,y)$ : partial derivative of$f$ -
$∇_x = (∂/∂ x_1, \ldots, ∂/∂ x_n)$ , vector of partial derivatives.
-
$Pr$ : Probability (informally generally) -
$\mathbb{E}$ : Probability -
$P$ : A probability measure -
$p$ : A probability density -
$P(A | B) = P(A ∪ B) / P(B)$ . Conditional probability,$A, B ⊂ Ω$ . -
$\param$ : Parameter -
$\Param$ : Parameter set -
$\{P_\param | \param ∈ \Param\}$ : A family of parametrised models -
$Pr(x | y)$ conditional probability for random variables x, y (generally)
The theory of probability is used to mathematically define processes
with uncertain outcomes. The set of all possible outcomes depends on
the process. For example, if we throw a die, this is the set of all
possible ways, locations etc that the die can fall and land. However,
we may only be interested in two events: whether the die lands showing
a ‘6’ or not. Formally, an event
For some more technical details, see Probability space.
See also:
In probability theory, we typically define the set of all possible
events that we care about as the algebra
- If
$A, B ∈ Σ$ then$A ∪ B ∈ Σ$ - If
$A ∈ Σ$ then$¬ A ∈ Σ$ .
Here,
Together with a probability measure
A probability measure
- A universe
$Ω$ of outcomes - The algebra
$Σ$ of subsets of$Ω$ (which we can think of as all the ‘events’ of interest) so that:
(a) If
The axioms of probability A probability measure
-
$P(Ω) = 1$ . - If
$A ∩ B = ∅$ then$P(A ∪ B) = P(A) + P(B)$ .
From these, it also follows that
See also: https://en.wikipedia.org/wiki/Probability_measure
If
Two events
This means that there is no random outcome
Two events
Intuitively, this means that knowing if
For any events
We can also express independence in terms of conditional probability. A and B are independent if:
\[
P(A | B) = P(A),
\]
or, if
We can use the conditional probability definition to relate
This is most useful for statistical inference, where we think of
It is also useful to write this rule when
Two events
This can also be written as \[ P(A | C, B) = P(A | C) \]
We focus on distributions where there is a finite number of possible outcomes, and hence a finite number of possible events that we might care about. All such distributions are characterised by one or more parameters. The simplest such distribution is a distribution on only two outcomes, the family of Bernoulli distributions.
Let us start with a simple example, the Bernoulli distribution with
parameter
Probability space Formally, if the underlying probability space is
See also: https://en.wikipedia.org/wiki/Bernoulli_distribution
If we repeat a Bernouli trial, we can also count the number of times
the coin comes heads. The distribution of the counts is
called the Binomial distribution. If
See also: https://en.wikipedia.org/wiki/Binomial_distribution
A multinomial distribution is an extension of the Bernoulli and
binomial distributions to
Categorial distributions Let us start with
one trial, e.g. a single throw of a die. We can model this dice throw
as the distribution where the probability
that the die lands with its
If the underlying probability space is
See also: https://en.wikipedia.org/wiki/Multinomial_distribution
A special case of binomial and multinomial distributions is the uniform distribution. This is defined as follows.
Let
This definition applies to continuous distributions as well. A standard example is the uniform distribution on the interval
A real-valued random variable
Random variables can be easily generalised to other domains than the real numbers.
See also: https://en.wikipedia.org/wiki/Random_variable En francais: https://fr.wikipedia.org/wiki/Variable_al%C3%A9atoire
Take a 10-sided die. The outcomes of the die represent the space
What is the distribution of
In python, there are procedures for generating data from many types of random variables. However, all such methods for generating random numbers are not truly random. They rely on something called a pseudorandom number generator. The values output by this generator are then transformed so as to become similar to the random variable we want.
- Let us start by throwing a 10-sided die. How do you generate a
Bernoulli random variable with
$θ = 0.6$ from the outcomes of the dice throw? - Now let us consider the following python code, which generates values from a Bernoulli random variable.
import numpy as np
return np.random.choice(2,p=[0.6, 0.4])
Let us replicate it through a distribution of uniform randomv variables in
import numpy as np
x = np.random.uniform() # returns a uniformly generated number in [0,1).
if x < 0.6:
return 1
else:
return 0
The expectation of a random variable
For the general case, we define it in terms of integrals: \[ \E_P(f) = ∫Ω f(ω) d P(ω). \]
If we have obtained a sequence of samples
To see this, first note that
In fact, the average
The variance of a random variable
We can define the conditional expectation of a random variable through conditional probabilities.
In particular, let some variable
We can even re-write this in a more intuitive way. Note that if
Let us say we wish to compare the distribution among multiple groups. Perhaps we wish to compare the number of students who achieve a certain test like the table below.
Success | Success | |
---|---|---|
School | Male | Female |
A | 62% | 82% |
B | 63% | 68% |
C | 37% | 34% |
D | 33% | 35% |
E | 28% | 24% |
F | 6% | 7% |
Average | 45% | 38% |
- Let us plot the success rate for females and males over different schools.
- Does this show a bias? What information is missing?
- Let us combine these two plots into one plot now for this we need to use the following code:
w=0.5
X = np.linspace(1,6*w,6)
M = ...
F = ...
plt.bar(X, M, align=edge, width=w/2)
plt.bar(X + w, F, align=edge, width=w/2)
- Histogram and 3D extensions
- Density curve
- Scatterplot
- Smooth scatterplot
- Violin plot
- Line Plot
- Confidence Intervals
- Geographical/topological maps
- Network graphs
- Word cloud
See also: Catalogue of data visualisation
The schedule of this and the other courses is in flux, but I do not expect it to change very much. In any case, the course will operate independently of the other courses. You should expect to cover the same topic more than once. However, this course will not focus on either statistical theory or programming.
Week | Statistics | Programming | In-Course | Homework |
---|---|---|---|---|
1 | Course intro | Python intro | Histograms | |
23 Sep | Randomness | |||
Math score | ||||
2 | R Intro | Data types | Form groups | |
30 Sep | Data manipulation | Uncertainty | ||
Histograms | Discrete Variables | |||
Scatterplots | Continuous Variables | |||
Boxplots | ||||
Variable types | ||||
Mosaic plots | ||||
Functions | ||||
3 | Quantifying Variability | Control | Time-Series | |
7 Oct | Distribution | Linear functions | Form groups | |
Density function | Stock market prices | |||
Histograms | Crime statistics | |||
Skewness | S&P index | |||
Quantiles | World Records | |||
4 | Qualitative vars in R | Structures | ||
14 Oct | Discrete vars in R | Proj. Proposal | ||
Scatterplots | ||||
Unemployment | ||||
5 | Continous RV | Functions | ||
21 Oct | Data Cleaning | Table2Picture | ||
6 | Continuous RV | Complements | Experiment design | |
28 Oct | Random Sampling | Deconstruction | ||
Undercounting | ||||
Represnetative samples | ||||
7 | Continuous RV | Classes | Expectations | |
4 Nov | NewsPaper | |||
8 | Dependencies. | Objects | Bayesian Inference | |
11 Nov | Joint distribution. | Project Highlight | ||
Conditional distribution. | ||||
9 | Moments | Errors | Pipelines | Project |
18 Nov | Simulation studies | |||
10 | Covariance | Iterators | ||
25 Nov | Correlation | Hypothesis testing | Project | |
Scatterplots | The garden of many paths | |||
11 | Prices, returns | FP | Examples, project work | |
2 Dec | Project presentation | |||
12 | Conditional expectations | Project report | ||
9 Dec | Examples, project work | |||
13 | ||||
16 Dec | Project presentations | Project report | ||
- Area: Air
- Die (Dice): Dé(s)
- Expectation: Espérance
- Experiment: événement
- Histogram: Histogramme?
- Pie chart: Diagramme circulaire
- Bar chart: Diagramme à barres
- Scatterplot: Nuage de points
- Randomness: Hasard
- Uncertainty: Incertitude
- Probability: Probabilité
- Stochastic: stochastique
- Random: Aléatoire
- Random Variable: Variable aléatoire
- Sample Space: see Universe
- Sample (v): Échantilloner
- Sample (n): Échantillon
- Sampling: Échantillonnage
- Set: Ensemble
- Subset: Sous-ensemble
- Superset: Sur-ensemble
- Survey: Sondage
- Universe: Univers
-
$σ$ -algebra: tribu,$σ$ -algèbre
Python help: Use python’s !help! function whenever you can.
help(print)
- [BT] Introduction to Probability, Bertsekas and Tsitsiklis preprint at: https://vfu.bg/en/e-Learning/Math–Bertsekas_Tsitsiklis_Introduction_to_probability.pdf >
- [CA] The Truthful Art. Cairo, Alberto 📖
- [GJ] Data Science Par La Pratique (Data Science from Scratch). Grus, Joel (Ch3: Visualisation, Ch6: Probability, Ch7: Inference, Ch9: Data collection, Ch10: Exploration, Ch14: Regression, Ch15: Regression+Bootstrap)