# Introduction to Statistics with Python

Python is a script based programming language that is by now widely used in the scientific community. There are many libraries that enable statistical data analysis in general as well as packages that are very specific to the respective (scientific) field.

In general, a python script can be just set up and directly executed (you don't need to compile it). However, there is also an interactive mode called IPython that can be very useful e.g. for quick tests and debugging. Jupyter notebooks provide an interface to IPython that is very convenient for users, allowing to store the used code directly and to add documentation alongside the script, while still being able to execute the code interactively.

In this course, some exercises will be provided in the form of Jupyter notebooks. You will find those exercises in the according folder in the github repository. You should then upload your filled notebooks to the submissions folder. Please remember to add your name and the excercise sheet number in the filename!

Note: In this introduction only the most important things will be shown and it is by no means a complete documentation. For more information you should have a look at the [Python Docs](https://docs.python.org/3/) or the documentation of the individual libraries. Furthermore, I can recommend the "Python Data Science Handbook" by Jake VanderPlas.


## NumPy

Data can be stored and handled in different ways. One of the most common is the form of an array (that of course can have more than one dimension). The NumPy library (short for numerical python) provides an efficient and userfriendly way for storing and operating data. The main difference to a 'list' (which is a type already built into python) is it's better scaling performance when handling larger and larger datasets. However, in contrast to 'list' type objects, it only allows for elements of the same type. If you have installed NumPy in your conda environment, you can simply import it in your code as follows and check the version:

In [33]:
import numpy

numpy.__version__

'1.26.0'

It is also very common to import NumPy with an alias 'np':

In [34]:
import numpy as np

### Initializing a NumPy array

A NumPy array can be initialized in many forms. E.g. we can create a NumPy array from a standard python list:

In [35]:
np.array([1, 2, 3, 4])

array([1, 2, 3, 4])

Note, as NumPy only allows for same type elements, giving it a list like the following, will result into something slightly different:

In [36]:
np.array([1.1, 2, 3, 4])

array([1.1, 2. , 3. , 4. ])

As you can see, we now instead have an array of floats instead of integers. We can also force the NumPy array to contain a certain element type:

In [37]:
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

A big advantage of NumPy arrays over lists in terms of data operations is of course that it can be multidimensional:

In [38]:
np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Or, alternatively:

In [39]:
np.array([range(i, i+3) for i in [1, 4, 7]])

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Of course for larger datasets it is not very handy to type in lists by hand. There are functionalities to automatically fill an array with zeros ('np.zeros()'), or ones ('np.ones()'), or create an axb matrix filled with a certain element n ('np.full((a,b),n)'). What can be useful sometimes is creating a set of random values:

In [40]:
np.random.random((3,3))

array([[0.65279032, 0.63505887, 0.99529957],
       [0.58185033, 0.41436859, 0.4746975 ],
       [0.6235101 , 0.33800761, 0.67475232]])

which will give you a 3x3 matrix of random numbers between 0 and 1. Similarly, one can also create such a matrix with random numbers following a normal distribution with mean 0 and standard deviation 1:

In [41]:
np.random.normal(0, 1, (3,3))

array([[ 1.0657892 , -0.69993739,  0.14407911],
       [ 0.3985421 ,  0.02686925,  1.05583713],
       [-0.07318342, -0.66572066, -0.04411241]])

An identity matrix can be created like this:

In [42]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

However, very often you will be provided some data in the form of a txt or csv file. In this case, NumPy provides the 'genfromtxt' functionality to create an array directly from reading in such a file:

In [43]:
np.genfromtxt("data/IntroTestfile.csv", delimiter=",", dtype='float32')

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.],
       [13., 14., 15., 16.],
       [17., 18., 19., 20.]], dtype=float32)

There are a few things one needs to take care of though. For instance, one needs to know whether the dataset provided is 'complete' in the sense if there are some missing values:

In [44]:
np.genfromtxt("data/IntroTestfile2.csv", delimiter=",", dtype='float32')

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., nan, 12.],
       [13., 14., 15., 16.],
       [17., 18., 19., 20.]], dtype=float32)

Here, we got a 'nan' (not a number) output because the corresponding element in the file is missing. 'genfromtxt' allows for dealing with these missing elements by defining a default value for them:

In [45]:
np.genfromtxt("data/IntroTestfile2.csv", delimiter=",", dtype='float32', filling_values=-1.0)

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., -1., 12.],
       [13., 14., 15., 16.],
       [17., 18., 19., 20.]], dtype=float32)

### Properties of NumPy Arrays

Working with NumPy arrays, being able to access its properties is often very useful. In this example, we will look at a three-dimensional array using the random number generatore once more. (Specifying the random seed makes sure, the results can be reproduced every time the code is run. 'randint' gives us a set of random integers -- in this case between 0 and 10.)

In [46]:
np.random.seed(0)

a3 = np.random.randint(10, size=(3,4,5))

We can now ask for the attributes of these arrays, i.e. the number of dimensions ('ndim'), the size of each dimension ('shape'), the total size of the array ('size'), the data type ('dtype'), the memory size for each element ('itemsize') and the total memory size of the array ('nbytes'):

In [47]:
print("a3 ndim: ", a3.ndim)
print("a3 shape: ", a3.shape)
print("a3 size: ", a3.size)
print("a3 dtype: ", a3.dtype)
print("a3 itemsize: ", a3.itemsize)
print("a3 nbytes: ", a3.nbytes)

a3 ndim:  3
a3 shape:  (3, 4, 5)
a3 size:  60
a3 dtype:  int64
a3 itemsize:  8
a3 nbytes:  480


Another important aspect of handling NumPy arrays is to be able to access and set individual elements by index. Let's start with a one-dimensional array:

In [48]:
a1 = np.random.randint(10, size=6)

In one dimension, accessing a specific element is quite straight forward -- you can just use the index of the element you want to access in square brackets:

In [49]:
print(a1)
print("3rd element: ", a1[2])

[8 1 1 7 9 9]
3rd element:  1


Please note that the counting (as usual in programming) starts at '0'. Alternatively, you can also start counting from the end and use negative indices:

In [50]:
print(a1)
print("last element: ", a1[-1])

[8 1 1 7 9 9]
last element:  9


Looking at a two-dimensional array, individual elements can be accessed by a comma separated tuple of indices:

In [51]:
a2 = np.random.randint(10, size=(3,4))
print(a2)
print("first element in the second row: ", a2[1,0])

[[3 6 7 2]
 [0 3 5 9]
 [4 4 6 4]]
first element in the second row:  0


Using the same notation, individual elements can also be modified:

In [52]:
a2[1,0] = 100
print(a2)

[[  3   6   7   2]
 [100   3   5   9]
 [  4   4   6   4]]


Instead of accessing only indivdual elements on can also work with slices of arrays:

In [53]:
print(a1)
print("first five elements: ", a1[:5])
print("all elements after the fifth: ", a1[5:])
print("elements between the second and the fourth: ", a1[1:4])
print("every second element, starting from index 1: ", a1[1::2])
print("reversing the array: ", a1[::-1])

[8 1 1 7 9 9]
first five elements:  [8 1 1 7 9]
all elements after the fifth:  [9]
elements between the second and the fourth:  [1 1 7]
every second element, starting from index 1:  [1 7 9]
reversing the array:  [9 9 7 1 1 8]


In case of more than one dimension, one has to consider indices for all dimensions separated by commata, as before:

In [54]:
print(a2)
print("first two rows, first three columns: ", a2[:2, :3])
print("all rows, every second column: ", a2[:3, ::2])
print("first column: ", a2[:,0])
print("first row: ", a2[0,:])
print("reversing the array: ", a2[::-1, ::-1])

[[  3   6   7   2]
 [100   3   5   9]
 [  4   4   6   4]]
first two rows, first three columns:  [[  3   6   7]
 [100   3   5]]
all rows, every second column:  [[  3   7]
 [100   5]
 [  4   6]]
first column:  [  3 100   4]
first row:  [3 6 7 2]
reversing the array:  [[  4   6   4   4]
 [  9   5   3 100]
 [  2   7   6   3]]


One important thing to keep in mind is that extracting slices from arrays doesn't provide you with an independent copy of the original array's slice but rather returns a view of it. This means that changing the slice will also change the corresponding part of the original array! If you want a copy instead of a view, you can achieve this by using the 'copy()' method:

In [55]:
new_a2_slice = a2[:2,:2].copy()
print(new_a2_slice)

[[  3   6]
 [100   3]]


Modifying 'new_a2_slice' now won't have an effect on the original 'a2' array. (Bonus task: Try it out!)

A rather useful functionality is the so-called masking, with which only specific elements of the array can be selected. E.g. if we only want to display entries that satisfy a certain requirement like being smaller than 7:

In [56]:
print(a2)

print("\n", a2[(a2<7)])

[[  3   6   7   2]
 [100   3   5   9]
 [  4   4   6   4]]

 [3 6 2 3 5 4 4 6 4]


### Standard operations on one or more NumPy Arrays

Arrays can be put together (concatenated) or one array split into multiple arrays:

In [57]:
x = np.array([1,2,3])
y = np.array([4,5,6])
print(x, y)

z = np.concatenate([x,y])
print(z)

z1, z2, z3 = np.split(z, [2, 4])
print(z1, z2, z3)

[1 2 3] [4 5 6]
[1 2 3 4 5 6]
[1 2] [3 4] [5 6]


For concatenations with multi-dimensional arrays, an axis can be specified:

In [58]:
print("joining rows:")
print(np.concatenate([a2, a2], axis=0))
print("\njoining columns:")
print(np.concatenate([a2, a2], axis=1))

joining rows:
[[  3   6   7   2]
 [100   3   5   9]
 [  4   4   6   4]
 [  3   6   7   2]
 [100   3   5   9]
 [  4   4   6   4]]

joining columns:
[[  3   6   7   2   3   6   7   2]
 [100   3   5   9 100   3   5   9]
 [  4   4   6   4   4   4   6   4]]


Splitting also works in more than one dimension but has dedicated functions for it:

In [59]:
print(a2)
upper, lower = np.vsplit(a2, [2])
print(upper)
print(lower)

[[  3   6   7   2]
 [100   3   5   9]
 [  4   4   6   4]]
[[  3   6   7   2]
 [100   3   5   9]]
[[4 4 6 4]]


In [60]:
left, right = np.hsplit(a2, [2])
print(left)
print(right)

[[  3   6]
 [100   3]
 [  4   4]]
[[7 2]
 [5 9]
 [6 4]]


Besides concatenating and splitting arrays, also mathematical operations can be performed on them. These built-in operations are often very useful as they avoid inefficient loops over the whole array.These 'ufuncs' (universal functions) provide vectorized operations, i.e. the operation will be carried out on each element. In general, there are unary ufuncs, which only operate on a single input array, and binary ufuncs, which take two inputs.

Unary ufuncs are for example:

In [61]:
print(a1, "\n")

print("adding 5 to all elements: ", a1+5)
print("alternative: ", np.add(a1, 5), "\n")

print("subtracting 3 from all elements: ", a1-3)
print("alternative: ", np.subtract(a1, 3), "\n")

print("multiplying 4 to all elements: ", a1*4)
print("alternative: ", np.multiply(a1,4), "\n")

print("dividing all elements by 2: ", a1/2)
print("alternative: ", np.divide(a1,2), "\n")

print("dividing all elements by 2 and only keep floor number: ", a1//2)
print("alternative: ", np.floor_divide(a1,2), "\n")

print("square each element: ", a1**2)
print("alternative: ", np.power(a1,2), "\n")

print("take modulus of each element: ", a1%2)
print("alternative: ", np.mod(a1,2), "\n")

print("absolute value of each element: ", abs(a1))
print("alternative: ", np.absolute(a1), "\n")

print("sinus of each element: ", np.sin(a1), "\n")
print("exponential: ", np.exp(a1), "\n")
print("logarithm: ", np.log(a1), "\n")

[8 1 1 7 9 9] 

adding 5 to all elements:  [13  6  6 12 14 14]
alternative:  [13  6  6 12 14 14] 

subtracting 3 from all elements:  [ 5 -2 -2  4  6  6]
alternative:  [ 5 -2 -2  4  6  6] 

multiplying 4 to all elements:  [32  4  4 28 36 36]
alternative:  [32  4  4 28 36 36] 

dividing all elements by 2:  [4.  0.5 0.5 3.5 4.5 4.5]
alternative:  [4.  0.5 0.5 3.5 4.5 4.5] 

dividing all elements by 2 and only keep floor number:  [4 0 0 3 4 4]
alternative:  [4 0 0 3 4 4] 

square each element:  [64  1  1 49 81 81]
alternative:  [64  1  1 49 81 81] 

take modulus of each element:  [0 1 1 1 1 1]
alternative:  [0 1 1 1 1 1] 

absolute value of each element:  [8 1 1 7 9 9]
alternative:  [8 1 1 7 9 9] 

sinus of each element:  [0.98935825 0.84147098 0.84147098 0.6569866  0.41211849 0.41211849] 

exponential:  [2.98095799e+03 2.71828183e+00 2.71828183e+00 1.09663316e+03
 8.10308393e+03 8.10308393e+03] 

logarithm:  [2.07944154 0.         0.         1.94591015 2.19722458 2.19722458] 



These functions all perform the same operation on the individual elements while the form of the array doesn't change. But there are also possibilities to aggregate the elements by e.g. calling the 'reduce()' function. This will apply an operation on the elements repeatedly until only one single result is obtained. For instance, the 'add' function can be 'reduced' to sum up all the elements of the array:

In [62]:
print(a1, "\n")
print(np.add.reduce(a1))

[8 1 1 7 9 9] 

35


There are a lot more ufuncs that can come in handy depending on the usecase. If you need any specific operation it is therefore always a good idea to look up the documentation on ufuncs on the [NumPy webpage](https://numpy.org/doc/stable/user/basics.ufuncs.html). This also contains e.g. dealing with operations performed on arrays with different shapes and sizes ('broadcasting').

### Statistics with NumPy arrays

Now that we have covered the basics, it is time to look what NumPy arrays can do for us when we deal with statistical data analysis. In principle, with the ufuncs we've just seen, you should be able to implement functions to calculate quantities like the mean and the standard deviation yourself. However, as this very often involves inefficient and time-intensive for-loops, such methods have already been conveniently provided by the NumPy library. 

You can, for example, easily determine the minimum or the maximum element or the sum of all elements:

In [63]:
print("minimum: ", np.min(a2))
print("maximum: ", np.max(a2))
print("sum: ", np.sum(a2))
print("\nalternatively:")
print("minimum: ", a2.min())
print("maximum: ", a2.max())
print("sum: ", a2.sum())

minimum:  2
maximum:  100
sum:  153

alternatively:
minimum:  2
maximum:  100
sum:  153


Again, looking at multidimensional arrays, these operations can also be carried out for each row or column, just by specifying an 'axis' argument:

In [64]:
print(a2)
print("minimum per column: ", np.min(a2, axis=0))
print("minimum per row: ", np.min(a2, axis=1))

[[  3   6   7   2]
 [100   3   5   9]
 [  4   4   6   4]]
minimum per column:  [3 3 5 2]
minimum per row:  [2 3 4]


More interesting statistics functions that are available within NumPy include:

In [65]:
print(a1, "\n")
print("product of elements: ", np.prod(a1))
print("mean of elements: ", np.mean(a1))
print("standard deviation of elements: ", np.std(a1))
print("variance of elements: ", np.var(a1))
print("median: ", np.median(a1))
print("25th percentile: ", np.percentile(a1, 25))

[8 1 1 7 9 9] 

product of elements:  4536
mean of elements:  5.833333333333333
standard deviation of elements:  3.4840908267278117
variance of elements:  12.138888888888888
median:  7.5
25th percentile:  2.5


It is worth mentioning, as there can be all sorts of artifacts in a dataset, that for each of these statistical functions exists a 'nan-proof' version. The names of the functions are the same, just with 'nan' in the beginning. 

Concerning statistical properties of more than one array, there are also built-in functions. Using again the dummy data of IntroTestfile.csv:

In [66]:
file = np.genfromtxt("data/IntroTestfile.csv", delimiter=",", dtype='float32')
file

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.],
       [13., 14., 15., 16.],
       [17., 18., 19., 20.]], dtype=float32)

We can now ask, if the numbers in column 1 change, how do the numbers in colum 2 change relative to that? This can be answered by the covariance between the two columns and their correlation coefficients, respectively:

In [67]:
col1 = file[:,0].copy()
col2 = file[:,1].copy()
print(col1)
print(col2)

print("\n", np.cov(col1, col2))
print(np.corrcoef(col1, col2))

[ 1.  5.  9. 13. 17.]
[ 2.  6. 10. 14. 18.]

 [[40. 40.]
 [40. 40.]]
[[1. 1.]
 [1. 1.]]


As can be expected, those two columns are 100\% correlated as one of them is the other +1.

As with the standard ufuncs, this list is by no means complete. You can inform yourself in more detail on the [NumPy user manual](https://numpy.org/doc/stable/reference/routines.statistics.html) in the section of statistical methods.

Very often when we look at specific datasets, we have the case of different kinds of data being stored, which translate to different kinds of 'dtypes' in python. This is not ideal with NumPy, as an array only can take elements from one dtype, as mentioned in the beginning. In the form of structured arrays, there indeed exists a solution to that within NumPy. However, in every-day life using Pandas dataframes is much more convenient. This will be discussed in the next notebook.