# Introduction to Statistics with Python

```
Koen Plevoets
Last modified: 2020-12-06
```

- Chapter 1: Data types
  - 1.1: Numbers
  - 1.2: Strings
  - 1.3: Lists
  - 1.4: Tuples
  - 1.5: Sets
  - 1.6: Dictionaries
- Chapter 2: Programming tools
  - 2.1: Functions
  - 2.2: Control structures
  - 2.3: Exception handling
  - 2.4: Classes, methods and object-oriented programming
- Chapter 3: External data
  - 3.1: Importing modules
  - 3.2: Reading and writing data files
- Chapter 4: Numerical computations with NumPy
  - 4.1: Creating NumPy arrays
  - 4.2: Computing with NumPy arrays
  - 4.3: Indexing with NumPy arrays
  - 4.4: NumPy submodules
- Chapter 5: Data handling with pandas
  - 5.1: Working with pandas Series and DataFrames
  - 5.2: Exploratory statistics with pandas
  - 5.3: Reading data with pandas
  - 5.4: Data wrangling with pandas
  - 5.5: Aggregate statistics with GroupBy objects
  - 5.6: Reshaping pandas DataFrames
- Chapter 6: Visualization with Matplotlib
  - 6.1: Creating graphs with Matplotlib
  - 6.2: Statistical graphs with Matplotlib
  - 6.3: Style sheets
  - 6.4: Object-oriented approach
  - 6.5: Graphical methods in pandas
- Chapter 7: Statistical visualization with seaborn
  - 7.1: Style sheets in seaborn
  - 7.2: Visualizing relationships between variables
  - 7.3: Visualizing categorical data
  - 7.4: Visualizing distributions of data
  - 7.5: Visualizing regression models
  - 7.6: Visualizing facets
- Chapter 8: Statistical inference with SciPy
  - 1.1: Tests for categorical data
  - 8.2: Tests for continuous data
- Chapter 9: Statistical modelling with statsmodels

## Chapter 1: Data types

### 1.1 Numbers

Python can best be seen as a **sophisticated pocket calculator**: if you type in an expression and press **Enter**, then Python will print the result. In Jupyter Notebooks this comes down to selecting a code block (by clicking on it) and pressing the button **Run** above.

In [None]:
2 + 3

In [None]:
3 - 5

In [None]:
3 * 6

In [None]:
6 / 2

In [None]:
3 ** 2

Division of two integers gives the correct solution in Python **3**:

In [None]:
7 / 4

However, Python **2** distinguishes between **integers** and **floating point numbers**. Hence, it is always safe to type in one of the numbers as a floating point number:

In [None]:
7.0 / 4

In [None]:
7 / 4.0

 The functions above are **built-in** functions. Other functions are **not** automatically available.

In [None]:
log(9) # Error

You first have to **load/import** the module `math`. You always have to write the module name before the function name separated by a dot (`.`).

In [None]:
import math
math.log(9)

Modules contain various functions, constants, etc., which are somehow related to each other (Python has many different modules). They are not loaded automatically in order not to make Python too "heavy". The user chooses what he/she needs.

If you want to access the functions directly, then you can type in the following:

In [None]:
from math import *
log(9)

However, this is considered to be a **bad practice** (because it may overwrite existing function names).

Every function in Python has a **help page** which explains the details of the function. You can access the help page of a function with the function `help()`. In-between the brackets you specify the name of your function between quotation marks.

In [None]:
help('math.log')

You can also ask for the help page of a whole module.

In [None]:
help('math')

Python is **object-oriented**. That means that you can assign values to variables. You do assignment with the `=` operator:

In [None]:
mass = 30
speed = 5
mass * speed
momentum = mass * speed
momentum

**The typography matters**: capital letters, dots, comma's etc. have to be correct!

In [None]:
Momentum # Error

### 1.2 Strings

Another data type in Python are **strings**. A string is made by placing a sequence of characters **between** either **double quotes** or **single quotes**.

In [None]:
s = 'hello'
s

In [None]:
ss = "hello"
ss

The difference is when your string already contains quotation marks. The rule is:
```
A quotation mark which differs from the surrounding two quotation marks is always possible,
but any similar quotation mark has to be "escaped", i.e. preceded by a backslash (\).
```

In [None]:
'My dad says "hello".'

In [None]:
"My dad's words are heard."

In [None]:
'My dad\'s words are "hello".'

In [None]:
"My dad's words are \"hello\"."

Backslashes can always be added. So, the following is also okay:

In [None]:
'My dad\'s words are \"hello\".'

In [None]:
"My dad\'s words are \"hello\"."

But this is not:

In [None]:
'My dad's words are "hello".' # Error

In [None]:
"My dad's words are "hallo"." # Error

Quotation marks **define** the sequence as a string. That is why Python prints all quotation marks, backslashes etc. when you call for the string. This is called the "**string literal**". If you want to have the **content** of a string, then you use the command `print()`:

In [None]:
print(s)

In [None]:
print('My dad\'s words are \"hello\".')

There are also strings defined by **triple quotation marks**: those are strings surrounded by either three single quotation marks (`'''`) or three double quotation marks (`"""`). You use triple quotation marks if your string spans **multiple lines**:

In [None]:
sss = '''This is the first line
and this is the second'''
print(sss)

Notice that the string literal erin has a `\n` character between the two lines:

In [None]:
sss

This `\n` is the so-called "**newline**" (meta)character.

You can perform some operations on strings (and other sequence types such as lists - see later). Two fundamental operators are `+` and `*`. `+` **concatenates** two strings:

In [None]:
'pa' + 'pa'

`*` **repeats** the string:

In [None]:
'pa' * 2

In [None]:
'My dad says' + ' ' + 'a' * 34

In Python **3** the `+` operator is not necessary anymore for concatenation:

In [None]:
'pa' 'pa'

You can **test** whether a particular character sequence **occurs in** the string.

In [None]:
'e' in s

In [None]:
'z' in s

You can also test whether a character sequence does **not** occur in the string.

In [None]:
'e' not in s

In [None]:
'z' not in s

The **order** is important: Python does not test for the individual characters.

In [None]:
'le' in s

You can ask for the **smallest** or the **largest** element in the string - i.e. **in alphabetical order**.

In [None]:
min(s)

In [None]:
max(s)

Both functions also apply to numbers.

In [None]:
min('375')

In [None]:
max('375')

The **numbers of characters** in a string can be found with the function `len()`:

In [None]:
len(s)

The **empty string** has, of course, length zero.

In [None]:
len('')

You can also access the individual elements of a sequence. Python works with numbers for the positions **between** the elements of a sequence; these are the "**indexes**" (or "**indices**"). Compare the following scheme:
```
	+---+---+---+---+---+
	| h | e | l | l | o |
	+---+---+---+---+---+
	0   1   2   3   4   5
```
Python treats the areas between the indexes as a **ruler**:
- all numbers in the **first** area (i.e. from 0,0... until 0,9...) can be **rounded off** to `0`,
- all numbers in the **second** area (i.e. from 1,0... until 1,9...) can be **rounded off** to `1`,
- all numbers in the **third** area (i.e. from 2,0... until 2,9...) can be **rounded off** to `2`,
- etc.

As a consequence, the index of the **first** element is `0`, the index of the **second** element is `1`, the index of the **third** element is `2`, etc.

In other words, indexing in Python is "**zero-based**" (so the last element in a sequence has index *len-1*).

In [None]:
s[0]

In [None]:
s[1]

In [None]:
s[2]

In [None]:
s[3]

In [None]:
s[4]

In [None]:
s[5] # Error

Indexes also allow for accessing a **whole segment** of a sequence; this is a so-called **slice**, and they are defined by a `:`.  Before `:` you specify the index **from which** you want to slice, after the `:` you specify the index **until which** you want to slice.

**Remark**: this means that the upper index is in fact the index of the **next** element in the sequence.

In [None]:
s[1:4]

In [None]:
s[1:5]

You can also slice "with an **increment**". To do that, you specify a second `:` followed by the value of the increment:

In [None]:
s[0:5:2]

There are straightforward **default values** for slices:
- if you omit the first index, then Python considers this as equivalent to the beginning of the sequence (index = `0`)
- if you omit the last index, then Python considers this as equivalent to the end of the sequence (index = *len*)

In [None]:
s[:2]

In [None]:
s[3:]

That means that `s[:i] + s[i:] = s` for **every** `i`: *0* < `i` < *len(s)*!

f you specify an **upper index** which is **too high**, then Python will **automatically reduce** it to the actual length of the sequence:

In [None]:
s[:50]

However, if you specify a **lower limit** which is **too high**, then you will get the **empty sequence**:

In [None]:
s[50:]

The same happens if you **change** the **order** of the indexes:

In [None]:
s[3:1]

In other words, the last two results may be possible errors!

If you specify **negative** indexes, then you count from the **end** of the sequence. In other words, starting from *len* (= *5*), index `-1` accesses the **last** element, index `-2` accesses the **one but last** element, etc.

**Important**: Python still counts with **positive** indexes "behind the scene":
- `-1` is actually a **shorthand** for *len-1* (and equals *4* in the example of "hello")
- `-2` is actually a **shorthand** for *len-2* (and equals *3* in the example of "hello")
- etc.

In order to make everything clear, we extend the scheme above with negative indexes:
```
	+---+---+---+---+---+
	| h | e | l | l | o |
	+---+---+---+---+---+
	0   1   2   3   4   5
   -5  -4  -3  -2  -1
```

In [None]:
s[-1]

In [None]:
s[-2]

In [None]:
s[-3]

In [None]:
s[-4]

In [None]:
s[-5]

In [None]:
s[-6] # Error

Index `-0` is equal to index `0` (which is in line with the implicit back-transformation of negative indexes to positive indexes):

In [None]:
s[-0]

You can also **slice** with **negative indexes**:

In [None]:
s[-4:-2]

In [None]:
s[:-3]

In [None]:
s[-2:]

Just as with positive indexes, the order has to be **from small to large** (otherwise you will again get the empty sequence).

In [None]:
s[-2:-4]

### Exercises
1. Compute (pay attention to decimal digits):
  - (1000 times 3) divided by 2
  - (minus 1000 divided by 3) times (2 to the power 5)
  - 3000 times (5 to the power minus 20)
  - the square root of 10000
  - the logarithm of 3000
  - the tangent of minus 10000
  - the sine of 90
  - the cosine of minus pi
  - the tangent of (minus pi divided by 4)
2. Compute the following:
  - Create two variables, the first with your first name as its value and the second with your surname.
  - Count the number of letters in both variables.
  - Extract the first two letters in both variables and extract the last two letters.

String objects have many different "**methods**". Methods belong to the **object-oriented** paradigm: they are small functions which you can apply on an object of a certain type (in this case: strings).

All string methods can be found in the [Python Standard Library under String Methods](https://docs.python.org/3/library/stdtypes.html#string-methods). We will illustrate a few of them by means of the following two examples:

In [None]:
vs1 = 'what shall we do with the drunken sailor'
vs2 = '    what shall we do with the\tdrunken sailor    '

The method `.count()` counts how often a certain substring occurs in a string.

In [None]:
vs1.count('d')

In [None]:
vs1.count('dr')

You can optionally also specify the **start** and **stop index** between which you want to search.

In [None]:
vs1.count('d', 10)

In [None]:
vs1.count('d', 10, 20)

 The method `.find()` gives the **first** index where a certain substring can be found, and otherwise the result is `-1`. You can optionally specify a start and stop index.

In [None]:
vs1.find('d')

In [None]:
vs1.find('dr')

In [None]:
vs1.find('z')

The method `.index()` does the same as `.find()` but raises an **error** if nothing is found (instead of `-1`). You can optionally specify a start and stop index.

In [None]:
vs1.index('d')

In [None]:
vs1.index('dr')

In [None]:
vs1.index('z') # Error

There are also the methods `.rfind()` and `.rindex()` which give the **last** index.

There are various methods which **test** for a property of your string, e.g.:
- `.isalnum()` is `True` if all characters in your string are alphanumeric and is `False` otherwise
- `.isalpha()` is `True` if all characters in your string are aphabetic and is `False` otherwise
- etc.

See the full lists of methods for testing in the [String Methods](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [None]:
vs1.isalnum() # False because of the spaces

In [None]:
vs1.isalpha()

In [None]:
vs1.islower()

The method `.join()` uses the string to **concatenate** the sequence specified as the argument. Usually the argument is a list (see later).

In [None]:
'_'.join( ['h', 'e', 'l', 'l', 'o'] )

In [None]:
' '.join( ['what', 'shall', 'we', 'do', 'with', 'the', 'drunken', 'sailor'] )

The methods `.ljust()`, `.center()` and `.rjust()` do **line adjustment** of a string (left, centered and right, respectively). You always have to specify the **total width** of the resulting string. Optionally you can specify the **filling character** used for padding the string (the default is one space).

In [None]:
vs1.ljust(50)

In [None]:
vs1.center(50)

In [None]:
vs1.rjust(50)

The methods `.lower()` and `.upper()` **convert** the character **case** of a string (to lowercase and uppercase, respectively).

In [None]:
vs1.lower()

In [None]:
vs1.upper()

There are also the methods `.capitalize()`, `.casefold()`, `.swapcase()` and `.title()`; see the [String Methods](https://docs.python.org/3/library/stdtypes.html#string-methods).

 The methods `.lstrip()`, `.rstrip()` and `.strip()` **remove characters** from a string: `.lstrip()` from the beginning, `.rstrip()` from the end and `.strip()` from both the beginning and the end. You can optionally specify the set of characters to remove (the default is the set of all whitespace characters).

In [None]:
vs2.lstrip()

In [None]:
vs2.lstrip(' wh')

In [None]:
vs2.rstrip()

In [None]:
vs2.rstrip('or ')

In [None]:
vs2.strip()

The method `.replace()` **replaces** the occurrences of a certain substring (in your string) by another substring.

In [None]:
vs1.replace('a', 'o')

In [None]:
vs1.replace('drunken', 'sober')

You can optionally specify the number of replacements.

In [None]:
vs1.replace('a', 'o', 2)

The methods `.rsplit()` and `.split()` **split** your string on the occurrences of a "**separator string**", to be specified as the argument. You can optionally specify the maximal number of splits. `.rsplit()` starts splitting from the end of the string, `.split()` starts from the beginning of the string.

In [None]:
vs1.rsplit(' ')

In [None]:
vs1.rsplit(' ', 2)

In [None]:
vs1.split(' ')

In [None]:
vs1.split(' ', 2)

**Note**: the methods `.split()` and `.join()` perform opposite operations, but notice the difference in syntax:

`SEQUENCE = STRING.split(sep)` vs. `STRING = sep.join(SEQUENCE)`.

The method `.splitlines()` **splits** your string on every **newline** character. You can optionally specify `True` to keep the newline character itself.

In [None]:
sss.splitlines()

In [None]:
sss.splitlines(True)

The methods `.startswith()` and `.endswith()` **test** whether your string starts with, resp. ends with a **specified substring**. You can optionally specify the start index and stop index **between** which to search.

In [None]:
vs1.startswith('what')

In [None]:
vs1.endswith('or')

In [None]:
vs1.endswith('er')

The method `.zfill()` **adds zero's to the left** of your string until the resulting string has a certain **specified width**.

In [None]:
vs1.zfill(50)

Numbers and strings can also be converted to one another. You can do that with the functions functies `int()` and `str()`: `int()` converts a string to a number; `str()` converts a number to a string.

In [None]:
int('5') * 3

In [None]:
str(5) * 3

For completeness' sake, we also mention the function `float()` with which you can convert to a floating point number:

In [None]:
float('5') * 3

Finally, there is the special character `#` (viz. the hash tag). Its meaning is that everything which appears **after it** (until a newline) is **ignored**. Therefore, it is useful to add **comments** to programming code.

**Important**: a `#` occurring **inside** of a string is, of course, just the `#` character itself:

In [None]:
'The # here is still printed' # but now not anymore.

### Exercises
3. Use the two string variables (with your first name and your second name) from exercise 2 for:
  - Center both variables, left-align them and right-align them in a total string of 30 characters.
  - Find in all those (6) variables the position of your initial (which can be done with **two** different methods).
  - Try to do the same, counting from the **end** of the string.
    - **Hint**: Your initial is the letter before which there is no preceding letter. (Pay attention to the fact that you want the index of the initial itself and not of the preceding character.)
  - Remove the whitspace at the beginning of each variable, at the end and at both the beginning and the end.
  - Join the variable with your first name and the one with your surname together into one string with a space in-between.
  - Put all letters in that one string in uppercase and put them in lowercase.
  - Try to put only the first letter of your first name and of your surname in uppercase (and all other letters in lowercase).
  - Join your first name and your second name again together into one string but now with a newline in-between. This can be done in **two** ways.
  - Display that string as intended: your first name and your second name on separate lines without the newline character.

### 1.3 Lists

The third major data type in Python are **lists**. A list is an ordered sequence of elements, it is created by **square brackets** and separating the elements by **comma's**.

In [None]:
l = ['morning', 'midday', 'evening', 123, math.log]
l

Like strings, lists are **sequences**. That means that many above-mentioned operations can also be applied on lists:

Concatenation:

In [None]:
l + [456, math.sin, 789, math.cos]

Repetition:

In [None]:
l * 2

Test for membership:

In [None]:
'midday' in l

In [None]:
math.tan in l

In [None]:
'midday' not in l

In [None]:
math.tan not in l

The smallest or largest element (this does not make sense for 'l', which contains different data types):

In [None]:
min(['a', 'b', 'c', 'd', 'e'])

In [None]:
max([1, 2, 3, 4, 5])

Length:

In [None]:
len(l)

Just like the empty string, the empty list has length zero:

In [None]:
len([])

Indexing & slicing (in the same **zero-based** fashion):

In [None]:
l[0]

In [None]:
l[2:4]

In [None]:
l[-3]

In [None]:
l[-2:]

The difference with strings is obvious: lists can contain elements of different types. That has to do with the **core** property of lists: they are **mutable** sequences. Strings are **immutable**:

In [None]:
l[1] = 'noon'
l

In [None]:
s[1] = 'a' # Error

That means that new elements can always be added at the end of a list with the following generic formula:

`LIST1[len(LIST1):] = LIST2` (abbrevation of: `LIST1[len(LIST1):len(LIST1)] = LIST2`).

In [None]:
l[len(l):] = ['dad', 'mum']
l

This is not possible with strings:

In [None]:
s[len(s):] = 'hey' # Error

By consequence, you can add new elements at the **beginning** of a list with: `LIST1[:0] = LIST2` (Abbreviation of: `LIST1[0:0] = LIST2`).

In [None]:
l[:0] = ['mother', 'father']
l

Again, this is not possible with strings:

In [None]:
s[:0] = 'bye' # Error

A consequence of the mutability of lists is that lists can be **nested**: i.e. you can create lists of lists.

In [None]:
ll = [456, l, 789]
ll

In the overarching (super)list (`ll`) the sublist (`l`) is just a single element:

In [None]:
len(ll)

In [None]:
ll[1]

In order to access elements in nested lists, you can work with multiple indexes: i.e. you place the index of the sublist **immediately after** the index of the superlist.

In [None]:
ll[1][5]

Depending on the depth of you list, you can keep on specifying indexes (between square brackets).

Slicing remains similar:

In [None]:
lll = ll[1][2:7]
lll

Lists also have some **methods**. We will discuss them all.

The method `.append()` **adds** a specified **element** to the **end** of your list.

In [None]:
lll.append('dawn')
lll

This is in fact **concatenation** of the form: `lll + ['dawn']`. Hence, it is also equivalent to: `lll[len(lll):] = ['dawn']`.

The method `.extend()` **adds** a **list** to the **end** of your list.

In [None]:
lll.extend(['dawn', 'dusk'])
lll

This is in fact **pure** concatenation: `lll + ['dawn','dusk']`. Hence, it is equivalent to: `lll[len(lll):] = ['dawn','dusk']`.

The method `.count()` **counts** the number of occurrences in your list of a certain specified **element**.

In [None]:
lll.count('morning')

The method `.copy()` creates a **copy** of your list.

In [None]:
l_2 = lll.copy()

This is equivalent to `l_2 = lll[:]`:

In [None]:
l_2

The method `.clear()` **removes all elements** from your list.

In [None]:
l_2.clear()

This is equivalent to: `l_2 = del l_2[:]`:

In [None]:
l_2

The method `.index()` gives the **first** index of a certain specified **element** in your list (and an error if nothing is found).

In [None]:
lll.index(123)

The method `.insert()` **inserts** an **element** at a certain **position** (index) in the list. You specify the index as the first argument and the new element as the second argument. The element which already occupied the indexed position moves one position to the right.

In [None]:
lll.insert(5, 'night')
lll

The method `.pop()` **removes *and* prints** an **element** from your list. By default this is the last element in the list, but you can specify an index as the argument.

In [None]:
v = lll.pop()
v

In [None]:
lll.pop(4)

The method `.remove()` **removes** the **first** occurrence of a specified **element** from your list.

In [None]:
lll.remove(123)

In [None]:
lll.remove('morning')

In [None]:
lll

The method `.reverse()` **reverses** the **order** of the elements in your list.

In [None]:
lll.reverse()
lll

The method `.sort()` **sorts** the elements in your list. By default, this is **increasing** order, but you can specify the argument `reverse = True` to get **decreasing** order.

In [None]:
lll.sort()
lll

In [None]:
lll.sort(reverse = True)
lll

Because lists are mutable, you can also **delete** elements or segments from a list. That is possible with the function `del()`.

In [None]:
del (lll[1:4])

This is equivalent to: `lll[1:4] = []`:

In [None]:
lll

We come back to the `.copy()` method because it reveals a fundamental aspect of assignment in Python: when you **assign** an existing list to another list, you do **not** create a new object but only a new **name**. In other words, you create two names for the same object:

In [None]:
l_2 = lll
l_2

In [None]:
l_2[1] = 'daybreak'
l_2

In [None]:
lll

With the `.copy()` method you create an entirely new object (which is called a "**deep**" copy):

In [None]:
l_3 = lll.copy()
l_3[1] = 'nightfall'
l_3

In [None]:
lll

However, this aspect of assignment **only** applies to lists (and other mutable sequences). If you assign an existing number object or string object to a new object, then you have two different objects:

In [None]:
n_1 = 30
n_2 = n_1
n_2 = 20
n_2

In [None]:
n_1

In [None]:
s_1 = 'hey'
s_2 = s_1
s_2 = 'bye'
s_2

In [None]:
s_1

### Exercises
4. Create the following list (copy-paste): `exList = ['shall','I','compare','thee','to','a','summer\'s','day']` and:
  - Create (**manually**) a **nested list** on the basis of the **indexes** in `exList` with:
    - the words with only an `"a"` in one sublist (i.e. `"shall"`, `"a"` and `"day"`)
    - the words which starts with an `"s"` in one sublist (i.e. `"shall"` and `"summer's"`)
    - the words of a single character in one sublist (i.e. `"I"` and `"a"`)
    - all other words as subsequent elements (i.e. `"compare"`, `"thee"` and `"to"`)
  - Revert the order in every sublist.
  - Sort every sublist.
  - Remove from every sublist the second element. That can be done in **three** different ways, so use a different way for each of the first three sublists.
  - Give all the ways in which you can add elements to the end of a list and illustrate them with `['rose']` and `['snow']`.