# Introduction to Statistics with Python

```
Koen Plevoets
Last modified: 2020-12-07
```

# Class 2

Python has some other data types such as tuples, sets and dictionaries.

### 1.4 Tuples

A **tuple** is an **immutable** sequence of elements. They can be created with **round brackets**:

In [None]:
t = (1, 2, 3)
t

However, the round brackets are not necessary:

In [None]:
tt = 1, 2, 3
tt

Since tuples are sequences, you can perform any sequence operation on them: concatenation, repetition, membership testing, indexing, slicing, minimum, maximum and len(gth).

In contrast to lists, tuples are immutable (in that regard, tuples are more like strings):

In [None]:
t[1] = 4 # Error

Nonetheless, tuples can contain (nested) elements which themselves are mutable:

In [None]:
ttt = (1,2,['mummy','daddy'])
ttt[2].append('baby')
ttt

The major use of tuples is to have ordered sequences of elements which cannot be changed anymore.

### 1.5 Sets

A **set** is an **unordered** collection of **unique elements**. That means that sets do not exhibit a sequence, not do they contain repeated values. They can be created with the function `set()`, which always extracts the unique elements from a sequence:

In [None]:
v = set('baby')
v

In [None]:
vv = set([1,2,3,2,3])
vv

The first example shows that sets have no ordering. Hence, indexes/slices are not applicable to sets:

In [None]:
v[1] # Error

In [None]:
vv[1] # Error

Sets do have a length, this is their "**cardinal number**".

In [None]:
len(v)

In [None]:
len(vv)

You can also **test for membership** in a set.

In [None]:
'y' in v

In [None]:
5 not in vv

Of course, sets allow for all the usual set-theoretical operations. In order to illustrate them, we create some examples:

In [None]:
v1 = set(['father', 'mother', 'son', 'daughter'])
v2 = set(['daddy', 'mummy', 'son', 'daughter'])
v3 = set(['father', 'son'])

- Union:

In [None]:
v1 | v2

In [None]:
v1.union(v2)

- Intersection:

In [None]:
v1 & v2

In [None]:
v1.intersection(v2)

- Difference:

In [None]:
v1 - v2

In [None]:
v1.difference(v2)

- Symmetric difference ("Exclusion"):

In [None]:
v1 ^ v2

In [None]:
v1.symmetric_difference(v2)

Furthermore, there are various **tests** (next to the above-mentioned test for membership):

- Test for disjunction:

In [None]:
v1.isdisjoint(v2)

In [None]:
(v1 & v2).isdisjoint(v1 ^ v2)

In [None]:
(v1 ^ v2).isdisjoint(v1 & v2)

- Test for subset:

In [None]:
v3 <= v1

In [None]:
v3.issubset(v1)

  - There is also a test for a **proper** (or **strict**) subset:

In [None]:
v1 <= v1

In [None]:
v1 < v1

In [None]:
v3 < v1 # PROPER subset

- Test for superset:

In [None]:
v1 >= v3

In [None]:
v1.issuperset(v3)

- There is also a test for a **proper** (or **strict**) superset:

In [None]:
v1 >= v1

In [None]:
v1 > v1

In [None]:
v1 > v3 # PROPER superset

Finally, there are some operations:

- Update:

In [None]:
v1 |= set(['uncle', 'aunt'])

In [None]:
v1

In [None]:
v1.update(set(['uncle', 'aunt']))

- Intersection update:

In [None]:
v1 &= set(['mother', 'daughter', 'uncle', 'aunt', 'nephew', 'niece'])
v1

In [None]:
v1.intersection_update(set(['mother', 'daughter', 'uncle', 'aunt', 'nephew', 'niece']))

 - Difference update:

In [None]:
v1 -= set(['son', 'daughter', 'nephew', 'niece'])
v1

In [None]:
v1.difference_update(set(['son', 'daughter', 'nephew', 'niece']))

- Exclusion update:

In [None]:
v1 ^= set(['uncle', 'aunt', 'brother', 'sister'])
v1

In [None]:
v1.symmetric_difference_update(set(['uncle', 'aunt', 'brother', 'sister']))

**Important**: this is of course another data set!

- Copy:

In [None]:
v4 = v2.copy()
v4

- Add:

In [None]:
v2.add('bro')
v2

- Remove:

In [None]:
v2.remove('daddy')
v2

In [None]:
v2.remove('aunt') # Error

- Remove (without Error):

In [None]:
v2.discard('aunt')
v2

In [None]:
v2.discard('son')
v2

- Remove **random** element:

In [None]:
v2.pop()

In [None]:
v2

- Empty set:

In [None]:
v2.clear()
v2

In [None]:
len(v2)

### 1.6 Dictionaries

Another important data type in Python are **dictionaries**. The core property of a dictionary is that every element in it is accessed by a **name** instead of an index (that also means that dictionaries are **unordered**).

Such a name is called the "**key**", and Python allows any immutable sequence as a key. The elements themselves of a dictionary are called the "**values**", and every pair of key and value is called an "**item**".

The quickest way to create a dictionary is with **curly braces** (`{` en `}`) and every pair of key and value is separated by a `:`.

In [None]:
d = {'subject1' : 'John', 'subject2' : 'Mark', 'object' : ['apples', 'oranges', 'pears', 'fish'], 'price' : 12345 }
d

You can access the value of any items by specifying the **key** between **square brackets**:

In [None]:
d['price']

In [None]:
d['subject500'] # Error

You can also **assign** items to a dictionary (in other words, dictionaries are mutable):

In [None]:
d['subject3'] = 'Bart'
d

You can remove an item from a dictionary with the well-known function `del()`. Again, you specify the key between square brackets.

In [None]:
del d['subject2']
d

You can **test for membership** in a dictionary by testing on a **key**:

In [None]:
'subject1' in d

In [None]:
'subject2' not in d

In [None]:
'John' in d # False

The number of items in a dictionary is found with the function `len()`:

In [None]:
len(d)

Dictionaries also have some **methods**. We discuss all of them. First we create an example dictionary:

In [None]:
vd0 = {'key1' : 'value1', 'key2' : 'value2', 'key3' : 'value3'}

The method `.clear()` **removes all items** from your dictionary.

In [None]:
vd0.clear()
vd0

In [None]:
len(vd0)

The method `.copy()` creates a **copy** of your dictionary.

In [None]:
vd0 = d.copy()
vd0

The method `.fromkeys()` creates a **new dictionary** based on a **list** of the **keys** of an old dictionary. The list keys in question are specified as the argument. You can also specify a list of values as the second argument (but the default in `None`).

In [None]:
d.fromkeys(['subject1', 'price', 'subject500'])

In [None]:
d.fromkeys(['subject1', 'price', 'subject500'], ['pineapple', 'peach', 'prune'])

The method `.get()` gives the **value** of the **key** which you specify as the first argument. If the dictionary does not contain the key, then the value specified as the second argument is returned (the default is `None`).

In [None]:
d.get('subject1')

In [None]:
d.get('subject500')

In [None]:
d.get('subject500', 'noway')

In [None]:
d.get('subject1', 'noway')

The method `.items()` gives a **list** of all **items** (i.e. pairs of keys and values) in your dictionary.

In [None]:
d.items()

**Important**: the list is not alphabetically ordered. That reflects the fact that the dictionary is an unordered data type.

The method `.keys()` gives a **list** of all **keys** in your dictionary.

In [None]:
d.keys()

It can sometimes be convenient to sort the keys:

In [None]:
sorted(d.keys())

The method `.pop()` **prints *and* removes** the item with the key which you specify as an argument. f your dictionary does not contain the key, then you have the specify a value as the second argument.

In [None]:
d.pop('subject1')

In [None]:
d

In [None]:
d.pop('subject500') # Error

In [None]:
d.pop('subject500', 'noway')

In [None]:
d

The method `.popitem()` prints and removes a **random** item from your dictionary.

In [None]:
d.popitem()

In [None]:
d

The method `.setdefault()` gives the **value** for the **key** which you specify as an argument. If your dictionary does not contain the key, then a new item will be **assigned** to your dictionary with the value specified as the second argument (the default is `None`).

In [None]:
d.setdefault('subject500') # None

In [None]:
d

In [None]:
d.setdefault('subject501', 'William Wordsworth')

In [None]:
d

In [None]:
d.setdefault('subject501')

The method `.update()` **adds** the **items** from the dictionary specified as the argument to your dictionary. Existing items with the same key are overwritten.

In [None]:
d.update({'subject1' : 'John', 'subject3' : 'Bart', 'subject501' : 'Samuel Taylor Coleridge',
          'object' : ['pineapple', 'peach', 'prune'], 'price' : 98765})
d

The method `.values()` gives a **list** of all **values** in your dictionary.

In [None]:
d.values()

Again, you can sort this list, but you can also test for membership:

In [None]:
'Samuel Taylor Coleridge' in d.values()

### Exercises
5. Create a dict `PythonBooks` with the following items:


  |         KEY               |              VALUE                    |
  |:-------------------------:|:-------------------------------------:|
  | Mckinney, Wes             |   Python for Data Analysis            |
  | Matthes, Eric             |   Python Crash Course                 |
  | Saha, Amit                |   Doing Math with Python              |
  | Sweigart, Al              |   Automate Boring Stuff with Python   |
  | VanderPlas, Jake          |   Python Data Science Handbook        |
  | Venkitachalam, Mahesh     |   Python Playground                   |


  - Print the keys of `PythonBooks` in alphabetical order.
  - Print the values of `PythonBooks` in alphabetical order.
  - Print the items of `PythonBooks` in alphabetical order.


## Chapter 2: Programming tools

### 2.1 Functions

Because Python is a programming language, you can also create your own **functions** (next to Python's built-in functions). To recapitulate, a function is an operation which gives a certain output value on the basis of input values, e.g.:

```
f(x) = 3x + 1
```

In Python you can create this function the `def` operator as follows:

In [None]:
def f(x):
    print(3*x + 1)


After the `def` you specify the **name** of the function (which is `f` in this case, but you have the freedom to choose), and **in-between brackets** you specify the **arguments** of your function (the argument is `x` here, but you can choose both the number of arguments as well as their names).

Immediately after the enclosing brackets there **has to be** a `:` and an **enter**, otherwise Python will give an error. On the next line(s) you give the actual **function definition** (i.e. what your function will do).

It is important that all the lines of the function definition are **indented**, otherwise Python will give an error. Many Python-related tools now indent automatically, otherwise you have to do it yourself. Indentation with a **tab** will work, but the convention nowadays is **four spaces** (or you can set your tab to be four spaces).

Finally, the function definition has to end with an **empty line**, otherwise Python will give an error. (The empty line is how Python finds out where the function definition **terminates**.)

If you have done that, then you can use the function to perform computations:

In [None]:
f(1)

In [None]:
f(2)

In [None]:
f(4)

In [None]:
f(-50)

In [None]:
f(7.5)

Functions are not restricted to numbers, you can define function for any data type.

In [None]:
def g(x):
    print('Shall I compare thee to '+ x +'?')


In [None]:
g("a summer's day")

In [None]:
g("bananas")

You can assign **default values** to your **arguments** in the function definition. Such default values will then be used unless other values are explicitly specified (i.e. the default values are "overridden").

In [None]:
def h(x = "a winter's breeze"):
    print('Shall I compare thee to '+ x +'?')


In [None]:
h()

In [None]:
h('mangoes')

There are a few restrictions to be kept in mind. The first one is that the **order of the arguments** is important when many of them have default values. The second one is that you **cannot assign more than one value** to an argument. For the rest, there is great liberty, as is exemplified by the next function.

In [None]:
def hey(subject = 'John', object1 = 'Marc', object2 = "a winter's breeze"):
    print('Shall ' + subject + ' compare ' + object1 + ' to ' + object2 + '?')


In [None]:
hey()

In [None]:
hey(object1 = 'Bart')

In [None]:
hey(object1 = 'Bart', subject = 'you')

**But**:

In [None]:
hey('Bart', 'you')

In [None]:
hey('Bart', 'you', "a summer's day", object2 = 'mangoes') # Error

The **result** of a function can also be **assigned** to an object. For that you need to replace the `print()` command by `return`:

In [None]:
def i(x = "a winter's night"):
    return 'Shall I compare thee to '+ x +'?'


In [None]:
y = h('mangoes')

In [None]:
y # (Nothing)

In [None]:
y = i('mangoes')

In [None]:
y

You can also use the object which you want to modify as one of your **arguments**. (Arguments without default values always have to appear before argument with default values.)

In [None]:
def j(y, x = "a winter's night"):
    y.append('Shall I compare thee to '+ x +'?')


In [None]:
y=[]
j(y, 'mangoes')
j(y, 'bananas')
j(y, 'apples')
y

Next to the `def` statement you can also create (small) functions with a `lambda` statement: After the `lambda` operator you specify the arguments and after a `:` you specify the computation. Lambda statements have to be all on one line (which is why they usually are quite simple functions):

In [None]:
j_alt = lambda y, x: y.append('Shall I compare thee to '+ x +'?')

In [None]:
j_alt(y, 'pears')
y

Lambda statements are useful when you need to perform quick computations on the fly. Applications of `NumPy` and/or `pandas` contain many examples of lambda statements.

### 2.2 Control structures

Another programming feature of Python are the **control structures** `if`, `for` and `while`.

An `if` statement tests whether a certain condition is `True` before executing an operation.

In [None]:
x1 = 'bananas'
if len(x1) > 4:
    print(x1)


In [None]:
x2 = 'fish'
if len(x2) > 4:
    print(x2)

# (Nothing)

**Note** how the syntax is similar to the syntax of function definitions:

- The first line (with the actual control structure) ends on a `:` and an **enter**
- The second line (with the executable commands) is indented
- The last line is a blank line

This syntax also holds for `for` and `while`.

The `if` statement can be combined with `elif` or `else` statements in order to control what happens if the condition is `False`. An `elif` stement tests for another condition, an `else` condition specifies what happens if all previous conditions are `False`. Hence, `elif` statements can appear multiple times, but the final statement has to be `else`. (MIND the `:`!)

In [None]:
x3 = 'apples'
if len(x3) > 8:
    print('nine or more')
elif len(x3) > 6:
    print('seven or more')
elif len(x3) > 4:
    print('five or  more')
else:
    print('four or less')


In [None]:
x4 = 'fish'
if len(x4) > 4:
    print('more than four')
else:
    print('four or less')


A `for` statement loops over the individual elements in a sequence.

In [None]:
for x in ['apples', 'oranges', 'pears', 'fish']:
    print(x)


Sometimes you have to loop over the **indexes** of a sequence. To that end you can use the function `range()` which returns a (zero-based) sequence of integers until a certain value. That is to say, `range()` returns a so-called range object, which itself can be converted to a sequence.

In [None]:
range(5)

In [None]:
list(range(5))

Looping over the indexes can then be done by specifying the **length** of your sequence as the argument to `range()`. The previous `for` loop can therefore also be run as follows (the `for` statement automatically converts the range object to a list):

In [None]:
xList = ['apples', 'oranges', 'pears', 'fish']
for x in range(len(xList)):
    print(xList[x])


For completeness' sake, we mention that you can also specify `range()` with a **start value** as well as with an **increment**. It is even possible to work with **negative indexes**.

In [None]:
list(range(5,10))

In [None]:
list(range(0,10,2))

In [None]:
list(range(-10,0))

In [None]:
list(range(-10,0,2))

In [None]:
list(range(0,-10,-2))

Sometimes you need to loop over **both** the indexes and elements of a sequence. For that you can use the function `enumerate()` (which returns both indexes and elements):

In [None]:
for i, x in enumerate(xList):
    print(i, x)


Sometimes you need to loop over **two or more** sequences. Then you can use the function `zip()`:

In [None]:
for x, y in zip(xList, y):
    print('Food in the morning: ' + x + ', fruit at 4pm: ' + y)


A `while` statement keeps on executing a certain operation as long as a certain condition is fulfilled.

In [None]:
while len(xList) > 0:
    print(xList.pop())


In other words, a `while` loop can be seen as a combination of a `for` loop and an `if` statement.

The logical tests (in `if` or `while`) make use of the following **logical operators** (which are all specified with left-hand side and a right-hand side):

| Operator   |  Meaning                       |
|-----------:|:-------------------------------|
|   `==`     |  is equal to                   |
|   `!=`     |  is not equal to               |
|   `<`      |  is smaller than               |
|   `<=`     |  is smaller than or equal to   |
|   `>`      |  is larger than                |
|   `>=`     |  is larger than or equal to    |
|   `is`     |  is the same object as         |
|   `is not` |  is not the same object as     |

The outcome of any logical test is either `True` or `False`.

In [None]:
x2 is x4

In [None]:
3 > 5

In [None]:
3 >= 5

In [None]:
'e' < 'o'

In [None]:
'e' >= 'o'

**Note** the difference between `==` (logical comparison) and `=` (assignment)!

Logical tests can be combined to complex expressions by means of the **Boolean operators** `and`, `or` and `not`. In the following descriptions, the symbols `P` and `Q` stand for any logical expression.

The Boolean operator `and` returns `True` if **both** logical expressions are `True` and returns `False` otherwise:

 |  `P`    |  `Q`    | `P and Q` |
 |:-------:|:-------:|:---------:|
 | `True`  | `True`  | `True`    |
 | `True`  | `False` | `False`   |
 | `False` | `True`  | `False`   |
 | `False` | `False` | `False`   |

In [None]:
3 < 4 and 'c' < 'e'

In [None]:
3 < 4 and 'c' >= 'e'

The Boolean operator `or` returns `True` if **one** of the logical expressions is `True` and returns `False` otherwise:

 |  `P`    |  `Q`    | `P or Q`  |
 |:-------:|:-------:|:---------:|
 | `True`  | `True`  | `True`    |
 | `True`  | `False` | `True`    |
 | `False` | `True`  | `True`    |
 | `False` | `False` | `False`   |

In [None]:
3 > 4 or 'c' < 'e'

In [None]:
3 > 4 or 'c' >= 'e'

The Boolean operator `not` returns `True` if the logical expression `False` and returns `False` otherwise:

 |   `P`    | `not P`   |
 |:--------:|:---------:|
 |  `True`  |  `False`  |
 |  `False` |  `True`   |

In [None]:
not 3 < 4

In [None]:
not 3 >= 4

A **convenient** feature in Python is that logical expressions are processed in **consecutive order** (from left to right). That means that **superfluous** expressions are **not evaluated**, e.g.:

- If the left-hand side of `and` is `False`, then the result is always `False` no matter what the right-hand side is.
- If the left-hand side of `or` is `True`, then the result is always `True` no matter what the right-hand side is.

In [None]:
x5 == 'hoi' # Error

In [None]:
if 3 > 4 and x5 == 'hoi':
    print('hello')

# No error

In [None]:
if 3 < 4 or x5 == 'hoi':
    print('hello')

# No error

There are also some special cases of data types which Python considers as equivalent to `False`:

- `None`
- `0` (i.e. the number zero, also in the floating-point version: `0.0`)
- `''` (i.e. the empty string)
- `[]` (i.e. the empty list)
- `()` (i.e. the empty tuple)
- `{}` (i.e. the empty dictionary)

To illustrate this, we ask for the "truth value" of any object in Python with the built-in function `bool()`.

In [None]:
bool(5)

In [None]:
bool('z')

In [None]:
bool([3, 'hello'])

In [None]:
bool()

In [None]:
bool(None)

In [None]:
bool(0)

In [None]:
bool(0.0)

In [None]:
bool('')

In [None]:
bool([])

In [None]:
bool({})

As a consequence, you can use **short-hand** notations for logical tests:

In [None]:
xCopy = ['apples', 'oranges', 'pears', 'fish']
while xCopy: # becomes False when xCopy becomes the empty list [].
    print(xCopy.pop())


The full extent of Python's programming possibilities reside in **combining** control structures with function definitions. As an example, we will count how many times each element in a list occurs and store the results in a dictionary. In order to do that, we first define a function which updates the (frequency) values for the keys:

In [None]:
def updateFreq(d, k):
    if k in d.keys():
        d[k] += 1
    else:
        d[k] = 1


The symbol `+=` is a **short-hand** operator for: `x = x + 1`. This means that the value of an existing key gets incremented by 1. Any new key is created with value 1:

In [None]:
vbD = {}

In [None]:
updateFreq(vbD, 'Marc')

In [None]:
vbD

In [None]:
updateFreq(vbD, 'Marc')
vbD

In [None]:
updateFreq(vbD, 'Bart')
vbD

In [None]:
updateFreq(vbD, 'Bart')
updateFreq(vbD, 'John')
vbD

**Note the indentation**: the `if` statement is itself indented, so any conditional command has to be indented twice. Similarly, more complex (nested) control structures lead to further indentation. The **indentation has meaning in Python**: it is the way to control for the "**level**" of your computations.

The next step is to extend our function in order to loop over the elements in a sequence:

In [None]:
def countFreqs(d, l):
    for i in l:
        if i in d.keys():
            d[i] += 1
        else:
            d[i] = 1


Note that **within** the `for` loop the code remains the same as in our function `updateFreq()`. Because we have already defined that function, we can also use it in the definition of `countFreqs()`:

In [None]:
def countFreqs_alt(d, l):
    for i in l:
        updateFreq(d, i)


We create an example list in order to show some results.

In [None]:
vbL = ['banana', 'apple', 'pear', 'apple', 'banana', 'apple', 'prune', 'pear', 'apple', 'peach', 'pear']

In [None]:
vbD = {}
countFreqs(vbD, vbL)
vbD

In [None]:
vbD_alt = {}
countFreqs_alt(vbD_alt, vbL)
vbD_alt

In [None]:
vb2 = [7, 8, 9, 2, 10, 9, 9, 9, 9, 4, 5, 6, 1, 5, 6, 7, 8, 6, 1, 10]

In [None]:
vbD = {}
countFreqs(vbD, vb2)
vbD

In [None]:
vbD = {}
countFreqs_alt(vbD, vb2)
vbD

An **extra example** is counting the words in a string. In order to do that we need to split the string into a list. We also make use of string methods to "clean up" the string:

- Converting to lowercase
- Removing punctuation

Incidentally, we do not specify the output dictionary as an argument, but we let the function produce it itself:

In [None]:
def countWords(s):
    s = s.lower()
    l = s.split(' ')
    outD = {}
    for i in l:
        i = i.strip('.,:;?!')
        if i in outD.keys():
            outD[i] += 1
        else:
            outD[i] = 1
    return outD

vbS = 'A rose is a rose is a rose.'
countWords(vbS)

### Exercises
6. Define the following functions:
  - A function `triLinear(a, b, c)` which gives the result `3a + 5b + 4c + 11`. The results should be able to be stored in an object.
  - A function `capRepeat(l, s, x)` which repeats a string `s` a number of `x` times in capital letters and adds that to a list `l`. The default number of repetitions should be five.
  - A function `myAverage(l)` which computes the mean of list `l` of numbers, i.e. the sum of the numbers in `l` divided by the size. The results should be able to be stored in an object. Apply the function to the list `vb2`.
  - **(Optional:)** A function `myVariance(l)` which computes the variance of `l`, i.e. the sum of the squared differences to the mean divided by size minus one. The results should be able to be stored in an object. Apply the function to the list `vb2`.
  - **(Optional:)** A function `myMode(l)` which computes the mode of `l`, i.e. the most common element in `l`. Use the function `countFreqs()` for that. The results should be able to be stored in an object. Apply the function to the list `vb2`.