# Python data types

Python has quite a number of built-in data types, i.e., data types that are part of the core language.  However, the standard library implements quite a number of additional data types, many of which also simplify your task as a developer. We will not cover each and every method that are defined for these data structures, but only those that are most commonly used. As always, it is a good idea to read through the [documentation](https://docs.python.org/3/library/index.html) for more information.

For this tutorial, we will use a very simple data file that allows us to illustrate use cases for Python's data types. The data file is a text file, with three columns. It represents patients by listing their first name, their age, and their weight. The data types are `str`, `int`, and `float` respectively. Colums are separated by spaces.

In [1]:
!cat Data/patients.txt

name age weight
ivy 35 59.9
bob 33 72.4
gitta 27 56.3
carol 33 58.7
elly 27 61.3
alice 35 63.2
freddy 33 78.9
harry 41 65.3
darren 35 68.5
john 40 67.1

In order to ensure that all cells in this notebook can be evaluated without errors, we will use `try`-`except` to catch exceptions. To show the actual errors, we need to print the backtrace, hence the following import.

In [2]:
from traceback import print_exc

## Tuples

It would be convenient to have a function that reads throught the file, and returns the patients one at the time.  But how to represent a patient so that it can be returned as a value from a function?  We can represent a patient as a tuple with three fields. The first represents the patient's name, the second her age, the third her weight.

In [3]:
patient = ('suzan', 15, 48.9)

The type of this data structure is `tuple`, and we can access the values of its fields by 0-based index.  The number of fields can be determined using the `len` function.

In [4]:
type(patient)

tuple

In [5]:
patient[0]

'suzan'

In [6]:
len(patient)

3

Similar to a `str` in Python, a `tuple` is immutable, i.e., it can not be modified after its creation. So no birthdays for our patients.

In [7]:
try:
    patient[1] = 16
except Exception:
    print_exc()

Traceback (most recent call last):
  File "<ipython-input-7-ad58dd2021f1>", line 2, in <module>
    patient[1] = 16
TypeError: 'tuple' object does not support item assignment


A `tuple` seems a reasonable data type to represent our patients, so let's proceed with the function to read them.

In [8]:
def read_patients(file_name):
    '''
    Generator function that yields the patients in a file.
        file_name: name of the data file
        returns tuples representing the patients
    '''
    with open(file_name, 'r') as patients:
        # read but ignore the column names
        _ = patients.readline()
        # iterate over all the patient data
        for line in patients:
            # strip the line endings, and split on whitespace
            data = line.rstrip().split()
            # turn this function into a generator by yielding a tuple
            # representing a patient; for convenience, do appropriate
            # type conversion
            yield (str(data[0]), int(data[1]), float(data[2]))

The `yield` statement interrupts the execution of the function, returning a value. When the function is called again, execution will resume from that same point in the function, retaining state.

Let's test the function by simply printing the results.

In [9]:
for patient in read_patients('Data/patients.txt'):
    print(patient)

('ivy', 35, 59.9)
('bob', 33, 72.4)
('gitta', 27, 56.3)
('carol', 33, 58.7)
('elly', 27, 61.3)
('alice', 35, 63.2)
('freddy', 33, 78.9)
('harry', 41, 65.3)
('darren', 35, 68.5)
('john', 40, 67.1)


That seems to work quite well.  Let's proceed with some data analytics.

## Named tuples

Although this implementation works, it has a drawback. We have to remember that `patient[0]` is the name, `patient[1]` the age, and `patient[2]` the weight of the patient.  This is error-prone, and may give rise to subtle bugs.  It would be much more convenient if we could access the `tuple`'s fields by name, rather than by index.  Python's standard library implements a convenient way to define named tuples through `typing.NameTuple`.

In [10]:
from typing import NamedTuple

class Patient(NamedTuple):
    name: str
    age: int
    weight: float

We will cover defining our own classes elsewhere, but this is quite staightforward. Note that `typing.NamedTuple` was introduced in Python 3.5, for earlier version of Python, you can use the somewhat more cumbersome `collections.namedtuple`.  A named tuple of type `Patient` has three fields, `name`, `age`, and `weight`, which we declare using type hints, `str`, `int`, and `float` respectively.  Let's redefine the tuple representing Suzan.

In [11]:
patient = Patient('suzan', 15, 48.9)

The type of this data structure is `Patient`, and we can access the values of its fields by 0-based index, but also by name.  The number of fields can be determined using the `len` function.

In [12]:
type(patient)

__main__.Patient

In [13]:
patient[0], patient.name

('suzan', 'suzan')

In [14]:
len(patient)

3

Similar to Python's built-in `tuple`, a named tuple is immutable, i.e., it can not be modified after its creation.

In [15]:
try:
    patient.age = 16
except Exception:
    print_exc()

Traceback (most recent call last):
  File "<ipython-input-15-5a8845bc5820>", line 2, in <module>
    patient.age = 16
AttributeError: can't set attribute


Modifying the function `read_patients` to return named tuples is trivial.

In [16]:
def read_patients(file_name):
    '''
    Generator function that yields the patients in a file.
        file_name: name of the data file
        returns named tuples representing the patients
    '''
    with open(file_name, 'r') as patients:
        # read but ignore the column names
        _ = patients.readline()
        # iterate over all the patient data
        for line in patients:
            # strip the line endings, and split on whitespace
            data = line.rstrip().split()
            # turn this function into a generator by yielding a tuple
            # representing a patient; for convenience, do appropriate
            # type conversion
            yield Patient(str(data[0]), int(data[1]), float(data[2]))

To illustrate, let's read the paient, but only print their names.

In [17]:
for patient in read_patients('Data/patients.txt'):
    print(patient.name)

ivy
bob
gitta
carol
elly
alice
freddy
harry
darren
john


## Lists

A `list` is a very convenient data type, representing an ordered sequence.  We can for instance create a list of names of our patients.

In [18]:
patient_names = list()
for patient in read_patients('Data/patients.txt'):
    patient_names.append(patient.name)

We start off with an empty list, created using the `list()` function, and we append the name of each patient to it. The resulting list contains the patient names, in the order they have been read from the file, and added to the list `patient_names`.

In [19]:
patient_names

['ivy',
 'bob',
 'gitta',
 'carol',
 'elly',
 'alice',
 'freddy',
 'harry',
 'darren',
 'john']

List elements can be accessed by 0-based index, and the length of a list can be obtained using the `len` function.

In [20]:
patient_names[1]

'bob'

In [21]:
len(patient_names)

10

We can iterate over the elements of a `list` using a `for`-loop.

In [22]:
for patient_name in patient_names:
    print(patient_name.capitalize())

Ivy
Bob
Gitta
Carol
Elly
Alice
Freddy
Harry
Darren
John


In contrast to a `tuple`, the elements of a list can be modified, for instance, if we want to change `'bob'` to `'robert'`, we can just do that.

In [23]:
patient_names[1] = 'robert'

Surreptitiously, we have already used a `list` when we used the `split` method on the lines of our data file. The inverse operation, converting a `list` of `str` type elements can be accomplished using the `join` method.

In [24]:
', '.join(patient_names)

'ivy, robert, gitta, carol, elly, alice, freddy, harry, darren, john'

Note that `join` is a method defined on a `str`, the separator, and that is argument must be an iterable over `str` values.

Indices can be negative, for instance the element at index `-1` would be the last element of the list, `-2` the second but last, and so on. Hence, if the list has $n$ elements, legal index values run from $-n$ to $n - 1$. Although that is a nice shortcut, it is also a good way to introduce subtle bugs in your code.

In [25]:
patient_names[-1] == patient_names[len(patient_names) - 1]

True

In [26]:
patient_names[0] == patient_names[-len(patient_names)]

True

### Modifying lists

Whereas the number of fields of a `tuple` is always the same, elements can be added to or removed from a list at any time. Besides the `append` method we have already used to build the list, there is the `insert` method to add elements anywhere in the list, and the `pop` method to remove elements.

In [27]:
patient_names.insert(2, 'kathy')

In [28]:
patient_names.pop(5)

'elly'

Our `patient_names` list now has `'kathy'` as the third element, while `'elly'` is no longer an element.

In [29]:
', '.join(patient_names)

'ivy, robert, kathy, gitta, carol, alice, freddy, harry, darren, john'

Using the `pop()` method without an index will remove and return the list's last element, `'john'` in this case.

In [30]:
patient_names.pop()

'john'

Checking list membership is easy using the `in` operator, for instance, `'robert'` is in our list, while `'mary'` isn't.

In [31]:
'robert' in patient_names

True

In [32]:
'mary' in patient_names

False

The `index` method will return... the index at which an element first occurs in a list, and raises an exception otherise.

In [33]:
patient_names.index('robert')

1

In [34]:
try:
    patient_names.index('mary')
except Exception:
    print_exc()

Traceback (most recent call last):
  File "<ipython-input-34-39cd85352f6e>", line 2, in <module>
    patient_names.index('mary')
ValueError: 'mary' is not in list


Note that the same element can occur multiple times in a list, for instance, we can add another `'alice'` at the end of the list.

In [35]:
patient_names.append('alice')

The `index` method will only return the index of the first occurence.

In [36]:
patient_names.index('alice')

5

The `pop` method removes an element at a given index, the `remove` method removes an element by value. If we remove `'alice'` from the list, only the `'alice'` at the end of the list will remain.  Removing a value that doesn't occur in the list will raise an exception.

In [37]:
', '.join(patient_names)

'ivy, robert, kathy, gitta, carol, alice, freddy, harry, darren, alice'

In [38]:
patient_names.remove('alice')

In [39]:
', '.join(patient_names)

'ivy, robert, kathy, gitta, carol, freddy, harry, darren, alice'

### Aliasing versus copying

It may be a bit counter-intuitive, but assigning a list to a new variable doesn't copy the list. The new variable refers to the same list as the original one. We assign `patient_names` to another variable, remove the first element, and check the value of both variables.

In [40]:
other_patient_names = patient_names

In [41]:
other_patient_names.pop(0)

'ivy'

In [42]:
', '.join(other_patient_names)

'robert, kathy, gitta, carol, freddy, harry, darren, alice'

In [43]:
', '.join(patient_names)

'robert, kathy, gitta, carol, freddy, harry, darren, alice'

The `copy` method will create an actual copy of the original list.

In [44]:
other_patient_names = patient_names.copy()

In [45]:
other_patient_names.pop()

'alice'

In [46]:
', '.join(other_patient_names)

'robert, kathy, gitta, carol, freddy, harry, darren'

In [47]:
', '.join(patient_names)

'robert, kathy, gitta, carol, freddy, harry, darren, alice'

### Slicing

We can create sublists out of list by "slicing", i.e., indexing by a start index, and end index, and, optionally, a stride.  For instance, we could select the first three elements of the list, or the second to the sixth element, but only every other element.

In [48]:
patient_names[0:3]

['robert', 'kathy', 'gitta']

In [49]:
patient_names[1:6:2]

['kathy', 'carol', 'harry']

Both the start and end index can be left out, that means that the slice starts from the beginning, or runs up to the end of the list respectively. Combined with negative indices, this is quite expressive. Getting the first, or the last three elements of list is quite trivial that way.

In [50]:
patient_names[:3]

['robert', 'kathy', 'gitta']

In [51]:
patient_names[-3:]

['harry', 'darren', 'alice']

Leaving out both start and end index, and using a stride of -1 is a neat thrick to reverse a list.

In [52]:
patient_names[::-1]

['alice', 'darren', 'harry', 'freddy', 'carol', 'gitta', 'kathy', 'robert']

Note that slicing creates a new list, but that it is a shallow copy. This is important when list elements are mutable.

### Creating lists

The simplest way to construct a list is by a literal enumeration of its elements.

In [53]:
first_names = ['peter', 'sally', 'vaughan', 'sophie', 'patrick']

However, new lists can be constructed from iterables by comprehension. For instance, a list of capitalized names can be constructed.

In [54]:
[name.capitalize() for name in first_names]

['Peter', 'Sally', 'Vaughan', 'Sophie', 'Patrick']

It is also possible to select only certain elements for the new list by adding an `if` clause to the comprehension.

In [55]:
[name.capitalize() for name in first_names if name.startswith('p')]

['Peter', 'Patrick']

Returning to our running example, we can combine this to select the names of the patients that are older than 35.

In [56]:
[patient.name for patient in read_patients('Data/patients.txt') if patient.age > 35]

['harry', 'john']

Using the `list` function, we can also create a list of the patients in our data file using the generator.

In [57]:
patients = list(read_patients('Data/patients.txt'))

In [58]:
patients[0].age

35

## Sets

Which ages are represented in our group of patients? When we want to answer this question, we actually ask for a mathematical set containing the ages, each element of the set occurs only once. We could accomplish this using a `list` as well, but that would be pretty cumbersome.

In [59]:
age_list = list()
for patient in read_patients('Data/patients.txt'):
    if patient.age not in age_list:
        age_list.append(patient.age)

In [60]:
age_list

[35, 33, 27, 41, 40]

Using Python's built-in `set` type, this task is not only easier, but the data structure represents the mathematical concept we actually have in mind.

In [61]:
age_set = set()
for patient in read_patients('Data/patients.txt'):
    age_set.add(patient.age)

In [62]:
age_set

{27, 33, 35, 40, 41}

This is even simpler using a set comprehension, similar to the list comprehension we've encountered before.

In [87]:
age_set = {patient.age for patient in read_patients('Data/patients.txt')}

In [64]:
age_set

{27, 33, 35, 40, 41}

The number of elements of a `set` is obtained using the `len` function, and we can test membership using the `in` operator.

In [65]:
len(age_set)

5

In [66]:
40 in age_set

True

In [67]:
53 in age_set

False

As opposed to a `list`, a `set` is not a sequence, i.e., it is not ordered, it has no first, second, or last element. This is of course the same for a mathematical set, which is obviously no coincidence.

Iterating over the elements of a `set` is done using a `for`-loop.

In [88]:
for age in age_set:
    print(age)

33
35
40
41
27


### Modifying sets

The `add` method adds an element to an existing set. To remove an element, we can use the `pop()` method.

In [68]:
age_set.pop()

33

Note that the element that is removed is random (well, implementation dependent, to be precise). To remove an element from the set, you can use either the `remove` or `discard` method.

In [70]:
age_set.remove(41)

In [71]:
try:
    age_set.remove(41)
except Exception:
    print_exc()

Traceback (most recent call last):
  File "<ipython-input-71-c2dc3b58f16a>", line 2, in <module>
    age_set.remove(41)
KeyError: 41


The `discard` method will not raise an exception when we try to remove an element that is not in the set. To remove all elements from a `set`, we can use the `clear` method.

In [72]:
age_set.clear()

In [73]:
len(age_set)

0

### Set operations

All the mathematical operation you would expect on sets are indeed defined, e.g., union (`|`), intersection (`&`), difference (`-`), symmetric difference (`^`).

In [77]:
set1 = {1, 2, 3, 4, 6, 12}
set2 = {1, 3, 5, 15}

In [78]:
set1 | set2

{1, 2, 3, 4, 5, 6, 12, 15}

In [79]:
set1 & set2

{1, 3}

In [80]:
set1 - set2

{2, 4, 6, 12}

In [82]:
set2 - set1

{5, 15}

In [81]:
set1 ^ set2

{2, 4, 5, 6, 12, 15}

All operations create a new set, an method to perform these operations in place is implemented, mainly for performance reasons.  For example, `difference_update` applied to `set1` would modify that set.

In [83]:
set1.difference_update(set2)

In [84]:
set1

{2, 4, 6, 12}

Three Boolean methods are available as well,
  * `s1.isdisjoint(s2)` will test whether the sets are disjoint,
  * `s1.issubset(s2)` will check whether `s1` is a subset of `s2`, and
  * `s1.issuperset(s2)` checks whether `s1` is a superset of `s2`.

## Dicts