# Day 1: Sets and Dictionaries
Lukas Jarosch

## Sets
Sets are a special type of data that only contain **unique** and **immutable** objects and reflect the Python implementation of mathematical sets (German: "Mengen"). Sets are very different from lists or tuples, because they are **unordered**. In lists and tuples the order of contained items matters and you can access them at different positions, for sets this is not possible.

Sets are usually used to find unique values in a list/tuple/string/... or determine overlaps between different collections of unique values.

### Creating sets
We can define sets with curly brackets `{}` or the `set()` function.

In [13]:
# set definition
set_1 = {1, 3, 2}
set_2 = set(("A", "C", "B", 2, 3, 1))

# sets do not keep order information
print(set_1, set_2, sep = "\n")

{1, 2, 3}
{1, 2, 3, 'B', 'C', 'A'}


In [14]:
# sets can only contain immutable objects, so nested lists are not allowed because lists are mutable objects
set_3 = {1, 2, 3, [3, 4]}

TypeError: unhashable type: 'list'

In [None]:
# instead of a list, we can use a nested tuple
set_3 = {1, 2, 3, (3, 4)}
print(set_3)

{1, 2, 3, (3, 4)}


In [None]:
# sets do not contain duplicates
set_4 = {1, 1, 1, 1, 2, 2, 3}

# we can easily convert strings or lists into sets
lst = ["A", "A", 1, 2, 2]
set_5 = set(lst)
set_6 = set("ATTGCTT")

print(set_4, set_5, set_6, sep = "\n")

{1, 2, 3}
{1, 2, 'A'}
{'A', 'T', 'C', 'G'}


### Specific methods and functions
Python has many different functions for dealing with sets, which you write with the syntax `myset.function(argument)`, analogous to list functions. The most important set methods are listed below:

* `.add()`: add an element to the set
* `.copy()`: return a copy of the set
* `.difference()`: return set of elements that are in the first set but not in the second set (the order matters here!)
* `.symmetric_difference()`: return elements that occur in only one of both sets (the order doesn't matter here)
* `.intersection()`: return the intersection between two sets
* `.union()`: returns the combination of two sets into one
* item `in` set: returns `True` if an item is in the set, `False` if it is not
* `len()`: length of the set

Again, we will provide a few code examples for this:

In [6]:
sequence_1 = "TTTTAAAA"
set_1 = set(sequence_1)

sequence_2 = "GCGGGCCTT"
set_2 = set(sequence_2)

print(set_1, set_2, sep="\n")

# add an element to the set
set_3 = set_1.copy()
set_3.add("C")
print(set_3)

{'A', 'T'}
{'G', 'C', 'T'}
{'A', 'C', 'T'}


In [17]:
organisms = {"zebrafish", "fruit fly", "human", "mouse"}
mammals = {"human", "rat", "mouse", "cow", "chimpanzee"}

##########################
## differences between "organisms" and "mammals", note the difference between .difference()
## and .symmetric_difference()

# elements that are in "organisms" but not in "mammals"
print(organisms.difference(mammals))

#  elements that are in "mammals" but not in "organisms"
print(mammals.difference(organisms))

# elements that only occur in one of both sets
print(mammals.symmetric_difference(organisms))

##########################

# intersection betweeen "organisms" and "mammals"
print(organisms.intersection(mammals))

# elements from both sets combined
print(mammals.union(organisms))

# check if item is in set
print("zebrafish" in mammals)
print("zebrafish" in organisms)

# length of the set
print(len(organisms))


{'fruit fly', 'zebrafish'}
{'rat', 'chimpanzee', 'cow'}
{'fruit fly', 'rat', 'zebrafish', 'cow', 'chimpanzee'}
{'human', 'mouse'}
{'fruit fly', 'rat', 'mouse', 'zebrafish', 'cow', 'human', 'chimpanzee'}
False
True
4


## Dictionaries
Like lists, dictionaries can store any kind of data type. However, instead of accessing the stored objects via their index as you would do in lists, dictionaries give a specific **key** to each stored object, which you can use to access the underlying values. In theory, you can use immutable data types like `tuples`, `integers`, or `floats` for your dictionary keys but most often you use a `string`.

Dictionaries are generally very handy if you want to replace certain elements with others, for example translate RNA to DNA, form an antisense strand, translate one biological ID into another, ...\
The applications are numerous!

You can also use dictionaries for data storage. For example, if you took multiple measurements in an experiment you could store the individual values as lists and assign them to dictionary keys like "measurement 1", "measurement 2", ... Just imagine the dictionary keys as column names of a datatable and the underlying lists as the individual entries in that column.

### Creating dictionaries
We can create dictionaries with the following syntax: `{key1:value1, key2:value2, ...}` or from a list of nested lists such as `[[key1, value1], [key2, value2], ...]` with the `dict()` function. If you already have keys and values in a separate list, you can create a dictionary from this with the help of the `zip()` function (this function will be explained in the loops section).

In [2]:
# {} syntax
dict_1 = {"A": [1, 2, 3, 4], "B": {"A", "B"}, "C": 1.35}

# dict() function
dict_2 = dict([["A", [1, 2, 3, 4]], ["B", {"A", "B"}], ["C", 1.35]])

# zip() function
keys = ["A", "B", "C"]
values = [[1, 2, 3, 4], {"A", "B"}, 1.35]

dict_3 = dict(zip(keys, values))

print(dict_1 == dict_2 == dict_3)
print(dict_1)

True
{'A': [1, 2, 3, 4], 'B': {'A', 'B'}, 'C': 1.35}


### Indexing
Indexing with dictionaries is simple. For a list we would ask a question like *"What value does this list have at the 5th element?"* and code this as `mylist[4]`. For a dictionary indexing is very analogous and we code a question like *"What value does my dictionary have at key 'X'?"* as `mydict["X"]`.

You can also add a key:value pair to an existing dictionary by simply assigning the value to the respective index.

In [None]:
mydict = {"sequence 1": "ATTTGCTGAC", "sequence 2": "ACGTTGACTAAAAA"}

print(mydict["sequence 1"])

# add key:value pair
mydict["sequence 3"] = "ATGACTGACTTTT"

print(mydict)

ATTTGCTGAC
{'sequence 1': 'ATTTGCTGAC', 'sequence 2': 'ACGTTGACTAAAAA', 'sequence 3': 'ATGACTGACTTTT'}


### Specific methods and functions
Python also has many methods for handling dictionaries. Again, you can find an overview of the most relevant ones below:

* `.update()`: used to add another dictionary to an existing dictionary
* `.copy()`: returns a copy of the dictionary
* `.keys()`: returns the dictionary keys
* `.values()`: returns the dictionary values

Again, see below some code examples for those functions. Note that the last three functions return special object types like `dict_items`, which you can just convert to a list if you want.

In [3]:
# add two dicts together
dict_1 = {"A":5, "B":6}
dict_2 = {"C":7, "D":8}

dict_1.update(dict_2)

print(dict_1)

# copy a dict
dict_3 = dict_1.copy()
print(dict_3 == dict_1)

# print keys or values
print(dict_1.keys())
print(dict_1.values())

# convert special dict objects into lists
print(list(dict_1.values()))

{'A': 5, 'B': 6, 'C': 7, 'D': 8}
True
dict_items([('A', 5), ('B', 6), ('C', 7), ('D', 8)])
dict_keys(['A', 'B', 'C', 'D'])
dict_values([5, 6, 7, 8])
[5, 6, 7, 8]


### Useful utility: defaultdict()
Python also supports a special dictionary type called `defaultdict`. This functionality is a bit more advanced but you might see it in some code later on and then you can come back to this chapter. Defaultdict is integrated in standard Python but not loaded by default, so we have to explicitly import it with the following code (don't worry, imports will be explained in another chapter):

In [13]:
# import the defaultdict function
from collections import defaultdict

Defaultdict allows us to assign a common default value to our dictionary values. For this, you have to supply defaultdict with a function that returns your default value when executed. Defaultdict is used very often with the `list` function. Such a defaultdict will then make each future dictionary value an empty list `[ ]` if you don't explicitly assign it to a different value.

**Note:** You may be confused why we write `list` instead of `list()` in the definition. This is because we want to supply the `list` function itself, not the value that it returns when we execute it as `list()` (which would be an empty list `[ ]`).

In [14]:
## create an empty dictionary with default value []
mydefaultdict = defaultdict(list)
#  any new keys will automatically be assigned to the default value if you don't
#  reassign them to something else
print(mydefaultdict["new_key"])

## create a regular empty dictionary for comparison
mydict = dict()
#  a normal dict has no default value so it will throw an error if you try to
#  print the value of a key that didn't exist before
print(mydict["new_key"])

[]


KeyError: 'new_key'

When programming with Python you will often store lists inside dictionaries and append values to them step by step. The great thing about a defaultdict with type list is that if you access it with a new key that is not present in the dictionary yet, it will automatically create a key:value pair for that key with an empty list as the value which you can directly append values to.

In [22]:
# instead of this...
mydict["newkey"] = []
mydict["newkey"].append(1)

# ...you can do this
mydefaultdict["newkey"].append(1)

print(mydict, mydefaultdict, sep="\n")

{'newkey': [1]}
defaultdict(<class 'list'>, {'newkey': [1]})


This might not seem like a huge advantage at the moment but later when you get to know loops and functions you might want to come back to this chapter.

Also this hypothetical example might help:

Consider you have some data with genes and the organisms they belong to. You want to sort this into a dictionary that contains each organism name as a key and the list of genes that belong to them as the respective value. The script you wrote for this is now right in the middle of your data, finds a gene that belongs to the organism "fruit fly" and tries to add it to the dictionary..

With a normal dictionary you would have to make an exception for the case where this is the first gene you ever found for that organism (because you can't append to a list that doesn't exist yet), while for a defaultdict the code is much more elegant.

In [None]:
# normal dict implementation
mydict = dict()

if "fruit fly" in mydict.keys():
    mydict["fruit fly"].append("new_gene")
else:
    mydict["fruit fly"] = []
    mydict["fruit fly"].append("new_gene")

# defaultdict implementation
mydefaultdict = defaultdict(list)

mydefaultdict["fruit fly"].append("new_gene")