# 5  Sets are like dictionaries but with only keys
Sets are an unordered mutable collections of unique objects. (The corresponding immutable type is frozenset.) They are basically dictionaries that have only keys, without associated values (Python tutorial, realpython.com).

Like dictionaries, sets allow for very efficient membership testing (item in my_set). This is possible because Python can "convert" the data item to its memory location. For example, if it sees the integer 1, it checks a specific memory location – and can immediately say if 1 is present in the set or not. If it sees the string "abc", it checks another memory location. If it sees the string "a", it checks a third memory location, and so on. Every object has a memory location "assigned", and Python can quickly check (in O(1) time) if the item is present or not. It doesn't need to look at all the other items in the set, like it would have to in a list or tuple. This is also why duplicates in a set/dictionary are not allowed, and indexing is not possible: the items don't have any order. It's the same with dictionary keys: no duplicates, no indexing, and very quick lookups. Internally, this works via a "hash table", as we'll learn later.
Basic uses:

searching (membership testing)
eliminating duplicates
comparisons: set operations operations like union, intersection, difference, symmetric difference
For example, to determine the number of distinct characters in the string s = 'AGHCDFJALSDFHVHASKDFLVHTUYPCV', you can write len(set(s)).

Set operators and corresponding method syntax:

set1 < set2  # true if every set1 element also in set2, and set2 is larger
set1 <= set2 # set1.issubset(collection) -> is set1 subset of set2?
set1 | set2  # set1.union(collection) -> combine set1 and set2
set1 & set2  # set1.intersection(collection) -> get elements that are both in set1 and set2
set1 - set2  # set1.difference(set2) -> get set1 elements that are not in set2

Basic set operations

Because locating items in sets is much faster than in lists or tuples (Stackoverflow), sets are often used for comparisons between collections of items. On the downside, sets are unordered and allow no duplicates (like dictionary keys).

Example:

In [1]:
A = {3,6,8}
B = {6,7}

print(A & B)  # A ∩ B (intersection)

{6}


In [None]:
#Operators/methods to update sets:

set1.add(element)
set1.remove(element)  # see also: set1.discard(element), set1.pop()
set1 |= set2  # set1.update(coll) -> add elements to set1
set1 &= set2  # set1.interesection_update
set1 -= set2  # set1.difference_update
set1 ^= set2  # set1.symmetric_difference_update(coll)

In [None]:
>>> basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}
>>> print(basket)                      # show that duplicates have been removed
{'orange', 'banana', 'pear', 'apple'}
>>> 'orange' in basket                 # fast membership testing
True
>>> 'crabgrass' in basket
False

>>> # Demonstrate set operations on unique letters from two words
...
>>> a = set('abracadabra')
>>> b = set('alacazam')
>>> a                                  # unique letters in a
{'a', 'r', 'b', 'c', 'd'}
>>> a - b                              # letters in a but not in b
{'r', 'd', 'b'}
>>> a | b                              # letters in a or b or both
{'a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'}
>>> a & b                              # letters in both a and b
{'a', 'c'}
>>> a ^ b                              # letters in a or b but not both
{'r', 'd', 'b', 'm', 'z', 'l'}

#Similarly to list comprehensions, set comprehensions are also supported:
>>> a = {x for x in 'abracadabra' if x not in 'abc'}
>>> a
{'r', 'd'}

1. Your colleague compiled a list of sequence identifiers they want to use in the experiment: best_seqids = ['f3', 'g9','e2','r0']. Another colleague compiled another list of sequence identifiers they want to use in the experiment: optimal_seqids = ['e2','e3','e4','f3','n1']
    - Determine how many distinct sequence identifiers are in both lists.
    - Determine which identifiers are only in the first list.
    - Determine which identifiers are unique in each one of the lists.

In [23]:
best_sequids = {'f3', 'g9', 'e2', 'r0'}
optimal_sequids = {'e2','e3','e4','f3','n1'}

In [38]:
# determine how many distinct sequence identifiers are in both lists.
print(f"identifiers in both lists: {best_sequids ^ optimal_sequids}")
print(f"that is {(len(best_sequids ^ optimal_sequids))} distinct identifiers.")
# ^ items that are in list 1 or list 2 but not both --> machen Listen unterscheidbar

identifiers in both lists: {'n1', 'g9', 'e3', 'r0', 'e4'}
that is 5 distinct identifiers.


In [25]:
# determine which identifiers are only in the first list.
best_sequids - optimal_sequids

{'g9', 'r0'}

In [35]:
# all identifiers in both lists:
print(f"all identifiers that are listed: {best_sequids | optimal_sequids}") #letters in list 1 or list 2 or both

# determine which identifiers are unique in each one of the lists
print(f"identifiers in list 1 OR list 2: {best_sequids ^ optimal_sequids}")
# ^ excludes identifiers which are present in both lists (list 1 OR list 2 but not in both)

# determine which are unique in list 1 so we can find out, which one of the previous listed identifiers belong
# to list 1
print(f"identifiers only in list 1: {best_sequids - optimal_sequids}")

# determine which are unique in list 2 so we can find out, which one of the previous listed identifiers belong
# to list 2
print(f"identifiers only in list 2: {optimal_sequids - best_sequids}")

all identifiers that are listed: {'e3', 'r0', 'e4', 'f3', 'e2', 'g9', 'n1'}
identifiers in list 1 OR list 2: {'n1', 'g9', 'e3', 'r0', 'e4'}
identifiers only in list 1: {'r0', 'g9'}
identifiers only in list 2: {'e4', 'n1', 'e3'}
