# Sets

Sets are an unordered datatype in Python. Sets have two important properties:

* They are **unordered**. Thus, the elements of a set are not in a particular order, as they are in a list.
* Each element can be contained in a set **only once**. For example, if you have a set of names, each name can only occur exactly once in that set. This is also the reason why mutable data types are not allowed as values in a set. Therefore, you cannot create a set of lists, for example.

In Python there are two types of sets:

* **set** is the data type for mutable sets. This means that elements can be added and removed as needed.
* **frozenset** is the data type for immutable sets. For example, you can convert a list or a `set` into a `frozenset`. This `frozenset` is not changeable afterwards. Since `frozensets` do not differ from normal sets apart from their immutability, we will not delve deeper into it

In [None]:
from IPython.display import Image
Image("img/Python-data-structure.jpg")

## Create sets

Just as square brackets create a list, curly brackets create a set:

In [None]:
names = {'Santa', 'Claus', 'Klara', 'Marko'}
type(names)

Another possibility is to convert other data types (sequence types) into a set. For this we need the function `set()`: 

In [None]:
names = ['Santa', 'Claus', 'Klara', 'Marko']
nameset = set(names)
type(nameset)

So the function `set()` converts other types into a set. It is applicable to all *iterables*, that is, to any data type that can return its elements one by one.

### Example: Convert list to set

In the last notebook, we created a distinct list of first names by checking in a `for` loop for each entry in the list of names whether it was already present in the second list `distinct_names`. Since by definition there cannot be duplicate entries in a set, we can take advantage of this to accomplish the same thing: We simply convert our list of names into a set.

First we read all the names into the list `clean_names` again:

In [None]:
with open('data/names/names_short.txt', encoding='utf-8') as fh:
    clean_names = [line.rstrip() for line in fh.readlines()]

Then we convert the list into a `set`:

In [None]:
distinct_names = set(clean_names)
print('clean_names: {} entires, distinct_names: {} entries'. format(len(clean_names), len(distinct_names)))   

However, `distinct_names` is now no longer a list, but of type `set`:

In [None]:
type(distinct_names)

We could simply convert the set created above back into a list to apply the already known list methods to it:

In [None]:
distinct_names = list(distinct_names)
type(distinct_names)

We can even do it in one go:

In [None]:
distinct_names = list(set(clean_names))
type(distinct_names), len(distinct_names)

Often, however, this is not necessary, because many things that work with lists can also be applied to sets. For example, the function `len()` can be used to determine the number of elements in a set:

## Count elements with len()

The `len()` function can also be applied to sets to count the number of elements present in the set:

In [None]:
distinct_names = set(clean_names)
len(distinct_names)

## Check for the presence of elements with the `in` operator.

The `in` operator also works exactly the same as with lists: with `in` we can test whether a value is present in the set:

In [None]:
names = {'Santa', 'Claus', 'Klara'}
'Santa' in names

## Iterate through sets with for ... in

You can iterate through a set with `for ... in` in exactly the same way as you can iterate through a list. However, you should note that unlike lists, there is no fixed order of elements in sets. So it is not predictable in which order the elements will be processed in the loop. 

To illustrate this, we can again divide the first names into long, medium-length and short names and count them out: 

In [None]:
short_length_names = 0
medium_length_names = 0
long_length_names = 0

# if you have run the cells above, distinct_names is a set, not a list like in the original example!
print(type(distinct_names))
for name in distinct_names: # distinct_names is a set
    if len(name) > 8:
        long_length_names += 1
    elif len(name) < 5:
        short_length_names += 1
    else:
        medium_length_names += 1
        
print('{} short names, {} medium-length und {} long names'.format(
    short_length_names, medium_length_names, long_length_names))

<div class="alert alert-block alert-info">
<b>Exercise 1</b>
<p>Since strings are also sequence types, you can cast a string to a set (of characters). Use this technique to find out how many <b>different</b> letters make up your name.</p>
</div>

In [None]:
name = input('Input your name: ')
distinct_chars = # TODO
print(f'your name contains {len(name)} characters, form which {len(distinct_chars)} are different')

## Change quantities

### Add values to a set

A new element can be added to a set using the `add()` method. Please note the subtle semantic difference to `append()` for lists: `append()` appends to the end of the list; but since a set has no defined order, the same functionality is offered here by the `add()` method: the new element is simply added to the set without implying a specific position.

In [None]:
names = {'Santa', 'Claus', 'Klara'}
names.add('Kat')
names

Of course you can also start with an empty set. However, the notation with the curly brackets does not work here because it is already being used to create an empty dictionary (coming in the notebook after next). Therefore, we need to explicitly create a `set` object:

In [None]:
names = set()
names.add('Tim')
names.add('Blue power ranger')
names

Since each element can only appear once in a set, a set silently ignores an added element if it already exists:

In [None]:
names = {'Santa', 'Claus', 'Klara'}
names.add('Santa')
names

### Removing values from a set with discard()

The `discard()` method removes a specific element from a set:

In [None]:
names = {'Santa', 'Claus', 'Klara'}
names.discard('Klara')
print(names)

If we try to discard a nonexistent element with `discard()` , Python silently ignores it:

In [None]:
names = {'Santa', 'Claus', 'Klara'}
names.discard('Powrranger')
print(names)

### Remove values from a set with remove()

An alternative method to remove an element is available with `remove()`:

In [None]:
names = {'Santa', 'Claus', 'Klara', 'Marko'}
print(names)
names.remove('Santa')
names

The difference from `discard()` lies in the way Python reacts when we try to remove a nonexistent element:

In [None]:
names = {'Santa', 'Claus', 'Klara', 'Marko'}
print(names)
names.remove('Lucija')
names

Here, the missing value is not simply ignored, but an exception is thrown. We will learn what that is in one of the next notebooks.

### Remove a value with `pop()`

In [None]:
names = {'Santa', 'Claus', 'Klara', 'Marko'}
print(names)
names.pop()
names

We learned that `pop()` removes the last element of a list. Since a set has no defined order, `pop()` removes an unpredictable element from the set. The method still makes sense, a bit if you want to remove one element at a time in a `while` loop:

In [None]:
names = {'Santa', 'Claus', 'Klara', 'Marko'}
while names:  # True, as long as there's elements in the set
    print(names.pop())
names # now empty

### Delete all values from a set

There is a method `clear()` for deleting all values from a set:

In [None]:
names = {'Santa', 'Claus', 'Klara', 'Marko'}
print(names)
names.clear()
names

<div class="alert alert-block alert-info">
<b>Exercise 2</b>
<ol>
<li>Read the file <tt>data/names/first_2015.txt</tt> line by line, remove the line breaks and convert the resulting list into a set <tt>distinct_names</tt>.</li >
<li>Iterate through this set in a loop and print out all entries that consist of 3, 4 or 5 characters.</li>
</ol>
</div>

## Set operations

While removing multiple elements is a nice feature, the real purpose of sets is that we can use them to do set theory. 

### Form the intersection

Let's assume we have two groups of friends: One from the university, a second we meet regularly at yoga:

In [None]:
uni_friends = {'Ana', 'Tino', 'Hannes', 'Sabrina'}
yoga_friends = {'Emil', 'Sabrina', 'Tino'}

If we want to find out who belongs to both groups, we have to make the intersection of the two sets.

![intersection](img/set_intersection.png)

In [None]:
uni_friends & yoga_friends

We have learned here about the `&` operator, which, applied to two sets, forms the intersection of these two sets.

As an alternative to the `&` operator, you can use the `intersection()` method of a `set` object:

In [None]:
uni_friends.intersection(yoga_friends)

### Forming the difference of two sets

The difference of two sets is formed by removing all elements that are present in both sets,
is removed from one set: Thus, only the elements that are present in only one set remain.

![difference set](img/set_difference.png)

To make this a bit more concrete: If we want to find out which of our friends we only know from university, the difference operator `-` is exactly what we need:

In [None]:
uni_friends - yoga_friends

Here, too, there is an alternative method to the operator: `difference()`.


In [None]:
uni_friends.difference(yoga_friends)

If we want to find out if we only know from bouldering, we have to swap the two sets:

In [None]:
yoga_friends - uni_friends

In [None]:
yoga_friends.difference(uni_friends)

### Form union of two sets

The union quantity can be generated just as easily:

In [None]:
uni_friends | yoga_friends

now contains all elements from both sets.

![union set](img/set_union.png)

Again, there is a method that does the same as the `|` operator: `.union()`:

In [None]:
uni_friends.union(yoga_friends)

## Other useful methods

When programming, it's not uncommon to program something yourself and then later find out that this functionality was already there anyway. Therefore here area few more useful methods of the Set object. (From here on no exam material, because you can look this up at any time).

### Are there any overlaps?

To find out if two sets have no elements in common, you can use `isdisjoint()`:

In [None]:
uni_friends = {'Ana', 'Elena', 'Alex', 'George'}
yoga_friends = {'Emil', 'Sabrina', 'George'}
highschool_friends = {'Maja', 'Mario', 'Elisabeth'}

uni_friends.isdisjoint(yoga_friends)

In [None]:
uni_friends.isdisjoint(highschool_friends)

## Is one set a subset of the other set?

To find out if all elements of one set are contained in the other set, you can use `issubset()`.

In [None]:
uni_friends = {'Ana', 'Elena', 'Alex', 'George'}
yoga_friends = {'Ana', 'George'}

yoga_friends.issubset(uni_friends)

We can also reverse it:

In [None]:
uni_friends = {'Ana', 'Elena', 'Alex', 'George'}
yoga_friends = {'Ana', 'George'}

uni_friends.issuperset(yoga_friends)

# Literature:
* https://docs.python.org/3/tutorial/datastructures.html
* https://www.w3schools.com/python/python_sets.asp
* https://realpython.com/python-sets/
* https://www.youtube.com/watch?v=W8KRzm-HUcc