# Sets
Another variable type that can be used to store multiple items is the `Set`. Sets are similar to lists, but they are an *un-ordered* collection of items that can contains *no duplicate values*. This can be useful in a variety of situations where we are only interested whether an item is present, and not how many times it occurs. We can also add and remove items from the set, and perform operations on two sets to compare their contents.

Just like with lists, sets can hold elements of many different types - `int`, `float`, `string` etc. - but they cannot contain lists, dictionaries or other sets because these items could be changed after being added to the set, and the set must not contain duplicate items.

### Creating a Set
Sets can be created in a similar way to lists, but using the curly braces `{}` instead of angled ones.

In [1]:
cell_types = {"hepatocyte", "adipocyte", "erythrocyte", "fibroblast"}

print(cell_types)

{'fibroblast', 'hepatocyte', 'erythrocyte', 'adipocyte'}


Sets can also be created from an existing list using the `set()` function.

In [2]:
amino_acids = ['Lysine', 'Glutamine', 'Alanine', 'Lysine']

amino_set = set(amino_acids)

print(amino_set)

{'Lysine', 'Glutamine', 'Alanine'}


**Note that in the second example, the duplicate value from `amino_acids` was removed when converted to a set.**

<div class="alert alert-info">
Note that an empty set cannot be created with `my_set = {}` as this creates and empty dictionary (we will cover dictionaries in the next topic). To create an empty set one must use `my_set = set()`
</div>

### Set Containment
One of the main uses of a set is simply to keep track of elements.

In [3]:
sequence = "ATGCTGACATGACTGTACATC"

nucleotides = set(sequence)

print(nucleotides)

{'A', 'C', 'G', 'T'}


In [4]:
print( 'A' in nucleotides )

True


In [5]:
print( '*' in nucleotides )

False


In [6]:
codons = ['ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC']

unique_codons = set(codons)

print(unique_codons)

{'ATG', 'TGA', 'CTG', 'ACA', 'ATC', 'TAC'}


In [7]:
print( 'TGA' in unique_codons )

True


In [8]:
print( 'GTG' in unique_codons )

False


#### Subset and Superset
Sometimes it is useful to know whether one set contains all the elements of another. If set A contains all the elements of set B, then A is a *superset* of B, and B is a *subset* of A.

In [9]:
codons1 = {'ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC'}
codons2 = {'ATG', 'TGA', 'CTG', 'TAC', 'ATC'}

if codons2 <= codons1: # If codons2 is a subset of codons1
    print("They require the same amino acids")

They require the same amino acids


*This will also return true if the two sets are identical.*

It is possible to check that B is strictly a subset of A (and not the same as)  with `B < A`.

Similarly, it is possible to test if two sets are identical with the comparison operator we have already seen (`==`).

### Set Operations

#### Adding and Removing Elements
Adding and removing elements from a set individually can be done with the `add()` and `remove()` methods:

In [10]:
codons = {'ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC'}
codons.add('GCC')
print(codons)
codons.remove('CTG')
print(codons)

{'ATG', 'TGA', 'CTG', 'ACA', 'GCC', 'ATC', 'TAC'}
{'ATG', 'TGA', 'ACA', 'GCC', 'ATC', 'TAC'}


In the same way as with removing elements from a list with the `remove()` method, if the element is not present in the set, python will throw an error.

It is possible to add and remove multiple elements at once using operations below.

#### Loops
It is possible to iterate through all the elements of the set, but remember that they are not ordered.

In [11]:
codons = ['ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC']

unique_codons = set(codons)

for codon in unique_codons:
    print('G' in codon)

True
True
True
False
False
False


*Note that the sequence of `True` and `False` do not match the original list*

#### Union

    A | B
Returns a *new* set that contains all elements that were in either set.

In [12]:
sequence1 = "ATGCTGACATGACTGTACATC"
sequence2 = "TACCAGTAGATCCATCATGAC"

codons1 = ['ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC']
codons2 = ['TAC', 'CAG', 'TAG', 'ATC', 'CAT', 'CAT', 'GAC']

unique_codons1 = set(codons1)
unique_codons2 = set(codons2)

print(unique_codons1)
print(unique_codons2)

{'ATG', 'TGA', 'CTG', 'ACA', 'ATC', 'TAC'}
{'CAT', 'TAG', 'ATC', 'GAC', 'TAC', 'CAG'}


In [13]:
unique_codons = unique_codons1 | unique_codons2
print(unique_codons)

{'ATG', 'CAT', 'TAG', 'TGA', 'ACA', 'CTG', 'ATC', 'GAC', 'TAC', 'CAG'}


#### Update

    A |= B
Adds all the elements of set B to set A

In [14]:
unique_codons = {'ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC'}
print(unique_codons)

{'ATG', 'TGA', 'CTG', 'ACA', 'ATC', 'TAC'}


In [15]:
codons2 = {'TAC', 'CAG', 'TAG', 'ATC', 'CAT', 'CAT', 'GAC'}
unique_codons |= codons2

print(unique_codons)

{'ATG', 'CAT', 'TAG', 'TGA', 'CTG', 'ACA', 'ATC', 'GAC', 'TAC', 'CAG'}


*In the same way as this operation is the same as `Union` but includes an equals sign to store the result in the set A, the following operations have equivalent alterations to store the result in A (`&=`, `-=`, `^=`).*

#### Intersection
    A & B
Returns a *new* set that contains only elements present in both A *and* B

In [16]:
codons1 = {'ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC'}
codons2 = {'TAC', 'CAG', 'TAG', 'ATC', 'CAT', 'CAT', 'GAC'}

common_codons = codons1 & codons2
print(common_codons)

{'ATC', 'TAC'}


#### Difference
    A - B
Returns a *new* set that contains only elements that were present in A but not in B

In [17]:
codons1 = {'ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC'}
codons2 = {'TAC', 'CAG', 'TAG', 'ATC', 'CAT', 'CAT', 'GAC'}

unique_to_1 = codons1 - codons2
print(unique_to_1)

{'ATG', 'CTG', 'ACA', 'TGA'}


#### Symmetric Difference
    A ^ B
Returns a *new* set that contains elements that belong to either *only* A or *only* B (not both)

In [18]:
codons1 = {'ATG', 'CTG', 'ACA', 'TGA', 'CTG', 'TAC', 'ATC'}
codons2 = {'TAC', 'CAG', 'TAG', 'ATC', 'CAT', 'CAT', 'GAC'}

personal_codons = codons1 ^ codons2
print(personal_codons)

{'CAT', 'TAG', 'TGA', 'ACA', 'CTG', 'GAC', 'ATG', 'CAG'}


# Exercises

- Find all the amino acids that are common to both these sequences

In [19]:
sequence1 = "CAWEWPRHRYT"
sequence2 = "GATTCGAWRPQY"

- Find out which of these proteins can be created with the available amino acids

In [20]:
available = { 'C', 'G', 'R', 'T', 'Q', 'Y', 'A', 'H'}

protein1 = "AGTHYAHYCTGR"
protein2 = "GATTCGAWRPQY"
protein3 = "CAWEWPRHRYT"


- Which amino acids need to be added to the available pool for protein 3 to be synthesizable?