# Studying research group applications

**Learning goal**: By the end of this case, you will be able to differentiate between sets, lists, tuples, and dictionaries in Python. Additionally, you will know how to create and use them.

You are the assistant of a top-rated professor who has been collecting applications for her research group. She has been recording them in a Word document that she updates every time she receives an email from an applicant. So far, there are four people interested in joining the group. Your professor has asked you to help her get some preliminary insight on the kinds of interests that applicants have in common, as well as organize the data for further analysis and interviewing.

## Analysis of affinities

These are the preferences that the applicants reported (the names of the applicants and the subjects are in no particular order):

* **Dr. Hannah Parker**:
    - Topology
    - Applied mathematics
    - Number theory
* **Dr. Tim Neely**:
    - Statistics
    - Labor economics
    - Applied mathematics
    - Medieval literature
* **Dr. Matt Harford**:
    - Software development
    - Statistics
    - Applied mathematics
    - Number theory
* **Dr. Mithuna Patel**:
    - Quantum mechanics
    - General relativity
    - Quantum computing
    
Let's create a Python object for each academic to store these preferences:

In [16]:
hannah_parker_interests = {"topology", "applied mathematics", "number theory"}
tim_neely_interests = {"statistics", "labor economics", "applied mathematics", "medieval literature"}
matt_harford_interests = {"software development", "statistics", "applied mathematics", "number theory"}
mithuna_patel_interests = {"quantum mechanics", "general relativity"}

This particular kind of object is called a **set**. Sets are created by placing elements between curly braces, and separating them with commas. One important thing about sets is that they don't allow for duplicates. To illustrate this, let's say you wanted to add `quantum computing` to `mithuna_patel`. You can do that easily with the **`.add()`** method.

Before adding the element:

In [2]:
mithuna_patel_interests

{'general relativity', 'quantum mechanics'}

After adding it:

In [3]:
mithuna_patel_interests.add("quantum computing")
mithuna_patel_interests

{'general relativity', 'quantum computing', 'quantum mechanics'}

But if we try to add an element that is already present, we simply get the exact same set as before, because Python detects that would create a duplicate entry and ignores our attempts to add it again:

In [4]:
mithuna_patel_interests.add("general relativity")
mithuna_patel_interests

{'general relativity', 'quantum computing', 'quantum mechanics'}

Sets are super useful if you want to compare collections of items. This is done using **[set logic](https://medium.com/better-programming/mathematical-set-operations-in-python-e065aac07413)**. For instance, we wanted to know what interests our academics have in common - this is called **set intersection**. If you picture a set as a circle, this is what an intersection of two sets (circles) looks like:

![Intersection between two sets](data/images/intersection_two.png)

Intersecting two sets gets you all the elements that are present in both of them. The syntax to do this in Python is:

~~~python
A & B
~~~

where `A` and `B` are sets. So, if we wanted to find the intersection of interests of Dr. Hannah Parker and Dr. Tim Neely, we'd do the following:

In [5]:
hannah_parker_interests & tim_neely_interests

{'applied mathematics'}

This effectively shows us the subjects that are both in `hannah_parker_interests` *and* `tim_neely_interests`.

### Exercise 1

Find the interests that Dr. Tim Neely and Dr. Matt Harford have in common. How about the interests that all four applicants have in common?

**Answer.**

-------

In [8]:
tim_neely_interests & matt_harford_interests
tim_neely_interests & matt_harford_interests & hannah_parker_interests & mithuna_patel_interests

set()

The result of the intersection of all the sets, `set()`, is called the **empty set**. If you look carefully, you'll see that Dr. Hannah Parker, Dr. Tim Neely, and Dr. Tim Harford have `applied mathematics` in common. That would be a diagram like this:

![Intersection between three sets](data/images/intersection_three.png)

But when you intersect that with the interests of Dr. Mithuna Patel, the result is the empty set, because Dr. Mithuna Patel isn't interested in `applied mathematics`. This is an empty intersection.

You can check that this set is empty by calling the **`len()`** function, which tells you the number of elements in the set being passed to it:

In [3]:
interests_all_common = hannah_parker_interests & tim_neely_interests & matt_harford_interests & mithuna_patel_interests
len(interests_all_common)

0

You can use the **`in`** keyword to verify that `general relativity` is in Dr. Mithuna Patel's set of subjects, while it isn't in Dr. Hannah Parker's:

In [4]:
print("general relativity" in mithuna_patel_interests)
print("general relativity" in hannah_parker_interests)

True
False


Now let's find what interests Dr. Tim Neely has that Dr. Hannah Parker doesn't. This is called a **set difference**:

In [5]:
tim_neely_interests - hannah_parker_interests

{'labor economics', 'medieval literature', 'statistics'}

If we wanted to find the reverse; that is, which interests Dr. Hannah Parker has that aren't in Dr. Tim Neely's interests, we would do this:

In [6]:
hannah_parker_interests - tim_neely_interests

{'number theory', 'topology'}

You can visualize this by comparing these two set differences. They are indeed different:

!["Set difference"](data/images/set_difference_1.png)
!["Set difference reversed"](data/images/set_difference_2.png)

Now, we know that Dr. Tim Neely is interested in labor economics, medieval literature and statistics, while Dr. Hannah Parker isn't. And we know that Dr. Hannah Parker is interested in number theory and topology, while Dr. Tim Neely isn't. Therefore, the set of all the interests that are not common between those two colleagues is:

* `labor economics`
* `medieval literature`
* `statistics`
* `number theory`
* `topology`

This is called the **symmetric difference** between the two sets. It is the result of joining together both set differences. In our circle intersection diagram, this would look like this:

!["Symmetric difference"](data/images/symmetric_difference.png)

And this would be the code (we use a caret, `^`, as the operator):

In [7]:
hannah_parker_interests ^ tim_neely_interests

{'labor economics',
 'medieval literature',
 'number theory',
 'statistics',
 'topology'}

Let's now find how many interests each academic has in common with every other academic:

In [8]:
hp_tn = (hannah_parker_interests & tim_neely_interests)
print("Dr. Hannah Parker and Dr. Tim Neely:", len(hp_tn))

hp_mh = (hannah_parker_interests & matt_harford_interests)
print("Dr. Hannah Parker and Dr. Matt Harford:", len(hp_mh))

hp_mp = (hannah_parker_interests & mithuna_patel_interests)
print("Dr. Hannah Parker and Dr. Mithuna Patel:", len(hp_mp))

tn_mh = (tim_neely_interests & matt_harford_interests)
print("Dr. Tim Neely and Dr. Matt Harford:", len(tn_mh))

tn_mp = (tim_neely_interests & mithuna_patel_interests)
print("Dr. Tim Neely and Dr. Mithuna Patel:", len(tn_mp))

mh_mp = (matt_harford_interests & mithuna_patel_interests)
print("Dr. Matt Harford and Dr. Mithuna Patel:", len(mh_mp))

Dr. Hannah Parker and Dr. Tim Neely: 1
Dr. Hannah Parker and Dr. Matt Harford: 2
Dr. Hannah Parker and Dr. Mithuna Patel: 0
Dr. Tim Neely and Dr. Matt Harford: 2
Dr. Tim Neely and Dr. Mithuna Patel: 0
Dr. Matt Harford and Dr. Mithuna Patel: 0


To wrap up this section, let's print the set of all the interests that the applicants reported. This would be the **union** of the sets, which means the set that includes all the elements that are in any of the sets. The union of two sets would be represented by this diagram:

![Union](data/images/union.png)

Or, if they happen to be disjoint (i.e. have no intersection), by this diagram:

![Union disjoint](data/images/union_disjoint.png)

This is the code to get the union of all four sets of interests (we use a vertical bar, `|`, as our operator):

In [9]:
hannah_parker_interests | tim_neely_interests | matt_harford_interests | mithuna_patel_interests

{'applied mathematics',
 'general relativity',
 'labor economics',
 'medieval literature',
 'number theory',
 'quantum mechanics',
 'software development',
 'statistics',
 'topology'}

## Contact details

The applicants were asked to provide their phone numbers and emails *in order of priority*, which means that sets wouldn't be enough to encode all the information we want to keep (sets ignore order altogether). One option is to use **lists** instead.

A list is created just like a set, with the difference that instead of curly braces (`{` and `}`), we use square brackets (`[` and `]`). The elements from a list are accessed by indicating their numerical position in the list, *starting from zero* (not from one!). Thus, to get the second element of the following list, we'd write:

In [1]:
my_list = ["One", "Two", "Three", "Four", "Four"]
my_list[1]

'Two'

This doesn't output `One` (the first element of the list) as you might expect, but rather it outputs `Two` (the second element of the list). Additionally, lists do allow duplicates, as you can verify (`Four` appears twice):

In [2]:
my_list

['One', 'Two', 'Three', 'Four', 'Four']

You can **slice** a list by defining a start index and an end index and separating them with a colon (`:`), like this:

In [3]:
my_list[1:4] # This retrieves the elements in positions 1, 2 and 3 (i.e., 4 is not included)

['Two', 'Three', 'Four']

These are the contact details of the four academics:

* **Dr. Hannah Parker**:
    - Phone: +1 978-385-3123, +1 610-684-1514
    - Email: hannahp@unk.edu, hannah7897@gmail.com
* **Dr. Tim Neely**:
    - Phone: +1 503-831-5725, +1 530-832-3841
    - Email: tim@mij.edu, neelo@outlook.com
* **Dr. Matt Harford**:
    - Phone: +44 07872-179187, +44 01279-877139
    - Email: statistics@nru.ac.uk, mattharx@yahoo.co.uk
* **Dr. Mithuna Patel**:
    - Phone: +61 (08)87150531, +61 (08)90881262
    - Email: mithuna.patel@nua.edu.au, pattmit66@gmail.com
    

Let's code the contact details of Dr. Hannah Parker's. Let's start with her phone numbers first:

In [4]:
hannah_parker_phones = ["+1 978-385-3123", "+1 610-684-1514"]
hannah_parker_phones

['+1 978-385-3123', '+1 610-684-1514']

We can access her preferred phone number by using the `0` **positional index** (the first element of this list):

In [5]:
hannah_parker_phones[0]

'+1 978-385-3123'

And here's the same for her emails:

In [6]:
hannah_parker_emails = ["hannahp@unk.edu", "hannah7897@gmail.com"]
hannah_parker_emails[0]

'hannahp@unk.edu'

### Exercise 2

Translate Dr. Tim Neely's phone numbers to a Python list and access his second phone number option.

**Answer.**

-------

In [9]:
tim_neely_phones = ["+1 503-831-5725", "+1 530-832-3841"]
tim_neely_phones[1]

'+1 530-832-3841'

An alternative way of retrieving his second phone number option would make use of the fact that it is the last element in the list. In Python, you can have **negative indexes** that work just like an ordinary index, only starting from the right to the left. Thus,

In [10]:
tim_neely_phones[-1]

'+1 530-832-3841'

Will give you the same result. If you wanted to get the first element of the list you would then use this:

In [11]:
tim_neely_phones[-2] # Equivalent to tim_neely_phones[0] in this case, because the list has 2 elements

'+1 503-831-5725'

Here's the code to translate the remaining data:

In [13]:
tim_neely_emails = ["tim@mij.edu", "neelo@outlook.com"]
matt_harford_phones = ["+44 07872-179187", "+44 01279-877139"]
matt_harford_emails = ["statistics@nru.ac.uk", "mattharx@yahoo.co.uk"]
mithuna_patel_phones = ["+61 (08)87150531", "+61 (08)90881262"]
mithuna_patel_emails = ["mithuna.patel@nua.edu.au", "pattmit66@gmail.com"]

The next step would be to consolidate all these lists into a single directory. For this, you can **nest** them inside another list, thus creating a *list of lists* for each person, like this:

In [17]:
hannah_parker = [hannah_parker_phones, hannah_parker_emails, hannah_parker_interests]
tim_neely = [tim_neely_phones, tim_neely_emails, tim_neely_interests]
matt_harford = [matt_harford_phones, matt_harford_emails, matt_harford_interests]
mithuna_patel = [mithuna_patel_phones, mithuna_patel_emails, mithuna_patel_interests]

# Inspecting hannah_parker
hannah_parker

[['+1 978-385-3123', '+1 610-684-1514'],
 ['hannahp@unk.edu', 'hannah7897@gmail.com'],
 {'applied mathematics', 'number theory', 'topology'}]

When we inspect `hannah_parker`, we see that we have a list, `hannah_parker`, that has two other lists inside, one corresponding to her phone numbers and another one corresponding to her emails. These are nested lists. The third element of the list is a set - the set of interests we created before.

But now we can go even further and nest these lists again to create our unified directory:

In [18]:
# This is a list of lists that have lists inside!
directory = [
    hannah_parker,
    tim_neely,
    matt_harford,
    mithuna_patel
]

directory

[[['+1 978-385-3123', '+1 610-684-1514'],
  ['hannahp@unk.edu', 'hannah7897@gmail.com'],
  {'applied mathematics', 'number theory', 'topology'}],
 [['+1 503-831-5725', '+1 530-832-3841'],
  ['tim@mij.edu', 'neelo@outlook.com'],
  {'applied mathematics',
   'labor economics',
   'medieval literature',
   'statistics'}],
 [['+44 07872-179187', '+44 01279-877139'],
  ['statistics@nru.ac.uk', 'mattharx@yahoo.co.uk'],
  {'applied mathematics',
   'number theory',
   'software development',
   'statistics'}],
 [['+61 (08)87150531', '+61 (08)90881262'],
  ['mithuna.patel@nua.edu.au', 'pattmit66@gmail.com'],
  {'general relativity', 'quantum mechanics'}]]

Now, let's say we wanted to display Dr. Mithuna Patel's primary email. We would have to accomplish this by using positional indexing. First, we know that she is the last element of the directory (therefore, index `3`, or `-1` if you want to use negative indexing), then we see that emails are the second element of her personal data list (index `1`), and finally we notice that primary emails are always first on the emails list (index `0`). Then the way to access this element would then be:

In [19]:
directory[3][1][0] # Equivalent to directory[-1][1][0] with negative indexing

'mithuna.patel@nua.edu.au'

Notice how we worked hierarchically, starting from the outermost list (the list that nests everything else) and working inwards, creating multiple positional indexes along the way.

**Note**. There is another Python collection type called a **tuple**. Tuples are very similar to lists, with the difference that you can't modify them after you've created them (you can do this with lists). If you need to alter the contents of a tuple, you'd have to rewrite it from scratch. This property is useful sometimes, but lists are generally the preferred option in most cases. To create a tuple, you use ordinary parentheses:

In [20]:
my_tuple = ("One", "Two", "Three", "Four", "Four")
my_tuple[1]

'Two'

## Making the directory more user-friendly

The `directory` list is great - it has all the data we need and is readily accessible. However, it has the downside that if you want to retrieve a particular email or phone, or the set of interests of a person, you have to be very familiar with the structure of the list and type very meticulously so as not to pass the wrong index number. For example, in extracting Dr. Mithuna Patel's email we had to correctly pass in three indexes - leaving lots of opportunity for error!

This difficulty can be overcome by using **dictionaries**. Dictionaries are collections in which you can assign **keys** to your elements, so that instead of accessing the items using a positional index, you simply ask Python to get you the item (called a **value**) that corresponds to a given key. Dictionaries can't have duplicate keys, but you can have duplicate values provided that they have different keys. Just like lists, dictionaries are **mutable** (i.e., you can modify its contents without having to recreate the dictionary), but they are similar to sets in that they don't preserve the order of the elements.

Dictionaries are created using this syntax:

In [22]:
my_dict = {
    "One":"This is number 1",
    "Two":2,
    "Four":4,
    "Another four":4,
    "Three":["This", "is", "a", "list", "of", "strings"],
}

And its elements are accessed like this:

In [23]:
my_dict["One"]

'This is number 1'

So let's make `directory` a dictionary (observe how we nest dictionaries inside the dictionary):

In [24]:
directory_dict ={
    "hannah_parker":{ # The key is hannah_parker and the value is another dictionary
        "phones":hannah_parker_phones,
        "emails":hannah_parker_emails,
        "interests":hannah_parker_interests,
    },
    "tim_neely":{
        "phones":tim_neely_phones,
        "emails":tim_neely_emails,
        "interests":tim_neely_interests,
    },
    "matt_harford":{
        "phones":matt_harford_phones,
        "emails":matt_harford_emails,
        "interests":matt_harford_interests,
    },
    "mithuna_patel":{
        "phones":mithuna_patel_phones,
        "emails":mithuna_patel_emails,
        "interests":mithuna_patel_interests,
    },
}

Now it's a lot easier to display Dr. Mithuna Patel's primary email address. We first determine the key of the person's contact data (`mithuna_patel`), then we find the key of the email data (`emails`) and then we access the 0th index of the resulting list (we coded `emails` as a nested list inside the dictionary, to preserve the order of the elements):

In [25]:
directory_dict["mithuna_patel"]["emails"][0]

'mithuna.patel@nua.edu.au'

### Exercise 3

Using `directory_dict`, display Dr. Matt Harford interests. What kind of data structure is the resulting output?

**Answer.**

-------

In [26]:
directory_dict["matt_harford"]["interests"]
#It's a set, because it uses {}

{'applied mathematics', 'number theory', 'software development', 'statistics'}

## Takeaways

In this case, you learned about some basic data structures in Python and important properties of each of those data structures:

* **Sets**. If order doesn't matter, you can store elements in sets. Sets don't allow for duplicates and don't let you access elements by positional indexing (i.e., by stating the position the element has in the object), because they don't naturally have order. You can't change the elements of a set, but you can add new elements. You can perform various kinds of operations using sets, such as set intersections, set differences, and set unions - these all fall under set logic.
* **Lists**. Lists are ordered. They allow for duplicates and let you access elements by positional indexing. You can delete, add, or modify elements in a list, without needing to create the list again.
* **Tuples**. These are basically like lists, with the difference that tuples are immutable. This means that if you want to modify the contents of a tuple, you have to create it from scratch with the modified data.
* **Dictionaries**. Dictionaries act a lot like lists (and are mutable too), but you don't access the elements using positional indexes. You use key names instead. Dictionaries can't have duplicate keys, but you can have duplicate values provided that they have different keys. Importantly, similar to sets, dictionaries *don't* preserve the order of elements.

## Cheat sheet of `set`, `list`, `tuple` and `dict`

We can't possibly cover here everything that can be learned about Python data structures, so here's a cheat sheet we've put together for your future reference:

### Set methods

* Initializing a set: `my_set = {13, 47, 56}`
* Initializing an empty set: `my_set = set()`
* Union: `A | B`
* Intersection: `A & B`
* Difference: `A - B` or `B - A` (this operation is not [commutative](https://en.wikipedia.org/wiki/Commutative_property))
* Symmetric difference: `A ^ B`
* Membership test: `item in A`
* Add an element to a set: `A.add(item)`
* Remove an element from a set: `A.remove(item)`
* Remove all elements from a set: `A.clear()`
* Returning the size of a set: `len(my_set)`

You can take advantage of the fact that sets don't allow duplicates to remove duplicates from a list by turning it into a set and back into a list again. For instance, if you have a list:

~~~python
list_with_duplicates = [1,2,2,2,3]
~~~

You can do this to get a list with the duplicates removed:

~~~python
list_with_no_duplicates = list( set(list_with_duplicates) )
list_with_no_duplicates
~~~

### List methods

* Creating a list: `my_list = [1, 2, 2, 3]`
* Creating an empty list: `my_list = []`
* Append an element to the end of a list: `my_list.append(item)`
* Append a list to the end of another list: `my_list.extend(another_list)`
* Retrieve an element of the list: `my_list[positional_index]`
* Retrieve a slice of the list: `my_list[initial_position:final_position]`
* Remove an element based on value: `my_list.remove(value)` (this will remove the first element in the list that has a value of `value`)
* Remove an element based on positional index: `del my_list[positional_index]` (can be used with slices as well)
* Retrieve a value and remove it from the list at the same time: `my_list.pop([positional_index])`
* Remove all the elements from a list: `my_list.clear()`
* Display the index of the first element that has a given value: `my_list.index(value)`
* Count the number of times a value appears in a list: `my_list.count(value)`
* Sort the elements of the list: `my_list.sort()` (to sort Z-A, use `my_list.sort(reverse=True)`
* Reverse the current order of the elements in a list: `my_list.reverse()`
* Return a copy of the list: `my_list.copy()`
* Returning the length of a list: `len(my_list)`
* Membership test: `item in my_list`

### Tuple methods

* Creating a tuple: `my_tuple = (1,2,2,3)`
* Creating an empty tuple: `my_tuple = ()`
* Creating a tuple with one element: `my_tuple = ("One element",)` (notice the comma)
* Retrieve an element of the tuple: `my_tuple[positional_index]`
* Retrieve a slice of the tuple: `my_tuple[initial_position:final_position]`
* Concatenating two tuples: `tuple1 + tuple2`
* Repeat a tuple several times: `my_tuple * times_to_repeat`
* Returning the length of a tuple: `len(my_tuple)`
* Membership test: `item in my_tuple`

### Dictionary methods

* Creating a dictionary: `my_dict = {"key":"value", "another_key":"another_value",}`
* Retrieving the value that corresponds to a given key: `my_dict["key"]`. Alternatively, `my_dict.get("key")`
* Deleting an element from the dictionary based on key: `del my_dict["key"]`
* List all the keys in the dictionary: `list(my_dict)`
* Membership test: `key in my_dict`
* Replacing / creating an element: `my_dict["key"] = expression`
* Remove all the elements from a dictionary: `my_dict.clear()`
* Return a copy of the dictionary: `my_dict.copy()`
* Retrieve a value and remove it from the dictionary at the same time: `my_dict.pop("key")`
* Returning the length of a dictionary: `len(my_dict)`

You can find more information about Python data structures in the [documentation](https://docs.python.org/3/tutorial/datastructures.html).

## Attribution

Venn0001.svg. Wikimedia Commons. Public Domain. https://commons.wikimedia.org/wiki/File:Venn0001.svg

Venn 0000 0001.svg. Wikimedia Commons. Public Domain. https://commons.wikimedia.org/wiki/File:Venn_0000_0001.svg

Venn0010.svg. Wikimedia Commons. Public Domain. https://commons.wikimedia.org/wiki/File:Venn0010.svg

Venn0110.svg. Wikimedia Commons. Public Domain. https://commons.wikimedia.org/wiki/File:Venn0110.svg

Venn0111.svg. Wikimedia Commons. Public Domain. https://commons.wikimedia.org/wiki/File:Venn0111.svg