# Assignment 4a: Data structures (CSV/TSV and JSON)

**Deadline for Assignment 4a+b: Friday, October 14, 2022 (5pm) via Canvas (Assignment 4)** 

* Please name your notebook with the following naming convention: 
  ASSIGNMENT_4a_FIRSTNAME_LASTNAME.ipynb 
* Please submit your complete assignment (4a + 4b) by compressing all your material into **a single .zip file** following this naming convention: ASSIGNMENT_4_FIRSTNAME_LASTNAME.zip.  

## Please note that there is a BA and an MA version of Assignment 4b

In case you are not sure about creating a zip file from a folder, please refer to [this guide](https://fossbytes.com/how-to-zip-file-in-windows-mac/) (or any other guide you find online).



If you have **questions** about this chapter, please contact us at cltl.python.course@gmail.com. Questions and answers will be collected in [this Q&A document](https://docs.google.com/document/d/1ynQAqPa2CGB02okyyE4F1StytDqpyRoBqUpWfeBqI_Y/edit?usp=sharing), so please check if your question has already been answered. 

In this block, we covered the following chapters about data formats:

- Chapter 16 - Data Formats I (CSV/TSV)
- Chapter 17 - Data Formats II (JSON)
- Chapter 18 - Data Formats III (XML) *only for master-level course*

In this assignment, you will also have to apply your knowledge about containers. If you get stuck, you are likely to find solutions in the chapters about containers (Block 2). 


**Tip**:

It could happen that your code throws a unicode error when you're trying to open one of the files used in this assignment. If this is the case, you can probably solve if by specifying the encoding when reading in the file:

```python
with open(your/file/path, 'r', encoding = 'utf-8') as infile:
    #your code
```


## Exercise 1: Trump's Facebook Status Updates (CSV/TSV)

In the folder `../Data/csv_data` there is a TSV file called `trump_facebook.tsv` that contains Facebook status updates posted by Donald Trump. It was downloaded from [here](https://www.reddit.com/r/datasets/comments/581hqm/all_of_donald_trumps_facebook_statuses_reaction). Follow the instructions below to read the file and find specific status updates.


### 1a. Write your own function for reading CSV
Write a function called `read_csv()` that has two parameters: 

* **`input_file`** (positional parameter) and 
* **`delimiter`** (keyword parameter with default string `","`). 

The function should read the file and return `status_updates` which contains the content of the file as a 'list of dicts'. When tested on `../Data/Trump-Facebook/FacebookStatuses.tsv` the first two status updates should thus be represented as follows:

```
[{'link_name': 'Timeline Photos',
  'num_angrys': '7',
  'num_comments': '543',
  'num_hahas': '17',
  'num_likes': '6178',
  'num_loves': '572',
  'num_reactions': '6813',
  'num_sads': '0',
  'num_shares': '359',
  'num_wows': '39',
  'status_id': '153080620724_10157915294545725',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'status_published': '10/17/2016 20:56:51',
  'status_type': 'photo'},
 {'link_name': '',
  'num_angrys': '5211',
  'num_comments': '3644',
  'num_hahas': '75',
  'num_likes': '26649',
  'num_loves': '487',
  'num_reactions': '33768',
  'num_sads': '191',
  'num_shares': '17653',
  'num_wows': '1155',
  'status_id': '153080620724_10157914483265725',
  'status_link': 'https://www.facebook.com/DonaldTrump/videos/10157914483265725/',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 8th, we will END this RIGGED system once and for all!",
  'status_published': '10/17/2016 18:00:41',
  'status_type': 'video'}]
```

**DO NOT USE THE CSV MODULE FOR THIS EXERCISE!**

In [13]:
def read_csv(input_file, delimiter=","):
    '''
Reads a CSV file and returns a list of dictionaries with the data.

        Parameters:
                input_file (str): Name of the CSV file.
                delimiter (str): Character that seperates the data in the CSV file. Defaults to ",".

        Returns:
                output (list): A list of dictionaries where each dictionary represents a line in the CSV file.
'''
    with open(input_file, 'r') as infile:
        contents = infile.readlines()
    headers = contents.pop(0).strip('\n').split(delimiter)
    output = []
    for line in contents:
        dictionary = dict()
        elements = line.strip('\n').split(delimiter)
        for key, value in zip(headers, elements):
            dictionary[key] = value
        output.append(dictionary)
    return output
        

# test your function here
filename = "../Data/csv_data/trump_facebook.tsv"
status_updates = read_csv(filename, delimiter="\t") 
status_updates[0:2]

[{'status_id': '153080620724_10157915294545725',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'link_name': 'Timeline Photos',
  'status_type': 'photo',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_published': '10/17/2016 20:56:51',
  'num_reactions': '6813',
  'num_comments': '543',
  'num_shares': '359',
  'num_likes': '6178',
  'num_loves': '572',
  'num_wows': '39',
  'num_hahas': '17',
  'num_sads': '0',
  'num_angrys': '7'},
 {'status_id': '153080620724_10157914483265725',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 

In case you didn't manage to create the `read_csv()` function, run the following code using the `DictReader()` method from the `csv` module to get the data in the right format for the following exercises:

In [8]:
import csv

filename = "../Data/csv_data/trump_facebook.tsv"
with open(filename, "r") as infile:
    status_updates = []
    csv_reader = csv.DictReader(infile, delimiter='\t')
    for row in csv_reader:
        status_updates.append(row)
status_updates[0:2]

[{'status_id': '153080620724_10157915294545725',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'link_name': 'Timeline Photos',
  'status_type': 'photo',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_published': '10/17/2016 20:56:51',
  'num_reactions': '6813',
  'num_comments': '543',
  'num_shares': '359',
  'num_likes': '6178',
  'num_loves': '572',
  'num_wows': '39',
  'num_hahas': '17',
  'num_sads': '0',
  'num_angrys': '7'},
 {'status_id': '153080620724_10157914483265725',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 

### 1b. Find the status updates with the most responses

Define a function called **`get_update_most_responded_to()`** that has the following parameters: 
* **`status_updates`** (positional parameter) 
* **`response_type`** (keyword parameter with default string `"likes"`) 

The fuction should find the status update that received the highest number of possible reactions to a Facebook status (emoji such as 'angrys', 'comments', 'hahas', etc. - anything that starts with 'num_'). It should return three strings: the **`status_message`**, the **`status_type`** and the **`status_link`** of this particular status update.


In [15]:
# Here is some fake data for testing
test_data = [
    {
        'status_message': 'This one has the most likes.',
        'status_type': 'video',
        'status_link': 'www.apple.com',
        'num_likes': '19789000',
        'num_reactions': '33768',
        'num_comments': '3644',
        'num_shares': '17653',
        'num_loves': '487',
        'num_wows': '1155',
        'num_hahas': '75',
    },
    {
        'status_message': 'This one has the most hahas.',
        'status_type': 'photo',
        'status_link': 'www.google.com',
        'num_likes': '19789',
        'num_reactions': '33768',
        'num_comments': '3644',
        'num_shares': '17653',
        'num_loves': '487',
        'num_wows': '1155',
        'num_hahas': '999999999',
    },
    {
        'status_message': 'This one has the most loves.',
        'status_type': 'status',
        'status_link': 'www.twitter.com',
        'num_likes': '19789',
        'num_reactions': '33768',
        'num_comments': '3644',
        'num_shares': '17653',
        'num_loves': '4800000',
        'num_wows': '1155',
        'num_hahas': '99',
    }
]

def get_update_most_responded_to(status_updates, /, response_type="likes"):
    '''
Returns the status update, type, and link of the status update that received the most responses.

        Parameters:
                status_updates (list): A list of dictionaries, where each dictionary represents a status update.
                response_type (str): The type of responses to compare. Defaults to 'likes'.

        Returns:
                status_msg (str): The status message of the status update that received the most responses.
                status_type (str): The type of the status update that received the most responses.
                status_link (str): The link to the status update that received the most responses.
'''
    key = 'num_' + response_type
    max_responses = -1
    max_istance = None
    
    for status in status_updates:
        if int(status[key]) >= max_responses:
            max_responses = int(status[key])
            max_instance = status
            
    if max_instance:
        return max_instance['status_message'], max_instance['status_type'], max_instance['status_link']
    else:
        return None, None, None

# This one should have the most likes
print(get_update_most_responded_to(test_data))

# This one should have the most hahas
print(get_update_most_responded_to(test_data, 'hahas'))

# This one should have the most loves
print(get_update_most_responded_to(test_data, 'loves'))

('This one has the most likes.', 'video', 'www.apple.com')
('This one has the most hahas.', 'photo', 'www.google.com')
('This one has the most loves.', 'status', 'www.twitter.com')


### 1c. Find the longest status updates

Define a function called **`get_longest_update()`** that has the following parameters: 
* **`status_updates`** (positional parameter) 
* **`length_type`** (keyword parameter with default string `"tokens"`). 

The function should find the longest update. By default, the fuction should find the status update that is the longest in terms of number of tokens. Also implement the options to find the longst status update in terms of characters or sentences in the message. These options should be carried out when `length_type` is changed to `"sentences"` or `"characters"` 

The function should return the status message (called `'status_message'` in the data structure) of the longest update as a string. 

**Attention**: It is recommended to use NLTK for this exercise. 


In [18]:
import nltk
# Data for testing
status_updates = [
    {
        'status_message': "This one definitely has the most tokens because I am just writing whhatever comes to my mind, like what does it really mea to be succesful at something? Is there a way to know?"
    },
    {
        'status_message': 'Here. We. Have. Lots. Of. Short. Sentences.'
    },
    {
        'status_message': 'Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis '
    }
]
def get_longest_update(status_updates, /, length_type="tokens"):
    '''
Returns the longest status update, based on the given length type.

        Parameters:
                status_updates (list): A list of dictionaries, where each dictionary represents a status update.
                length_type (str): The type of length to compare ('characters', 'sentences', or 'tokens'). Defaults to 'tokens'.

        Returns:
                max_instance (str): The status message of the longest status update.
'''
    # your code here
    max_instance = None
    max_num = -1
    
    for instance in status_updates:
        if length_type == 'characters':
            if len(instance['status_message']) >= max_num:
                max_num = len(instance['status_message'])
                max_instance = instance['status_message']
        elif length_type == 'sentences':
            if len(nltk.sent_tokenize(instance['status_message'])) >= max_num:
                max_num = len(nltk.sent_tokenize(instance['status_message']))
                max_instance = instance['status_message']
        elif length_type == 'tokens':
            if len(nltk.word_tokenize(instance['status_message'])) >= max_num:
                max_num = len(nltk.word_tokenize(instance['status_message']))
                max_instance = instance['status_message']
    return max_instance
            

# test your function here
print('Most TOKENS:', get_longest_update(status_updates))
print('Most SENTENCES:', get_longest_update(status_updates, 'sentences'))
print('Most CHARACTERS:', get_longest_update(status_updates, 'characters'))

Most TOKENS: This one definitely has the most tokens because I am just writing whhatever comes to my mind, like what does it really mea to be succesful at something? Is there a way to know?
Most SENTENCES: Here. We. Have. Lots. Of. Short. Sentences.
Most CHARACTERS: Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis Pneumonoultramicroscopicsilicovolcanoconiosis 


### 1d. Find the status updates containing specific keywords

Define a function called **`get_updates_with_keywords()`** that takes three input arguments: 

* **`status_updates`** (mandatory positional argument) 
* **`keywords`** (mandatory positional argument) 
* **`case_sensitive`** (keyword argument with default `False`)

The fuction should find the status updates that contain **any of the keywords**. The parameter `case_sensitive` should specify whether uppercase and lowercase characters must be treated as distinct. 

The function should return **`filtered_status_updates`**, which is a list of dictioaries with all information about the status updates (same format as the input argument `'status_updates'`). 

**Attention**: It is highly recommended to use NLTK for this exercise. Make sure that you **tokenize** the messages before you look for keywords. 

In [59]:
def get_update_with_keywords(status_updates, keywords, /, case_sensitive = False):
    '''
Gets status updates that contain any of the given keywords.

        Parameters:
                status_updates (list of dicts): A list of status updates, where each status update is represented by a dictionary with a 'status_message' key.
                keywords (list of str): A list of keywords to search for in the status messages.
                case_sensitive (bool): If True, the search is case-sensitive. Otherwise, it is not.

        Returns:
                filtered_updates (list of dicts): A list of status updates that contain at least one of the given keywords.
'''
    filtered_updates = []
    if not case_sensitive: keywords = [keyword.lower() for keyword in keywords]
    for update in status_updates:
        message = update['status_message']
        tokens  = [token.lower() if not case_sensitive else token for token in nltk.word_tokenize(message)]
        if any([keyword in tokens for keyword in keywords]):
            filtered_updates.append(update)
    return filtered_updates

test_updates = [
    {'status_message': 'Clinton is here'},
    {'status_message': 'Obama is here'},
    {'status_message': 'Bob is here'},
    {'status_message': 'John is here'}
]

keywords = ["clinton", "obama", "John", 'Benny'] # test with these keywords; also experiment with other keywords
# test your function here
print(f"Case insensitive keywords: {keywords} ->",get_update_with_keywords(test_updates, keywords, case_sensitive=False))

print(f"Case sensitive keywords: {keywords} ->",get_update_with_keywords(test_updates, keywords, case_sensitive=True))

print(f"Case sensitive keywords: {keywords} ->",get_update_with_keywords(test_updates, keywords, case_sensitive=True))

Case insensitive keywords: ['clinton', 'obama', 'John', 'Benny'] -> [{'status_message': 'Clinton is here'}, {'status_message': 'Obama is here'}, {'status_message': 'John is here'}]
Case sensitive keywords: ['clinton', 'obama', 'John', 'Benny'] -> [{'status_message': 'John is here'}]
Case sensitive keywords: ['clinton', 'obama', 'John', 'Benny'] -> [{'status_message': 'John is here'}]


## Exercise 2: Nobel Prize Winners (JSON)

There is a lot of interesting data online. For example, the [Nobel Prize Organisaton](https://www.nobelprize.org) provides the [Nobel Prize API](https://nobelprize.readme.io) that allows you to download information about the prizes, the laureates and the countries. 

The information is formatted in JSON. Have a look at the following URLs:
- http://api.nobelprize.org/v1/prize.json
- http://api.nobelprize.org/v1/laureate.json
- http://api.nobelprize.org/v1/country.json

For this exercise, we will only look at the prizes and the laureates. 

We can download the data using the `requests` module. How this works is shown below.

In [28]:
import requests

In [29]:
# Download data on prizes
api_url = "http://api.nobelprize.org/v1/prize.json"
r = requests.get(api_url)
dict_prizes = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_prizes 

In [30]:
# Download data on laureates
api_url = "http://api.nobelprize.org/v1/laureate.json"
r = requests.get(api_url)
dict_laureates = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_laureates 

### 2a. Read the JSON files

We have already stored the data as the JSON files `laureate.json` and `prize.json` in the folder `../Data/json_data/NobelPrize`. Open these JSON files and load them as the Python dictionaries `dict_laureates` and `dict_prizes`.

In [42]:
# load laureates.json and prize.json here
import json
with open('../Data/json_data/NobelPrize/laureate.json', 'rb') as infile:
    dict_laureates = json.load(infile)['laureates']

with open('../Data/json_data/NobelPrize/prize.json', 'rb') as infile:
    dict_prizes = json.load(infile)['prizes']

### 2b. Get all laureates from year and category

Create a function called **`get_laureates()`** that thas three parameters: 

* **`dict_prizes`** (positional parameter) 
* **`year`** (keyword parameter with default `None`) 
* **`category`** (keyword parameter with default `None`) 

The function should find all laureates that received the Nobel Prize, optionally in a specific year and/or category (specified using the keywords `year` and `category`). It should return a list of the full names of the laureates.

For example, for the year 2018 and category "peace" it should return the list `['Denis Mukwege', 'Nadia Murad']`.

In [60]:
def get_laureates(dict_prizes, /, year=None, category=None):
    '''
Returns a list of laureates from the Nobel Prize API. Defaults to returning all laureates from all years and all categories.

    Parameters:
            dict_prizes (dict): The dictionary of prizes from the Nobel Prize API
            year (int): An optional parameter to specify which year to return laureates from
            category (str): An optional parameter to specify which category to return laureates from

    Returns:
            people (list): A list of laureates
'''
    batches = []
    for batch in dict_prizes:
        if all(
        [
            not year or (year and int(batch['year']) == year),
            not category or (category and batch['category'] == category)
        ]):
            batches.append(batch)
    people = []
    for batch in batches:
        for laureate in batch['laureates']:
            people.append(laureate['firstname'] + ' ' + laureate['surname'])
    return people

year = 2018
category = "peace"
# test your function here
get_laureates(dict_prizes, year=2018, category='peace')

['Denis Mukwege', 'Nadia Murad']

### 2c. Get all prizes from affiliations

Create a function called **`get_affiliation_prizes()`** that takes one input parameters: 

* **`dict_laureates`** (positional parameter) 

The function should find all affiliations that were involved in winning the Nobel Prize and provide information on the category and year of those Nobel Prizes. It should return a nested dictionary of the following format:

```
{
    "A.F. Ioffe Physico-Technical Institute": [
        {"category": "physics", "year": "2000"}
    ],
    "Aarhus University": [
        {"category": "chemistry", "year": "1997"},
        {"category": "economics","year": "2010"}
    ]
}
```

**Tip:** some of the entries will lack information (for example, there is no associated affiliation). Use `if-statements` to check if essential information is present. 

**General tip for working with data**: If your code breaks, check whether your assumptions about the data hold (very often, they unfortunatelydo not). For instance, a dictionary key you thought was always present is missing from a couple of dictionaries, etc. 

In [94]:
def get_affiliation_prizes(dict_laureates):
    '''
Returns a dictionary of universities and their prizes.

Parameters:
    dict_laureates (dict): A dictionary of laureates with their prizes

Returns:
    universities (dict): A dictionary of universities with their prizes
'''
    universities = dict()
    for laureate in dict_laureates:
        for prize in laureate['prizes']:
            p = {}
            if 'category' in prize:
                p['category'] = prize['category']
            if 'year' in prize:
                p['year'] = prize['year']
            if p == {}:
                continue
                
            if 'affiliations' in prize:
                print(prize['affiliations'])
                for affiliation in prize['affiliations']:
                    if 'name' in affiliation:
                        if affiliation['name'] not in universities:
                            universities[affiliation['name']] = [p]
                        else:
                            universities[affiliation['name']].append(p)
    return universities
            
        
affiliations = get_affiliation_prizes(dict_laureates)
affiliations

[{'name': 'Munich University', 'city': 'Munich', 'country': 'Germany'}]
[{'name': 'Leiden University', 'city': 'Leiden', 'country': 'the Netherlands'}]
[{'name': 'Amsterdam University', 'city': 'Amsterdam', 'country': 'the Netherlands'}]
[{'name': 'École Polytechnique', 'city': 'Paris', 'country': 'France'}]
[{'name': 'École municipale de physique et de chimie industrielles (Municipal School of Industrial Physics and Chemistry)', 'city': 'Paris', 'country': 'France'}]
[[]]
[{'name': 'Sorbonne University', 'city': 'Paris', 'country': 'France'}]
[{'name': 'Royal Institution of Great Britain', 'city': 'London', 'country': 'United Kingdom'}]
[{'name': 'Kiel University', 'city': 'Kiel', 'country': 'Germany'}]
[{'name': 'University of Cambridge', 'city': 'Cambridge', 'country': 'United Kingdom'}]
[{'name': 'University of Chicago', 'city': 'Chicago, IL', 'country': 'USA'}]
[{'name': 'Sorbonne University', 'city': 'Paris', 'country': 'France'}]
[{'name': 'Marconi Wireless Telegraph Co. Ltd.', 

{'Munich University': [{'category': 'physics', 'year': '1901'},
  {'category': 'chemistry', 'year': '1905'},
  {'category': 'chemistry', 'year': '1915'},
  {'category': 'chemistry', 'year': '1927'}],
 'Leiden University': [{'category': 'physics', 'year': '1902'},
  {'category': 'physics', 'year': '1913'},
  {'category': 'medicine', 'year': '1924'}],
 'Amsterdam University': [{'category': 'physics', 'year': '1902'},
  {'category': 'physics', 'year': '1910'}],
 'École Polytechnique': [{'category': 'physics', 'year': '1903'},
  {'category': 'physics', 'year': '2018'}],
 'École municipale de physique et de chimie industrielles (Municipal School of Industrial Physics and Chemistry)': [{'category': 'physics',
   'year': '1903'}],
 'Sorbonne University': [{'category': 'chemistry', 'year': '1911'},
  {'category': 'physics', 'year': '1908'},
  {'category': 'physics', 'year': '1926'},
  {'category': 'chemistry', 'year': '1906'},
  {'category': 'medicine', 'year': '1913'},
  {'category': 'peace',

### 2d. Write to JSON

Next, write the dictionary created in the previous exercise to a JSON file using the following path: 

`../Data/json_data/NobelPrize/nobel_prizes_affiliations.json`.

In [92]:
# write the resulting dictionary to 'json_file'
with open('../Data/json_data/NobelPrize/nobel_prizes_affiliations.json', 'w') as outfile:
    json.dump(affiliations, outfile)