# Assignment 4a: Data structures (CSV/TSV and JSON)


* Please name your notebook with the following naming convention: 
  ASSIGNMENT_4a_FIRSTNAME_LASTNAME.ipynb 
* Please submit your complete assignment (4a + 4b) by compressing all your material into **a single .zip file** following this naming convention: ASSIGNMENT_4_FIRSTNAME_LASTNAME.zip.  

## Please note that there is a BA and an MA version of Assignment 4b

In case you are not sure about creating a zip file from a folder, please refer to [this guide](https://fossbytes.com/how-to-zip-file-in-windows-mac/) (or any other guide you find online).



If you have **questions** about this chapter, please contact us at cltl.python.course@gmail.com. Questions and answers will be collected in [this Q&A document](https://docs.google.com/document/d/1ynQAqPa2CGB02okyyE4F1StytDqpyRoBqUpWfeBqI_Y/edit?usp=sharing), so please check if your question has already been answered. 

In this block, we covered the following chapters about data formats:

- Chapter 16 - Data Formats I (CSV/TSV)
- Chapter 17 - Data Formats II (JSON)
- Chapter 18 - Data Formats III (XML) *only for master-level course*

In this assignment, you will also have to apply your knowledge about containers. If you get stuck, you are likely to find solutions in the chapters about containers (Block 2). 


**Tip**:

It could happen that your code throws a unicode error when you're trying to open one of the files used in this assignment. If this is the case, you can probably solve if by specifying the encoding when reading in the file:

```python
with open(your/file/path, 'r', encoding = 'utf-8') as infile:
    #your code
```


## Exercise 1: Trump's Facebook Status Updates (CSV/TSV)

In the folder `../Data/csv_data` there is a TSV file called `trump_facebook.tsv` that contains Facebook status updates posted by Donald Trump. It was downloaded from [here](https://www.reddit.com/r/datasets/comments/581hqm/all_of_donald_trumps_facebook_statuses_reaction). Follow the instructions below to read the file and find specific status updates.


### 1a. Write your own function for reading CSV
Write a function called `read_csv()` that has two parameters: 

* **`input_file`** (positional parameter) and 
* **`delimiter`** (keyword parameter with default string `","`). 

The function should read the file and return `status_updates` which contains the content of the file as a 'list of dicts'. When tested on `../Data/Trump-Facebook/FacebookStatuses.tsv` the first two status updates should thus be represented as follows:

```
[{'link_name': 'Timeline Photos',
  'num_angrys': '7',
  'num_comments': '543',
  'num_hahas': '17',
  'num_likes': '6178',
  'num_loves': '572',
  'num_reactions': '6813',
  'num_sads': '0',
  'num_shares': '359',
  'num_wows': '39',
  'status_id': '153080620724_10157915294545725',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'status_published': '10/17/2016 20:56:51',
  'status_type': 'photo'},
 {'link_name': '',
  'num_angrys': '5211',
  'num_comments': '3644',
  'num_hahas': '75',
  'num_likes': '26649',
  'num_loves': '487',
  'num_reactions': '33768',
  'num_sads': '191',
  'num_shares': '17653',
  'num_wows': '1155',
  'status_id': '153080620724_10157914483265725',
  'status_link': 'https://www.facebook.com/DonaldTrump/videos/10157914483265725/',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 8th, we will END this RIGGED system once and for all!",
  'status_published': '10/17/2016 18:00:41',
  'status_type': 'video'}]
```

**DO NOT USE THE CSV MODULE FOR THIS EXERCISE!**

In [1]:
def read_csv(input_file, /, delimiter=","):
    """
    Reads in a given file and returns the content as a list of dictionaries.
    
    :param input_file: the input file you would like to read
    :param delimiter: the character on which the data is separated (default is ",")
    :return: a list of dictionaries containing the status updates
    """
    
    # Create an empty list
    status_updates = []
    
    # Read the file and assign its content as a string to the variable 'data'
    with open(input_file, "r") as infile:
        data = infile.read()
    
    # Splitting the string on the newline character, creating a list of 
    # strings where each item in the list is a row from the original file
    data_list = data.split('\n')
    
    # Taking the first row of the data_list as this is the header
    header_row = data_list[0]
    
    # Splitting the header row on the delimiter 
    header_row_list = header_row.split(delimiter)
    
    # Excluding the header, looping over the remaining rows
    for row in data_list[1:]:
        # Split the row into a list where each item in the list is a cell value
        row_list = row.split(delimiter)
        
        # Create empty dictionary
        row_dict = {}
        
        # Using .zip() to simultaneously iterate over two lists
        for header, cell_value in zip(header_row_list, row_list):
            # Create a key value pair for each cell value and its corresponding header
            row_dict[header] = cell_value
        
        # Append the row_dict to the list of status updates 
        status_updates.append(row_dict)
    
    # Return the list of dictionaries
    return status_updates

# test your function here
filename = "../Data/csv_data/trump_facebook.tsv"
status_updates = read_csv(filename, delimiter="\t") 
status_updates[0:2]

[{'status_id': '153080620724_10157915294545725',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'link_name': 'Timeline Photos',
  'status_type': 'photo',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_published': '10/17/2016 20:56:51',
  'num_reactions': '6813',
  'num_comments': '543',
  'num_shares': '359',
  'num_likes': '6178',
  'num_loves': '572',
  'num_wows': '39',
  'num_hahas': '17',
  'num_sads': '0',
  'num_angrys': '7'},
 {'status_id': '153080620724_10157914483265725',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 

In case you didn't manage to create the `read_csv()` function, run the following code using the `DictReader()` method from the `csv` module to get the data in the right format for the following exercises:

In [2]:
import csv

filename = "../Data/csv_data/trump_facebook.tsv"
with open(filename, "r") as infile:
    status_updates = []
    csv_reader = csv.DictReader(infile, delimiter='\t')
    for row in csv_reader:
        status_updates.append(row)
status_updates[0:2]

[{'status_id': '153080620724_10157915294545725',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'link_name': 'Timeline Photos',
  'status_type': 'photo',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_published': '10/17/2016 20:56:51',
  'num_reactions': '6813',
  'num_comments': '543',
  'num_shares': '359',
  'num_likes': '6178',
  'num_loves': '572',
  'num_wows': '39',
  'num_hahas': '17',
  'num_sads': '0',
  'num_angrys': '7'},
 {'status_id': '153080620724_10157914483265725',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 

### 1b. Find the status updates with the most responses

Define a function called **`get_update_most_responded_to()`** that has the following parameters: 
* **`status_updates`** (positional parameter) 
* **`response_type`** (keyword parameter with default string `"likes"`) 

The fuction should find the status update that received the highest number of possible reactions to a Facebook status (emoji such as 'angrys', 'comments', 'hahas', etc. - anything that starts with 'num_'). It should return three strings: the **`status_message`**, the **`status_type`** and the **`status_link`** of this particular status update.


In [3]:
def get_update_most_responded_to(status_updates, /, response_type="likes"):
    """
    Finds the status update that received the highest number of reactions of a 
    specified response type. 
    
    :param status_updates: a list of dicts of the status updates
    :param response_type: the response type you are interested in (default is "likes")
    :return: the status message, type and link (as strings) of the most responded 
    status update
    """
    
    # Creating variable to keep track of the highest response value
    highest_response_value = 0
    
    # Creating variable to store the status update with the highest response value
    most_responded_status_update = None
    
    # Looping over each status update
    for status_update in status_updates:
        # Getting the current response value and converting it to an integer 
        str_current_response_value = status_update.get(f"num_{response_type}")
        int_current_response_value = int(str_current_response_value)
        
        # Comparing the current value to the highest value and updating the variables if it's higher
        if int_current_response_value > highest_response_value:
            highest_response_value = int_current_response_value
            most_responded_status_update = status_update
    
    # Getting the message, type and link of the status update with the highest response value
    status_message = most_responded_status_update.get('status_message')
    status_type = most_responded_status_update.get('status_type')
    status_link = most_responded_status_update.get('status_link')
    
    # Returning the message, type and link
    return status_message, status_type, status_link
    
# Testing the output of the function for different response types
status_update_most_likes = get_update_most_responded_to(status_updates, "likes")
print(f"Status update with the most 'likes':\n{status_update_most_likes}")
print()

status_update_most_angrys = get_update_most_responded_to(status_updates, "angrys")
print(f"Status update with the most 'angrys':\n{status_update_most_angrys}")
print()

status_update_most_hahas = get_update_most_responded_to(status_updates, "hahas")
print(f"Status update with the most 'hahas':\n{status_update_most_hahas}")

Status update with the most 'likes':
('Stop congratulating Obama for killing Bin Laden. The Navy Seals killed Bin Laden.', 'status', '')

Status update with the most 'angrys':
('Happy Cinco de Mayo! The best taco bowls are made in Trump Tower Grill. I love Hispanics!', 'photo', 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157008375200725/?type=3')

Status update with the most 'hahas':
("The media is spending more time doing a forensic analysis of Melania's speech than the FBI spent on Hillary's emails.", 'status', '')


### 1c. Find the longest status updates

Define a function called **`get_longest_update()`** that has the following parameters: 
* **`status_updates`** (positional parameter) 
* **`length_type`** (keyword parameter with default string `"tokens"`). 

The function should find the longest update. By default, the fuction should find the status update that is the longest in terms of number of tokens. Also implement the options to find the longst status update in terms of characters or sentences in the message. These options should be carried out when `length_type` is changed to `"sentences"` or `"characters"` 

The function should return the status message (called `'status_message'` in the data structure) of the longest update as a string. 

**Attention**: It is recommended to use NLTK for this exercise. 


In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

def get_longest_update(status_updates, /, length_type="tokens"):
    """
    Finds the longest status update based on a specific type.
    
    :param status_updates: a list of dicts of the status updates
    :param length_type: the type for which you want to count (default is "tokens")
    :return: the status message of the longest status update
    """
    # Creating variables to keep track of the length of the longest update 
    # and its corresponding status message
    length_longest_update = 0
    status_message_longest_update = ''

    # Looping over the status updates
    for status_update in status_updates:
        
        # Getting the message of the current status update
        current_message = status_update.get("status_message")
        
        # Determining the length of the current update based on the specified type
        if length_type == "tokens":
            tokens = word_tokenize(current_message)
            length_current_update = len(tokens)
        elif length_type == "sentences":
            sentences = sent_tokenize(current_message)
            length_current_update = len(sentences)
        elif length_type == "characters": 
            length_current_update = len(current_message)
        
        # Comparing the length of the current update to the length of the longest update
        if length_current_update > length_longest_update:
            # If current update is longer adjust the variables accordingly
            length_longest_update = length_current_update
            status_message_longest_update = current_message

    # Return the status message of the longest update
    return status_message_longest_update

# Testing the output of the function for the length type "tokens"
status_update_longest_token = get_longest_update(status_updates, "tokens")
print(status_update_longest_token)

***Message from Eric Trump*** Before my father takes the stage to face Hillary Clinton, I'll be giving him a list of supporters who made a contribution just before the big debate.   Add your name to the list: http://bit.ly/2duCRlw   Please contribute $100, $65, $35, $20, $15, or even $3 before 8pm ET tonight to get your name on the list of supporters I give him before he takes the stage.  Get on the list here: http://bit.ly/2duCRlw Paula P. James N. Kathy W. Erick M. Curt C. Mark C. Nancy  L. Barbara S. David L. Roy B. Kris M. Daniel S. Daniel D. Eugene L. Caihua W. Ken M. Tommy H. Bill S. Thomas O. Christine W. Dennis J. Erin C. Chad M. Rachel T. Carolyn G. William J. Cindy C. Eugene L. Judy F. Manny C. Edward R. Garry L. Grace B. Boris V. Chris  L. William J. Steven  T. Joann M. Paul S. James E. John P. Marc S. Jim B. Melynda S. Richard S. Jonathan J. Craig O. Ed K. Eileen M. Carmen M. Sherry P. Daniel Mabrey T. Chad B. Ellen  R. Scott P. Keith T. Steven W. Alan A. Don L. Vickie S. C

### 1d. Find the status updates containing specific keywords

Define a function called **`get_updates_with_keywords()`** that takes three input arguments: 

* **`status_updates`** (mandatory positional argument) 
* **`keywords`** (mandatory positional argument) 
* **`case_sensitive`** (keyword argument with default `False`)

The fuction should find the status updates that contain **any of the keywords**. The parameter `case_sensitive` should specify whether uppercase and lowercase characters must be treated as distinct. 

The function should return **`filtered_status_updates`**, which is a list of dictioaries with all information about the status updates (same format as the input argument `'status_updates'`). 

**Attention**: It is highly recommended to use NLTK for this exercise. Make sure that you **tokenize** the messages before you look for keywords. 

In [5]:
def get_update_with_keywords(status_updates, keywords, /, case_sensitive=False):    
    """
    Finds the status updates that contain any of the given keywords.
    
    :param status_updates: a list of dicts of the status updates
    :param keywords: a list of keywords you want to search for in the status updates
    :param case_sensitive: a bool indicating if the search is case sensitive or not (default is False)
    :return: a list of dicts of the status updates containing any of the keywords
    """
    
    # Creating an empty list
    filtered_status_updates = []
    
    # Looping over the status updates
    for status_update in status_updates:
        
        # Getting the message of the current status update
        current_message = status_update.get("status_message")
        
        # Convert current message and keywords to lowercase if case_sensitive is set to False
        if case_sensitive == False:
            current_message = current_message.lower()
            keywords = [keyword.lower() for keyword in keywords]
        
        # Tokenize the current message 
        tokenized_current_message = word_tokenize(current_message)
        
        # Loop over the keywords
        for keyword in keywords:
            if keyword in tokenized_current_message: # Check if keyword is in the message
                filtered_status_updates.append(status_update) # If there is a match, append status update to list
                break # Break out of the loop after the first match is found
    
    # Return the list of filtered status updates
    return filtered_status_updates

# Testing the function for the keywords 'clinton' and 'obama'
keywords = ["clinton", "obama"]
find_updates_clinton_obama = get_update_with_keywords(status_updates, keywords)

# Only displaying the first 5 status updates as the output would be very long is eveything is printed
find_updates_clinton_obama[:5]

[{'status_id': '153080620724_10157912962325725',
  'status_message': 'JournoCash: Media gives $382,000 to Clinton, $14,000 Trump, 27-1 margin:',
  'link_name': 'JournoCash: Media gives $382,000 to Clinton, $14,000 Trump, 27-1 margin',
  'status_type': 'link',
  'status_link': 'http://www.washingtonexaminer.com/journocash-media-gives-382000-to-clinton-14000-trump-27-1-margin/article/2604736',
  'status_published': '10/17/2016 14:17:24',
  'num_reactions': '22696',
  'num_comments': '3665',
  'num_shares': '5082',
  'num_likes': '14029',
  'num_loves': '122',
  'num_wows': '2091',
  'num_hahas': '241',
  'num_sads': '286',
  'num_angrys': '5927'},
 {'status_id': '153080620724_10157911739680725',
  'status_message': 'Voter fraud! Crooked Hillary Clinton even got the questions to a debate, and nobody says a word. Can you imagine if I got the questions?',
  'link_name': '',
  'status_type': 'status',
  'status_link': '',
  'status_published': '10/17/2016 10:33:34',
  'num_reactions': '83038

## Exercise 2: Nobel Prize Winners (JSON)

There is a lot of interesting data online. For example, the [Nobel Prize Organisaton](https://www.nobelprize.org) provides the [Nobel Prize API](https://nobelprize.readme.io) that allows you to download information about the prizes, the laureates and the countries. 

The information is formatted in JSON. Have a look at the following URLs:
- http://api.nobelprize.org/v1/prize.json
- http://api.nobelprize.org/v1/laureate.json
- http://api.nobelprize.org/v1/country.json

For this exercise, we will only look at the prizes and the laureates. 

We can download the data using the `requests` module. How this works is shown below.

In [6]:
import requests

In [7]:
# Download data on prizes
api_url = "http://api.nobelprize.org/v1/prize.json"
r = requests.get(api_url)
dict_prizes = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
# dict_prizes 

In [8]:
# Download data on laureates
api_url = "http://api.nobelprize.org/v1/laureate.json"
r = requests.get(api_url)
dict_laureates = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_laureates 

### 2a. Read the JSON files

We have already stored the data as the JSON files `laureate.json` and `prize.json` in the folder `../Data/json_data/NobelPrize`. Open these JSON files and load them as the Python dictionaries `dict_laureates` and `dict_prizes`.

In [9]:
# load laureates.json and prize.json here
import json

with open("../Data/json_data/NobelPrize/laureate.json", "r") as infile:
    dict_laureates = json.load(infile)
    
with open("../Data/json_data/NobelPrize/prize.json", "r") as infile:
    dict_prizes = json.load(infile)

### 2b. Get all laureates from year and category

Create a function called **`get_laureates()`** that thas three parameters: 

* **`dict_prizes`** (positional parameter) 
* **`year`** (keyword parameter with default `None`) 
* **`category`** (keyword parameter with default `None`) 

The function should find all laureates that received the Nobel Prize, optionally in a specific year and/or category (specified using the keywords `year` and `category`). It should return a list of the full names of the laureates.

For example, for the year 2018 and category "peace" it should return the list `['Denis Mukwege', 'Nadia Murad']`.

In [10]:
def get_laureates(dict_prizes, / , year=None, category=None):
    """
    Finds laureates that won the Nobel Prize, optionally filtering by year and/or category.
    
    :param dict_prizes: dict of all the nobel prizes
    :param year: the year for which you want to find the laureates (default is None)
    :param category: the category for which you want to find the laureates (default is None)
    :return: list of names of the laureates
    """
    
    # Creating an empty list to store the names of the laureates in 
    laureates = []
    
    # Looping over the prizes
    for prizes in dict_prizes.values():
        for prize in prizes:
            
            # Continue if a specific year is provided and doesn't match the current prize year
            if (year is not None) and (str(year) != prize['year']):
                continue
            
            # Continue if a specific category is provided and doesn't match the current category
            if (category is not None) and (category != prize['category']):
                continue
            
            # Loop over the laureates of the current prize
            for laureate in prize['laureates']:
                # Construct the full name and add to the list of laureates
                full_name_laureate = f"{laureate['firstname']} {laureate['surname']}"
                laureates.append(full_name_laureate)
    
    # Returning the list of laureates
    return laureates

# Testing the function to find the laureates of the peace prize in 2018
year = 2018
category = "peace"
laureates_peace_2018 = get_laureates(dict_prizes, year, category)
print(laureates_peace_2018)

['Denis Mukwege', 'Nadia Murad']


### 2c. Get all prizes from affiliations

Create a function called **`get_affiliation_prizes()`** that takes one input parameters: 

* **`dict_laureates`** (positional parameter) 

The function should find all affiliations that were involved in winning the Nobel Prize and provide information on the category and year of those Nobel Prizes. It should return a nested dictionary of the following format:

```
{
    "A.F. Ioffe Physico-Technical Institute": [
        {"category": "physics", "year": "2000"}
    ],
    "Aarhus University": [
        {"category": "chemistry", "year": "1997"},
        {"category": "economics","year": "2010"}
    ]
}
```

**Tip:** some of the entries will lack information (for example, there is no associated affiliation). Use `if-statements` to check if essential information is present. 

**General tip for working with data**: If your code breaks, check whether your assumptions about the data hold (very often, they unfortunatelydo not). For instance, a dictionary key you thought was always present is missing from a couple of dictionaries, etc. 

In [11]:
def get_affiliation_prizes(dict_laureates, /):
    """
    Finds affiliations that have been associated with Nobel Prize winners and provides 
    details about the categories and years of the corresponding Nobel Prizes.
    
    :param dict_laureates: a dict containing information about laureates and their prizes
    :return: a dict that maps affiliations to their associated Nobel Prizes
    """
    
    # Create empty dictionary
    dict_affiliations_prizes = {}
    
    for laureates in dict_laureates.values():
        # Looping over the laureates
        for laureate in laureates:
            # Looping over the prizes of the laureates
            for prize in laureate["prizes"]:
                # If information on category and year exists, store it in prize_dict
                if prize.get("category") and prize.get("year"):
                    prize_dict = {}
                    prize_dict["category"] = prize["category"]
                    prize_dict["year"] = prize["year"]
                    # Looping over the associated affiliations
                    for afilliation in prize["affiliations"]:
                        # If affiliation information is available, retrieve the name
                        if afilliation != [] and afilliation.get("name"):
                            name_affilliation = afilliation["name"]
                            # If affilliation not in dict, add
                            if name_affilliation not in dict_affiliations_prizes:
                                dict_affiliations_prizes[name_affilliation] = []
                            # Append associated prize to the list
                            dict_affiliations_prizes[name_affilliation].append(prize_dict)
    
    # Return the dict of the affiliations and their associated prizes                       
    return dict_affiliations_prizes

# Testing the code
affiliations_prizes = get_affiliation_prizes(dict_laureates)
affiliations_prizes

{'Munich University': [{'category': 'physics', 'year': '1901'},
  {'category': 'chemistry', 'year': '1905'},
  {'category': 'chemistry', 'year': '1915'},
  {'category': 'chemistry', 'year': '1927'}],
 'Leiden University': [{'category': 'physics', 'year': '1902'},
  {'category': 'physics', 'year': '1913'},
  {'category': 'medicine', 'year': '1924'}],
 'Amsterdam University': [{'category': 'physics', 'year': '1902'},
  {'category': 'physics', 'year': '1910'}],
 'École Polytechnique': [{'category': 'physics', 'year': '1903'},
  {'category': 'physics', 'year': '2018'}],
 'École municipale de physique et de chimie industrielles (Municipal School of Industrial Physics and Chemistry)': [{'category': 'physics',
   'year': '1903'}],
 'Sorbonne University': [{'category': 'chemistry', 'year': '1911'},
  {'category': 'physics', 'year': '1908'},
  {'category': 'physics', 'year': '1926'},
  {'category': 'chemistry', 'year': '1906'},
  {'category': 'medicine', 'year': '1913'},
  {'category': 'peace',

### 2d. Write to JSON

Next, write the dictionary created in the previous exercise to a JSON file using the following path: 

`../Data/json_data/NobelPrize/nobel_prizes_affiliations.json`.

In [12]:
# write the resulting dictionary to 'json_file'

with open("../Data/json_data/NobelPrize/nobel_prizes_affiliations.json", "w") as outfile:
     json.dump(affiliations_prizes, outfile)