# Data formats and open data
**Exercises for week 4B** in Digital Methods, University of Copenhagen

## 1. Data formats

We will be working with two main types of data formats: tabular and non-tabular. In 99% of practical scenarios
that translates to files in either CSV format (tabular) or JSON format (non-tabular).

### 1.1 Tabular data

> **Ex. 1**: Download [this file](https://www.dropbox.com/s/8ntns0td580be3i/output_got.csv?dl=0) and load it as a list
of lists. Print the first 10 lines, as well as the total number of lines.

In [27]:
import csv
with open('output_got.csv', "r", newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

print(data[:10])

[['date', 'username', 'to', 'replies', 'retweets', 'favorites', 'text', 'geo', 'mentions', 'hashtags', 'id', 'permalink'], ['2020-02-28 08:15:19', 'raebengala', '', '0', '0', '0', 'Per tranquillizzarci ci dicono che morti #coronavirus sono quasi tutti 80enni. Ci chiedono di tornare normalità, come se questi #anziani non contassero nulla. Pochi giorni fa parlavamo di come si viva più a lungo. Oggi ci sembra una fortuna che muoiano "solo" anziani #COVID2019', '', '', '#coronavirus #anziani #COVID2019', '1233304662671069184', 'https://twitter.com/raebengala/status/1233304662671069184'], ['2020-02-28 08:15:19', 'martin_lainez', '', '0', '0', '0', 'El sevillano con coronavirus: «Mando un mensaje de tranquilidad, tenemos una sanidad preparada» https://sevilla.abc.es/sevilla/sevi-sevillano-coronavirus-mando-mensaje-tranquilidad-tenemos-sanidad-preparada-202002272347_noticia.html#vca=rrss-inducido&vmc=abcdesevilla-es&vso=tw&vli=noticia-entrevista … vía @abcdesevilla @RamonRomanR', '', '@abcdes

> **Ex. 2**: Load the same file into a `pandas.DataFrame`. What do you think it describes, and can you characterize
it just from a glance? What would be some things we could investigate about this data?
>
> *Hint: To load a csv as a pandas dataframe you can `import pandas as pd` (if you do not have it installed, run `pip install pandas`
in your terminal/command prompt) and load it using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).*

In [28]:
#! pip install pandas

import pandas as pd
data = pd.read_csv("output_got.csv")
data

#if you use data.head, it gives your an overview of what's going on, but if you just print data, it will should you 
#the actual data.

Unnamed: 0,date,username,to,replies,retweets,favorites,text,geo,mentions,hashtags,id,permalink
0,2020-02-28 08:15:19,raebengala,,0,0,0,Per tranquillizzarci ci dicono che morti #coro...,,,#coronavirus #anziani #COVID2019,1233304662671069184,https://twitter.com/raebengala/status/12333046...
1,2020-02-28 08:15:19,martin_lainez,,0,0,0,El sevillano con coronavirus: «Mando un mensaj...,,@abcdesevilla @RamonRomanR,#vca,1233304662092152832,https://twitter.com/martin_lainez/status/12333...
2,2020-02-28 08:15:19,Davenews5,,0,0,0,#HDPROS Frontières/coronavirus : contrôler les...,,,#HDPROS #Minable #Irresponsable,1233304660473151488,https://twitter.com/Davenews5/status/123330466...
3,2020-02-28 08:15:19,nnoeliaperez,,0,0,0,No quiero ser alarmista pero VAMOS A MORIR ALF...,,,,1233304660188041216,https://twitter.com/nnoeliaperez/status/123330...
4,2020-02-28 08:15:19,paulstpancras,,0,0,0,Why Trump is secretly terrified of coronavirus...,,,,1233304659156180994,https://twitter.com/paulstpancras/status/12333...
...,...,...,...,...,...,...,...,...,...,...,...,...
95,2020-02-28 08:15:04,92Andrianiutari,,0,0,0,"“Aksi Kemanusiaan mengantisipasi virus corona,...",,,#BersamaCegahCoronapic,1233304599550758913,https://twitter.com/92Andrianiutari/status/123...
96,2020-02-28 08:15:04,TheRealOlaDiab,,0,0,0,No coronavirus cases in #Qatar and risk is cur...,,,#Qatar,1233304598305214464,https://twitter.com/TheRealOlaDiab/status/1233...
97,2020-02-28 08:15:04,TMZ,,0,0,1,Ski World Cup Finals In Italy Bans Spectators ...,,,,1233304597650956290,https://twitter.com/TMZ/status/123330459765095...
98,2020-02-28 08:15:04,SpeedBird_NCL,,0,0,0,Spreading coronavirus is making some American ...,,,#Aviation #Airlines #COVID19 #Coronavirus,1233304597172772865,https://twitter.com/SpeedBird_NCL/status/12333...


> **Ex. 3**: Are there missing values in the data? How does `pandas` handle these? Can we remedy missing values in
any way?

> ***If you are motivated***: Install [GetOldTweets3](https://pypi.org/project/GetOldTweets3/). Use it to download
a dataset of tweets about Corona virus originating from Denmark (or in Danish). Print a handful of tweets and summarize the discourse.

### 1.2 Non-tabular data

> **Ex. 4** In Python, JSON is rendered as a `dict`-type object. `dict`s are key-value stores, much like 
dictionaries that map words between e.g. Danish and English: input a word in Danish (the key) and get the word
in English (the value). In Python this operation is achieved by "keying" into a dictionary with code like
`my_dict[the_key]`. JSON objects are, however, a step more complicated. How? Because values can themselves be
dictionaries (or lists), thus nesting the data in a tree structure. In this exercise, you should "key" into `my_json_obj`
below to access the list of useless cats.

In [30]:
my_json_obj = {
    'cats': {
        'awesome': ['Missy'],
        'useless': ['Kim', 'Frank', 'Sandy']
    },
    'dogs': {
        'awesome': ['Finn', 'Dolores', 'Fido', 'Casper'],
        'useless': []
    }
}

In [32]:
my_json_obj['cats']['useless']

['Kim', 'Frank', 'Sandy']

> **Ex. 5**: Run the code snippet below to download frontpage of posts from the *[r/coronavirus](https://www.reddit.com/r/Coronavirus/)* subreddit in
[json format](https://www.reddit.com/r/coronavirus.json). With pen and paper (or another illustration tool) draw a sketch that outlines the data
as a tree (similar to example in lecture).

In [40]:
import requests as rq
data = rq.get(
    "https://www.reddit.com/r/coronavirus.json",    # link to the data
    headers={'User-agent': 'digital_methods_2020'}  # a user agent that tells reddit who we are (good measure)
).json()                                            # render the response from reddits server as JSON data


data

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'Coronavirus',
     'selftext': 'As of 5 PM Feb 27 GMT  there are 82,586 (🔺 2,228 or +148%  vs yesterday 🔺897) confirmed coronavirus cases worldwide.\n \nFor those who are confused with percentages, it’s day to day delta: the difference of increased (or decreased) numbers compared to yesterday’s increase or decrease. \n\nTop 3 infected countries will be featured here from now on:\n\nChina 🇨🇳:\n\nThere are 78,514 (🔺450 or +11% vs yesterday 🔺406) \n\nAmong them, 8,346 (🔻206  vs yesterday 🔻574) people are listed in serious conditions. \n\n29,745 recovered cases in China 💚 \n(🔺2,750 or +9% vs yesterday 🔺2,515)\n\nSouth Korea 🇰🇷:\n1,766 confirmed cases (🔺505 or +78% vs yesterday 🔺284)\n\n20,716 people are being tested for Coronavirus (🔺6,834 vs yesterday) \n\nItaly 🇮🇹:\n528 confirmed cases (🔺154 or +69% vs yesterday 🔺91)\n\nWorldwide death toll is 

In [65]:
data['data']['children'][0]['data']['subreddit']
#the first data is refering to the key, which has a dictionary as its value.
#the next key has a list, as its value, which is why we need to call element located on position 1 in order to get
#access. 

'Coronavirus'

> **Ex. 6** Similar to how you returned the list of useless cats in Ex. 4, now "key" into `data`
to access a value that informs you this data is from the 'Coronavirus' subreddit.