# Data formats and open data
**Exercises for week 4B** in Digital Methods, University of Copenhagen

## 1. Data formats

We will be working with two main types of data formats: tabular and non-tabular. In 99% of practical scenarios
that translates to files in either CSV format (tabular) or JSON format (non-tabular).

### 1.1 Tabular data

> **Ex. 1**: Download [this file](https://www.dropbox.com/s/8ntns0td580be3i/output_got.csv?dl=0) and load it as a list
of lists. Print the first 10 lines, as well as the total number of lines.

In [6]:
lines = []
with open('output_got.csv') as fp:
    for line in fp:
        lines.append(line.split(","))

In [7]:
lines[:10]

[['date',
  'username',
  'to',
  'replies',
  'retweets',
  'favorites',
  'text',
  'geo',
  'mentions',
  'hashtags',
  'id',
  'permalink\n'],
 ['2020-02-28 08:15:19',
  'raebengala',
  '',
  '0',
  '0',
  '0',
  '"Per tranquillizzarci ci dicono che morti #coronavirus sono quasi tutti 80enni. Ci chiedono di tornare normalità',
  ' come se questi #anziani non contassero nulla. Pochi giorni fa parlavamo di come si viva più a lungo. Oggi ci sembra una fortuna che muoiano ""solo"" anziani #COVID2019"',
  '',
  '',
  '#coronavirus #anziani #COVID2019',
  '1233304662671069184',
  'https://twitter.com/raebengala/status/1233304662671069184\n'],
 ['2020-02-28 08:15:19',
  'martin_lainez',
  '',
  '0',
  '0',
  '0',
  '"El sevillano con coronavirus: «Mando un mensaje de tranquilidad',
  ' tenemos una sanidad preparada» https://sevilla.abc.es/sevilla/sevi-sevillano-coronavirus-mando-mensaje-tranquilidad-tenemos-sanidad-preparada-202002272347_noticia.html#vca=rrss-inducido&vmc=abcdesevilla-es&

> **Ex. 2**: Load the same file into a `pandas.DataFrame`. What do you think it describes, and can you characterize
it just from a glance? What would be some things we could investigate about this data?
>
> *Hint: To load a csv as a pandas dataframe you can `import pandas as pd` (if you do not have it installed, run `pip install pandas`
in your terminal/command prompt) and load it using [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).*

In [8]:
import pandas as pd

df = pd.read_csv('output_got.csv')

In [10]:
df.head()

Unnamed: 0,date,username,to,replies,retweets,favorites,text,geo,mentions,hashtags,id,permalink
0,2020-02-28 08:15:19,raebengala,,0,0,0,Per tranquillizzarci ci dicono che morti #coro...,,,#coronavirus #anziani #COVID2019,1233304662671069184,https://twitter.com/raebengala/status/12333046...
1,2020-02-28 08:15:19,martin_lainez,,0,0,0,El sevillano con coronavirus: «Mando un mensaj...,,@abcdesevilla @RamonRomanR,#vca,1233304662092152832,https://twitter.com/martin_lainez/status/12333...
2,2020-02-28 08:15:19,Davenews5,,0,0,0,#HDPROS Frontières/coronavirus : contrôler les...,,,#HDPROS #Minable #Irresponsable,1233304660473151488,https://twitter.com/Davenews5/status/123330466...
3,2020-02-28 08:15:19,nnoeliaperez,,0,0,0,No quiero ser alarmista pero VAMOS A MORIR ALF...,,,,1233304660188041216,https://twitter.com/nnoeliaperez/status/123330...
4,2020-02-28 08:15:19,paulstpancras,,0,0,0,Why Trump is secretly terrified of coronavirus...,,,,1233304659156180994,https://twitter.com/paulstpancras/status/12333...


It describes tweets on the COVID-19 virus. One could
* go ahead and analyse languages to get an idea of what demographics were
talking about it. 
* Reconstruct a network of users based on replies and/or retweets (if more data
were available)
* Analyse the language in the tweets. Is it fear driven or do people seem more
rational?

> **Ex. 3**: Are there missing values in the data? How does `pandas` handle these? Can we remedy missing values in
any way?

There are lots of missing values. If we inspect the raw data, we see that a 
missing value occurs every time we have two consecutive commas. Pandas represents
these as 'NaN' values, i.e. "Not a Number" (as everything in a computer is
ultimately a number).

> ***If you are motivated***: Install [GetOldTweets3](https://pypi.org/project/GetOldTweets3/). Use it to download
a dataset of tweets about Corona virus originating from Denmark (or in Danish). Print a handful of tweets and summarize the discourse.

In [11]:
!GetOldTweets3 --querysearch "corona" --near "56.035014, 10.392624" --within 150km --maxtweets 100 --output "coronadk.csv"

Downloading tweets...
Saved 100
Done. Output file generated "coronadk.csv".


In [12]:
df = pd.read_csv("coronadk.csv")

In [15]:
for text in df['text']:
    print(text, end="\n\n")

vent fr, jeg skal vidst ud og hamstre corona til sommeren

Jeg giver det 48 timer, så er hele det danske mediebilledes Corona-selvsving skiftet ud med FLYGTNINGENE KOMMER!!!« https://twitter.com/piphotosagency/status/1233410354023149569 …

»Granddad… was that before the Corona virus apocalypse?«

I forhold til hvordan man straks har sat et beredskab igang i lande som f.eks. England når det gælder Corona-virus. Og hvor man i Irland straks frarådede rejser til Italien efter udbruddet der. Så virker det danske myndigheds håndtering total amatør-agtig. #dkpol #covid2019

Aldi i Aalborgs Vestby udnytter lige hypen og navnegenkendelsen til at introducere en nyhed i sortimentet. Ps. Bare rolig. Man kan forholdsvist risikofrit drikke Corona, blot man husker at bruge maske. pic.twitter.com/upMnL4LJTr

har corona med skal du smage nilaash

Corona virussen er en oplagt mulighed til at starte kønsdebatten

Og hey: IKKE i orden, at SARS og Corona parrer sig.

corona makes u stronger. so its a drug.

My summary: Lots of tweets in other languages (primarily english) talking about
the situaion in DK (atm 3 infected). Most of these are heavy on fear. At the same
time lots of tweets in danish are criticizing the Danish media for being hysterical,
and drawing attention away from more important matters.

### 1.2 Non-tabular data

> **Ex. 4** In Python, JSON is rendered as a `dict`-type object. `dict`s are key-value stores, much like 
dictionaries that map words between e.g. Danish and English: input a word in Danish (the key) and get the word
in English (the value). In Python this operation is achieved by "keying" into a dictionary with code like
`my_dict[the_key]`. JSON objects are, however, a step more complicated. How? Because values can themselves be
dictionaries (or lists), thus nesting the data in a tree structure. In this exercise, you should "key" into `my_json_obj`
below to access the list of useless cats.

In [16]:
my_json_obj = {
    'cats': {
        'awesome': ['Missy'],
        'useless': ['Kim', 'Frank', 'Sandy']
    },
    'dogs': {
        'awesome': ['Finn', 'Dolores', 'Fido', 'Casper'],
        'useless': []
    }
}

In [17]:
my_json_obj['cats']['useless']

['Kim', 'Frank', 'Sandy']

> **Ex. 5**: Run the code snippet below to download frontpage of posts from the *[r/coronavirus](https://www.reddit.com/r/Coronavirus/)* subreddit in
[json format](https://www.reddit.com/r/coronavirus.json). With pen and paper (or another illustration tool) draw a sketch that outlines the data
as a tree (similar to example in lecture).

In [18]:
import requests as rq
data = rq.get(
    "https://www.reddit.com/r/coronavirus.json",    # link to the data
    headers={'User-agent': 'digital_methods_2020'}  # a user agent that tells reddit who we are (good measure)
).json()                                            # render the response from reddits server as JSON data

[Link to my sketch](https://www.dropbox.com/s/2et2bs5d1xu07ud/ex5_sketch.png?dl=0)

> **Ex. 6** Similar to how you returned the list of useless cats in Ex. 4, now "key" into `data`
to access a value that informs you this data is from the 'Coronavirus' subreddit.

In [40]:
data['data']['children'][0]['data']['subreddit']

'Coronavirus'