### Exploring data from datahub.io

Acquiring data from [here](https://github.com/datasets/covid-19) and maybe soon be able to automate the integration of that data with wikidata.

Just some things to think about (jvfe):
- How to properly reference the data? Choose [datahub.io](https://datahub.io/core/covid-19) as the reference?
    - They aggregate it from various sources
    
    
- I've acquired the country outbreak items via the following query and modified it slightly to better merge the items.
```
SELECT ?item ?itemLabel ?countryid ?countryidLabel
WHERE 
{
  ?item p:P31 ?statement. 
      ?statement ps:P31 wd:Q3241045. 
      ?statement pq:P642 wd:Q84263196.
      ?statement pq:P3005 ?countryid.
      ?countryid wdt:P31 wd:Q6256.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
```

In [1]:
import pandas as pd

In [2]:
from datetime import date,timedelta
yesterday = date.today() - timedelta(days=1)
today = date.today()

yesterday_table = yesterday.strftime("%Y-%m-%d")
today_table = today.strftime("%Y-%m-%d")


In [3]:
countries = pd.read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv")
wdt_items = pd.read_csv("../data/country_outbreaks.csv")

In [4]:
full = pd.merge(countries, wdt_items, on="Country")
full

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid
0,2020-01-22,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
1,2020-01-23,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
2,2020-01-24,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
3,2020-01-25,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
4,2020-01-26,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
...,...,...,...,...,...,...,...,...
11825,2020-04-17,Zimbabwe,24,2,3,Q88164033,2020 coronavirus pandemic in Zimbabwe,Q954
11826,2020-04-18,Zimbabwe,25,2,3,Q88164033,2020 coronavirus pandemic in Zimbabwe,Q954
11827,2020-04-19,Zimbabwe,25,2,3,Q88164033,2020 coronavirus pandemic in Zimbabwe,Q954
11828,2020-04-20,Zimbabwe,25,2,3,Q88164033,2020 coronavirus pandemic in Zimbabwe,Q954


In [5]:
#Most recent data seems to be from the day before
recent = full.query("Date == @yesterday_table")
recent

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid
90,2020-04-21,Afghanistan,1092,150,36,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
181,2020-04-21,Algeria,2811,1152,392,Q87202921,2020 coronavirus pandemic in Algeria,Q262
272,2020-04-21,Angola,24,6,2,Q88082534,2020 coronavirus pandemic in Angola,Q916
363,2020-04-21,Antigua and Barbuda,23,7,3,Q87708331,2020 coronavirus pandemic in Antigua and Barbuda,Q781
454,2020-04-21,Argentina,3031,840,147,Q87235137,2020 coronavirus pandemic in Argentina,Q414
...,...,...,...,...,...,...,...,...
11465,2020-04-21,Venezuela,285,117,10,Q87652010,2020 coronavirus pandemic in Venezuela,Q717
11556,2020-04-21,Vietnam,268,216,0,Q83873057,2020 coronavirus pandemic in Vietnam,Q881
11647,2020-04-21,Yemen,1,0,0,Q89695985,2020 coronavirus pandemic in Yemen,Q805
11738,2020-04-21,Zambia,70,35,3,Q87976629,2020 coronavirus pandemic in Zambia,Q953


In [6]:
# The following countries appear to be updated manually from more specific sources.
idx = recent['Country'].isin(['US', 'United Kingdom', 'France', 'Sweden', 'Brazil', 'Netherlands',
                             'China', 'Italy', 'Spain', 'Germany', 'Iran', 'Mexico', 'Argentina',
                             'Canada', 'Spain', 'Norway', 'Portugal', 'Tunisia', 'Uruguay'])
not_manual = recent[~idx]

In [7]:
yesterday_wdt = yesterday.strftime("+%Y-%m-%dT00:00:00Z/11")
today_wdt = today.strftime("+%Y-%m-%dT00:00:00Z/11")

with open(f'../data/{today_table}.qs', 'w') as file:
    for index, row in not_manual.iterrows():
        print(
              row['item'] + "|P1603|" + str(int(row['Confirmed'])) + "|P585|" + yesterday_wdt + "|S854|" + '"' + 
                    "https://github.com/datasets/covid-19" + '"' +
                    "|S813|" + today_wdt + "\n" +
              row['item'] + "|P1120|" + str(int(row['Deaths'])) + "|P585|" + yesterday_wdt + "|S854|" + '"' + 
                    "https://github.com/datasets/covid-19" + '"' +
                    "|S813|" + today_wdt + "\n" +
              row['item'] + "|P8010|" + str(int(row['Recovered'])) + "|P585|" + yesterday_wdt + "|S854|" + '"' + 
                    "https://github.com/datasets/covid-19" + '"' +
                    "|S813|" + today_wdt + "\n",
                file = file)