### Exploring data from datahub.io

Acquiring data from [here](https://github.com/datasets/covid-19) and maybe soon be able to automate the integration of that data with wikidata.

Just some things to think about (jvfe):
- How to properly reference the data? Choose [datahub.io](https://datahub.io/core/covid-19) as the reference?
    - They aggregate it from various sources
    
    
- I've acquired the country outbreak items via the following query and modified it slightly to better merge the items.
```
SELECT ?item ?itemLabel ?countryid ?countryidLabel
WHERE 
{
  ?item p:P31 ?statement. 
      ?statement ps:P31 wd:Q3241045. 
      ?statement pq:P642 wd:Q84263196.
      ?statement pq:P3005 ?countryid.
      ?countryid wdt:P31 wd:Q6256.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
```

In [1]:
import pandas as pd

In [2]:
from datetime import date, time, timedelta
yesterday = date.today() - timedelta(days=1)
today = date.today()

yesterday_table = yesterday.strftime("%Y-%m-%d")
today_table = today.strftime("%Y-%m-%d")


In [3]:
countries = pd.read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv")
wdt_items = pd.read_csv("../data/country_outbreaks.csv")

In [4]:
full = pd.merge(countries, wdt_items, on="Country")
full

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid
0,2020-01-22,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
1,2020-01-23,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
2,2020-01-24,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
3,2020-01-25,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
4,2020-01-26,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
...,...,...,...,...,...,...,...,...
12735,2020-04-24,Zimbabwe,29,2,4,Q88164033,2020 coronavirus pandemic in Zimbabwe,Q954
12736,2020-04-25,Zimbabwe,31,2,4,Q88164033,2020 coronavirus pandemic in Zimbabwe,Q954
12737,2020-04-26,Zimbabwe,31,2,4,Q88164033,2020 coronavirus pandemic in Zimbabwe,Q954
12738,2020-04-27,Zimbabwe,32,5,4,Q88164033,2020 coronavirus pandemic in Zimbabwe,Q954


In [5]:
#Most recent data seems to be from the day before
recent = full.query("Date == @yesterday_table")
recent

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid
97,2020-04-28,Afghanistan,1828,228,58,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
195,2020-04-28,Algeria,3649,1651,437,Q87202921,2020 coronavirus pandemic in Algeria,Q262
293,2020-04-28,Angola,27,6,2,Q88082534,2020 coronavirus pandemic in Angola,Q916
391,2020-04-28,Antigua and Barbuda,24,11,3,Q87708331,2020 coronavirus pandemic in Antigua and Barbuda,Q781
489,2020-04-28,Argentina,4127,1162,207,Q87235137,2020 coronavirus pandemic in Argentina,Q414
...,...,...,...,...,...,...,...,...
12347,2020-04-28,Venezuela,329,142,10,Q87652010,2020 coronavirus pandemic in Venezuela,Q717
12445,2020-04-28,Vietnam,270,222,0,Q83873057,2020 coronavirus pandemic in Vietnam,Q881
12543,2020-04-28,Yemen,1,1,0,Q89695985,2020 coronavirus pandemic in Yemen,Q805
12641,2020-04-28,Zambia,95,42,3,Q87976629,2020 coronavirus pandemic in Zambia,Q953


In [6]:
# The following countries appear to be updated manually from more specific sources.
idx = recent['Country'].isin(['US', 'United Kingdom', 'France', 'Sweden', 'Brazil', 'Netherlands',
                             'China', 'Italy', 'Spain', 'Germany', 'Iran', 'Mexico', 'Argentina',
                             'Canada', 'Spain', 'Norway', 'Portugal', 'Tunisia', 'Uruguay'])
not_manual = recent[~idx]

In [7]:
yesterday_wdt = yesterday.strftime("+%Y-%m-%dT00:00:00Z/11")
today_wdt = today.strftime("+%Y-%m-%dT00:00:00Z/11")

with open(f'../data/{today_table}.qs', 'w') as file:
    for index, row in not_manual.iterrows():
        print(
              row['item'] + "|P1603|" + str(int(row['Confirmed'])) + "|P585|" + yesterday_wdt + "|S854|" + '"' + 
                    "https://github.com/datasets/covid-19" + '"' +
                    "|S813|" + today_wdt + "\n" +
              row['item'] + "|P1120|" + str(int(row['Deaths'])) + "|P585|" + yesterday_wdt + "|S854|" + '"' + 
                    "https://github.com/datasets/covid-19" + '"' +
                    "|S813|" + today_wdt + "\n" +
              row['item'] + "|P8010|" + str(int(row['Recovered'])) + "|P585|" + yesterday_wdt + "|S854|" + '"' + 
                    "https://github.com/datasets/covid-19" + '"' +
                    "|S813|" + today_wdt + "\n",
                file = file)

In [8]:
%run check_last_update_for_country_items.py

In [9]:
country_outbreak_items_of_interest = list(recent["item"])

# Api only takes 50 at a time, so we have to cut it.


# implementation from    https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def get_chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))

        
chunks_of_country_outbreak_items_of_interest = list(get_chunks(country_outbreak_items_of_interest, 50))        
        
outbreak_item_to_timestamp = {}

for chunk in chunks_of_country_outbreak_items_of_interest:
    outbreak_item_to_timestamp.update(get_timestamp_of_last_edits(chunk))


In [10]:
recent["timestamp_of_last_edit"] = recent["item"].map(outbreak_item_to_timestamp)

recent.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit
97,2020-04-28,Afghanistan,1828,228,58,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-04-29T13:29:02Z


In [11]:
from datetime import datetime

def convert_timestamp_to_time_until_now(timestamp):

    time_in_datetime_format = datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ")
    diff = datetime.now() - time_in_datetime_format
    return(diff)

In [12]:
recent["time_from_last_edit_until_now"] = recent["timestamp_of_last_edit"].map(convert_timestamp_to_time_until_now)

recent.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit,time_from_last_edit_until_now
97,2020-04-28,Afghanistan,1828,228,58,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-04-29T13:29:02Z,03:37:35.333148


In [13]:
outdated_items = recent[recent["time_from_last_edit_until_now"] > timedelta(hours=23)]

In [15]:
outdated_items.head(5)

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit,time_from_last_edit_until_now
293,2020-04-28,Angola,27,6,2,Q88082534,2020 coronavirus pandemic in Angola,Q916,2020-04-27T10:08:48Z,2 days 06:57:49.333267
391,2020-04-28,Antigua and Barbuda,24,11,3,Q87708331,2020 coronavirus pandemic in Antigua and Barbuda,Q781,2020-04-23T21:42:43Z,5 days 19:23:54.333292
587,2020-04-28,Australia,6744,5665,89,Q83873548,2020 coronavirus pandemic in Australia,Q408,2020-04-27T03:48:57Z,2 days 13:17:40.333333
979,2020-04-28,Barbados,80,39,6,Q87902902,2020 coronavirus pandemic in Barbados,Q244,2020-04-27T23:27:57Z,1 days 17:38:40.333421
1077,2020-04-28,Benin,64,33,1,Q87781572,2020 coronavirus pandemic in Benin,Q962,2020-04-23T11:38:16Z,6 days 05:28:21.333453
