### Exploring data from datahub.io

Acquiring data from [here](https://github.com/datasets/covid-19) and maybe soon be able to automate the integration of that data with wikidata.

Just some things to think about (jvfe):
- How to properly reference the data? Choose [datahub.io](https://datahub.io/core/covid-19) as the reference?
    - They aggregate it from various sources
    
    
- I've acquired the country outbreak items via the following query and modified it slightly to better merge the items.
```
SELECT ?item ?itemLabel ?countryid ?countryidLabel
WHERE 
{
  ?item p:P31 ?statement. 
      ?statement ps:P31 wd:Q3241045. 
      ?statement pq:P642 wd:Q84263196.
      ?statement pq:P3005 ?countryid.
      ?countryid wdt:P31 wd:Q6256.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
```

In [1]:
%load_ext pycodestyle_magic

In [2]:
%flake8_on

In [3]:
import pandas as pd

In [4]:
from datetime import date, time, timedelta
yesterday = date.today() - timedelta(days=1)
today = date.today()

yesterday_table = yesterday.strftime("%Y-%m-%d")
today_table = today.strftime("%Y-%m-%d")


In [5]:
countries = pd.read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv")
wdt_items = pd.read_csv("../data/country_outbreaks.csv")

1:80: E501 line too long (115 > 79 characters)


In [6]:
full = pd.merge(countries, wdt_items, on="Country")
full.head(3)

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid
0,2020-01-22,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
1,2020-01-23,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
2,2020-01-24,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889


In [7]:
from datetime import datetime
# Most recent data seems to be from the day before

query = "Date == @yesterday_table"
recent = full.query(query)

# that does not happen aways, though.


dates_in_full = [datetime.strptime(date, "%Y-%m-%d") for date in full["Date"]]
most_recent_date = max(dates_in_full).strftime("%Y-%m-%d")

# pd.query did not work
recent = full[full["Date"] == most_recent_date]

recent.head(2)

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid
143,2020-06-13,Afghanistan,24102,4201,451,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
287,2020-06-13,Algeria,10810,7420,760,Q87202921,2020 coronavirus pandemic in Algeria,Q262


In [8]:
# The following countries appear to be updated
# manually from more specific sources.
idx = recent['Country'].isin(['US', 'United Kingdom', 'France', 'Sweden',
                              'Brazil', 'Netherlands', 'China', 'Italy',
                              'Spain', 'Germany', 'Iran', 'Índia', 'Mexico',
                              'Argentina', 'Canada', 'Spain', 'Norway',
                              'Uruguay'])
not_manual = recent[~idx]

In [9]:
yesterday_wdt = yesterday.strftime("+%Y-%m-%dT00:00:00Z/11")
today_wdt = today.strftime("+%Y-%m-%dT00:00:00Z/11")

with open(f'../data/{today_table}.qs', 'w') as file:
    for index, row in not_manual.iterrows():
        print(
            row['item'] + "|P1603|" + str(int(row['Confirmed'])) +
            "|P585|" + yesterday_wdt + "|S854|" + '"' +
            "https://github.com/datasets/covid-19" + '"' +
            "|S813|" + today_wdt + "\n" +

            row['item'] + "|P1120|" + str(int(row['Deaths'])) +
            "|P585|" + yesterday_wdt + "|S854|" + '"' +
            "https://github.com/datasets/covid-19" + '"' +
            "|S813|" + today_wdt + "\n" +

            row['item'] + "|P8010|" + str(int(row['Recovered'])) +
            "|P585|" + yesterday_wdt + "|S854|" + '"' +
            "https://github.com/datasets/covid-19" + '"' +
            "|S813|" + today_wdt + "\n",
            file=file)

In [10]:
%run check_last_update_for_country_items.py

In [11]:
country_outbreak_items = list(recent["item"])

# Api only takes 50 at a time, so we have to cut it.


# implementation from    https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def get_chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))


chunks_of_country_outbreak_items = list(get_chunks(country_outbreak_items, 50))

outbreak_item_to_timestamp = {}

for chunk in chunks_of_country_outbreak_items:
    outbreak_item_to_timestamp.update(get_timestamp_of_last_edits(chunk))


6:80: E501 line too long (116 > 79 characters)


In [12]:
recent["timestamp_of_last_edit"] = recent["item"].map(outbreak_item_to_timestamp)

recent.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit
143,2020-06-13,Afghanistan,24102,4201,451,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-06-11T21:49:41Z


1:80: E501 line too long (81 > 79 characters)


In [13]:
from datetime import datetime


def convert_timestamp_to_time_until_now(timestamp):

    time_in_datetime_format = datetime.strptime(timestamp,
                                                "%Y-%m-%dT%H:%M:%SZ")
    diff = datetime.now() - time_in_datetime_format
    return(diff)

In [14]:
recent["time_from_last_edit_until_now"] = recent["timestamp_of_last_edit"].map(convert_timestamp_to_time_until_now)

recent.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit,time_from_last_edit_until_now
143,2020-06-13,Afghanistan,24102,4201,451,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-06-11T21:49:41Z,3 days 00:19:43.386745


1:80: E501 line too long (115 > 79 characters)


In [15]:
outdated_items = recent[recent["time_from_last_edit_until_now"] > timedelta(hours=23)]

In [16]:
outdated_items.head(5)

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit,time_from_last_edit_until_now
143,2020-06-13,Afghanistan,24102,4201,451,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-06-11T21:49:41Z,3 days 00:19:43.386745
287,2020-06-13,Algeria,10810,7420,760,Q87202921,2020 coronavirus pandemic in Algeria,Q262,2020-06-08T17:00:32Z,6 days 05:08:52.386792
431,2020-06-13,Angola,138,61,6,Q88082534,2020 coronavirus pandemic in Angola,Q916,2020-05-29T04:41:32Z,16 days 17:27:52.386814
575,2020-06-13,Antigua and Barbuda,26,20,3,Q87708331,2020 coronavirus pandemic in Antigua and Barbuda,Q781,2020-05-14T21:22:34Z,31 days 00:46:50.386831
863,2020-06-13,Australia,7320,6838,102,Q83873548,2020 coronavirus pandemic in Australia,Q408,2020-05-28T22:20:24Z,16 days 23:49:00.386860


In [17]:

table_date_in_wikidata_format = datetime.strptime(
    outdated_items["Date"].values[0], "%Y-%m-%d").strftime(
    "+%Y-%m-%dT00:00:00Z/11")

point_in_time = "|P585|" + table_date_in_wikidata_format

today_wdt = today.strftime("+%Y-%m-%dT00:00:00Z/11")


reference_URL = "|S854|" + '"' + "https://datahub.io/core/covid-19" + '"'
retrieved_in = "|S813|" + today_wdt
filename_in_archive = "|S7793|" + '"' + "r/countries-aggregated.csv" + '"'

reference = reference_URL + retrieved_in + filename_in_archive


with open(f'../data/{today_table}_outdated_items.qs', 'w') as file:
    for index, row in outdated_items.iterrows():
        print(
            row['item'] + "|P1603|" + str(int(row['Confirmed'])) +
            point_in_time + reference + "\n" +
            row['item'] + "|P1120|" + str(int(row['Deaths'])) +
            point_in_time + reference + "\n" +
            row['item'] + "|P8010|" + str(int(row['Recovered'])) +
            point_in_time + reference + "\n",
            file=file)

In [18]:
outdated_items.head(5)

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit,time_from_last_edit_until_now
143,2020-06-13,Afghanistan,24102,4201,451,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-06-11T21:49:41Z,3 days 00:19:43.386745
287,2020-06-13,Algeria,10810,7420,760,Q87202921,2020 coronavirus pandemic in Algeria,Q262,2020-06-08T17:00:32Z,6 days 05:08:52.386792
431,2020-06-13,Angola,138,61,6,Q88082534,2020 coronavirus pandemic in Angola,Q916,2020-05-29T04:41:32Z,16 days 17:27:52.386814
575,2020-06-13,Antigua and Barbuda,26,20,3,Q87708331,2020 coronavirus pandemic in Antigua and Barbuda,Q781,2020-05-14T21:22:34Z,31 days 00:46:50.386831
863,2020-06-13,Australia,7320,6838,102,Q83873548,2020 coronavirus pandemic in Australia,Q408,2020-05-28T22:20:24Z,16 days 23:49:00.386860


In [22]:
from wikidataintegrator import wdi_core, wdi_login
import credentials as credentials        
# login object
login_instance = wdi_login.WDLogin(user=credentials.username, pwd=credentials.password)
         
# data type object, e.g. for a NCBI gene entrez ID
death_counts_for_country = wdi_core.WDQuantity(value=451, prop_nr='P1120')
    
    # data goes into a list, because many data objects can be provided to 
data_to_update_for_country = [death_counts_for_country]
    
    # Search for and then edit/create new item
wd_item = wdi_core.WDItemEngine(wd_item_id="Q87768605",
                                data=data_to_update_for_country)
wd_item.write(login_instance,
              bot_account=True,
              dit_summary="updating case counts for today")

https://www.wikidata.org/w/api.php
Successfully logged in as CovidDatahubBot


'Q96288412'

2:34: W291 trailing whitespace
4:80: E501 line too long (87 > 79 characters)
5:1: W293 blank line contains whitespace
8:1: W293 blank line contains whitespace
9:5: E116 unexpected indentation (comment)
9:74: W291 trailing whitespace
11:1: W293 blank line contains whitespace
12:5: E116 unexpected indentation (comment)


In [None]:
credentials