### Exploring data from datahub.io

Acquiring data from [here](https://github.com/datasets/covid-19) and maybe soon be able to automate the integration of that data with wikidata.

Just some things to think about (jvfe):
- How to properly reference the data? Choose [datahub.io](https://datahub.io/core/covid-19) as the reference?
    - They aggregate it from various sources
    
    
- I've acquired the country outbreak items via the following query and modified it slightly to better merge the items.
```
SELECT ?item ?itemLabel ?countryid ?countryidLabel
WHERE 
{
  ?item p:P31 ?statement. 
      ?statement ps:P31 wd:Q3241045. 
      ?statement pq:P642 wd:Q84263196.
      ?statement pq:P3005 ?countryid.
      ?countryid wdt:P31 wd:Q6256.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
```

In [1]:
%load_ext pycodestyle_magic

In [2]:
%flake8_on

In [3]:
import pandas as pd

In [4]:
from datetime import date, time, timedelta
yesterday = date.today() - timedelta(days=1)
today = date.today()

yesterday_table = yesterday.strftime("%Y-%m-%d")
today_table = today.strftime("%Y-%m-%d")


In [5]:
countries = pd.read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv")
wdt_items = pd.read_csv("../data/country_outbreaks.csv")

1:80: E501 line too long (115 > 79 characters)


In [6]:
full = pd.merge(countries, wdt_items, on="Country")
full.head(3)

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid
0,2020-01-22,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
1,2020-01-23,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
2,2020-01-24,Afghanistan,0,0,0,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889


In [7]:
from datetime import datetime
# Most recent data seems to be from the day before

query = "Date == @yesterday_table"
recent = full.query(query)

# that does not happen aways, though.


dates_in_full = [datetime.strptime(date, "%Y-%m-%d") for date in full["Date"]]
most_recent_date = max(dates_in_full).strftime("%Y-%m-%d")

# pd.query did not work
recent = full[full["Date"] == most_recent_date]

recent.head(2)

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid
153,2020-06-23,Afghanistan,29481,9260,618,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889
307,2020-06-23,Algeria,12076,8674,861,Q87202921,2020 coronavirus pandemic in Algeria,Q262


In [8]:
# The following countries appear to be updated
# manually from more specific sources.
idx = recent['Country'].isin(['US', 'United Kingdom', 'France', 'Sweden',
                              'Brazil', 'Netherlands', 'China', 'Italy',
                              'Spain', 'Germany', 'Iran', 'Índia', 'Mexico',
                              'Argentina', 'Canada', 'Spain', 'Norway',
                              'Uruguay'])
not_manual = recent[~idx]

In [9]:
%run check_last_update_for_country_items.py

In [10]:
country_outbreak_items = list(recent["item"])

# Api only takes 50 at a time, so we have to cut it.


# implementation from    https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def get_chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))


chunks_of_country_outbreak_items = list(get_chunks(country_outbreak_items, 50))

outbreak_item_to_timestamp = {}

for chunk in chunks_of_country_outbreak_items:
    outbreak_item_to_timestamp.update(get_timestamp_of_last_edits(chunk))



6:80: E501 line too long (116 > 79 characters)


In [11]:
recent["timestamp_of_last_edit"] = recent["item"].map(outbreak_item_to_timestamp)

recent.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit
153,2020-06-23,Afghanistan,29481,9260,618,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-06-16T14:02:42Z
307,2020-06-23,Algeria,12076,8674,861,Q87202921,2020 coronavirus pandemic in Algeria,Q262,2020-06-23T11:17:14Z
461,2020-06-23,Angola,189,77,10,Q88082534,2020 coronavirus pandemic in Angola,Q916,2020-06-16T14:02:49Z
615,2020-06-23,Antigua and Barbuda,26,22,3,Q87708331,2020 coronavirus pandemic in Antigua and Barbuda,Q781,2020-06-17T19:41:19Z
769,2020-06-23,Argentina,47203,13576,1078,Q87235137,2020 coronavirus pandemic in Argentina,Q414,2020-06-23T13:57:43Z


1:80: E501 line too long (81 > 79 characters)


In [12]:
from datetime import datetime


def convert_timestamp_to_time_until_now(timestamp):

    time_in_datetime_format = datetime.strptime(timestamp,
                                                "%Y-%m-%dT%H:%M:%SZ")
    diff = datetime.now() - time_in_datetime_format
    return(diff)

In [13]:
recent["time_from_last_edit_until_now"] = recent["timestamp_of_last_edit"].map(convert_timestamp_to_time_until_now)

recent.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit,time_from_last_edit_until_now
153,2020-06-23,Afghanistan,29481,9260,618,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-06-16T14:02:42Z,7 days 17:23:43.160619
307,2020-06-23,Algeria,12076,8674,861,Q87202921,2020 coronavirus pandemic in Algeria,Q262,2020-06-23T11:17:14Z,0 days 20:09:11.160779
461,2020-06-23,Angola,189,77,10,Q88082534,2020 coronavirus pandemic in Angola,Q916,2020-06-16T14:02:49Z,7 days 17:23:36.160853
615,2020-06-23,Antigua and Barbuda,26,22,3,Q87708331,2020 coronavirus pandemic in Antigua and Barbuda,Q781,2020-06-17T19:41:19Z,6 days 11:45:06.160899
769,2020-06-23,Argentina,47203,13576,1078,Q87235137,2020 coronavirus pandemic in Argentina,Q414,2020-06-23T13:57:43Z,0 days 17:28:42.160939


1:80: E501 line too long (115 > 79 characters)


In [14]:
#outdated_items = recent[recent["time_from_last_edit_until_now"] > timedelta(hours=23)]

outdated_items = recent

1:1: E265 block comment should start with '# '
1:80: E501 line too long (87 > 79 characters)


In [15]:
outdated_items.head(5)

Unnamed: 0,Date,Country,Confirmed,Recovered,Deaths,item,itemLabel,countryid,timestamp_of_last_edit,time_from_last_edit_until_now
153,2020-06-23,Afghanistan,29481,9260,618,Q87768605,2020 coronavirus pandemic in Afghanistan,Q889,2020-06-16T14:02:42Z,7 days 17:23:43.160619
307,2020-06-23,Algeria,12076,8674,861,Q87202921,2020 coronavirus pandemic in Algeria,Q262,2020-06-23T11:17:14Z,0 days 20:09:11.160779
461,2020-06-23,Angola,189,77,10,Q88082534,2020 coronavirus pandemic in Angola,Q916,2020-06-16T14:02:49Z,7 days 17:23:36.160853
615,2020-06-23,Antigua and Barbuda,26,22,3,Q87708331,2020 coronavirus pandemic in Antigua and Barbuda,Q781,2020-06-17T19:41:19Z,6 days 11:45:06.160899
769,2020-06-23,Argentina,47203,13576,1078,Q87235137,2020 coronavirus pandemic in Argentina,Q414,2020-06-23T13:57:43Z,0 days 17:28:42.160939


In [16]:
# The following countries appear to be updated
# manually from more specific sources.
idx = outdated_items['Country'].isin(['US', 'United Kingdom', 'France', 'Sweden',
                              'Brazil', 'Netherlands', 'China', 'Italy',
                              'Spain', 'Germany', 'Iran', 'Índia', 'Mexico',
                              'Argentina', 'Canada', 'Spain', 'Norway',
                              'Uruguay'])
outdated_items = outdated_items[~idx]

3:80: E501 line too long (81 > 79 characters)
4:31: E128 continuation line under-indented for visual indent
5:31: E128 continuation line under-indented for visual indent
6:31: E128 continuation line under-indented for visual indent
7:31: E128 continuation line under-indented for visual indent


In [17]:
outdated_items = outdated_items[10:]

In [18]:
def make_url_statement_reference(url):
    return wdi_core.WDString(value=url, prop_nr="P854", is_reference=True)


def make_retrieved_today_statement_reference():
    today = datetime.now()
    today_wikidata_format = today.strftime("+%Y-%m-%dT00:00:00Z")
    reference_retrieved_in = wdi_core.WDTime(today_wikidata_format,
                                             prop_nr='P813', is_reference=True)
    return reference_retrieved_in


def get_date_of_ocurrence_statement_qualifier(row):
    date_string = row["Date"]
    date = datetime.strptime(date_string, '%Y-%m-%d')
    date_string_in_wikidata_format = date.strftime("+%Y-%m-%dT00:00:00Z")
    qualifier_date_of_ocurrence = wdi_core.WDTime(date_string_in_wikidata_format,
                                                  prop_nr="P585", is_qualifier=True)
    
    return qualifier_date_of_ocurrence

17:80: E501 line too long (81 > 79 characters)
18:80: E501 line too long (84 > 79 characters)
19:1: W293 blank line contains whitespace


In [19]:
from wikidataintegrator import wdi_core, wdi_login
import credentials as credentials
# login object
login_instance = wdi_login.WDLogin(user=credentials.username, pwd=credentials.password)


https://www.wikidata.org/w/api.php
Successfully logged in as CovidDatahubBot


4:80: E501 line too long (87 > 79 characters)


In [21]:

for index, row in outdated_items.iterrows():
    item_being_updated = row["Country"] + " " + row["item"]
    print(item_being_updated)
    
    
    break

Bhutan Q87715166


In [44]:
from datetime import datetime, date

url_for_reference = make_url_statement_reference("https://datahub.io/core/covid-19")
reference_retrieved_in = make_retrieved_today_statement_reference()

references_list = [[url_for_reference, reference_retrieved_in]]


flag = 0

import time

for index, row in outdated_items.iterrows():
    item_being_updated = row["Country"] + " " + row["item"]
    print(item_being_updated)
    qualifier_date_of_ocurrence = get_date_of_occurence_statement_qualifier(row)
    qualifier_list = [qualifier_date_of_ocurrence]
    
    outbreak_item = row["item"]
    deaths = row["Deaths"]
    confirmeds = row["Confirmed"]
    recovereds = row["Recovered"]
    
    print(deaths)
    print(confirmeds)
    print(recovereds)
    
    deaths_statement = wdi_core.WDQuantity(value=deaths, prop_nr='P1120', references= references_list, qualifiers=qualifier_list)
    confirmeds_statement = wdi_core.WDQuantity(value=confirmeds, prop_nr='P1603', references= references_list, qualifiers=qualifier_list)
    recovereds_statement = wdi_core.WDQuantity(value=recovereds, prop_nr='P8010', references= references_list, qualifiers=qualifier_list)
    
    data_to_update_for_country = [deaths_statement, confirmeds_statement, recovereds_statement]

    
    wd_item = wdi_core.WDItemEngine(wd_item_id=outbreak_item,
                                    data=data_to_update_for_country)
   
    
    wd_item.write(login_instance,
                  bot_account=True,
                  max_retries=3,
                  edit_summary="updating case counts for today")
    
    

    
    

Bhutan Q87715166
0
67
22
2020-06-16 14:10:15.949445: maxlag. sleeping for 5.866666666666666 seconds


KeyboardInterrupt: 

3:80: E501 line too long (84 > 79 characters)
11:1: E402 module level import not at top of file
16:80: E501 line too long (80 > 79 characters)
18:1: W293 blank line contains whitespace
23:1: W293 blank line contains whitespace
27:1: W293 blank line contains whitespace
28:80: E501 line too long (129 > 79 characters)
28:86: E251 unexpected spaces around keyword / parameter equals
29:80: E501 line too long (137 > 79 characters)
29:94: E251 unexpected spaces around keyword / parameter equals
30:80: E501 line too long (137 > 79 characters)
30:94: E251 unexpected spaces around keyword / parameter equals
31:1: W293 blank line contains whitespace
32:80: E501 line too long (95 > 79 characters)
34:1: W293 blank line contains whitespace
35:5: E303 too many blank lines (2)
37:1: W293 blank line contains whitespace
38:1: W293 blank line contains whitespace
39:5: E303 too many blank lines (2)
43:1: W293 blank line contains whitespace
44:1: W293 blank line contains whitespace
45:5: E303 too many blan

In [26]:
credentials

<module 'credentials' from '/home/lubianat/Documents/wikidata_covid19/sandbox/worldwide_data/src/credentials.py'>