# Gathering Influenza Related Data via the DELPHI Epidata API
### Purpose of this notebook
* Gather the Wikipedia Pageviews and ILInet data from the DELPHI epidata API
* Store the data into pandas DataFrames to facilitate analysis (in other notebooks)
* Create a DataFrame that maps Epiweeks to their corresponding Datetimes

*The relevant computed variables are stored in iPython's local data store to avoid recomputation. These variables can be accessed from other notebooks using the %store magic command*

In [1]:
import epidata as delphi
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict
from datetime import datetime

%store -r pageViewResps
%store -r wILIresp
%store -r pageViews
%store -r wILI
%store -r epiweeksDf

definedVariables = set(list(globals().keys()) + list(vars().keys()))

#### Getting Wikipedia Page Views for Flu Related Articles

In [2]:
if 'pageViewResps' not in definedVariables:
    
    print("Calling the DELPHI epidata API for wikipedia pageviews")
    
    epidata = delphi.Epidata() # interface to CMU delphi API

    with open('./data/allarticles.txt') as f:
        fluRelatedArticles = [article.strip() for article in f]
    years = range(2008, 2017) # 2008 - 2016 full years
    epiranges = [ epidata.range(int(str(yr) + '01'), int(str(yr) + '52')) for yr in years]
    pageViewResps = []
    # API calls to the delphi epidata API
    for epiyear in epiranges:
        resp = epidata.wiki(fluRelatedArticles, epiweeks=epiyear)['epidata']
        pageViewResps.extend(resp)
        time.sleep(15)
    %store pageViewResps
else:
    print("Found pageViewResps")

Found pageViewResps


#### Getting state level ILInet data

In [3]:
if 'wILIresp' not in definedVariables:
    
    print("Calling the DELPHI epidata API for ILInet data")
    
    wILIresp = epidata.fluview('nat', epidata.range(200801, 201652))['epidata']
    %store wILIresp
else:
    print("Found wILIresp")

Found wILIresp


#### Putting pageViews API response data in DataFrame

In [4]:
if 'pageViews' not in definedVariables:
    pageToViews = defaultdict(list)
    pageViewsIndex = { week['epiweek'] for week in pageViewResps }
    pageViewsIndex = list(pageViewsIndex)
    pageViewsIndex.sort()

    # map each article to it's weekly view counts (from 2008 to 2016)
    for week in pageViewResps:
        page, weeklyViews = week['article'], week['count']
        pageToViews[page].append(weeklyViews)

    pageViews = pd.DataFrame.from_dict(pageToViews, orient='index', dtype='int')
    pageViews.fillna(0)
    pageViews = pageViews.transpose()
    # convert to ints, for some reasons transpose() coereces to floats
    for column in pageViews.columns:
        pageViews[column] = pageViews[column].fillna(0.0).astype('int')
    pageViews.index = pageViewsIndex
    pageViews[:2]
    
    %store pageViews
else:
    print("Found pageViews")

pageViews[:5]

Found pageViews


Unnamed: 0,influenzavirus_c,influenza_a_virus_subtype_h1n1,rhinorrhea,sore_throat,equine_influenza,swine_influenza,influenza_b_virus,influenza_a_virus_subtype_h3n2,antiviral_drugs,influenza_a_virus_subtype_h7n7,...,fatigue_(medical),cat_flu,paracetamol,influenzavirus_a,influenza,influenzalike_illness,human_flu,viral_neuraminidase,influenza_a_virus_subtype_h7n2,avian_influenza
200801,209,34,1511,3513,187,17,14,242,135,6,...,957,319,24246,1500,17568,0,350,0,14,3292
200802,243,33,1821,3841,212,21,12,260,185,12,...,1045,339,24699,2229,23338,0,453,0,19,4870
200803,228,45,1751,3549,268,16,18,97,193,16,...,960,291,22948,2488,22742,0,534,0,25,5478
200804,247,40,1786,3736,297,24,13,27,171,12,...,440,298,19869,3295,23488,0,577,0,16,6522
200805,341,64,1707,3859,230,38,22,25,202,9,...,1174,354,19414,4557,29240,0,735,0,14,6773


#### Putting ILInet API response in DataFrame

In [5]:
if 'wILI' not in definedVariables:
    wILIvalues = [ week['ili'] for week in wILIresp ]
    wILIindex = [ week['epiweek'] for week in wILIresp ]
    wILIindex.sort()
    wILI = pd.DataFrame(wILIvalues, columns=['Weekly ILI'], index=wILIindex)
    wILI.drop([200853, 201453], inplace=True) # these epiweeks aren't in pageViews
    
    %store wILI
else:
    print("Found wILI")
wILI[:5]

Found wILI


Unnamed: 0,Weekly ILI
200801,2.254048
200802,2.091472
200803,2.359343
200804,3.323314
200805,4.43381


#### Creating DataFrame that maps Epiweek number to a datetime
This will prove useful when doing timeseries analysis, as dealing with epiweeks (e.g. 200840) isn't ideal. This data is taken from 
> https://ibis.health.state.nm.us/resource/MMWRWeekCalendar.html

Instead of sending a GET request, I have pasted the source html in the data folder.

In [6]:
if 'epiweeksDf' not in definedVariables:
    
    print("Creating epiweeks dataframe")
    
    with open("data/epiweeks.html") as f:
        html = f.read().replace('\n', '')
        soup = BeautifulSoup( html, 'lxml' )
        tables = soup.findAll("table", {'class':'Info'})

    epiweeksDf = pd.DataFrame()

    for table in tables[::-1]:
        rows = iter(table.findAll('tr'))
        next(rows) # skip header
        years = [int(year) for year in next(rows).text.split()]
        df = pd.DataFrame(columns=years)
        for i, row in enumerate(rows):
            weeks = [ datetime.strptime(d, '%m/%d/%Y') for d in row.text.split()[1:] ]
            if len(weeks) == 5:
                df.loc[i+1] = weeks
        epiweeksDf = pd.concat([epiweeksDf, df], axis=1)
    %store epiweeksDf

else:
    print("Found epiweeksDf")

epiweeksDf[:5] 

Found epiweeksDf


Unnamed: 0,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
1,2006-01-07,2007-01-06,2008-01-05,2009-01-10,2010-01-09,2011-01-08,2012-01-07,2013-01-05,2014-01-04,2015-01-10,2016-01-09,2017-01-07,2018-01-06,2019-01-05,2020-01-04
2,2006-01-14,2007-01-13,2008-01-12,2009-01-17,2010-01-16,2011-01-15,2012-01-14,2013-01-12,2014-01-11,2015-01-17,2016-01-16,2017-01-14,2018-01-13,2019-01-12,2020-01-11
3,2006-01-21,2007-01-20,2008-01-19,2009-01-24,2010-01-23,2011-01-22,2012-01-21,2013-01-19,2014-01-18,2015-01-24,2016-01-23,2017-01-21,2018-01-20,2019-01-19,2020-01-18
4,2006-01-28,2007-01-27,2008-01-26,2009-01-31,2010-01-30,2011-01-29,2012-01-28,2013-01-26,2014-01-25,2015-01-31,2016-01-30,2017-01-28,2018-01-27,2019-01-26,2020-01-25
5,2006-02-04,2007-02-03,2008-02-02,2009-02-07,2010-02-06,2011-02-05,2012-02-04,2013-02-02,2014-02-01,2015-02-07,2016-02-06,2017-02-04,2018-02-03,2019-02-02,2020-02-01
