# Rare Disease Dataset: Monthly Article Traffic

This notebook contains code to construct, analyze, and publish a dataset of monthly article traffic for a select set of pages from English Wikipedia from July 1, 2015 through September 30, 2024.

## Access Page View Data

The following sub-section contains code to access page view data using the [Wikimedia REST API](https://www.mediawiki.org/wiki/Wikimedia_REST_API). The API documentation, [pageviews/per-article](https://wikimedia.org/api/rest_v1/#/Pageviews%20data), covers additional details that may be helpful when trying to use or understand this example.

### License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.3 - August 16, 2024

In [1]:
#
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

The example relies on some constants that help make the code a bit more readable.

In [2]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error
REQUEST_HEADERS = {
    'User-Agent': 'hnaidu36@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023040100"    # this is likely the wrong end date
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article(article_title = None,
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT,
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS,
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded

    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


## Rare Disease Data Acquisition

IMPORTS

In [13]:
import os
import numpy as np
import pandas as pd

CONSTANTS

In [12]:
DEV_ENVIRONMENT = True
RARE_DISEASE_CSV_PATH = "rare-disease_cleaned.AUG.2024.csv"

In [15]:
# Change the arguments of params templates
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['end'] = "2024103100"

# Create access-specific params templates
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE.copy()
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP['access'] = "desktop"

ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILE = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE.copy()
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILE['access'] = "mobile-web"

ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILE_APP = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE.copy()
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILE_APP['access'] = "mobile-app"

FUNCTIONS

In [16]:
def write_json_to_file(data, filename, mode = "a+"):
    with open(filename, mode) as json_file:
        json.dump(data, json_file, ensure_ascii=False)

In [None]:
def get_page_views_json_for_article(article_title = None,
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT,
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS,
                                  headers = REQUEST_HEADERS) -> json:
    '''
    Get page views for a given article title

    Parameters:
    ----------
    article_title: str
        The title of the article to get page views for
    
    endpoint_url: str
        The URL of the endpoint to make the request to

    request_template: dict
        The template for the request parameters
    
    endpoint_params: str
        The parameters for the request.
        These will be used to format the request_template

    headers: dict
        The headers to be used in the request
    
    Returns:
    -------
    response: json
        The response from the API
        
    '''
    views = request_pageviews_per_article(article_title, endpoint_url, endpoint_params, request_template, headers)
    views = views['items']
    return views

### Load CSV and Query API

In [17]:
rare_disease_df = pd.read_csv(os.path.join(os.getcwd(), RARE_DISEASE_CSV_PATH))

In [18]:
rare_disease_df.count()

disease    1773
pageid     1773
url        1773
dtype: int64

In [19]:
rare_disease_df.head(1)

Unnamed: 0,disease,pageid,url
0,Klinefelter syndrome,19833554,https://en.wikipedia.org/wiki/Klinefelter_synd...


In [21]:
if DEV_ENVIRONMENT:
    rare_disease_df = rare_disease_df.head(5)