# Homework 1: Professionalism & Reproducibility
## Data Acquisition

The goal of this assignment is to construct, analyze, and publish a dataset of monthly article traffic for a selected set of pages from English Wikipedia from July 1, 2015 through September 30, 2024. We make sure to follow the best practices for open scientific research as mentioned in chapters "Assessing Reproducibility" and "The Basic Reproducible Workflow Template" of "The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences" publication.


This notebook talks about "Data Acquisition". It is the first step in this assignemnt. Here we collect data from the Wikimedia Analytics API for articles related to rare diseases from July 01 2015 to September 30 2024 and then use the Pageviews API to obtain metrics for desktop, mobile web, and mobile app traffic based on a curated list of Wikipedia articles matched to rare diseases. The data collected are then organized into three JSON files: one for monthly mobile access, another for monthly desktop access, and a third for cumulative pageviews, with each file named according to the specified date range.

Every step in this notebook is documented to ensure transparency and reproducibility.

### 1. Import required Libraries and Dependencies

In [58]:
# standard python modules
import json, time, urllib.parse
from IPython.display import clear_output

# not standard modules, need to be installed with pip/pip3 if not done earlier
import requests
import pandas as pd

### 2. Get Article Names

In this section, we obtain the disease names that will be used to make API requests in the next step. Let us start by loading a cleaned CSV file containing rare diseases ([provided with this assignment](https://drive.google.com/file/d/15_FiKhBgXB2Ch9c0gAGYzKjF0DBhEPlY/view?usp=drive_link)) into a DataFrame. From this, we extract the disease names from the 'disease' column to prepare for further API calls.

This list of pages was collected by using a database of rare diseases maintained by the [National Organization for Rare Diseases (NORD)](https://rarediseases.org)

In [37]:
rare_diseases_df = pd.read_csv('../data/input_data/rare-disease_cleaned.AUG.2024.csv')
rare_diseases_df.head()

Unnamed: 0,disease,pageid,url
0,Klinefelter syndrome,19833554,https://en.wikipedia.org/wiki/Klinefelter_synd...
1,Aarskog–Scott syndrome,7966521,https://en.wikipedia.org/wiki/Aarskog–Scott_sy...
2,Abetalipoproteinemia,68451,https://en.wikipedia.org/wiki/Abetalipoprotein...
3,MT-TP,20945466,https://en.wikipedia.org/wiki/MT-TP
4,Ablepharon macrostomia syndrome,10776100,https://en.wikipedia.org/wiki/Ablepharon_macro...


In [38]:
rare_diseases_names = rare_diseases_df['disease'].tolist()  # get the list of diseases
rare_diseases_names.sort()  # sorting helps users to quickly locate a specific disease name without having to scan through the long unsorted list
print("Total Diseases: ", len(rare_diseases_names))
print("Sample Disease Names: ", rare_diseases_names[:10])

Total Diseases:  1773
Sample Disease Names:  ['18p', '18p-', '2006 in Africa', '2007 in Africa', '2009 swine flu pandemic vaccine', '21-Hydroxylase', '22q13 deletion syndrome', '3-M syndrome', '3-Methylglutaconic aciduria', 'AA amyloidosis']


### 3. Define the constants

This section is a snippet from the [example notebook provided with this assigment, revision dated: August 16, 2024](https://drive.google.com/file/d/1fYTIX79t9jk-Jske8IwysV-rbRkD4_dc/view?usp=drive_link) (licensed [CC-BY](https://www.google.com/url?q=https%3A%2F%2Fcreativecommons.org%2Flicenses%2Fby%2F4.0%2F))

The following constants are essential for making requests to the Wikimedia Pageviews API amd helps in making the code to be more readable

In [39]:
# This variable holds the base URL for all pageviews API requests.
# It serves as the starting point for constructing requests to gather pageview metrics for specific articles.
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is specific format for making per-article pageviews requests.
# It uses placeholders that will be replaced with actual values when generating the final request URL.
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# To comply with the Wikimedia API's rate limiting policy, these constants define a slight delay between requests.
# We assume roughly 2ms latency on the API and network
# Throttle wait variable calculates the wait time between requests to ensure we do not exceed the limit of 100 requests per second.
API_LATENCY_ASSUMED = 0.002
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# This dictionary contains headers that are included in each API request.
# We need to providing an email address to be notified when there is a rate limit violations / other errors
REQUEST_HEADERS = {
    'User-Agent': '<pj2901@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024',
}

# We have the rare disease names that we extracted in the previous step here.
# It will be used to iterate through and make requests for each specific article to collect pageview data.
ARTICLE_TITLES = rare_diseases_names

# This dictionary acts as a template for constructing the parameters for our API requests with a start date of 07/01/2015, and an end date of 09/30/2024.
# Most fields remain constant across requests, such as project, agent, and granularity.
# However, the article name and access type will be dynamically set before each request.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "",  # this value will be changed for the different access types
    "agent":       "user",
    "article":     "",  # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",
    "end":         "2024093000"
}


### 4. Define required Functions

This section contains user-defined functions for interacting with the Wikimedia Pageviews API.

#### 4.1. request_pageviews_per_article

This section is a snippet from the [example notebook provided with this assigment, revision dated: August 16, 2024](https://drive.google.com/file/d/1fYTIX79t9jk-Jske8IwysV-rbRkD4_dc/view?usp=drive_link) (licensed [CC-BY](https://www.google.com/url?q=https%3A%2F%2Fcreativecommons.org%2Flicenses%2Fby%2F4.0%2F)) 

Here we construct and send REST API requests to the Wikimedia Pageviews API for pageview data associated with a specific article.

In [40]:
def request_pageviews_per_article(article_title = None,
                                  access_type = None,
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT,
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS,
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    """
    Fetches pageview data for a specified article and access type from the API.

    Parameters:
        article_title (str): The title of the article for which to fetch pageview data.
        access_type (str): The type of access ('mobile-app', 'mobile-web', 'desktop') to filter the data.
        endpoint_url (str): The URL of the API endpoint to send the request to.
        endpoint_params (str): The URL parameters template for the API request.
        request_template (dict): A template dictionary for formatting the request parameters.
        headers (dict): The headers to include in the request.

    Raises:
        Exception: If no article title or access type is supplied.

    Returns:
        dict or None: The JSON response containing the pageview data if successful, otherwise None.
    """

    # article title can a parameter to the request_template call
    if article_title:
        request_template['article'] = article_title
    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'), safe=' ')
    request_template['article'] = article_title_encoded

    # access type can be a parameter to the request_template call
    if access_type:
        request_template['access'] = access_type
    if not request_template['access']:
        raise Exception("Must supply an access type to make a pageviews request.")

    # create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)

    # make the request
    try:
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT) # throttling is always a good practice with a free community sources
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


#### 4.2. combine_mobile_views

In this function, we combine the pageviews from both mobile app and mobile web sources for each article.

In [41]:
def combine_mobile_views(mobile_app_view, mobile_web_view):
    """
    Combines mobile app and mobile web pageview data into a single list of combined views.

    Parameters:
        mobile_app_view (dict): The pageview data from the mobile app.
        mobile_web_view (dict): The pageview data from the mobile web.

    Raises:
        Exception: If either mobile_app_view or mobile_web_view is missing.

    Returns:
        list: A list of dictionaries containing combined views for each article with their respective timestamps.
    """
    # Make sure that we have the required views to append
    if not mobile_app_view or not mobile_web_view:
        raise Exception("Must have mobile app and web views to append.")

    # Combine the views and have it in a combined_view variable
    combined_view = []
    for app, web in zip(mobile_app_view['items'], mobile_web_view['items']):
        combined_view.append({'article': app['article'],
                            'timestamp': app['timestamp'],
                            'views': app['views'] + web['views']})
    return combined_view

#### 4.3 test_view_saved_JSON_file

This function just tests if the JSON files are saved successfully and is in the format we need for further analysis

In [54]:
def test_view_saved_JSON_file(filename):
    """
    This function takes a filename as input, constructs the full file path using the specified filename and date range.
    It then opens the JSON file, loads its contents, and prints the data in a readable format.

    Args:
        filename (str): The filename for the JSON file to be loaded.

    Returns:
        None
    """
    start_date = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['start'][:6]
    end_date = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['end'][:6]
    file_path = f'../data/generated_data/rare-disease_{filename}_{start_date}-{end_date}.json'
    with open(file_path, 'r') as file:
        data = json.load(file)
    print(json.dumps(data, indent=4))

### 5. Generate the dataset

In this section, we iterate through a predefined list of article titles, and get pageview data for each access type (desktop, mobile app, and mobile web). Later, we combine both the mobile views and store the whole dataset in three distinct JSON files for future analysis.

We are asked to produce these files for further analysis:
- **Monthly mobile access** - Since, the API separates mobile access types into two separate requests, we need to combine them and have a single mobile access data in the file in this format: `rare-disease_monthly_mobile_\<startYYYYMM>-\<endYYYYMM>.json`
- **Monthly desktop access** - Monthly desktop page traffic is based on one single request. We should store the desktop access data in a file in this format: `rare-disease_monthly_desktop_\<startYYYYMM>-\<endYYYYMM>.json`
- **Monthly cumulative** - Monthly cumulative data is the sum of all mobile, and all desktop traffic per article. We should store the monthly cumulative data in a file in the format: `rare-disease_monthly_cumulative_\<startYYYYMM>-\<endYYYYMM>.json`
For all of the files the \<startYYYYMM> and \<endYYYYMM> represent the starting and ending year and month as integer text strings.

The generated dataset is stored in the `data/data_generated/` folder to ensure better clarity and distinction from the original input data.

In [59]:
mobile_data = {}
desktop_data = {}
cumulative_data = {}

for article in ARTICLE_TITLES:
    # Fetch data for each access type
    print(f"\nFetching data for article: {article}...")
    mobile_app_views = request_pageviews_per_article(article, 'mobile-app')
    mobile_web_views = request_pageviews_per_article(article, 'mobile-web')
    mobile_views = combine_mobile_views(mobile_app_views, mobile_web_views) # Combibe the mobile views

    desktop_views = request_pageviews_per_article(article, 'desktop')

    # Calculate cumulative views (desktop + mobile)
    cumulative_views = []
    for desktop, mobile in zip(desktop_views['items'], mobile_views):
        cumulative_views.append({'article': desktop['article'],
                                'timestamp': desktop['timestamp'],
                                'views': desktop['views'] + mobile['views']})

    # Store the data
    mobile_data[article] = mobile_views
    desktop_data[article] = desktop_views['items']
    cumulative_data[article] = cumulative_views
    print(f"Data fetched successfully!")

clear_output(wait=True) # for better clarity, let us clear the previous outputs

print(f"Successfully fetched {len(ARTICLE_TITLES)} rare disease articles!")    # Print the dataset fetch combined update

# Save the data to JSON files
start_date = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['start'][:6]
end_date = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['end'][:6]

with open(f'../data/generated_data/rare-disease_monthly_mobile_{start_date}-{end_date}.json', 'w') as mobile_file:
    json.dump(mobile_data, mobile_file, indent=4)
print("\nMonthly mobile access data has been successfully saved to a JSON file!")

with open(f'../data/generated_data/rare-disease_monthly_desktop_{start_date}-{end_date}.json', 'w') as desktop_file:
    json.dump(desktop_data, desktop_file, indent=4)
print("Monthly desktop access data has been successfully saved to a JSON file!")

with open(f'../data/generated_data/rare-disease_monthly_cumulative_{start_date}-{end_date}.json', 'w') as cumulative_file:
    json.dump(cumulative_data, cumulative_file, indent=4)
print("Monthly cumulative data has been successfully saved as a JSON file!")

Successfully fetched 1773 rare disease articles!

Monthly mobile access data has been successfully saved to a JSON file!
Monthly desktop access data has been successfully saved to a JSON file!
Monthly cumulative data has been successfully saved as a JSON file!


In [None]:
# Test that the rare-disease_monthly_desktop file was saved successfully
test_view_saved_JSON_file("monthly_desktop")

In [None]:
# Test that the rare-disease_monthly_mobile file was saved successfully
test_view_saved_JSON_file("monthly_mobile")

In [None]:
# Test that the rare-disease_monthly_cumulative file was saved successfully
test_view_saved_JSON_file("monthly_cumulative")

### 6. Conclusion

As per the assignment requirements, we have successfully saved the JSON files: `rare-disease_monthly_mobile_\<startYYYYMM>-\<endYYYYMM>.json`, `rare-disease_monthly_desktop_\<startYYYYMM>-\<endYYYYMM>.json`, and `rare-disease_monthly_cumulative_\<startYYYYMM>-\<endYYYYMM>.json`.

We will further use them in `step2_data-analysis.ipynb` notebook.