# **Data 512: HW 1 - Professionalism & Reproducibility**
The goal of this assignment is to construct, analyze, and publish datasets of monthly article traffic based on access type for a select set of pages from English Wikipedia from July 1, 2015 through September 30, 2024. The purpose of the assignment is to develop and follow best practices for open scientific research as exemplified by the repository. We are specifically focusing on a subset of the English Wikipedia articles that represents a large number of rear diseases which was collected from [National Organization for Rare Diseases (NORD)](https://rarediseases.org).



## License
Parts of the code below were taken as-is or with minimal changes from the example code that was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.3 - August 16, 2024



## Part 1: Data Acquisition

In this part, we will focus on gathering the data from the provided [Wikimedia Analytics API](https://doc.wikimedia.org/generated-data-platform/aqs/analytics-api/reference/page-views.html)  which provides us access to desktop, mobile app, and mobile web pageviews data. We specifically gather pageview data from July 1, 2015 - Sept 30, 2024, for the [specified diseases (article names)](./rare-disease_cleaned.AUG.2024.csv) and store them as JSON files.

As part of this step, we will create 3 JSON Files:
1. Monthly mobile access - This will contain the monthly page traffic data for mobile web and mobile app.
  - The output file will be called: rare-disease_monthly_mobile_201507-202409.json

2. Monthly desktop access - This will contain the data for monthly desktop page traffic.
  - The output file will be called: rare-disease_monthly_desktop_201507-202409.json

3. Monthly cumulative - This will contain the monthly cumulative data of all mobile, and all desktop traffic per article.
  - The output file will be called: rare-disease_monthly_cumulative_201507-202409.json

In [1]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Switching to the required folder (user might have to change in case trying to reproduce)
%cd 'drive/MyDrive/Data 512/data-512-homework_1'

/content/drive/MyDrive/Data 512/data-512-homework_1


In [3]:
# These are standard python modules
import json, time, urllib.parse

# The 'requests' and 'pandas' module is not a standard Python module.
# You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

We add all the required consants below to help with readability of the code. Note- You would need to change the email in request headers to ensure you get email in case of any issues.

In [4]:
#########
#
#    CONSTANTS
#

#
# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error
REQUEST_HEADERS = {
    'User-Agent': '<gmihir@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024',
}

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",   # start and end dates need to be set
    "end":         "2024100100"
}

# Here we create templates for Mobile App and Mobile web by updating the "access" value in the Desktop template.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILEAPP = {
    **ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP,
    "access":      "mobile-app",
}

ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILEWEB = {
    **ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP,
    "access":      "mobile-web",
}

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article(article_title = None,
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT,
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS,
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP,
                                  headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    # We also add safe='' to ensure '/' are read and encoded correctly
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'), safe='')
    request_template['article'] = article_title_encoded

    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Now we load the rare disease data from a CSV file and initializes dictionaries to store the pageview data for desktop, mobile, and cumulative views.


In [None]:
# Read the list of rear diseases and corresponding wikipedia articles from the CSV file and create a dataframe
rear_disease_articles = pd.read_csv("./rare-disease_cleaned.AUG.2024.csv")

# Create empty dictionaries for each file to be created
desktop_views_dict = {}
mobile_views_dict = {}
cumulative_views_dict = {}


Here I have functions to fetch pageview data from the Wikipedia API for desktop and mobile views, as well as merge and calculate total views across devices for each rare disease.


In [None]:
# This function fetches pageviews and converts them into a DataFrame, dropping the "access" column
def fetch_pageviews(article_title, request_template):
    """
    This function fetches pageview data for a given Wikipedia article and converts it to a DataFrame.

    Args:
        article_title (str): The title of the Wikipedia article to fetch data for.
        request_template (dict): The request template with parameters for the API request.

    Returns:
        DataFrame: A pandas DataFrame containing the pageview data for the article with the "access" column dropped.
    """
    response = request_pageviews_per_article(article_title, request_template=request_template)
    data = response["items"]
    df = pd.DataFrame(data).drop('access', axis=1)  # Drop "access" column
    return df

# This function handles merging dataframes and calculating total views after the merge
def merge_and_calculate_views(df1, df2, view_col1, view_col2):
    """
    This function merges two DataFrames containing pageview data and calculates the total views by summing
    the values of two specified columns.

    Args:
        df1 (DataFrame): First DataFrame
        df2 (DataFrame): Second DataFrame
        view_col1 (str): The name of the view column in the first DataFrame.
        view_col2 (str): The name of the view column in the second DataFrame.

    Returns:
        DataFrame: A merged DataFrame with total views calculated and irrelevant columns dropped.
    """
    merged_df = pd.merge(df1, df2, on=["project", "article", "granularity", "timestamp", "agent"], how="outer")
    merged_df["views"] = merged_df[view_col1] + merged_df[view_col2]
    return merged_df.drop([view_col1, view_col2], axis=1)  # Drop the view columns to ensure only total views exist


Here the script fetches and merges the pageview data using the functions above for each disease in the dataset, calculates the cumulative views, and stores the results in separate JSON files for desktop, mobile, and total views.

Note that, the following step takes about 45 minutes to run on Google Colab, as it goes through all the articles as mentioned in the dataset. You can choose to skip this step and directly start the analysis using the already generated JSON files which are stored in the "JSON Data Files" folder.


In [6]:
# The loop iterates over each row in the DataFrame and fetches pageviews for desktop and mobile, and then calculate cumulative views
for index, row in rear_disease_articles.iterrows():
    article_title = row["disease"]
    print(f"Processing pageview data for: {article_title} | index: {index}")

    # Fetch desktop views
    desktop_df = fetch_pageviews(article_title, ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_DESKTOP)
    desktop_views_dict[article_title] = desktop_df.to_dict(orient="records")
    print("Added Desktop views to dictionary")

    # Fetch mobile views (App and Web)
    mobile_app_df = fetch_pageviews(article_title, ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILEAPP)
    mobile_web_df = fetch_pageviews(article_title, ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE_MOBILEWEB)

    # Merge Mobile App and Web data
    mobile_df = merge_and_calculate_views(mobile_app_df, mobile_web_df, "views_x", "views_y")
    mobile_views_dict[article_title] = mobile_df.to_dict(orient="records")
    print("Added Mobile views to dictionary")

    # Calculate cumulative views by merging desktop and mobile data
    cumulative_df = merge_and_calculate_views(desktop_df, mobile_df, "views_x", "views_y")
    cumulative_views_dict[article_title] = cumulative_df.to_dict(orient="records")
    print("Added Cumulative views to dictionary")

    print("-x-x-x-x-x-x-x-")

# Ths function helps write dictionaries to JSON files
def write_to_json(data_dict, filename):
    """
    This function writes a dictionary to a specified JSON file.

    Args:
        data_dict (dict): The dictionary containing the data to be written to a JSON file.
        filename (str): The path of the JSON file to write to.

    Returns:
        None
    """
    try:
        with open(filename, "w") as f:
            json.dump(data_dict, f, ensure_ascii=False, indent=4)
    except Exception as e:
        print(f"Error writing {filename}: {e}")

# Save the dictionaries to the required JSON files
write_to_json(desktop_views_dict, "./JSON Data Files/rare-disease_monthly_desktop_201507-202409.json")
write_to_json(mobile_views_dict, "./JSON Data Files/rare-disease_monthly_mobile_201507-202409.json")
write_to_json(cumulative_views_dict, "./JSON Data Files/rare-disease_monthly_cumulative_201507-202409.json")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Processing pageview data for: Lambert–Eaton myasthenic syndrome | index: 773
Added Desktop views to dictionary
Added Mobile views to dictionary
Added Cumulative views to dictionary
-x-x-x-x-x-x-x-
Processing pageview data for: Lamellar ichthyosis | index: 774
Added Desktop views to dictionary
Added Mobile views to dictionary
Added Cumulative views to dictionary
-x-x-x-x-x-x-x-
Processing pageview data for: Landau–Kleffner syndrome | index: 775
Added Desktop views to dictionary
Added Mobile views to dictionary
Added Cumulative views to dictionary
-x-x-x-x-x-x-x-
Processing pageview data for: Developmental regression | index: 776
Added Desktop views to dictionary
Added Mobile views to dictionary
Added Cumulative views to dictionary
-x-x-x-x-x-x-x-
Processing pageview data for: Langerhans cell histiocytosis | index: 777
Added Desktop views to dictionary
Added Mobile views to dictionary
Added Cumulative views to dictionary
-x

Now you have created the required three JSON files:

1. `rare-disease_monthly_mobile_201507-202409.json` - This contains the combined monthly page traffic data for mobile web and mobile app for the mentioned articles.

2. `rare-disease_monthly_desktop_201507-202409.json` - This contains the data for monthly desktop page traffic or the mentioned articles.

3. `rare-disease_monthly_cumulative_201507-202409.json` - This contains the monthly cumulative data of all mobile, and all desktop traffic per article or the mentioned articles.

Next step is to analyse the data, and that can be done through the following notebook: [Wikipedia_Traffic_Analysis_Part2_DataAnalysis.ipynb](./Wikipedia_Traffic_Analysis_Part2_DataAnalysis.ipynb)
