# **Data 512: HW 2 - Considering Bias in Data**

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles on political figures from different countries. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. We will then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries.
The analysis will consist of a series of tables that show:
- The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
- The countries with the highest and lowest proportion of high quality articles about politicians.
- A ranking of geographic regions by articles-per-person and proportion of high quality articles.




## License
Parts of the code below were taken as-is or with minimal changes from the example code that was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024

Note: The copy of all reference code and provided data files is available in the [Resources Repository](./Resources)



Setting up the Google Colab Workspace

In [81]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Switching to the required folder (user might have to change in case trying to reproduce)
%cd 'drive/MyDrive/Data 512/data-512-homework_2'

/content/drive/MyDrive/Data 512/data-512-homework_2


## Part 1: Data Acquisition

In this section, we will gather and utilize pre-collected datasets provided in the repository under the [Resources](./Resources) folder. The data required for this analysis consists of politician articles and population data.

### Available Data:

1. [**politicians_by_country.csv**](./Resources/politicians_by_country.csv): This file contains a list of Wikipedia articles about politicians from various countries. It was generated by crawling the Wikipedia [Category:Politicians by nationality](https://en.wikipedia.org/wiki/Category:Politicians_by_nationality) page to create a dataset of politician articles along with related country information.

2. [**population_by_country_AUG.2024.csv**](./Resources/population_by_country_AUG.2024.csv): This dataset contains population data sourced from the Population Reference Bureau’s [world population data sheet](https://www.prb.org/international/indicator/population/table). The file includes population counts for individual countries and cumulative totals for regions.

Note: Regional rows can be identified by ALL CAPS values in the 'geography' field, such as AFRICA or OCEANIA.

As part of this step, we will create two files:

1. [**all_articles_pageinfo.json**](Generated_Data_Files/all_articles_pageinfo.json): This file contains the revision id and other information all the politician articles.

2. [**all_articles_ores_scores.csv**](Generated_Data_Files/all_articles_ores_scores.csv): This file contains the ORES scores that were fetched from the API using the revision id's from the previous file.

Load required libraries

In [3]:
#
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd
from pathlib import Path

### Step 1: Fetching Page Info

The provided politicians' dataset contains only the article titles, so we need to retrieve the latest revision IDs. We use the [**MediaWiki REST API**](https://www.mediawiki.org/wiki/API:Main_page) for this purpose, following the official documentation available at [API:Info](https://www.mediawiki.org/wiki/API:Info). This API provides the current page revision for each article.

The code relies on some constants to help make it be bit more readable.

In [4]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<gmihir@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# Input file path containing article titles.
ARTICLE_LIST = "Resources/politicians_by_country_AUG.2024.csv"
POPULATION_List = "Resources/population_by_country_AUG.2024.csv"

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [6]:
def get_article_titles(df):
    """
    This function reads a CSV file, extracts unique article titles from the 'name' column,
    and identifies duplicate rows where both 'name' and 'url' are the same.

    Returns:
    - list: A list of unique article titles.
    """
    # Identifying duplicates based on both 'name' and 'url'
    duplicates = df[df.duplicated(subset=['name', 'url'], keep=False)]
    total_duplicate_rows = len(duplicates)

    # Count occurrences of each duplicate (name, url) combination
    duplicate_counts = duplicates.groupby(['name', 'url']).size().reset_index(name='count')
    total_unique_combinations = len(duplicate_counts) # Count unique duplicate combinations

    # Calculate the number of rows that will be removed (one copy kept per duplicate)
    rows_to_remove = total_duplicate_rows - total_unique_combinations

    # Print the required information
    print(
        "Total number of unique duplicate (name, url) combinations: " + str(total_unique_combinations) + ".\n"
        "However, the total number of rows removed is " + str(rows_to_remove) + ", because for each of the "
        + str(total_unique_combinations) + " duplicate combinations, only 1 row is kept, and there are "
        + str(rows_to_remove - total_unique_combinations) + " extra rows removed due to some combinations "
        "occurring more than twice."
    )

    # Display duplicate counts as a DataFrame
    wikie_count_dup_df = duplicate_counts[duplicate_counts['count'] > 1]
    print("\nTable of duplicate combinations with their count:")
    display(wikie_count_dup_df)

    # Extract the 'name' column, remove duplicates, and convert to a list
    article_titles = df['name'].drop_duplicates().tolist()
    return article_titles


In [7]:
article_df = pd.read_csv(ARTICLE_LIST)
print(f"The original number of articles: {len(article_df)}")

The original number of articles: 7155


In [8]:
articles = get_article_titles(article_df)

Total number of unique duplicate (name, url) combinations: 41.
However, the total number of rows removed is 44, because for each of the 41 duplicate combinations, only 1 row is kept, and there are 3 extra rows removed due to some combinations occurring more than twice.

Table of duplicate combinations with their count:


Unnamed: 0,name,url,count
0,Abir Al-Sahlani,https://en.wikipedia.org/wiki/Abir_Al-Sahlani,2
1,"Aleksandr Nikitin (politician, born 1987)",https://en.wikipedia.org/wiki/Aleksandr_Nikiti...,2
2,Ali al-Qaradaghi,https://en.wikipedia.org/wiki/Ali_al-Qaradaghi,2
3,Antonio Gutiérrez y Ulloa,https://en.wikipedia.org/wiki/Antonio_Gutiérre...,2
4,Antonín Janoušek,https://en.wikipedia.org/wiki/Antonín_Janoušek,2
5,Ashab Uddin Ahmad,https://en.wikipedia.org/wiki/Ashab_Uddin_Ahmad,2
6,Bak Jungyang,https://en.wikipedia.org/wiki/Bak_Jungyang,2
7,Bona Malwal,https://en.wikipedia.org/wiki/Bona_Malwal,2
8,Count Václav Antonín Chotek of Chotkov and Vojnín,https://en.wikipedia.org/wiki/Count_Václav_Ant...,2
9,Djama Ali Moussa,https://en.wikipedia.org/wiki/Djama_Ali_Moussa,2


In [9]:
print(f"The updated number of articles after removing duplicates: {len(articles)}")

The updated number of articles after removing duplicates: 7111


We get the information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, as mentioned in the [documentation](https://www.mediawiki.org/w/api.php?action=help&modules=query), the limit is 50 for the number of pages in one request.

In [10]:
def process_and_save_as_single_json(articles, output_file):
    """
    This function processes Wikipedia articles by querying the API in batches and
    saving all responses as a single JSON file.

    Args:
    - articles (list): List of article titles to query.
    - output_file (str): Path to save the combined JSON file.

    Returns:
    - None
    """
    # Ensure the output directory exists
    Path(output_file).parent.mkdir(parents=True, exist_ok=True)

    all_article_data = {}  # Dictionary to store data for all articles
    failed_articles = []    # List to store failed articles
    total_processed = 0     # Counter for successfully processed articles

    # Process articles in batches, with a maximum of 50 titles per request
    max_titles_per_request = 50
    for i in range(0, len(articles), max_titles_per_request):
        # Get the batch of article titles
        batch_articles = articles[i:i + max_titles_per_request]
        page_titles = '|'.join(batch_articles)  # Concatenate titles with '|'

        # Copy the template and update the titles
        request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
        request_info['titles'] = page_titles

        # Make the API request
        response = request_pageinfo_per_article(request_template=request_info)

        if response and 'query' in response:
            # Check for pages in the response
            pages = response['query']['pages']
            if pages:
                all_article_data.update(pages)
                total_processed += len(batch_articles)  # Increment processed count
                print(f"Added data for batch: {batch_articles}")
            else:
                print(f"Empty response for batch: {batch_articles}")
                failed_articles.extend(batch_articles)  # Log failed articles
        else:
            print(f"Failed to process batch: {batch_articles}")
            failed_articles.extend(batch_articles)  # Log failed articles

    # Save all collected article data as a single JSON file
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(all_article_data, f, ensure_ascii=False, indent=4)

    print(f"All article data saved to {output_file}")
    print(f"Total articles processed: {total_processed}")
    print(f"Number of failed requests or empty responses: {len(failed_articles)}")
    if failed_articles:
        print(f"Failed articles: {', '.join(failed_articles)}")

In [14]:
process_and_save_as_single_json(articles, "Generated_Data_Files/all_articles_pageinfo.json")

Added data for batch: ['Majah Ha Adrif', 'Haroon al-Afghani', 'Tayyab Agha', 'Khadija Zahra Ahmadi', 'Aziza Ahmadyar', 'Muqadasa Ahmadzai', 'Mohammad Sarwar Ahmedzai', 'Amir Muhammad Akhundzada', 'Nasrullah Baryalai Arsalai', 'Abdul Rahim Ayoubi', 'Ismael Balkhi', 'Abdul Baqi Turkistani', 'Mohammad Ghous Bashiri', 'Jan Baz', 'Bashir Ahmad Bezan', 'Rafiullah Bidar', 'Mohammad Siddiq Chakari', 'Cheragh Ali Cheragh', 'Nasir Ahmad Durrani', 'Muhammad Hashim Esmatullahi', 'Ezatullah (Nangarhar)', 'Aimal Faizi', 'Gajinder Singh Safri', 'Sharif Ghalib', 'Hashmat Ghani Ahmadzai', 'Abdul Ghani Ghani', 'Ghulam Ghaus', 'Ghulam Muhammad Ghobar', 'Mohammad Gul (Helmand Council)', 'Sayed Yousuf Halim', 'Rangina Hamidi', 'Sayed Zafar Hashemi', 'Qutbuddin Hilal', 'Mahboba Hoqomal', 'Musa Hotak', 'Mirza Muhammad Ismail', 'Sayed Jalal', 'Said Tayeb Jawad', 'Sayed Jalal Karim', 'Hafizullah Shabaz Khail', 'Masoud Khalili', 'Mohammad Khan (athlete)', 'Samoud Khan', 'Baran Khan Kudezai', 'Azizullah Lodin', 

Now, we have completed getting the revision ids for all articles while ensuring we get rid of any duplicates

### Step 2: Fetching ORES Scores

In this step, we evaluate the quality of a politician's Wikipedia page, by utilizing the [**ORES (Objective Revision Evaluation Service)**](https://www.mediawiki.org/wiki/ORES). ORES assigns a quality rating to each article based on the latest revision.

ORES provides quality predictions by categorizing articles into one of six quality levels, from highest to lowest:

1. FA - Featured article
2. GA - Good article (also known as A-Class)
3. B - B-Class article
4. C - C-Class article
5. Start - Start-class article
6. Stub - Stub-class article

In [11]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "gmihir@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

**Get your access token**

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this course and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

The value you need to work the code below is the Access token - a very long string.


In [12]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.

# Please enter your username and acces_token here
USERNAME = ""
ACCESS_TOKEN = ""

Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels.

In [13]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


**Note** - Running the fetch_article_ratings functions takes about 2 hours. This is so because there is no way to get ORES information about multiple pages at once. Therefore, you might want to skip running the next two cells. The output from the API is stored in [**all_articles_ores_scores.csv**](Generated_Data_Files/all_articles_ores_scores.csv) for your reference.

In [14]:
def fetch_article_ratings(json_file, output_csv, access_token):
    """
    This function fetches ratings for articles based on their revision IDs
    from a JSON file and save as a CSV.

    Args:
    - json_file (str): Path to the input JSON file containing article data.
    - output_csv (str): Path to save the enriched article data as a CSV file.
    - access_token (str): Access token for authentication.

    Returns:
    - None
    """
    # Load article data from JSON file
    with open(json_file, 'r', encoding='utf-8') as file:
        article_data = json.load(file)

    # Prepare a list to store enriched article data
    enriched_data = []

    # Iterate through articles in the JSON data
    for article_id, details in article_data.items():

        revision_id = details.get('lastrevid')

        # If 'lastrevid' is not found, skip the article
        if revision_id is None:
            print(f"Skipping article '{details.get('title', article_id)}' due to missing revision ID.")
            continue

        article_title = details['title']

        # Create a dictionary to hold the enriched article data
        article_info = details.copy()

        try:
            # Fetch the article quality score
            score_response = request_ores_score_per_article(article_revid=revision_id, email_address="gmihir@uw.edu", access_token=access_token)

            # Extract the predicted score
            predicted_score = score_response["enwiki"]["scores"][str(revision_id)]["articlequality"]["score"]["prediction"]
            article_info["article_rating"] = predicted_score

        except Exception as e:
            print(f"Error fetching article rating for '{article_title}' (Revision: {revision_id}): {e}")
            article_info["article_rating"] = None  # Assign None if an error occurs

        # Append the enriched article information to the list
        enriched_data.append(article_info)

    # Create a DataFrame from the enriched data
    enriched_df = pd.DataFrame(enriched_data)

    # Remove duplicate entries if any
    enriched_df.drop_duplicates(inplace=True)

    # Save the DataFrame with ORES scores to a CSV file
    enriched_df.to_csv(output_csv, index=False)
    print(f"ORES article data has been saved to '{output_csv}'.")

In [None]:
fetch_article_ratings('Generated_Data_Files/all_articles_pageinfo.json', 'Generated_Data_Files/all_articles_ores_scores.csv', ACCESS_TOKEN)


In [99]:
ORES_Data = pd.read_csv('Generated_Data_Files/all_articles_ores_scores.csv')
error_rate = (ORES_Data['article_rating'].isnull().sum() / len(ORES_Data)) * 100
print(f"Error rate: {error_rate:.2f}%")

Error rate: 0.04%


## Part 2: Combining the Datasets

In this section, we will focus on the process of merging the datasets obtained in Part 1. After enriching the politician articles with quality predictions from ORES, we need to combine the Wikipedia data with the population data. This step is crucial for ensuring that we have a comprehensive dataset that includes essential information about each country, such as population and article quality.

To effectively merge the datasets, we will follow these steps:

1. **Merging Wikipedia and Population Data**: Both datasets contain fields with country names, which will allow us to merge them. This step ensures that each article is associated with its corresponding population data.

2. **Identifying Mismatches**: During the merging process, we may encounter entries that cannot be matched. This can occur for several reasons:
   - The population dataset lacks an entry for the corresponding Wikipedia country.
   - The Wikipedia dataset includes country names that do not exist in the population data.

   We will identify all countries that do not have matches and compile a list of these countries. This list will be saved in a text file named [`wp_countries-no_match.txt`](./Generated_Data_Files/wp_countries-no_match.txt), with each country on a separate line.

3. **Consolidating the Remaining Data**: After resolving mismatches, we will consolidate the remaining data into a single CSV file called [`wp_politicians_by_country.csv`](./Generated_Data_Files/wp_politicians_by_country.csv). This CSV will include the following columns:

   - **country**: The name of the country.
   - **region**: The geographical region to which the country belongs.
   - **population**: The population of the country.
   - **article_title**: The title of the politician’s Wikipedia article.
   - **revision_id**: The latest revision ID of the article.
   - **article_quality**: The quality rating of the article assigned by ORES.

This consolidated dataset will serve as the foundation for our subsequent analyses and insights.

In [84]:
def load_and_merge_data(json_file_path, csv_file_path, source_data_file_path):
    """
    This function load politician article data with revision id(JSON), ORES scores (CSV),
    and politician article data (CSV), and merges them into a single DataFrame.

    Args:
        json_file_path (str): Path to the JSON file containing politician data.
        csv_file_path (str): Path to the CSV file containing ORES scores.
        source_data_file_path (str): Path to the CSV file containing source article data.

    Returns:
        pd.DataFrame: Merged DataFrame with politician article information, ORES scores, and source data.
    """

    # Load data
    with open(json_file_path, 'r', encoding='utf-8') as json_file:
        politicians_data = json.load(json_file)

    df_article_info = pd.DataFrame.from_dict(politicians_data, orient='index')
    source_data_df = pd.read_csv(source_data_file_path)

    # Merge the provided politicians data with the JSON article pageinfo data on 'name' and 'title'
    merged_df = pd.merge(source_data_df, df_article_info, left_on='name', right_on='title', how='left')

    ores_scores_df = pd.read_csv(csv_file_path) # Load the ORES scores CSV file

    # Merge the updated article pageinfo data (with country) data with ORES scores on last revision id
    final_merged_df = pd.merge(merged_df, ores_scores_df, left_on='lastrevid', right_on='lastrevid', how='left')

    # Select and rename relevant columns for clarity
    final_merged_df = final_merged_df.rename(columns={
        'pageid_x': 'pageid',
        'title_x': 'title',
        'article_rating': 'article_quality',
        'lastrevid': 'revision_id'
    })

    # Ensure revision_id is treated as Int64 to handle missing values
    final_merged_df['revision_id'] = final_merged_df['revision_id'].astype('Int64')

    # Select relevant columns to output
    final_merged_df = final_merged_df[['name', 'country', 'pageid', 'title', 'revision_id', 'article_quality']]

    print(f"Data merging complete. Total rows in merged DataFrame: {final_merged_df.shape[0]}")
    display(final_merged_df.head())

    # Summary statistic
    total_articles = len(df_article_info)
    print(f"Total number of articles: {total_articles}")
    oressucceeded_articles = final_merged_df['article_quality'].notnull().sum()

    print(f"Total number of articles for which both pageInfo call and ORES API call succeeded: {oressucceeded_articles}")

    return final_merged_df

In [73]:
def process_population_data(population_data_path):
    """
    This method processes the population data to distinguish countries and regions,
    and assign each country to its closest region.

    Args:
        population_data_path (str): Path to the population data CSV file.

    Returns:
        pd.DataFrame: Cleaned population DataFrame with countries, regions, and populations.
    """
    population_df = pd.read_csv(population_data_path)

    region = ""  # Variable to hold current region
    population_list = []

    # Processing population data to distinguish regions and countries
    for _, row in population_df.iterrows():
        row_dict = row.to_dict()
        if row["Geography"].isupper():  # Region is uppercase
            region = row["Geography"]
        else:
            row_dict["region"] = region
            population_list.append(row_dict)

    # Convert list to DataFrame
    population_df_regions = pd.DataFrame(population_list)
    population_df_regions = population_df_regions.rename(columns={"Geography": "country", "Population": "population"})

    # Ensure the DataFrame has the 'region' column
    population_df_regions = population_df_regions[['country', 'population', 'region']]

    print(f"Total countries in population data: {population_df_regions.shape[0]}")
    display(population_df_regions)
    return population_df_regions

In [78]:
def merge_and_identify_mismatches(final_df, population_df, output_no_match_file, output_csv_file):
    """
    This method merges the article data with population data and identifies unmatched countries.
    It also saves the unmatched countries to a text file and the consolidated data to a CSV file.

    Args:
        final_df (DataFrame): DataFrame with article, ORES, and source data.
        population_df (DataFrame): DataFrame with population data.
        output_no_match_file (str): Path to save unmatched countries.
        output_csv_file (str): Path to save the consolidated data.

    Returns:
        pd.DataFrame: DataFrame with matched article data and population data.
    """
    # Merging article data with population data
    merged_data = pd.merge(final_df, population_df, on="country", how="outer", indicator=True)

    # Identify unmatched countries
    unmatched_countries = merged_data[merged_data['_merge'] != 'both']['country'].dropna().unique()
    print(f"Total unmatched countries: {len(unmatched_countries)}")
    print(unmatched_countries)

    # Saving unmatched countries to wp_countries-no_match.txt
    with open(output_no_match_file, 'w') as f:
        for country in unmatched_countries:
            f.write(f"{country}\n")

    # Consolidate matched data
    consolidated_data = merged_data[merged_data['_merge'] == 'both'].drop(columns=["_merge"])

    # Ensure 'region' is included in the final DataFrame
    consolidated_data = consolidated_data.rename(columns={"quality_prediction": "article_quality"})
    consolidated_data = consolidated_data[["country", "region", "population", "title", "revision_id", "article_quality"]].drop_duplicates()

    # Saving consolidated data to wp_politicians_by_country.csv
    consolidated_data.to_csv(output_csv_file, index=False)
    print(f"\nConsolidated data total rows: {consolidated_data.shape[0]}")
    display(consolidated_data)

    return consolidated_data

In [65]:
# Constants
json_file = 'Generated_Data_Files/all_articles_pageinfo.json'
ores_scores_file = 'Generated_Data_Files/all_articles_ores_scores.csv'
source_data_file = 'Resources/politicians_by_country_AUG.2024.csv'
population_data_file = 'Resources/population_by_country_AUG.2024.csv'
no_match_output_file = 'Generated_Data_Files/wp_countries-no_match.txt'
consolidated_output_file = 'Generated_Data_Files/wp_politicians_by_country.csv'

In [85]:
# Load and merge article data with ORES scores and source data
df_1 = load_and_merge_data(json_file_path=json_file, csv_file_path=ores_scores_file, source_data_file_path=source_data_file)

Data merging complete. Total rows in merged DataFrame: 7155


Unnamed: 0,name,country,pageid,title,revision_id,article_quality
0,Majah Ha Adrif,Afghanistan,10483286.0,Majah Ha Adrif,1233202991,Start
1,Haroon al-Afghani,Afghanistan,11966231.0,Haroon al-Afghani,1230459615,B
2,Tayyab Agha,Afghanistan,46841383.0,Tayyab Agha,1225661708,Start
3,Khadija Zahra Ahmadi,Afghanistan,71600382.0,Khadija Zahra Ahmadi,1234741562,Stub
4,Aziza Ahmadyar,Afghanistan,47805901.0,Aziza Ahmadyar,1195651393,Start


Total number of articles: 7104
Total number of articles for which both pageInfo call and ORES API call succeeded: 7100


In [74]:
# Process population data and get regions
population_df = process_population_data(population_data_file)

Total countries in population data: 209


Unnamed: 0,country,population,region
0,Algeria,46.8,NORTHERN AFRICA
1,Egypt,105.2,NORTHERN AFRICA
2,Libya,6.9,NORTHERN AFRICA
3,Morocco,37.0,NORTHERN AFRICA
4,Sudan,48.1,NORTHERN AFRICA
...,...,...,...
204,Samoa,0.2,OCEANIA
205,Solomon Islands,0.8,OCEANIA
206,Tonga,0.1,OCEANIA
207,Tuvalu,0.0,OCEANIA


In [79]:
# Merge all the data and identify any mismatches/ missing country data
final_df = merge_and_identify_mismatches(df_1, population_df, no_match_output_file, consolidated_output_file)

Total unmatched countries: 46
['Andorra' 'Australia' 'Brunei' 'Canada' 'China (Hong Kong SAR)'
 'China (Macao SAR)' 'Curacao' 'Denmark' 'Dominica' 'Fiji' 'French Guiana'
 'French Polynesia' 'Georgia' 'Guadeloupe' 'Guam' 'Guinea-Bissau'
 'GuineaBissau' 'Iceland' 'Ireland' 'Jamaica' 'Kiribati' 'Korea (North)'
 'Korea (South)' 'Korea, South' 'Korean' 'Liechtenstein' 'Martinique'
 'Mauritius' 'Mayotte' 'Mexico' 'Nauru' 'Netherlands' 'New Caledonia'
 'New Zealand' 'Palau' 'Philippines' 'Puerto Rico' 'Reunion' 'Romania'
 'San Marino' 'Sao Tome and Principe' 'Suriname' 'United Kingdom'
 'United States' 'Western Sahara' 'eSwatini']

Consolidated data total rows: 7013


Unnamed: 0,country,region,population,title,revision_id,article_quality
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233202991,Start
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230459615,B
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225661708,Start
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234741562,Stub
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651393,Start
...,...,...,...,...,...,...
7192,Zimbabwe,EASTERN AFRICA,16.7,Josiah Tongogara,1203429435,C
7193,Zimbabwe,EASTERN AFRICA,16.7,Langton Towungana,1246280093,Stub
7194,Zimbabwe,EASTERN AFRICA,16.7,Sengezo Tshabangu,1228478288,Start
7195,Zimbabwe,EASTERN AFRICA,16.7,Herbert Ushewokunze,959111842,Stub


## Part 3: Analysis and Results

In this section, we analyze the combined dataset by calculating two key metrics: total articles per capita and high-quality articles per capita.

- Total Articles per Capita reflects the number of Wikipedia articles related to politicians per person in a country.
- High-Quality Articles per Capita indicates the number of articles classified as either "FA" (Featured Article) or "GA" (Good Article) per person.

The analysis will yield six data tables, showcasing:

1. The top 10 countries with the highest total articles per capita.
2. The bottom 10 countries with the lowest total articles per capita.
3. The top 10 countries with the highest high-quality articles per capita.
4. The bottom 10 countries with the lowest high-quality articles per capita.
5. A ranked list of geographic regions by total articles per capita.
6. A ranked list of geographic regions by high-quality articles per capita.

In [16]:
def calculate_article_ratios(final_df, population_df):
    """
    This method calculates the total articles per capita and high-quality articles per capita,
    and merges this with the region information.

    Args:
        final_df (DataFrame): DataFrame with articles and region information.
        population_df (DataFrame): DataFrame with population data.

    Returns:
        pd.DataFrame: DataFrame with calculated ratios and region information.
    """
    # Count total articles and high-quality articles
    total_articles = final_df.groupby('country')['title'].count().reset_index(name='total_articles')
    high_quality_articles = final_df[final_df['article_quality'].isin(['FA', 'GA'])] \
        .groupby('country')['title'].count().reset_index(name='high_quality_articles')

    # Merge total articles and high-quality articles
    articles_df = pd.merge(total_articles, high_quality_articles, on='country', how='left')

    # Merge with population data to get population and region
    articles_df = pd.merge(articles_df, population_df[['country', 'population', 'region']], on='country', how='left')

    # Calculate articles per capita
    articles_df['total_articles_per_capita'] = articles_df['total_articles'] / (articles_df['population'])
    articles_df['high_quality_articles_per_capita'] = articles_df['high_quality_articles'] / (articles_df['population'])

    return articles_df

In [80]:
# Get the total_articles_per_capita and high_quality_articles_per_capita information
article_ratios_df = calculate_article_ratios(final_df, population_df)
article_ratios_df

Unnamed: 0,country,total_articles,high_quality_articles,population,region,total_articles_per_capita,high_quality_articles_per_capita
0,Afghanistan,85,3.0,42.4,SOUTH ASIA,2.004717,0.070755
1,Albania,70,7.0,2.7,SOUTHERN EUROPE,25.925926,2.592593
2,Algeria,71,1.0,46.8,NORTHERN AFRICA,1.517094,0.021368
3,Angola,58,2.0,36.7,MIDDLE AFRICA,1.580381,0.054496
4,Antigua and Barbuda,33,,0.1,CARIBBEAN,330.000000,
...,...,...,...,...,...,...,...
161,Venezuela,56,1.0,28.8,SOUTH AMERICA,1.944444,0.034722
162,Vietnam,36,2.0,98.9,SOUTHEAST ASIA,0.364004,0.020222
163,Yemen,32,,34.4,WESTERN ASIA,0.930233,
164,Zambia,3,,20.2,EASTERN AFRICA,0.148515,


In [96]:
def generate_summary_tables(article_ratios_df):
    """
    This function generates all the required summary tables.

    Args:
        article_ratios_df (DataFrame): DataFrame with calculated ratios.

    Returns:
        dict: Dictionary containing summary tables.
    """
    # Top 10 countries by total articles per capita (sorted in descending order)
    top_10_countries_total = article_ratios_df.nlargest(10, 'total_articles_per_capita')[['country', 'total_articles_per_capita']] \
        .sort_values('total_articles_per_capita', ascending=False).reset_index(drop=True)

    # Bottom 10 countries by total articles per capita (sorted in ascending order)
    bottom_10_countries_total = article_ratios_df.nsmallest(10, 'total_articles_per_capita')[['country', 'total_articles_per_capita']] \
        .sort_values('total_articles_per_capita', ascending=True).reset_index(drop=True)

    # Filter for countries with at least 1 high-quality article
    filtered_high_quality = article_ratios_df[article_ratios_df['high_quality_articles_per_capita'] > 0]

    # Top 10 countries by high-quality articles per capita (sorted in descending order)
    top_10_countries_quality = filtered_high_quality.nlargest(10, 'high_quality_articles_per_capita')[['country', 'high_quality_articles_per_capita']] \
        .sort_values('high_quality_articles_per_capita', ascending=False).reset_index(drop=True)

    # Bottom 10 countries by high-quality articles per capita (sorted in ascending order)
    bottom_10_countries_quality = filtered_high_quality.nsmallest(10, 'high_quality_articles_per_capita')[['country', 'high_quality_articles_per_capita']] \
        .sort_values('high_quality_articles_per_capita', ascending=True).reset_index(drop=True)

    # Calculate total articles per capita for each region and sort them in descending order
    regions_total_coverage = (final_df.groupby('region')['title'].count() / final_df.groupby('region')['population'].first()).sort_values(ascending=False)
    regions_total_coverage = pd.DataFrame(regions_total_coverage).reset_index()
    regions_total_coverage.columns = ['Region', 'Total Articles per Capita']

    # Calculate the average high-quality articles per capita for each region
    regions_quality_coverage = final_df[(final_df['article_quality'] == 'FA') | (final_df['article_quality'] == 'GA')]
    regions_quality_coverage = regions_quality_coverage.groupby('region')['title'].count() / final_df.groupby('region')['population'].first()
    regions_quality_coverage = pd.DataFrame(regions_quality_coverage).reset_index()
    regions_quality_coverage.columns = ['Region', 'High Quality Articles per Capita']

    return {
        'Top 10 Countries by Total Articles': top_10_countries_total,
        'Bottom 10 Countries by Total Articles': bottom_10_countries_total,
        'Top 10 Countries by High Quality Articles': top_10_countries_quality,
        'Bottom 10 Countries by High Quality Articles': bottom_10_countries_quality,
        'Regions by Total Coverage': regions_total_coverage,
        'Regions by High Quality Coverage': regions_quality_coverage
    }

In [97]:
# Generate summary tables
summary_tables = generate_summary_tables(article_ratios_df)

In [88]:
print("Top 10 countries by coverage:")
summary_tables['Top 10 Countries by Total Articles']

Top 10 countries by coverage:


Unnamed: 0,country,total_articles_per_capita
0,Monaco,inf
1,Tuvalu,inf
2,Antigua and Barbuda,330.0
3,Federated States of Micronesia,140.0
4,Marshall Islands,130.0
5,Tonga,100.0
6,Barbados,83.333333
7,Montenegro,60.0
8,Seychelles,60.0
9,Bhutan,55.0


In [89]:
print('Bottom 10 countries by coverage:')
summary_tables['Bottom 10 Countries by Total Articles']

Bottom 10 countries by coverage:


Unnamed: 0,country,total_articles_per_capita
0,China,0.011337
1,Ghana,0.087977
2,India,0.105698
3,Saudi Arabia,0.135501
4,Zambia,0.148515
5,Norway,0.181818
6,Israel,0.204082
7,Egypt,0.304183
8,Cote d'Ivoire,0.323625
9,Ethiopia,0.347826


In [90]:
print('Top 10 countries by high quality:')
summary_tables['Top 10 Countries by High Quality Articles']

Top 10 countries by high quality:


Unnamed: 0,country,high_quality_articles_per_capita
0,Montenegro,5.0
1,Luxembourg,2.857143
2,Albania,2.592593
3,Kosovo,2.352941
4,Maldives,1.666667
5,Lithuania,1.37931
6,Croatia,1.315789
7,Guyana,1.25
8,Palestinian Territory,1.090909
9,Slovenia,0.952381


In [91]:
print('Bottom 10 countries by high quality:')
summary_tables['Bottom 10 Countries by High Quality Articles']

Bottom 10 countries by high quality:


Unnamed: 0,country,high_quality_articles_per_capita
0,Bangladesh,0.005764
1,Egypt,0.009506
2,Ethiopia,0.01581
3,Japan,0.016064
4,Pakistan,0.016632
5,Colombia,0.019157
6,Congo DR,0.01955
7,Vietnam,0.020222
8,Uganda,0.020576
9,Algeria,0.021368


In [92]:
print('Geographic regions by total coverage:')
summary_tables['Regions by Total Coverage']

Geographic regions by total coverage:


Unnamed: 0,Region,Total Articles per Capita
0,CARIBBEAN,2180.0
1,OCEANIA,720.0
2,CENTRAL AMERICA,376.0
3,SOUTHERN EUROPE,295.185185
4,WESTERN ASIA,203.0
5,NORTHERN EUROPE,136.428571
6,EASTERN EUROPE,77.065217
7,WESTERN EUROPE,54.021739
8,EASTERN AFRICA,50.378788
9,SOUTHERN AFRICA,45.555556


In [98]:
print('Geographic regions by high quality coverage:')
summary_tables['Regions by High Quality Coverage']

Geographic regions by high quality coverage:


Unnamed: 0,Region,High Quality Articles per Capita
0,CARIBBEAN,90.0
1,CENTRAL AMERICA,20.0
2,CENTRAL ASIA,0.251256
3,EAST ASIA,0.002126
4,EASTERN AFRICA,1.287879
5,EASTERN EUROPE,4.130435
6,MIDDLE AFRICA,0.217984
7,NORTHERN AFRICA,0.363248
8,NORTHERN EUROPE,5.714286
9,OCEANIA,10.0
