# Homework \#2: Considering Bias in Data

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles on political figures from different countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.
You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries. Your analysis will consist of a series of tables that show:
1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
The countries with the highest and lowest proportion of high quality articles about politicians.
2. A ranking of geographic regions by articles-per-person and proportion of high quality articles.
3. You are also expected to write a short reflection on the project that focuses on how both your findings from this analysis and the process you went through to reach those findings helps you understand the causes and consequences of biased data in large, complex data science projects.


For reproducing my analysis, all the cells should be run in order.


# License

Some portions of the code below are derived from examples created by Dr. David W. McDonald for the DATA 512 course in the UW MS Data Science degree program. This code is shared under the Creative Commons CC-BY license. Revision 1.2 - September 16, 2024.


A copy of the reference codes can be found in this repository within the folder labeled `code_references`.

This code is provided "as is," without any warranty of any kind. The author is not responsible for any issues that may arise from its use. Contributions to this project are welcome. Please submit a pull request or open an issue to discuss potential improvements and additions.



# Dataset Details

1. **The Wikipedia Category:** Politicians by nationality was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries. This data is in the homework folder as politicians_by_country.AUG.2024.csv.

2. The **population data** is available in CSV format as population_by_country_AUG.2024.csv. This dataset was downloaded from the world population data sheet published by the Population Reference Bureau.

In my analysis, I conducted a manual search to identify if some of the politicians in our data are in fact, not politicians. During this process, I discovered several prominent figures, including Mohammad Khan, an athlete from Afghanistan; Karen Sargsyan, a sociologist from Armenia; Julius Lippert, a historian from Austria; and Rick James, an actor from Antigua and Barbuda.

While there may not be a systematic method for searching these diverse professions, I found that each of these individuals had indeed engaged in political activities at different points in their lives. Recognizing their associations with political roles although they were from different background, I made the decision to include them in my dataset for analysis.

### Necessary Python modules for the script

`json`: For working with JSON data.  
`time`: For time-related operations.  
`urllib.parse`: For URL parsing.  
`requests`: For making HTTP requests.  
`csv`: For working with CSV files.  
`pandas`: For data manipulation and analysis.  
`os`: For operating system-related tasks.  

In [56]:
#
# These are standard python modules
import json
import time
import urllib.parse
import csv
import os

#
# The following modules is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

### Required Credentials

This cell stores the credentials required to access a particular service or API.

`USERNAME` holds the user's name or identifier for authentication purposes. Replace "Navyaeedula" with your actual username if needed.  
`EMAIL_ADDRESS` holds the email address
`ACCESS_TOKEN` is a placeholder for an API token or password that is be required to authorize access to the ORES service. Since for security reasons, this should not be hard-coded in production and should be kept confidential, I've removed the access key I used and added a placeholder instead.


In [57]:
#
# Credentials

# Replace with your actual username.
USERNAME = "Navyaeedula"
#Repleace with the email address you used to set up your account and API keys.
EMAIL_ADDRESS = "needula@uw.edu"
# Replace with your API access token.
ACCESS_TOKEN = "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJjNjczMzUyODk1Yjk2NzQ4MWYyMTYyZTNmMjg2NzVkYSIsImp0aSI6IjUzM2QzNzhkNmUxNDI2OWNlM2RhMTM1ZjRiNjY2YWQ1MzUyM2Y0MjMzNTRjMDU3NWY3MDRjNjY0NTBmMDljMmNkYWZmOWNmZTkzNDRiOWU0IiwiaWF0IjoxNzI4NjkyMTI4LjYyMjU1NSwibmJmIjoxNzI4NjkyMTI4LjYyMjU1NywiZXhwIjozMzI4NTYwMDkyOC42MjA3MjgsInN1YiI6Ijc2Njk2MjQ1IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.HxtGbtzchefIcMDoIKCozw9vIuTq1vqTnHl6dWtmyFsVG9QqQAQ8xrWh2Q4-iM7nVaYI8jcdArrwhfWxi_hs8MjXmG6hAGhnLYKgg8Wq1u-C4vTg63LDZvgBSbsTZRv2aO0io9R8_vIv9Ng-3-vsmDSIxtf0WYx4utXaM_HLkwHS--SA1BOTRkx_gS3xsmBKrHwfU4CqPe5hkY1ZAkdCxkTihOmT3YaawJCTiUOXX11rg5ssTdQGzNCAfxCNfhACJO-5cgGaIEY8LxWY7lQusH6SO0YTWa93CNr3PHIwjqBdrfiztlktW465_i9p7ZBEyJoCeTk67ufWjobVPXwvxIdzo5NdgsAsQtrGrWT5R3x8tKXWnTIvsPGI52VM7FqMt2Gk8rJCjXhrPST8q5ygBnx6Ugo_7C1rkzzUcrct9mMtOrXxuXqbcfhfpnu9rkBv9Brj6BY_uXeUPZcvbExKN0DX4QJAPgAKM8JSREx6aRayDVbyHBXHmBEeP9v3FButj3im9mEDyMt82iMOHtdAg1pxaPjPbYTupWlqxXupfwrTOqiPd-J9MY8dv7UijzxMKxgtR3lY7geT-aqyB87kJXkeGM8Qa_WmGGUkAjp1wdznfq69dEfPGYsK1EqtuTFZljY9qtuAXPV7Q5VmH2a3RFYQ2jQBg1vtd8VqM5DO6e0"


In [58]:
#
# Input file path containing politician article titles and population details for each country

ARTICLE_LIST_FILE = "/content/politicians_by_country_AUG.2024.csv" # Path to the CSV file with politician article titles.
POPULATION_FILE = "/content/population_by_country_AUG.2024.csv" # Path to the CSV file with population data by country and region

politician_df = pd.read_csv(ARTICLE_LIST_FILE)
population_df = pd.read_csv(POPULATION_FILE)


# Data Acquisition
This code defines constants and templates for interacting with the [Wikimedia ORES API](https://www.mediawiki.org/wiki/ORES#API_usage) to score the quality of English Wikipedia articles. It sets the API endpoint, specifies the "enwiki-articlequality" model, and calculates the necessary delay between requests to avoid exceeding the rate limit of 5000 requests per hour. It includes templates for the request headers, which contain the user's email and an access token for authentication, and the payload that includes the article revision ID and language (English). The code ensures that requests are properly structured and throttled to comply with API usage limits.

In [59]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': f"<{EMAIL_ADDRESS}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : EMAIL_ADDRESS,
    'access_token'  : ACCESS_TOKEN
}

#
#    Data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}




This code sets up constants and configurations for making requests to the [English Wikipedia API](https://www.mediawiki.org/wiki/ORES#API_usage)  to retrieve information about specific pages. It defines the API endpoint, assumes a small latency between requests to avoid overloading the server (throttling), and customizes request headers to include the user's email address as a user-agent. The code also outlines a template for the API request, which fetches page details such as the talk page ID, URL, and the number of watchers. These parameters are structured to query one Wikipedia page at a time.

In [60]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'


# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
REQUEST_HEADERS = {
    'User-Agent': f'<{EMAIL_ADDRESS}>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


## Functions for Handling Wikipedia Article Data, API Requests, and Data Integrity Checks

`read_article_file(file_path)`: This function reads article titles from a CSV file and extracts them using the 'name' column. It handles errors such as missing files, CSV parsing errors, or other unexpected issues. The returned result is a list of Wikipedia article titles for politicians.

`process_ores_scores_from_json(json_file_path, email_address, access_token)`: This function extracts ORES scores for Wikipedia articles from a JSON file that contains Wikipedia article revision IDs. It queries the ORES API to retrieve quality predictions and probabilities for each article, and saves this data into a CSV file.

`request_pageinfo_per_article(article_title, endpoint_url, request_template, headers)`: The function request_pageinfo_per_article queries Wikipedia's API for detailed page information about a given article title. It accepts the article title, an API endpoint, a request template, and HTTP headers as input.

`process_and_save_data(articles)`: This function processes the page information for each article by querying the Wikipedia API. It extracts relevant data, such as page ID, title, and last revision ID, and saves the results in a CSV file.

`read_json_keys(file_path)`: This function takes a file path to a JSON file, reads the file, and returns a list of the top-level keys found in the JSON object. It handles common errors such as file not being found, invalid JSON, or any unexpected issues that might occur during reading. If any error occurs, an empty list is returned, and an error message is printed to the console.

`print_duplicates(list)`: This function scans a list for any duplicate elements. It maintains a set of "seen" items to efficiently track whether an element has been encountered before. If an element appears more than once and hasn't already been added to the "duplicates" list, it's considered a duplicate and printed. If no duplicates are found, it prints a message indicating that. This function is useful for identifying repeated entries in a list.

In [66]:
#
# Function to read article titles of politicians from Wikipedia from the CSV file
def read_article_file(file_path):
    """
    Reads article titles from a CSV file.

    Args:
        file_path (str): Path to the CSV file containing article titles.

    Returns:
        list: A list of article titles extracted from the 'name' column of the CSV file.

    Raises:
        FileNotFoundError: If the specified CSV file is not found.
        csv.Error: If there is an error parsing the CSV file.
        Exception: For any other unexpected errors.
    """
    try:
        # Open the CSV file and read its contents
        with open(file_path, mode='r', newline='', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)

            # Extract the 'name' column values, if they exist
            article_titles = [row['name'] for row in reader if 'name' in row and row['name']]

    except FileNotFoundError as fnf_error:
        print(f"Error: CSV file '{file_path}' not found. Please check the file path and try again.")
        raise fnf_error

    except csv.Error as csv_error:
        print(f"Error: CSV file '{file_path}' could not be parsed properly. {csv_error}")
        raise csv_error

    except Exception as e:
        print(f"Unexpected error while reading '{file_path}': {e}")
        raise e

    return article_titles

#
# Function to process and save data from Wikipedia API
def process_and_save_data(articles):
    """
    Processes page information for each article from Wikipedia API and saves it to a CSV file.

    Args:
        articles (list): List of article titles to query from the Wikipedia API.

    Raises:
        Exception: If there is an error saving data to a CSV file.
    """
    # Specify the folder name
    folder_name = 'generated_files'

    # Create the folder if it doesn't exist
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)
        print(f"Folder '{folder_name}' created successfully.")
    else:
        print(f"Folder '{folder_name}' already exists.")

    output_file = "generated_files/articles_page_info.csv"
    article_data = []
    failed_to_process = []

    try:
        for article in articles:
            print(f"Processing article: {article}")

            # Request page information for each article
            response = request_pageinfo_per_article(article)

            if response is not None:
                # Append the response to the list of article data
                article_data.append(response)
            else:
                print(f"Failed to process {article}")
                failed_to_process.append(article)

        # Convert the article data to a DataFrame
        article_info_df = pd.DataFrame(article_data)
        result_df = pd.DataFrame()

        for index, row in article_info_df.iterrows():
            page_info = row['query']
            page_id = list(page_info['pages'].keys())[0]
            page_data = page_info['pages'][page_id]

            # Convert the page data (a nested dictionary) to a DataFrame row
            page_data_df = pd.DataFrame.from_dict(page_data, orient='index').T
            result_df = pd.concat([result_df, page_data_df])
            result_df = result_df.loc[:, ['pageid', 'title', 'lastrevid']]

        # Save the DataFrame to a CSV file
        try:
            result_df.to_csv(output_file, index=False, encoding='utf-8')
            print(f"Data successfully saved to {output_file}.")
        except Exception as e:
            print(f"Error saving data to CSV: {e}")
            print(failed_to_process)

    except KeyboardInterrupt:
        print("This code was previously run and the data file was retrieved.")


def request_ores_score_per_article(article_revid=None, email_address=None, access_token=None,
                                   endpoint_url=API_ORES_LIFTWING_ENDPOINT,
                                   model_name=API_ORES_EN_QUALITY_MODEL,
                                   request_data=ORES_REQUEST_DATA_TEMPLATE,
                                   header_format=REQUEST_HEADER_TEMPLATE,
                                   header_params=REQUEST_HEADER_PARAMS_TEMPLATE):
    """
    Sends a request to the ORES (Objective Revision Evaluation Service) API to retrieve a quality score
    for a specific Wikipedia article revision.

    Parameters:
    ----------
    article_revid : int, optional
        The revision ID of the Wikipedia article for which to request a score.
        Must be provided for the request to succeed.

    email_address : str, optional
        The email address associated with the API request for identification purposes.
        Must be provided for the request to succeed.

    access_token : str, optional
        The access token required to authenticate the API request.
        Must be provided for the request to succeed.

    endpoint_url : str, optional
        The base URL of the ORES API endpoint. Defaults to a pre-defined constant.

    model_name : str, optional
        The specific ORES model to use for scoring (e.g., article quality).
        Defaults to a pre-defined model for English Wikipedia article quality.

    request_data : dict, optional
        The data to be sent in the request, containing details such as revision ID.
        Defaults to a template dictionary.

    header_format : dict, optional
        The format for the headers of the request. Defaults to a template dictionary.

    header_params : dict, optional
        The parameters to be filled in the request headers, such as email and access token.
        Defaults to a template dictionary.

    Returns:
    -------
    dict
        A JSON response from the ORES API containing the article's quality score, or None if the request fails.
    """

    # Ensure an article revision ID, email, and access token are provided
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    # Raise an exception if any of the critical parameters are missing
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Format the request URL by including the specific ORES model name
    request_url = endpoint_url.format(model_name=model_name)

    # Prepare headers using the provided format and parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # Make the API request to ORES with throttling to avoid exceeding rate limits
    try:
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)  # Implement delay if throttling is set
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))  # Make the POST request
        json_response = response.json()  # Parse the JSON response
    except Exception as e:
        print(e)  # Log any exceptions
        json_response = None  # Return None if the request fails
    return json_response

def request_pageinfo_per_article(article_title=None,
                                 endpoint_url=API_ENWIKIPEDIA_ENDPOINT,
                                 request_template=PAGEINFO_PARAMS_TEMPLATE,
                                 headers=REQUEST_HEADERS):
    """
    Fetches page information from the Wikipedia API for a given article.

    Args:
        article_title (str): Title of the Wikipedia article to query.
        endpoint_url (str): Wikipedia API endpoint URL (default is set to the English Wikipedia).
        request_template (dict): Template for API request parameters, including action, format, and props.
        headers (dict): HTTP headers to be included in the request, especially containing 'User-Agent' for identification.

    Returns:
        dict or None: JSON response from the API if successful, otherwise None.

    Raises:
        Exception: If article title is not supplied or if the headers do not contain a valid 'User-Agent'.
    """

    # Set the article title in the request template if it's provided as an argument
    if article_title:
        request_template['titles'] = article_title

    # Raise an exception if no article title is supplied
    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # Check if the 'User-Agent' field is present in the headers for responsible API usage
    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    # Ensure the user has used their UW email address correctly in the 'User-Agent' field
    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # Attempt to make the request to the Wikipedia API
    try:
        # Implementing a throttle wait to avoid overwhelming the free API with too many requests
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)

        # Make the request to the Wikipedia API
        response = requests.get(endpoint_url, headers=headers, params=request_template)

        # Parse the response into JSON format
        json_response = response.json()

    except Exception as e:
        # Print the exception if an error occurs during the request
        print(e)
        json_response = None

    return json_response

#
# Function to process ORES scores of each Wikipedia article from a JSON file and save to CSV
def process_ores_scores_from_json(json_file_path, email_address, access_token):
    """
    Retrieves ORES scores for articles based on revision IDs and saves the data to a CSV file.

    Args:
        json_file_path (str): Path to the JSON file containing article page information.
        email_address (str): Email address for the API request header.
        access_token (str): Access token for ORES API authentication.

    Raises:
        FileNotFoundError: If the JSON file is not found.
        json.JSONDecodeError: If the JSON file cannot be decoded.
    """

    output_csv = 'generated_files/articles_ores_scores.csv'
    all_article_scores = []
    articles_without_revid = []

    try:
        # Load article data from the JSON file
        with open(json_file_path, mode='r', encoding='utf-8') as json_file:
            articles = json.load(json_file)

            for article in articles.values():  # Assuming articles is a dictionary
                lastrevid = article.get('lastrevid')

                if lastrevid:
                    print(f"Requesting ORES score for revision ID: {lastrevid}")

                    # Request ORES scores for the given revision ID
                    response = request_ores_score_per_article(
                        article_revid=lastrevid,
                        email_address=email_address,
                        access_token=access_token
                    )

                    if response is not None:
                        # Store score data
                        score_data = {
                            'revision_id': lastrevid,
                            'quality_prediction': response.get('enwiki', {}).get('scores', {}).get(str(lastrevid), {}).get('articlequality', {}).get('score', {}).get('prediction')
                        }

                        # Add probabilities to the score data
                        probabilities = response.get('enwiki', {}).get('scores', {}).get(str(lastrevid), {}).get('articlequality', {}).get('score', {}).get('probability', {})
                        score_data.update({f'Probability {key}': value for key, value in probabilities.items()})

                        all_article_scores.append(score_data)
                    else:
                        print(f"Failed to get score for lastrevid {lastrevid}")
                else:
                    print("lastrevid not found for an article.")
                    articles_without_revid.append(lastrevid)

        # Save the scores to a CSV file
        all_article_scores_df = pd.DataFrame(all_article_scores)
        all_article_scores_df.to_csv(output_csv, index=False)
        print(articles_without_revid)
        print(f"Scores saved to {output_csv}")

    except FileNotFoundError:
        print(f"Error: JSON file '{json_file_path}' not found. Please check the file path and try again.")
        raise
    except json.JSONDecodeError:
        print(f"Error: Failed to decode JSON from '{json_file_path}'.")
        raise
    except KeyboardInterrupt:
        print("This code was previously run, and the data file was retrieved.")

def read_json_keys(file_path):
    """
    Reads a JSON file and extracts all the keys present at the top level.

    Args:
    - file_path (str): Path to the JSON file.

    Returns:
    - list: A list of keys found in the JSON file, or an empty list in case of errors.

    Raises:
    - FileNotFoundError: If the file is not found.
    - json.JSONDecodeError: If the file cannot be parsed as JSON.
    - Exception: For other unexpected errors during file reading.
    """
    try:
        # Open and load the JSON file into a Python object
        with open(file_path, 'r', encoding='utf-8') as jsonfile:
            data = json.load(jsonfile)  # Load JSON data

        # Extract and return the list of top-level keys in the JSON object
        keys_list = list(data.keys())
        return keys_list

    except FileNotFoundError:
        # Handle missing file error
        print(f"Error: JSON file '{file_path}' not found. Please check the file path and try again.")
        return []

    except json.JSONDecodeError:
        # Handle JSON parsing errors
        print(f"Error: The file '{file_path}' could not be parsed as valid JSON.")
        return []

    except Exception as e:
        # Handle any other unexpected errors
        print(f"Unexpected error while reading '{file_path}': {e}")
        return []

def print_duplicates(lst):
    """
    Identifies and prints the duplicate elements in a given list.

    Args:
    - lst (list): The list to check for duplicate elements.

    Returns:
    - None: Prints the duplicate elements or indicates if no duplicates are found.
    """
    # Create an empty set to keep track of seen elements
    seen = set()

    # List to store any found duplicates
    duplicates = []

    # Iterate through each element in the list
    for x in lst:
        # Check if the element is already in the 'seen' set and not already in duplicates
        if x in seen and x not in duplicates:
            duplicates.append(x)  # Add to duplicates if it's repeated
        else:
            seen.add(x)  # Add element to the 'seen' set if new

    # Output the results
    if duplicates:
        return(duplicates)
    else:
        print("No duplicates found.")

Here we are generating the `.json` file having the Wikipedia Page Information. This piece of code took approximately 30 minutes to run on a Google Colab Notebook with CPU compute setting enabled.

In [67]:
# Reading the Wikipedia article titles from the CSV file
article_titles = read_article_file(ARTICLE_LIST_FILE)

# Processing and saving Page Information data for all the Wikipedia articles
process_and_save_data(article_titles)

Folder 'generated_files' already exists.
Processing article: Majah Ha Adrif
Processing article: Haroon al-Afghani
Processing article: Tayyab Agha
Processing article: Khadija Zahra Ahmadi
Processing article: Aziza Ahmadyar
This code was previously run and the data file was retrieved.


In [68]:
print("The number of politician wikipedia articles in the dataset is:",len(article_titles))

The number of politician wikipedia articles in the dataset is: 7155


In [69]:
# Example usage
file_path_json_page_info = "/content/generated_files/articles_page_info.json"
keys = read_json_keys(file_path_json_page_info)
print("The number of unique Wikipedia article titles found Keys found in the JSON file containing Page Information is:", keys)

The number of unique Wikipedia article titles found Keys found in the JSON file containing Page Information is: ['Majah Ha Adrif', 'Haroon al-Afghani', 'Tayyab Agha', 'Khadija Zahra Ahmadi', 'Aziza Ahmadyar', 'Muqadasa Ahmadzai', 'Mohammad Sarwar Ahmedzai', 'Amir Muhammad Akhundzada', 'Nasrullah Baryalai Arsalai', 'Abdul Rahim Ayoubi', 'Ismael Balkhi', 'Abdul Baqi Turkistani', 'Mohammad Ghous Bashiri', 'Jan Baz', 'Bashir Ahmad Bezan', 'Rafiullah Bidar', 'Mohammad Siddiq Chakari', 'Cheragh Ali Cheragh', 'Nasir Ahmad Durrani', 'Muhammad Hashim Esmatullahi', 'Ezatullah (Nangarhar)', 'Aimal Faizi', 'Gajinder Singh Safri', 'Sharif Ghalib', 'Hashmat Ghani Ahmadzai', 'Abdul Ghani Ghani', 'Ghulam Ghaus', 'Ghulam Muhammad Ghobar', 'Mohammad Gul (Helmand Council)', 'Sayed Yousuf Halim', 'Rangina Hamidi', 'Sayed Zafar Hashemi', 'Qutbuddin Hilal', 'Mahboba Hoqomal', 'Musa Hotak', 'Mirza Muhammad Ismail', 'Sayed Jalal', 'Said Tayeb Jawad', 'Sayed Jalal Karim', 'Hafizullah Shabaz Khail', 'Masoud Kha

In [70]:
print(f"There are a total of {len(print_duplicates(article_titles))} duplicate Wikipedia article names.")
print("Here is a list of all duplicate Wikipedia article names...")
print(print_duplicates(article_titles))

There are a total of 41 duplicate Wikipedia article names.
Here is a list of all duplicate Wikipedia article names...
['Count Václav Antonín Chotek of Chotkov and Vojnín', 'Eduard Hedvicek', 'Leopold, Count von Thun und Hohenstein', 'Ibrahim Harun', 'José Francisco Barrundia', 'Manuel Carrascalão', 'Bak Jungyang', 'Visar Ymeri', 'Torokul Dzhanuzakov', 'Tadeusz Kościuszko', 'Venko Markovski', 'Ashab Uddin Ahmad', 'Moinuddin Ahmed Chowdhury', 'Mohammad Toaha', 'Ali al-Qaradaghi', 'Aleksandr Nikitin (politician, born 1987)', 'José Alejandro de Aycinena', 'Shqiprim Arifi', 'Melko Čingrija', 'Oliver Ivanović', 'Stjepan Mitrov Ljubiša', 'Svetozar Pribićević', 'Goran Rakić', 'Lazar Tomanović', 'Antonín Janoušek', 'Juraj Košút', 'Josip Ferfolja', 'Djama Ali Moussa', 'Antonio Gutiérrez y Ulloa', 'Manuel Marliani', 'Rafael Montoro', 'Bona Malwal', 'George Kongor Arop', 'Luigi Adwok', 'Siricio Iro Wani', 'Abir Al-Sahlani', 'Jacob Magnus Sprengtporten', 'Hrant Maloyan', 'Yat Hwaidi', 'Sergey Abiso

Here we are generating the `.csv` file having the Wikipedia Article Quality information and ORES scores. This piece of code took approximately 2 hours to run on a Google Colab Notebook with CPU compute setting enabled.

In [71]:
json_file_path_page_info = "/content/generated_files/articles_page_info.json"
process_ores_scores_from_json(json_file_path_page_info, EMAIL_ADDRESS, ACCESS_TOKEN)

Requesting ORES score for revision ID: 1233202991
Requesting ORES score for revision ID: 1230459615
Requesting ORES score for revision ID: 1225661708
Requesting ORES score for revision ID: 1234741562
This code was previously run, and the data file was retrieved.


# Data Preparation and Processing
This code is a part of data processing and preparation for future use cases, such as performing table joins and data analysis on Wikipedia article details and ORES scores. The JSON file contains metadata on Wikipedia pages (like page ID, title, and last revision ID), while the CSV file stores article quality predictions from ORES.

In [23]:
# Load JSON data containing page information into a Python object
with open(json_file_path_page_info, 'r') as file:
    json_data = json.load(file)  # json_data is now a dictionary where keys are article titles and values are details

# Convert JSON data to a list of dictionaries where each dictionary represents an article's details
# and 'article_title' is added as a key to store the title in the DataFrame
data = [{'article_title': title, **details} for title, details in json_data.items()]

# Convert the list of dictionaries into a DataFrame (df_page_info), which stores page information
df_page_info = pd.DataFrame(data)

# Read the ORES scores data from a CSV file and convert it into a DataFrame (df_ores_scores)
csv_file_path_ores_scores = '/content/generated_files/articles_ores_scores.csv'
df_ores_scores = pd.read_csv(csv_file_path_ores_scores)

# Display the first few rows of each DataFrame to verify the data has been loaded correctly
df_page_info.head()

Unnamed: 0,article_title,pageid,ns,title,contentmodel,pagelanguage,pagelanguagehtmlcode,pagelanguagedir,touched,lastrevid,length,talkid,fullurl,editurl,canonicalurl,watchers,missing,redirect,new
0,Majah Ha Adrif,10483286.0,0,Majah Ha Adrif,wikitext,en,en,ltr,2024-09-30T14:32:18Z,1233203000.0,3188.0,13330265.0,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,https://en.wikipedia.org/w/index.php?title=Maj...,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,,,,
1,Haroon al-Afghani,11966231.0,0,Haroon al-Afghani,wikitext,en,en,ltr,2024-10-05T14:27:29Z,1230460000.0,17027.0,15250816.0,https://en.wikipedia.org/wiki/Haroon_al-Afghani,https://en.wikipedia.org/w/index.php?title=Har...,https://en.wikipedia.org/wiki/Haroon_al-Afghani,,,,
2,Tayyab Agha,46841383.0,0,Tayyab Agha,wikitext,en,en,ltr,2024-10-11T00:26:57Z,1225662000.0,6346.0,46843786.0,https://en.wikipedia.org/wiki/Tayyab_Agha,https://en.wikipedia.org/w/index.php?title=Tay...,https://en.wikipedia.org/wiki/Tayyab_Agha,,,,
3,Khadija Zahra Ahmadi,71600382.0,0,Khadija Zahra Ahmadi,wikitext,en,en,ltr,2024-10-11T00:30:22Z,1234742000.0,2569.0,71610138.0,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,https://en.wikipedia.org/w/index.php?title=Kha...,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,,,,
4,Aziza Ahmadyar,47805901.0,0,Aziza Ahmadyar,wikitext,en,en,ltr,2024-10-08T13:30:38Z,1195651000.0,3790.0,47806200.0,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,https://en.wikipedia.org/w/index.php?title=Azi...,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,,,,


In [24]:
# Display the first few rows of each DataFrame to verify the data has been loaded correctly
df_ores_scores.head()

Unnamed: 0,revision_id,quality_prediction,Probability B,Probability C,Probability FA,Probability GA,Probability Start,Probability Stub
0,1233202991,Start,0.114586,0.251813,0.006372,0.017664,0.548056,0.061509
1,1230459615,B,0.416808,0.377938,0.057959,0.089012,0.052925,0.005359
2,1225661708,Start,0.082645,0.247194,0.005548,0.018469,0.594374,0.05177
3,1234741562,Stub,0.019056,0.034997,0.003352,0.009019,0.264906,0.66867
4,1195651393,Start,0.046899,0.098852,0.004757,0.01937,0.712785,0.117338


The error rate is calculated as the ratio of the number of articles without a valid quality_prediction (i.e., missing ORES scores) to the total number of articles. This metric helps determine how reliable the ORES score retrieval process was. If the error rate exceeds 1%, there might be issues with the data collection process, and you should investigate potential causes such as API request failures or network issues.

In this case, the error rate is **0.05%**, which is well below the 1% threshold, indicating that the score retrieval process was successful, with only a few articles missing their ORES scores. This low error rate suggests that the data is mostly complete and ready for further analysis.

In [25]:
# Count the number of articles with missing 'quality_prediction' values (i.e., articles with no ORES score)
missing_scores_count = df_ores_scores['quality_prediction'].isna().sum()

# Calculate the error rate as the ratio of articles without ORES scores to the total number of articles
error_rate = missing_scores_count / len(df_ores_scores)

# Print the total number of articles without ORES scores
print(f"Total number of articles with missing ORES scores: {missing_scores_count}")

# Print the calculated error rate (should be less than 1% for a good result)
print(f"Error rate of missing ORES scores: {error_rate:.4%}")


Total number of articles with missing ORES scores: 4
Error rate of missing ORES scores: 0.0563%


# Data Manipulation and Table Joins


We are performing left joins with respect to the source table to ensure that no data is lost during the merging process. The source table contains all the Wikipedia articles we are working with, and by using left joins, we guarantee that every article from the source will be retained, even if corresponding data (such as ORES scores) is missing in the other tables. This approach is essential because our goal is to maintain a complete dataset of articles while enriching it with additional information, like ORES scores. By preserving all articles in the source table, we can later handle any missing data explicitly without unintentionally discarding valuable entries.

In [26]:
# Perform a left join between the politician dataframe and the article info dataframe
# - 'politician_df' is the source table containing a list of politicians
# - 'df_article_info' contains information about Wikipedia articles
# The join is done on 'name' from politician_df and 'article_title' from df_article_info
# This ensures that all rows from the politician_df are retained, even if there is no matching article info.

merged_df_join_with_source = pd.merge(politician_df,
                                      df_page_info,
                                      left_on='name',
                                      right_on='article_title',
                                      how='left')
merged_df_join_with_source.head()

Unnamed: 0,name,url,country,article_title,pageid,ns,title,contentmodel,pagelanguage,pagelanguagehtmlcode,...,lastrevid,length,talkid,fullurl,editurl,canonicalurl,watchers,missing,redirect,new
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Majah Ha Adrif,10483286.0,0,Majah Ha Adrif,wikitext,en,en,...,1233203000.0,3188.0,13330265.0,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,https://en.wikipedia.org/w/index.php?title=Maj...,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,,,,
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Haroon al-Afghani,11966231.0,0,Haroon al-Afghani,wikitext,en,en,...,1230460000.0,17027.0,15250816.0,https://en.wikipedia.org/wiki/Haroon_al-Afghani,https://en.wikipedia.org/w/index.php?title=Har...,https://en.wikipedia.org/wiki/Haroon_al-Afghani,,,,
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Tayyab Agha,46841383.0,0,Tayyab Agha,wikitext,en,en,...,1225662000.0,6346.0,46843786.0,https://en.wikipedia.org/wiki/Tayyab_Agha,https://en.wikipedia.org/w/index.php?title=Tay...,https://en.wikipedia.org/wiki/Tayyab_Agha,,,,
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,Khadija Zahra Ahmadi,71600382.0,0,Khadija Zahra Ahmadi,wikitext,en,en,...,1234742000.0,2569.0,71610138.0,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,https://en.wikipedia.org/w/index.php?title=Kha...,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,,,,
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,Aziza Ahmadyar,47805901.0,0,Aziza Ahmadyar,wikitext,en,en,...,1195651000.0,3790.0,47806200.0,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,https://en.wikipedia.org/w/index.php?title=Azi...,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,,,,


In [27]:
# Print the number of rows in the merged DataFrame, which represents the total number of Wikipedia articles in the source
print(f"The merged DataFrame contains {len(merged_df_join_with_source)} rows, reflecting the total number of Wikipedia articles in the source.")

The merged DataFrame contains 7155 rows, reflecting the total number of Wikipedia articles in the source.


In [54]:
print("The following is a list of Wikipedia articles that do not have a revision id associated with them.")
print(f"There are {len(merged_df_join_with_source[merged_df_join_with_source['lastrevid'].isna()]['name'])} of these Wikipedia articles.")
merged_df_join_with_source[merged_df_join_with_source['lastrevid'].isna()]['name']

The following is a list of Wikipedia articles that do not have a revision id associated with them.
There are 8 of these Wikipedia articles


Unnamed: 0,name
430,Barbara Eibinger-Miedl
516,Mehrali Gasimov
1200,Kyaw Myint
1342,André Ngongang Ouandji
1955,Tomás Pimentel
2427,Richard Sumah
4496,Segun ''Aeroland'' Adewale
5719,Bashir Bililiqo


In [28]:
# Perform an inner join between the merged DataFrame containing politician data
# and the DataFrame with ORES scores based on the 'lastrevid' from the politician data
# and 'revision_id' from the ORES scores. The left join ensures we keep all rows from
# merged_df_join_with_source while adding matching ORES scores.
merged_df_source_with_scores = pd.merge(
    merged_df_join_with_source,
    df_ores_scores,
    left_on='lastrevid',
    right_on='revision_id',
    how='left'
)

# Select only the relevant columns to retain in the final DataFrame
# This includes the article title, country, page ID, title, revision ID, and quality prediction.
merged_df_source_with_scores = merged_df_source_with_scores[[
    'article_title',
    'country',
    'pageid',
    'title',
    'revision_id',
    'quality_prediction'
]]

merged_df_source_with_scores.head()

Unnamed: 0,article_title,country,pageid,title,revision_id,quality_prediction
0,Majah Ha Adrif,Afghanistan,10483286.0,Majah Ha Adrif,1233203000.0,Start
1,Haroon al-Afghani,Afghanistan,11966231.0,Haroon al-Afghani,1230460000.0,B
2,Tayyab Agha,Afghanistan,46841383.0,Tayyab Agha,1225662000.0,Start
3,Khadija Zahra Ahmadi,Afghanistan,71600382.0,Khadija Zahra Ahmadi,1234742000.0,Stub
4,Aziza Ahmadyar,Afghanistan,47805901.0,Aziza Ahmadyar,1195651000.0,Start


In [29]:
print(f"The merged DataFrame contains {len(merged_df_source_with_scores)} rows, reflecting the total number of Wikipedia articles in the source.")

The merged DataFrame contains 7155 rows, reflecting the total number of Wikipedia articles in the source.


This code is designed to map countries to their respective regions based on a hierarchical structure found in the `population_by_country_AUG.2024.csv` dataset. In this analysis, a country can only belong to one region, which is critical for maintaining the integrity of our geographical categorization. The regions in the dataset are denoted in uppercase letters, while countries are listed in mixed case. By ensuring that each country is assigned to the closest (or lowest) hierarchical region, we can achieve an accurate representation of population data within their respective geographic contexts. This approach is essential for any subsequent analyses or visualizations that may rely on region-based aggregations or insights.

In [55]:
print("The following is a list of Wikipedia articles that do not have a quality predictions associated with them.")
print(f"There are {len(merged_df_source_with_scores[merged_df_source_with_scores['quality_prediction'].isna()]['article_title'])} of these Wikipedia articles.")
merged_df_source_with_scores[merged_df_source_with_scores['quality_prediction'].isna()]['article_title']

The following is a list of Wikipedia articles that do not have a quality predictions associated with them.
There are 12 of these Wikipedia articles.


Unnamed: 0,article_title
74,Mohammad Hashem Taufiqui
430,Barbara Eibinger-Miedl
516,Mehrali Gasimov
653,Abdul Halim Ghaznavi
1200,Kyaw Myint
1342,André Ngongang Ouandji
1955,Tomás Pimentel
2129,Taye Atskeselassie
2267,Hélène Pelosse
2427,Richard Sumah


In [30]:
# Initialize variables to store the current region and a list to hold country-region mappings
current_region = None
country_region_mapping = []

# Iterate through each row in the population DataFrame to map countries to their respective regions
for index, row in population_df.iterrows():
    # Extract the geography and population from the current row
    geography = row['Geography']
    population = row['Population']

    # Check if the geography is in all uppercase letters, indicating it is a region
    if geography.isupper():
        current_region = geography  # Update the current region
    else:
        # If the geography is a country name, map it to the current region
        if current_region:
            # Append a list containing the country, its region, and the population to the mapping list
            country_region_mapping.append([geography, current_region, population])

# Convert the list of country-region mappings into a DataFrame for easier analysis
df_mapped_country_to_region = pd.DataFrame(country_region_mapping, columns=['Country', 'Region', 'Population'])

# Display the first few rows of the resulting DataFrame to verify the mapping
df_mapped_country_to_region.head()

Unnamed: 0,Country,Region,Population
0,Algeria,NORTHERN AFRICA,46.8
1,Egypt,NORTHERN AFRICA,105.2
2,Libya,NORTHERN AFRICA,6.9
3,Morocco,NORTHERN AFRICA,37.0
4,Sudan,NORTHERN AFRICA,48.1


In this analysis, performing an outer join is crucial for ensuring that all relevant data from both the Wikipedia dataset and the population dataset is retained, even when some entries do not match. The outer join allows us to merge the datasets based on country names, including all records from both sources regardless of whether they have corresponding entries in the other dataset. This list will be saved as `wp_countries-no_match.txt`.

In [31]:
# This merge combines the previously merged DataFrame containing article scores with the country-region mapping DataFrame
# The 'outer' join ensures that all records from both DataFrames are included, regardless of whether they match
merged_df_politician_region = pd.merge(
    merged_df_source_with_scores,
    df_mapped_country_to_region,
    left_on="country",
    right_on='Country',
    how='outer',
    indicator=True
)

merged_df_politician_region.head()

Unnamed: 0,article_title,country,pageid,title,revision_id,quality_prediction,Country,Region,Population,_merge
0,Majah Ha Adrif,Afghanistan,10483286.0,Majah Ha Adrif,1233203000.0,Start,Afghanistan,SOUTH ASIA,42.4,both
1,Haroon al-Afghani,Afghanistan,11966231.0,Haroon al-Afghani,1230460000.0,B,Afghanistan,SOUTH ASIA,42.4,both
2,Tayyab Agha,Afghanistan,46841383.0,Tayyab Agha,1225662000.0,Start,Afghanistan,SOUTH ASIA,42.4,both
3,Khadija Zahra Ahmadi,Afghanistan,71600382.0,Khadija Zahra Ahmadi,1234742000.0,Stub,Afghanistan,SOUTH ASIA,42.4,both
4,Aziza Ahmadyar,Afghanistan,47805901.0,Aziza Ahmadyar,1195651000.0,Start,Afghanistan,SOUTH ASIA,42.4,both


In [32]:
print(f"The length of the merged DataFrame after the outer join is: {len(merged_df_politician_region)}")


The length of the merged DataFrame after the outer join is: 7198


In [33]:
# Perform an inner join between the DataFrames on the country names
# This join will only include rows where there are matching countries in both DataFrames.
# The indicator=True parameter adds a column to the result DataFrame that shows
# the source of each row: whether it came from the left, right, or both DataFrames.
merged_df_politician_region_inner_join = pd.merge(
    merged_df_source_with_scores,
    df_mapped_country_to_region,
    left_on="country",
    right_on='Country',
    how='inner',
    indicator=True
)

# Display the first few rows of the resulting DataFrame to verify the merge
merged_df_politician_region_inner_join.head()


Unnamed: 0,article_title,country,pageid,title,revision_id,quality_prediction,Country,Region,Population,_merge
0,Majah Ha Adrif,Afghanistan,10483286.0,Majah Ha Adrif,1233203000.0,Start,Afghanistan,SOUTH ASIA,42.4,both
1,Haroon al-Afghani,Afghanistan,11966231.0,Haroon al-Afghani,1230460000.0,B,Afghanistan,SOUTH ASIA,42.4,both
2,Tayyab Agha,Afghanistan,46841383.0,Tayyab Agha,1225662000.0,Start,Afghanistan,SOUTH ASIA,42.4,both
3,Khadija Zahra Ahmadi,Afghanistan,71600382.0,Khadija Zahra Ahmadi,1234742000.0,Stub,Afghanistan,SOUTH ASIA,42.4,both
4,Aziza Ahmadyar,Afghanistan,47805901.0,Aziza Ahmadyar,1195651000.0,Start,Afghanistan,SOUTH ASIA,42.4,both


In [34]:
print(f"The length of the merged DataFrame after the inner join is: {len(merged_df_politician_region_inner_join)}")

The length of the merged DataFrame after the inner join is: 7013


This code processes the merged DataFrame, identifying regions and titles with missing values to analyze the relationship between Wikipedia articles and corresponding population data. First, we identify and print regions where the article_title is null. Similarly, it checks for titles with null regions, ensuring that any articles lacking geographic context are also accounted for.

Next, the code extracts unique countries from both the population and politician DataFrames. By comparing these sets, it calculates the number of countries present in the population file but not in the politician file, and vice versa. Finally, it consolidates the unmatched countries into a list.



In [35]:
# Identify and print all 'Region' values where 'article_title' is null
names_with_null_title = merged_df_politician_region[merged_df_politician_region['article_title'].isnull()]['Region']

# Print a header for clarity
print("\nRegions with missing 'article_title':")
# Convert the Series to a list and print the regions
print(names_with_null_title.tolist())

# Output the count of regions with missing titles
print(f"Total number of regions with missing 'article_title': {len(names_with_null_title)}")

# Step 3: Print all "name" values in rows where "Region" is null
names_with_null_title = merged_df_politician_region[merged_df_politician_region['Region'].isnull()]['article_title']

print("\nNames where 'title' is null:")
# Convert the Series to a list and print the article titles
print(names_with_null_title.tolist())
# Output the count of article titles with missing regions
print(len(names_with_null_title))

# Extract unique countries from the population DataFrame
population_df_unique = df_mapped_country_to_region['Country'].unique()  # Unique countries from the population dataset
politician_df_unique = merged_df_source_with_scores['country'].unique()  # Unique countries from the politician dataset

# Calculate and print the number of countries in the population file not in the politician file
print("Number of countries in population file not in politician file:", len(set(population_df_unique) - set(politician_df_unique)))

# Calculate and print the number of countries in the politician file not in the population file
print("Number of countries in politician file not in population file:", len(set(politician_df_unique) - set(population_df_unique)))

# Identify countries that have no matches in both datasets
countries_with_no_match = list(set(population_df_unique) - set(politician_df_unique)) + list(set(politician_df_unique) - set(population_df_unique))
# Output the total count of countries with no match
print("Total number of countries with no matches:",len(countries_with_no_match))



Regions with missing 'article_title':
['SOUTHERN EUROPE', 'OCEANIA', 'SOUTHEAST ASIA', 'NORTHERN AMERICA', 'EAST ASIA', 'EAST ASIA', 'CARIBBEAN', 'NORTHERN EUROPE', 'CARIBBEAN', 'OCEANIA', 'SOUTH AMERICA', 'OCEANIA', 'WESTERN ASIA', 'CARIBBEAN', 'OCEANIA', 'WESTERN AFRICA', 'NORTHERN EUROPE', 'NORTHERN EUROPE', 'CARIBBEAN', 'OCEANIA', 'EAST ASIA', 'EAST ASIA', 'WESTERN EUROPE', 'CARIBBEAN', 'EASTERN AFRICA', 'EASTERN AFRICA', 'CENTRAL AMERICA', 'OCEANIA', 'WESTERN EUROPE', 'OCEANIA', 'OCEANIA', 'OCEANIA', 'SOUTHEAST ASIA', 'CARIBBEAN', 'EASTERN AFRICA', 'EASTERN EUROPE', 'SOUTHERN EUROPE', 'MIDDLE AFRICA', 'SOUTH AMERICA', 'NORTHERN EUROPE', 'NORTHERN AMERICA', 'NORTHERN AFRICA', 'SOUTHERN AFRICA']
Total number of regions with missing 'article_title': 43

Names where 'title' is null:
['Botche Candé', 'Juliano Fernandes', 'Teodora Inácia Gomes', 'Desejado Lima da Costa', 'Aristide Menezes', 'Florentino Mendes Pereira', 'Carmelita Pires', 'Agnelo Regalla', 'An Kyung-duk', 'Bae Deok-kwan

This code snippet is designed to create a structured output file (`wp_countries-no_match.txt`) that lists countries that did not have corresponding entries in the merged datasets. This file has 46 countries that satisfy this condition.

In [36]:
# Create folder 'generated_output' if it doesn't exist
output_folder = 'generated_output'
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Write the list to a .txt file, one element per line
txt_file_path = os.path.join(output_folder, 'wp_countries-no_match.txt')
with open(txt_file_path, 'w') as f:
    for country in countries_with_no_match:
        f.write(f"{country}\n")

This process ultimately addresses the task of identifying countries without matches after merging the Wikipedia data with the population data, ensuring that all entries are accounted for before finalizing the consolidated CSV file named `wp_politicians_by_country.csv`. This CSV will include information on matched countries, facilitating further analysis while preserving data integrity.

In [39]:
# Create the final DataFrame with relevant columns for analysis
final_columns = ['article_title', 'quality_prediction', 'country', 'Region', 'Population', 'revision_id']
merged_df_politician_region_final = merged_df_politician_region_inner_join[final_columns].copy()

# Rename columns for clarity and consistency
merged_df_politician_region_final.rename(columns={
    'Region': 'region',
    'Population': 'population',
    'quality_prediction': 'article_quality'
}, inplace=True)

# Convert 'revision_id' to numeric, coercing errors to NaN
merged_df_politician_region_final.loc[:, 'revision_id'] = pd.to_numeric(
    merged_df_politician_region_final['revision_id'],
    errors='coerce'
)

# Rearrange columns in a specific order for analysis
merged_df_politician_region_final = merged_df_politician_region_final[
    ['country', 'region', 'population', 'article_title', 'revision_id', 'article_quality']
]

# Print the number of entries in the final DataFrame
print("The number of entries in the final DataFrame to be used for analysis:", len(merged_df_politician_region_final))

# Save the final DataFrame to a CSV file for further analysis
output_file_path = "/content/generated_output/wp_politicians_by_country.csv"
merged_df_politician_region_final.to_csv(output_file_path, index=False)


The number of entries in the final DataFrame to be used for analysis: 7013


# Data Analysis

The analysis consists of calculating total-articles-per-capita (a ratio representing the number of articles per person)  and high-quality-articles-per-capita (a ratio representing the number of high quality articles per person).

Specifically, we aim the answer the following questions -

1. **Top 10 countries by coverage:** The 10 countries with the highest total articles per capita (in descending order) .
2. **Bottom 10 countries by coverage:** The 10 countries with the lowest total articles per capita (in ascending order) .
3. **Top 10 countries by high quality:** The 10 countries with the highest high quality articles per capita (in descending order) .
4. **Bottom 10 countries by high quality:** The 10 countries with the lowest high quality articles per capita (in ascending order).
5. **Geographic regions by total coverage:** A rank ordered list of geographic regions (in descending order) by total articles per capita.
6. **Geographic regions by high quality coverage:** Rank ordered list of geographic regions (in descending order) by high quality articles per capita


In [40]:
# Filter out rows where population is 0 for the calculation
filtered_df = merged_df_politician_region_final.loc[merged_df_politician_region_final['population'] > 0]

In [41]:
# Calculate the number of articles per capita for each country.
# Group the data by 'country' and divide the number of articles by the population of each country.
articles_per_capita = filtered_df.groupby('country')['article_title'].count() / filtered_df.groupby('country')['population'].first()

#  Get the top 10 countries based on articles per capita.
# Sort the calculated values in descending order and take the first 10 countries.
top_10_countries = articles_per_capita.sort_values(ascending=False).head(10).reset_index()
top_10_countries.columns = ['Country', 'Articles per Capita']

# Get the bottom 10 countries based on articles per capita.
# Sort the calculated values in ascending order to find countries with the lowest coverage.
bottom_10_countries = articles_per_capita.sort_values(ascending=True).head(10).reset_index()
bottom_10_countries.columns = ['Country', 'Articles per Capita']


In [42]:
top_10_countries

Unnamed: 0,Country,Articles per Capita
0,Antigua and Barbuda,330.0
1,Federated States of Micronesia,140.0
2,Marshall Islands,130.0
3,Tonga,100.0
4,Barbados,83.333333
5,Seychelles,60.0
6,Montenegro,60.0
7,Bhutan,55.0
8,Maldives,55.0
9,Samoa,40.0


In [43]:
bottom_10_countries

Unnamed: 0,Country,Articles per Capita
0,China,0.011337
1,India,0.105698
2,Ghana,0.117302
3,Saudi Arabia,0.135501
4,Zambia,0.148515
5,Norway,0.181818
6,Israel,0.204082
7,Egypt,0.304183
8,Cote d'Ivoire,0.323625
9,Ethiopia,0.347826


For questions 1 and 2 we calculate how many articles exist per capita for each country, allowing us to identify countries with relatively high or low Wikipedia coverage.

### Top 10 Countries by Coverage
The country with the highest number of Wikipedia articles per capita is **Antigua and Barbuda**, with **330 articles per million people**. This indicates a very high level of coverage relative to its population. In contrast, **Samoa** holds the 10th spot in the top 10, with **40 articles per million people**.

### Bottom 10 Countries by Coverage
The country with the lowest Wikipedia articles per capita is **China**, with only **0.011 articles per million people**, reflecting very low coverage relative to its massive population. **Ethiopia** is ranked 10th in the bottom list, with **0.348 articles per million people**, which is still quite low compared to the top-ranked countries.

High per capita values could indicate either a high number of articles or a relatively small population. The top 10 list showcases countries with substantial coverage relative to their population, often smaller nations. The bottom 10 list highlights larger countries or those with fewer articles relative to their population size.
We could hypothesize that countries with high per capita coverage might have a high concentration of contributors. Countries with low per capita coverage might face barriers like language, access to internet resources, or less emphasis to contribute to platforms like Wikipedia.

In [44]:
# Filter for high-quality articles.
# High-quality articles are defined as those marked 'FA' (Featured Articles) or 'GA' (Good Articles).
high_quality_articles = merged_df_politician_region_final[merged_df_politician_region_final['article_quality'].isin(['FA', 'GA'])]

# Calculate high-quality articles per capita.
# Group by 'country' and calculate the number of high-quality articles per capita.
high_quality_articles_per_capita = high_quality_articles.groupby('country')['article_title'].count() / merged_df_politician_region_final.groupby('country')['population'].first()

# Get the top 10 countries by high-quality article coverage.
top_10_high_quality_countries = high_quality_articles_per_capita.sort_values(ascending=False).head(10).reset_index()
top_10_high_quality_countries.columns = ['Country', 'Quality per Capita']

# Get the bottom 10 countries by high-quality article coverage.
top_10_bottom_quality_countries = high_quality_articles_per_capita.sort_values(ascending=True).head(10).reset_index()
top_10_bottom_quality_countries.columns = ['Country', 'Quality per Capita']

top_10_high_quality_countries


Unnamed: 0,Country,Quality per Capita
0,Montenegro,5.0
1,Luxembourg,2.857143
2,Albania,2.592593
3,Kosovo,2.352941
4,Maldives,1.666667
5,Lithuania,1.37931
6,Croatia,1.315789
7,Guyana,1.25
8,Palestinian Territory,1.090909
9,Slovenia,0.952381


In [45]:
top_10_bottom_quality_countries

Unnamed: 0,Country,Quality per Capita
0,Bangladesh,0.005764
1,Egypt,0.009506
2,Ethiopia,0.01581
3,Japan,0.016064
4,Pakistan,0.016632
5,Colombia,0.019157
6,Congo DR,0.01955
7,Vietnam,0.020222
8,Uganda,0.020576
9,Algeria,0.021368


Questions 3 and 4 focus on analyzing the availability of high-quality Wikipedia articles per capita.

By filtering for 'FA' and 'GA' articles, we ensure that we are only considering articles that meet certain quality criteria.
The top 10 list reveals countries where a larger share of the population has access to detailed and well-curated information.
The bottom 10 list shows countries with limited high-quality documentation, which could reflect disparities in resources or focus among contributors.

### Top 10 Countries by High-Quality Articles per Capita
**Montenegro** has the highest number of high-quality articles per capita, with **5 articles per million people** in the FA or GA category. This suggests a focus on well-researched content. **Slovenia** ranks 10th in the top list, with **0.95 high-quality articles per million people**.

### Bottom 10 Countries by High-Quality Articles per Capita
**Bangladesh** ranks the lowest in terms of high-quality articles, with **0.0058 high-quality articles per million people**, indicating a significant gap in the availability of well-curated content. **Algeria** is 10th in the list, with **0.021 high-quality articles per million people**.

High-quality articles are often a result of active contributors, robust editing culture, or focus on specific topics. Lower coverage might indicate a lack of resources, contributors, or less emphasis on in-depth article curation.

In [46]:
# Calculate articles per capita for each geographic region.
# Group data by 'region' and calculate total articles divided by the population of each region.
regions_total_coverage = merged_df_politician_region_final.groupby('region')['article_title'].count() / merged_df_politician_region_final.groupby('region')['population'].first()

# Rank regions by total coverage.
# Sort regions by articles per capita in descending order to see which regions have the most coverage.
regions_total_coverage_ranked = regions_total_coverage.sort_values(ascending=False).reset_index()
regions_total_coverage_ranked.columns = ['Region', 'Quality per Region']

regions_total_coverage_ranked

Unnamed: 0,Region,Quality per Region
0,CARIBBEAN,2190.0
1,OCEANIA,720.0
2,CENTRAL AMERICA,376.0
3,SOUTHERN EUROPE,295.185185
4,WESTERN ASIA,203.333333
5,NORTHERN EUROPE,136.428571
6,EASTERN EUROPE,77.065217
7,WESTERN EUROPE,54.130435
8,EASTERN AFRICA,50.378788
9,SOUTHERN AFRICA,45.555556


In [47]:

# Calculate high-quality articles per capita for each geographic region.
high_quality_articles_region = merged_df_politician_region_final[merged_df_politician_region_final['article_quality'].isin(['FA', 'GA'])]
regions_high_quality_coverage = high_quality_articles_region.groupby('region')['article_title'].count() / merged_df_politician_region_final.groupby('region')['population'].first()

# Rank regions by high-quality article coverage.
# Sort regions by high-quality articles per capita in descending order.
regions_high_quality_coverage_ranked = regions_high_quality_coverage.sort_values(ascending=False).reset_index()
regions_high_quality_coverage_ranked.columns = ['Region', 'High Quality Coverage']

regions_high_quality_coverage_ranked

Unnamed: 0,Region,High Quality Coverage
0,CARIBBEAN,90.0
1,CENTRAL AMERICA,20.0
2,SOUTHERN EUROPE,19.62963
3,OCEANIA,10.0
4,WESTERN ASIA,9.0
5,NORTHERN EUROPE,6.428571
6,EASTERN EUROPE,4.130435
7,SOUTHERN AFRICA,2.962963
8,WESTERN EUROPE,2.282609
9,EASTERN AFRICA,1.287879


Questions 5 and 6 focus on analysis to geographic regions, providing insights into how coverage varies across broader areas.

By comparing total articles and high-quality articles per capita, we can see which regions have the most comprehensive documentation.
This helps identify geographic disparities in information access and highlights regions with a focus on high-quality contributions.


### Geographic Regions by Total Coverage
The region with the highest coverage of Wikipedia articles per capita is the **Caribbean**, with **2,190 articles per million people**, reflecting a very active contributor base. On the lower end, **East Asia** has the fewest articles per capita, with only **0.108 articles per million people**, indicating minimal representation relative to population size.

### Geographic Regions by High-Quality Coverage
The **Caribbean** also leads in high-quality article coverage, with **90 high-quality articles per million people**, suggesting a focus on well-researched content within the region. **East Asia** again has the lowest high-quality article coverage, with **0.0021 high-quality articles per million people**, highlighting a significant gap in detailed, high-quality content.

Regions like the Caribbean or Oceania may appear at the top due to a combination of smaller population sizes and active local editing communities. Conversely, regions like East Asia might have a high population but relatively fewer contributors to English Wikipedia, which lowers their per capita metrics.
