# Wikipedia Politician Articles: Bias

This notebook explores bias in data using Wikipedia articles about political figures from different countries. The analysis aims to examine the coverage of politicians on Wikipedia and the quality of articles about politicians across nations. 

The notebook utilizes data from two sources: 
- Dataset of Wikipedia articles about politicians using Wikipedia [Category:Politicians_by_nationality](https://en.wikipedia.org/wiki/Category:Politicians_by_nationality).
- Dataset of country populations obtained from [World Population Data Sheet](https://www.prb.org/international/indicator/population/table/).

Additionally, the machine learning service ORES is used to estimate the quality of each article.

## Article Page Info MediaWiki API

The following sub-section contains code to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This sub-section shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

### License

This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/).

Modifications to this code were made by Himanshu Naidu on October 13, 2024.

In [1]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

The example relies on some constants that help make the code a bit more readable.

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

USER_EMAIL = "hnaidu36@uw.edu"

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': f'<{USER_EMAIL}>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [4]:
print(f"Getting page info data for: {ARTICLE_TITLES[3]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[3])
print(json.dumps(info,indent=4))

Getting page info data for: Chinook salmon
{
    "batchcomplete": "",
    "query": {
        "pages": {
            "1212891": {
                "pageid": 1212891,
                "ns": 0,
                "title": "Chinook salmon",
                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2024-10-12T10:13:54Z",
                "lastrevid": 1234351318,
                "length": 53787,
                "watchers": 109,
                "talkid": 3909817,
                "fullurl": "https://en.wikipedia.org/wiki/Chinook_salmon",
                "editurl": "https://en.wikipedia.org/w/index.php?title=Chinook_salmon&action=edit",
                "canonicalurl": "https://en.wikipedia.org/wiki/Chinook_salmon"
            }
        }
    }
}


## Requesting ORES scores through LiftWing ML Service API

This sub-section illustrates how to generate article quality estimates for article revisions using the LiftWing version of ORES. The ORES API documentation can be accessed from the main ORES page. The ORES LiftWing documentation is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).

### License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). 

Modifications to this code were made by Himanshu Naidu on October 13, 2024.

In [5]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_ORES_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_ORES_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_ORES_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

### Get your access token

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this course and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

The value you need to work the code below is the Access token - a very long string.


In [6]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.
#
#   In this case I don't want to distribute my keys with the source of the notebook, so I used the dotenv package to load them from
#   a file called '.env' in the same directory as this notebook. The file should look like this:
#
#   WIKIMEDIA_USERNAME="<your_wikimedia_username>"
#   WIKIMEDIA_CLIENT_ID="<your_wikimedia_client_id>"
#   WIKIMEDIA_CLIENT_SECRET="<your_wikimedia_client_secret>"
#   WIKIMEDIA_ACCESS_TOKEN="<your_wikimedia_provided_access_token_its_a_really_long_string>"
#
#   The repository has a file called '.env.example' that you can copy to '.env' and fill in the values.

# USERNAME = "<your_wikimedia_username>"
# ACCESS_TOKEN = "<your_wikimedia_provided_access_token_its_a_really_long_string>"

# Note: The above properties will be assigned in the "Getting Article Quality Predictions" section

### Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [7]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_ORES_THROTTLE_WAIT > 0.0:
            time.sleep(API_ORES_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


## Getting the Article and Population Data

In [8]:
#########
#
#    IMPORTS
#

import numpy as np
import pandas as pd

In [9]:
#########
#
#    CONSTANTS
#
# Dev environment flag to control the number of requests made to the API for testing
DEV_ENVIRONMENT = False

# The cleaned CSV file path that contains the list of politicians by country
POLITICIANS_BY_COUNTRY_CSV_PATH = "politicians_by_country_AUG.2024.csv"
# The cleaned CSV file path that contains the list of population by country
POPULATION_BY_COUNTRY_CSV_PATH = "population_by_country_AUG.2024.csv"

As stated in the [MediaWiki REST API Help](https://www.mediawiki.org/w/api.php?action=help&modules=query): 

For titles "Maximum number of values is 50 (500 for clients that are allowed higher limits)."

In [10]:
# To be on the safe side, we'll limit the number of titles to 40
TITLE_LIMIT = 40
# The separator for the titles in the query string
TITLE_SEPARATOR = "|"

In [11]:
# The path to the error list file, that includes the articles whose page info could not be retrieved
PAGE_INFO_ERROR_LIST_PATH = "data/page_info_errors.txt"
# The path to the output file that contains the page info data
WP_POLITICIANS_WITH_PAGE_INFO_PATH = "data/wp_politicians_with_page_info.csv"

There is a way to get the page information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has limits. You should probably check the API documentation if you want to do multiple pages in a single request - and limit the number of pages in one request reasonably.

In [12]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_page_info_multiple_articles(titles = None, title_sep = '|', error_list = None):
    '''
    This function takes a list of article titles and returns the page information for all the articles in a list.
    This function also adds the titles that could not be fetched to the error list.

    Parameters:
    ------------
    titles : str
        A string of article titles separated by a delimiter

    title_sep : str
        The delimiter that separates the article titles in the string

    error_list : list
        A list to store the titles that could not be fetched

    Returns:
    ------------
    page_info_list : list
        A list of dictionaries containing the page information for each article
    '''
    page_info_list = []
    page_info = None

    if titles is None:
        titles = ""
    if error_list is None:
        error_list = []
    try:
        json_response = request_pageinfo_per_article(titles)

        page_info_dict = json_response['query']['pages']
        for page_id, page_info in page_info_dict.items():
            try:
                page_info_list.append({
                    "page_id": page_info["pageid"],
                    "title": page_info["title"],
                    "revision_id": page_info["lastrevid"],
                    "page_length": page_info["length"],
                })
            except Exception as e:
                print("Error while fetching data for: ", page_info["title"])
                error_list.append(page_info["title"])
                print(e)
    
    except Exception as e:
        print("Error in API while fetching data")
        error_list.extend(titles.split(title_sep))
        print(e)
    
    return page_info_list

In [13]:
def write_list_to_file(file_path, list_to_write):
    '''
    This function writes a list to a file.

    Parameters:
    ------------
    file_path : str
        The path of the file to write the list to
    
    list_to_write : list
        The list to write to the file
    '''
    with open(file_path, 'w') as file:
        for item in list_to_write:
            file.write("%s\n" % item)

### Fetch Politicians Data

In [14]:
# Load the cleaned CSV file that contains the list of politicians by country
politicians_by_country_df = pd.read_csv(POLITICIANS_BY_COUNTRY_CSV_PATH)

Note: The following loop was executed on a system equipped with a 12th Gen Intel® Core™ i7-12700H processor (2.30 GHz). The total execution time for this loop was under 2 minutes

The results have already been saved in WP_POLITICIANS_WITH_PAGE_INFO_PATH. 

Skip the next 3 cells to directly utilize the data saved in WP_POLITICIANS_WITH_PAGE_INFO_PATH.

In [184]:
page_info_list = []
page_info_error_list = []
total_len = 0

# Get the page info in batches of TITLE_LIMIT
# The function np.array_split splits the dataframe into batches of TITLE_LIMIT or TITLE_LIMIT + 1
# Hence, we use a TITLE_LIMIT that is less than the actual limit to avoid making a bigger request than the limit
for batch in np.array_split(politicians_by_country_df, len(politicians_by_country_df) // TITLE_LIMIT):
    titles = batch["name"].tolist()
    page_info_list_batch = request_page_info_multiple_articles(TITLE_SEPARATOR.join(titles), TITLE_SEPARATOR, page_info_error_list)
    page_info_list.extend(page_info_list_batch)

page_info_df = pd.DataFrame(page_info_list)

# Write the error list to a file
write_list_to_file(PAGE_INFO_ERROR_LIST_PATH, page_info_error_list)

  return bound(*args, **kwds)


Error while fetching data for:  Barbara Eibinger-Miedl
'pageid'
Error while fetching data for:  Mehrali Gasimov
'pageid'
Error while fetching data for:  Kyaw Myint
'pageid'
Error while fetching data for:  André Ngongang Ouandji
'pageid'
Error while fetching data for:  Tomás Pimentel
'pageid'
Error while fetching data for:  Richard Sumah
'pageid'
Error while fetching data for:  Segun ''Aeroland'' Adewale
'pageid'
Error while fetching data for:  Bashir Bililiqo
'pageid'


As it turns out, there are duplicates in the politicians_by_country_df dataframe. We'll drop the duplicates in the page_info_df dataframe, and keep the first occurrence. 

In [17]:
page_info_df = page_info_df.drop_duplicates(subset=['page_id'])
page_info_df.shape

(7103, 4)

In [189]:
# Write the final page_info dataframe to a CSV file
page_info_df.to_csv(WP_POLITICIANS_WITH_PAGE_INFO_PATH, index=False)

In [15]:
# Extract the data in case the above data extraction is skipped.
page_info_df = pd.read_csv(WP_POLITICIANS_WITH_PAGE_INFO_PATH)

In [16]:
page_info_df.head()

Unnamed: 0,page_id,title,revision_id,page_length
0,27428272,Abdul Baqi Turkistani,1231655023,1357
1,29443640,Abdul Ghani Ghani,1227026187,1292
2,44482763,Abdul Rahim Ayoubi,1226326055,7313
3,52438668,Aimal Faizi,1185105938,2791
4,12084570,Amir Muhammad Akhundzada,1247931713,8865


In [17]:
# Calculate Error Rate
TOTAL_POLITICIAN_ARTICLES = politicians_by_country_df['name'].nunique()
TOTAL_PAGE_INFO_ARTICLES = page_info_df['page_id'].nunique()

page_info_error_rate = 1 - (TOTAL_PAGE_INFO_ARTICLES / TOTAL_POLITICIAN_ARTICLES)
print(f"Page Info Error Rate: {page_info_error_rate * 100}%")

Page Info Error Rate: 0.11250175783996674%


## Getting Article Quality Predictions

Once the latest revision ids are extracted for each article, using the Page Info API, the ORES API can be used to get the Quality Predictions

In [18]:
#########
#
#    IMPORTS
#
import os

from tqdm import tqdm
from dotenv import load_dotenv

In [19]:
#########
#
#    CONSTANTS
#

# The path to the .env file that contains the Wikimedia credentials
ENV_PATH = ".env"

ORES_ERROR_LIST_PATH = "data/ores_errors.txt"

# The path to the output file that contains the ORES scores
WP_POLITICIANS_WITH_ORES_PATH = "data/wp_politicians_with_ores.csv"

### Loading the Environment

In [20]:
load_dotenv(ENV_PATH)

USERNAME = os.getenv("WIKIMEDIA_USERNAME")
ACCESS_TOKEN = os.getenv("WIKIMEDIA_ACCESS_TOKEN")

### Query the ORES API

Note: The following loop was executed on a system equipped with a 12th Gen Intel® Core™ i7-12700H processor (2.30 GHz). The total execution time for this loop was 141 minutes and 44 seconds.

The results have already been saved in WP_POLITICIANS_WITH_ORES_PATH. 

Skip the next 3 cells to directly utilize the data saved in WP_POLITICIANS_WITH_ORES_PATH.

In [None]:
ores_quality_prediction_list = []
ores_quality_prediction_error_list = []


for index,row in tqdm(page_info_df.iterrows(), total=page_info_df.shape[0]):
    try:
        article_score_json_response = request_ores_score_per_article(article_revid=row["revision_id"],
                                           email_address=USER_EMAIL,
                                           access_token=ACCESS_TOKEN)
        
        article_scores = article_score_json_response["enwiki"]["scores"][f'{row["revision_id"]}']
        article_quality_prediction = article_scores["articlequality"]["score"]["prediction"]
    
        row_dict = row.to_dict()
        row_dict["article_quality_prediction"] = article_quality_prediction
        ores_quality_prediction_list.append(row_dict)
    
    except Exception as e:
        print("Error fetching quality prediction for", row["title"], ", Revision:", row["revision_id"])
        print(e)
        ores_quality_prediction_error_list.append(row["title"])

    if DEV_ENVIRONMENT and index > 5:
        break

ores_quality_prediction_df = pd.DataFrame(ores_quality_prediction_list)

# Write the error list to a file
write_list_to_file(ORES_ERROR_LIST_PATH, ores_quality_prediction_error_list)

In [25]:
ores_quality_prediction_df = ores_quality_prediction_df.drop_duplicates(subset=['page_id'])
ores_quality_prediction_df.shape

(7102, 5)

In [210]:
# Write the final ORES quality prediction dataframe to a CSV file
ores_quality_prediction_df.to_csv(WP_POLITICIANS_WITH_ORES_PATH, index=False)

In [21]:
# Extract the data in case the above data extraction is skipped.
ores_quality_prediction_df = pd.read_csv(WP_POLITICIANS_WITH_ORES_PATH)

In [22]:
ores_quality_prediction_df.head()

Unnamed: 0,page_id,title,revision_id,page_length,article_quality_prediction
0,27428272,Abdul Baqi Turkistani,1231655023,1357,Stub
1,29443640,Abdul Ghani Ghani,1227026187,1292,Stub
2,44482763,Abdul Rahim Ayoubi,1226326055,7313,Start
3,52438668,Aimal Faizi,1185105938,2791,Stub
4,12084570,Amir Muhammad Akhundzada,1247931713,8865,Start


In [23]:
# Calculate Error Rate
TOTAL_POLITICIAN_ARTICLES = politicians_by_country_df['name'].nunique()
TOTAL_ORES_PREDICTIONS = ores_quality_prediction_df['page_id'].nunique()

ores_error_rate = 1 - (TOTAL_ORES_PREDICTIONS / TOTAL_POLITICIAN_ARTICLES)
print(f"ORES Error Rate: {ores_error_rate * 100}%")

ORES Error Rate: 0.12656447756995703%


## Combining the Datasets

In [24]:
#########
#
#    CONSTANTS
#
COUNTRIES_NO_MATCH_PATH = "data/wp_countries-no_match.txt"
POLITICIANS_BY_COUNTRY_PATH = "data/wp_politicians_by_country.csv"

### Loading Population and Region Details

We now set up dataframes that contain population details of countries and regions. This will be matched with articles through the countries that they (the respective politicians) belong to. 

In [25]:
# Load the cleaned CSV file that contains the list of population by country
population_by_country_df = pd.read_csv(POPULATION_BY_COUNTRY_CSV_PATH)
population_by_country_df.head()

Unnamed: 0,Geography,Population
0,WORLD,8009.0
1,AFRICA,1453.0
2,NORTHERN AFRICA,256.0
3,Algeria,46.8
4,Egypt,105.2


The population_by_country_AUG.2024.csv represents regions in a hierarchical order. Thus, to get the region that a country belongs to, we simply have to put a country into the closest (lowest in the hierarchy) region. 

In [26]:
current_region = None

population_by_country_region_list = []

for index, row in population_by_country_df.iterrows():
    row_dict = row.to_dict()
    if row["Geography"].isupper():
        current_region = row["Geography"]
        row_dict["region"] = ""
    else:
        row_dict["region"] = current_region
    population_by_country_region_list.append(row_dict)

population_by_country_region_df = pd.DataFrame(population_by_country_region_list)

In [27]:
# Rename the columns 'Geography' and 'Population' to 'country' and 'population' respectively
population_by_country_region_df.rename(columns={"Geography": "country", "Population": "population"}, inplace=True)
# Remove the rows with 'region' as empty
population_by_country_region_df = population_by_country_region_df[population_by_country_region_df["region"] != ""]

print("Total number of countries in the population dataset:", population_by_country_region_df["country"].nunique())

Total number of countries in the population dataset: 209


In [28]:
population_by_country_region_df.head()

Unnamed: 0,country,population,region
3,Algeria,46.8,NORTHERN AFRICA
4,Egypt,105.2,NORTHERN AFRICA
5,Libya,6.9,NORTHERN AFRICA
6,Morocco,37.0,NORTHERN AFRICA
7,Sudan,48.1,NORTHERN AFRICA


### Merging Country-Wise

Merge the relevant data frames to finally have a data frame that contains details of each article, including the ORES API quality prediction, the respective country (along with population) and the respective region.

In [30]:
ores_quality_prediction_df.head()

Unnamed: 0,page_id,title,revision_id,page_length,article_quality_prediction
0,27428272,Abdul Baqi Turkistani,1231655023,1357,Stub
1,29443640,Abdul Ghani Ghani,1227026187,1292,Stub
2,44482763,Abdul Rahim Ayoubi,1226326055,7313,Start
3,52438668,Aimal Faizi,1185105938,2791,Stub
4,12084570,Amir Muhammad Akhundzada,1247931713,8865,Start


In [31]:
politicians_by_country_df.head()

Unnamed: 0,name,url,country
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan


In [32]:
# Merge the ores_quality_prediction_df with politicians_by_country_df to get the country for each article
# The merge needs to be done on ores_quality_prediction_df 'title' and politicians_by_country_df 'name' columns
ores_quality_prediction_country_df = \
    pd.merge(ores_quality_prediction_df, politicians_by_country_df, left_on='title', right_on='name', how='inner')
ores_quality_prediction_country_df.drop(columns=['name', 'url', 'page_length'], inplace=True)
ores_quality_prediction_country_df.head()

Unnamed: 0,page_id,title,revision_id,article_quality_prediction,country
0,27428272,Abdul Baqi Turkistani,1231655023,Stub,Afghanistan
1,29443640,Abdul Ghani Ghani,1227026187,Stub,Afghanistan
2,44482763,Abdul Rahim Ayoubi,1226326055,Start,Afghanistan
3,52438668,Aimal Faizi,1185105938,Stub,Afghanistan
4,12084570,Amir Muhammad Akhundzada,1247931713,Start,Afghanistan


In [35]:
# Correct some country names in population_by_country_region_df to match the country names in ores_quality_prediction_country_df
corrections = {
    "GuineaBissau": "Guinea-Bissau",
    "Korea (South)": "Korea, South"
}

population_by_country_region_df["country"] = population_by_country_region_df["country"].replace(corrections)

In [36]:
# Merge the ores_quality_prediction_country_df with population_by_country_region_df to get the population and region for each article
# The merge needs to be done on ores_quality_prediction_country_df 'country' and population_by_country_region_df 'country' columns
# The merge will be outer, to eventually identify entries that did not merge correctly
ores_quality_prediction_country_population_df = \
    pd.merge(ores_quality_prediction_country_df, population_by_country_region_df, on='country', how='outer', indicator=True)
ores_quality_prediction_country_population_df.head()

Unnamed: 0,page_id,title,revision_id,article_quality_prediction,country,population,region,_merge
0,27428272.0,Abdul Baqi Turkistani,1231655000.0,Stub,Afghanistan,42.4,SOUTH ASIA,both
1,29443640.0,Abdul Ghani Ghani,1227026000.0,Stub,Afghanistan,42.4,SOUTH ASIA,both
2,44482763.0,Abdul Rahim Ayoubi,1226326000.0,Start,Afghanistan,42.4,SOUTH ASIA,both
3,52438668.0,Aimal Faizi,1185106000.0,Stub,Afghanistan,42.4,SOUTH ASIA,both
4,12084570.0,Amir Muhammad Akhundzada,1247932000.0,Start,Afghanistan,42.4,SOUTH ASIA,both


In [37]:
# Identify all countries for which there are invalid merges
country_no_match_list1 = ores_quality_prediction_country_population_df[ores_quality_prediction_country_population_df["_merge"] == "left_only"]["country"].unique()
print("Countries with no population info:\n", country_no_match_list1)

Countries with no population info:
 ['Korean']


In [38]:
# Identify all politicians for which there are invalid merges
country_no_match_list2 = ores_quality_prediction_country_population_df[ores_quality_prediction_country_population_df["_merge"] == "right_only"]["country"].unique()
print("Countries in the population dataset do not have articles:\n", country_no_match_list2)

Countries in the population dataset do not have articles:
 ['Andorra' 'Australia' 'Brunei' 'Canada' 'China (Hong Kong SAR)'
 'China (Macao SAR)' 'Curacao' 'Denmark' 'Dominica' 'Fiji' 'French Guiana'
 'French Polynesia' 'Georgia' 'Guadeloupe' 'Guam' 'Iceland' 'Ireland'
 'Jamaica' 'Kiribati' 'Korea (North)' 'Liechtenstein' 'Martinique'
 'Mauritius' 'Mayotte' 'Mexico' 'Nauru' 'Netherlands' 'New Caledonia'
 'New Zealand' 'Palau' 'Philippines' 'Puerto Rico' 'Reunion' 'Romania'
 'San Marino' 'Sao Tome and Principe' 'Suriname' 'United Kingdom'
 'United States' 'Western Sahara' 'eSwatini']


In [39]:
country_no_match_list = list(set(country_no_match_list1) | set(country_no_match_list2))
write_list_to_file(COUNTRIES_NO_MATCH_PATH, country_no_match_list)

In [40]:
# Finally, drop the rows that did not merge correctly to get the final dataframe
wp_politicians_by_country_df = \
    ores_quality_prediction_country_population_df[ores_quality_prediction_country_population_df["_merge"] == "both"]
wp_politicians_by_country_df.drop(columns=["_merge"], inplace=True)
wp_politicians_by_country_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wp_politicians_by_country_df.drop(columns=["_merge"], inplace=True)


Unnamed: 0,page_id,title,revision_id,article_quality_prediction,country,population,region
0,27428272.0,Abdul Baqi Turkistani,1231655000.0,Stub,Afghanistan,42.4,SOUTH ASIA
1,29443640.0,Abdul Ghani Ghani,1227026000.0,Stub,Afghanistan,42.4,SOUTH ASIA
2,44482763.0,Abdul Rahim Ayoubi,1226326000.0,Start,Afghanistan,42.4,SOUTH ASIA
3,52438668.0,Aimal Faizi,1185106000.0,Stub,Afghanistan,42.4,SOUTH ASIA
4,12084570.0,Amir Muhammad Akhundzada,1247932000.0,Start,Afghanistan,42.4,SOUTH ASIA


In [41]:
# Write the final dataframe to a CSV file
wp_politicians_by_country_df.to_csv(POLITICIANS_BY_COUNTRY_PATH, index=False)

## Analysis and Results

Analysis 1: 

Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .

In [42]:
# Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .
top_10_countries_by_coverage = wp_politicians_by_country_df.groupby(["country", "region", "population"]).size().reset_index(name='total_articles')
top_10_countries_by_coverage["total_articles_per_capita"] = top_10_countries_by_coverage["total_articles"] / top_10_countries_by_coverage["population"]

top_10_countries_by_coverage.sort_values(by="total_articles_per_capita", ascending=False, inplace=True)
print("Top 10 countries by coverage (descending):")
top_10_countries_by_coverage.head(10)

Top 10 countries by coverage (descending):


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
156,Tuvalu,OCEANIA,0.0,1,inf
98,Monaco,WESTERN EUROPE,0.0,10,inf
4,Antigua and Barbuda,CARIBBEAN,0.1,33,330.0
51,Federated States of Micronesia,OCEANIA,0.1,14,140.0
95,Marshall Islands,OCEANIA,0.1,13,130.0
151,Tonga,OCEANIA,0.1,10,100.0
12,Barbados,CARIBBEAN,0.3,25,83.333333
127,Seychelles,EASTERN AFRICA,0.1,6,60.0
100,Montenegro,SOUTHERN EUROPE,0.6,36,60.0
92,Maldives,SOUTH ASIA,0.6,33,55.0


Analysis 2:

Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .

In [43]:
# Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
bottom_10_countries_by_coverage = wp_politicians_by_country_df.groupby(["country", "region", "population"]).size().reset_index(name='total_articles')
bottom_10_countries_by_coverage["total_articles_per_capita"] = bottom_10_countries_by_coverage["total_articles"] / bottom_10_countries_by_coverage["population"]

bottom_10_countries_by_coverage.sort_values(by="total_articles_per_capita", ascending=True, inplace=True)
print("Bottom 10 countries by coverage (ascending):")
bottom_10_countries_by_coverage.head(10)

Bottom 10 countries by coverage (ascending):


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
31,China,EAST ASIA,1411.3,16,0.011337
57,Ghana,WESTERN AFRICA,34.1,3,0.087977
67,India,SOUTH ASIA,1428.6,151,0.105698
124,Saudi Arabia,WESTERN ASIA,36.9,5,0.135501
166,Zambia,EASTERN AFRICA,20.2,3,0.148515
110,Norway,NORTHERN EUROPE,5.5,1,0.181818
71,Israel,WESTERN ASIA,9.8,2,0.204082
45,Egypt,NORTHERN AFRICA,105.2,32,0.304183
37,Cote d'Ivoire,WESTERN AFRICA,30.9,10,0.323625
50,Ethiopia,EASTERN AFRICA,126.5,44,0.347826


In [44]:
# Get the high quality article counts to be used in Analysis 4 and 5
high_quality_articles = wp_politicians_by_country_df[wp_politicians_by_country_df["article_quality_prediction"].isin(["FA", "GA"])]
high_quality_articles.head()

Unnamed: 0,page_id,title,revision_id,article_quality_prediction,country,population,region
20,3056711.0,Masoud Khalili,1246567000.0,GA,Afghanistan,42.4,SOUTH ASIA
43,627316.0,Abdul Salam Zaeef,1243759000.0,GA,Afghanistan,42.4,SOUTH ASIA
54,2202177.0,Fazal Hadi Shinwari,1242412000.0,GA,Afghanistan,42.4,SOUTH ASIA
90,7249978.0,Ali Pasha of Gusinje,1244422000.0,GA,Albania,2.7,SOUTHERN EUROPE
91,41985495.0,Aqif Pasha Elbasani,1243883000.0,GA,Albania,2.7,SOUTHERN EUROPE


Analysis 3:

Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .

In [45]:
# Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .
top_10_countries_by_high_quality = high_quality_articles.groupby(["country", "region", "population"]).size().reset_index(name='total_high_quality')
top_10_countries_by_high_quality["total_high_quality_per_capita"] = top_10_countries_by_high_quality["total_high_quality"] / top_10_countries_by_high_quality["population"]

top_10_countries_by_high_quality.sort_values(by="total_high_quality_per_capita", ascending=False, inplace=True)
print("Top 10 countries by high quality (descending):")
top_10_countries_by_high_quality.head(10)

Top 10 countries by high quality (descending):


Unnamed: 0,country,region,population,total_high_quality,total_high_quality_per_capita
64,Montenegro,SOUTHERN EUROPE,0.6,3,5.0
57,Luxembourg,WESTERN EUROPE,0.7,2,2.857143
1,Albania,SOUTHERN EUROPE,2.7,7,2.592593
51,Kosovo,SOUTHERN EUROPE,1.7,4,2.352941
59,Maldives,SOUTH ASIA,0.6,1,1.666667
56,Lithuania,NORTHERN EUROPE,2.9,4,1.37931
25,Croatia,SOUTHERN EUROPE,3.8,5,1.315789
40,Guyana,SOUTH AMERICA,0.8,1,1.25
71,Palestinian Territory,WESTERN ASIA,5.5,6,1.090909
82,Slovenia,SOUTHERN EUROPE,2.1,2,0.952381


Analysis 4:

Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).

In [46]:
# Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
bottom_10_countries_by_high_quality = high_quality_articles.groupby(["country", "region", "population"]).size().reset_index(name='total_high_quality')
bottom_10_countries_by_high_quality["total_high_quality_per_capita"] = bottom_10_countries_by_high_quality["total_high_quality"] / bottom_10_countries_by_high_quality["population"]

bottom_10_countries_by_high_quality.sort_values(by="total_high_quality_per_capita", ascending=True, inplace=True)
print("Bottom 10 countries by high quality (ascending):")
bottom_10_countries_by_high_quality.head(10)

Bottom 10 countries by high quality (ascending):


Unnamed: 0,country,region,population,total_high_quality,total_high_quality_per_capita
9,Bangladesh,SOUTH ASIA,173.5,1,0.005764
29,Egypt,NORTHERN AFRICA,105.2,1,0.009506
31,Ethiopia,EASTERN AFRICA,126.5,2,0.01581
46,Japan,EAST ASIA,124.5,2,0.016064
70,Pakistan,SOUTH ASIA,240.5,4,0.016632
22,Colombia,SOUTH AMERICA,52.2,1,0.019157
23,Congo DR,MIDDLE AFRICA,102.3,2,0.01955
101,Vietnam,SOUTHEAST ASIA,98.9,2,0.020222
96,Uganda,EASTERN AFRICA,48.6,1,0.020576
2,Algeria,NORTHERN AFRICA,46.8,1,0.021368


Analysis 5:

Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.


For this analysis, we need to get the total counts of articles along with population for each region.

In [47]:
articles_count_per_region = wp_politicians_by_country_df.groupby(['region']).size().reset_index(name='total_articles')
articles_count_per_region = pd.merge(articles_count_per_region, population_by_country_df, left_on=["region"], right_on=["Geography"], how="left")
articles_count_per_region.drop(columns=["Geography"], inplace=True)

# Calculate the total articles per capita for each region
articles_count_per_region["total_articles_per_capita"] = articles_count_per_region["total_articles"] / articles_count_per_region["Population"]
articles_count_per_region.head()

Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,CARIBBEAN,218,44.0,4.954545
1,CENTRAL AMERICA,188,182.0,1.032967
2,CENTRAL ASIA,106,80.0,1.325
3,EAST ASIA,230,1648.0,0.139563
4,EASTERN AFRICA,664,483.0,1.374741


In [48]:
# Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
regions_by_total_coverage_df = articles_count_per_region.sort_values(by="total_articles_per_capita", ascending=False).reset_index(drop=True)
print("Geographic regions by total coverage (descending):")
regions_by_total_coverage_df

Geographic regions by total coverage (descending):


Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,SOUTHERN EUROPE,796,152.0,5.236842
1,CARIBBEAN,218,44.0,4.954545
2,WESTERN EUROPE,497,199.0,2.497487
3,EASTERN EUROPE,709,285.0,2.487719
4,WESTERN ASIA,609,299.0,2.036789
5,NORTHERN EUROPE,191,108.0,1.768519
6,SOUTHERN AFRICA,123,70.0,1.757143
7,OCEANIA,72,45.0,1.6
8,EASTERN AFRICA,664,483.0,1.374741
9,SOUTH AMERICA,569,426.0,1.335681


Analysis 6:

Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

For this analysis, we need to get the total counts of high quality articles along with population for each region.

In [49]:
high_quality_articles_count_per_region = high_quality_articles.groupby(['region']).size().reset_index(name='total_high_quality')
high_quality_articles_count_per_region = pd.merge(high_quality_articles_count_per_region, population_by_country_df, left_on=["region"], right_on=["Geography"], how="left")
high_quality_articles_count_per_region.drop(columns=["Geography"], inplace=True)

# Calculate the total articles per capita for each region
high_quality_articles_count_per_region["total_high_quality_per_capita"] = high_quality_articles_count_per_region["total_high_quality"] / high_quality_articles_count_per_region["Population"]
high_quality_articles_count_per_region.head()

Unnamed: 0,region,total_high_quality,Population,total_high_quality_per_capita
0,CARIBBEAN,9,44.0,0.204545
1,CENTRAL AMERICA,10,182.0,0.054945
2,CENTRAL ASIA,5,80.0,0.0625
3,EAST ASIA,13,1648.0,0.007888
4,EASTERN AFRICA,17,483.0,0.035197


In [50]:
# Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.
regions_by_high_quality_coverage_df = high_quality_articles_count_per_region.sort_values(by="total_high_quality_per_capita", ascending=False).reset_index(drop=True)
print("Geographic regions by high quality coverage (descending):")
regions_by_high_quality_coverage_df

Geographic regions by high quality coverage (descending):


Unnamed: 0,region,total_high_quality,Population,total_high_quality_per_capita
0,SOUTHERN EUROPE,53,152.0,0.348684
1,CARIBBEAN,9,44.0,0.204545
2,EASTERN EUROPE,38,285.0,0.133333
3,SOUTHERN AFRICA,8,70.0,0.114286
4,WESTERN EUROPE,21,199.0,0.105528
5,WESTERN ASIA,27,299.0,0.090301
6,NORTHERN EUROPE,9,108.0,0.083333
7,NORTHERN AFRICA,17,256.0,0.066406
8,CENTRAL ASIA,5,80.0,0.0625
9,CENTRAL AMERICA,10,182.0,0.054945
