# Data 512: Homework #2 (Considering Bias in Data)
In this homework, we aim to analyze the coverage and quality of Wikipedia articles related to political figures across different countries. By combining datasets of Wikipedia articles with country population data, we can examine how the representation of politicians varies among nations and how this may reflect underlying biases in data collection and presentation.

In this homework, we will work with two datasets,
- **Politicians by Country Dataset**: `politicians_by_country.AUG.2024.csv` contains a list of Wikipedia articles about politicians categorized by their nationality.
- **Population Dataset**: `population_by_country_AUG.2024.csv` includes population data for various countries, sourced from the Population Reference Bureau.

We have 2 main sub-sections here, (1) Data Acquisition, (2) Data Analysis/Results, but before getting into it, let us import the required libraries.

### Import required libraries and constants


In [1]:
# These are standard python modules
import json, time
import numpy as np
from IPython.display import clear_output

# The module mentioned below are not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

# We have a few user defined scripts, we call the method to another script this way
from apikeys.KeyManager import KeyManager



## Step 1: Data Acquisition

In this sub-section, we mainly have 4 main sections,
- We will take the raw input data and make sure it’s clean and organized the way we need it; mainly we’ll remove any duplicates and irrelevent or missing entries, focus on including the `pageid` and `current revision id` for each politician's article and have the dataset ready for the next step. 
- We will get the predicted quality scores for each article in the Wikipedia dataset using a machine learning system called ORES (Objective Revision Evaluation Service), which classifies articles into quality categories ranging from Featured Article (FA) to Stub (Stub). We will read each line from the `revised_politicians_by_country_with_pageinfo_AUG.2024.csv` file, make a request to get the current revision ID of the article page, and then use that information to request a quality score from ORES. Additionally, we will calculate and print the error rate, which is the number of articles without a score divided by the total number of articles.
- We will merge the Wikipedia politicians articles dataset (generated in the previous step) with population data using country names in this step. , List of unmatched countries are saved in `wp_countries-no_match.txt` and data with countries having successful matches are stored in a CSV file, `wp_politicians_by_country.csv`.
- We have to calculate the total articles per capita and high-quality articles per capita (for "FA" or "GA" articles) on both a country and regional basis. 

We define CONSTANTS in the next step to make the code more readable (avoided hardcoding values), maintainable (all the quick updates in a single place), and easy to update.

In [2]:
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {'User-Agent': '<pj2901@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'}

# Get the list of articles to be crawled
ARTICLE_TITLES = "" # This will be modified in the later part of this notebook

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

# The current LiftWing ORES API endpoint and prediction model
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

# The throttling rate is a function of the Access token that you are granted when you request the token. The constants
# come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
# Because all LiftWing API requests require some form of authentication, you need to provide your access token
# as part of the header too
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}

# This is a template for the parameters that we need to supply in the headers of an API request
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring - this variable will be modified later in the notebook
ARTICLE_REVISIONS = {}

# This is a template of the data required as a payload when making a scoring request of the ORES model
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

# These are used later - defined here so they, at least, have empty values
USERNAME = ""
ACCESS_TOKEN = ""

### 1.1 Define functions

In this section, we will define all the functions we need in this notebook. Having functions make it easier to read, reuse, and maintain throughout the notebook.

#### 1.1.1 Request data from an article page

We access the page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). The MediaWiki Action API is a web service that allows access to some wiki features like authentication, page operations, and search. It can provide meta information about the wiki and the logged-in user.

We request the summary 'page info' for a single article page in the below method. We send an HTTP GET request to the Wikipedia API endpoint that returns the metadata about the specified article.

The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details and can be referred if required.

**License:** This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024

In [3]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


#### 1.1.2 Request data from a chunk of articles

In the function defined below, we get Wikipedia page information for a list of article titles in chunks. We process the list of article titles in smaller batches to avoid exceeding API request limits. We send requests to the Wikipedia API for each chunk and get the page metadata including the page ID and current revision ID.

In [4]:
def request_pageinfo_for_chunks(article_titles, chunk_size=50):
    
    requested_info = []

    # Iterate through the article titles in chunks
    for i in range(0, len(article_titles), chunk_size):
        chunk = article_titles[i:i + chunk_size]    # Get a chunk of titles
        page_titles = '|'.join(chunk)  # Create a pipe-separated string
        print(f"Getting page info data for: {page_titles}...")

        # Prepare the request info
        request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
        request_info['titles'] = page_titles

        # Fetch page info
        info = request_pageinfo_per_article(request_template=request_info)

        # Process the response and append to results
        if 'query' in info and 'pages' in info['query']:
            for page in info['query']['pages'].values():
                # Save only pageid and lastrevid
                filtered_info = {
                    'title': page.get('title'),
                    'pageid': page.get('pageid'),
                    'lastrevid': page.get('lastrevid')
                }
                requested_info.append(filtered_info)  # Add the filtered info to results
        else:
            print("No data found for this chunk.")

    return requested_info

#### 1.1.3 Make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is `article_revid`. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

**License:** This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024

In [5]:
def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

#### 1.1.4 Extract Score details
When the LiftWing ML Service API makes a request to get the ORES score, we obtain a lot of information. We don't want all of them. The below function, extracts and saves the obtained quality prediction and probabilities.

In [6]:
def extract_score_details(article_title, revision_id, score):
    # extract quality prediction and probabilities
    score_details = score["enwiki"]["scores"].get(str(revision_id), {}).get("articlequality", {}).get("score", {})
    quality_prediction = score_details.get("prediction", "")
    probabilities = score_details.get("probability", {})
    
    # Hold the score data in this dictionary
    score_dict = {
        'article_title': article_title,
        'revision_id': revision_id,
        'quality_prediction': quality_prediction
    }
    
    # Add probabilities to the score data
    score_dict.update({f'Probability {key}': value for key, value in probabilities.items()})

    return score_dict

### 1.2 Load the Politicians by Country Dataset

In [7]:
politicians_df = pd.read_csv('../data/input_data/politicians_by_country_AUG.2024.csv')
politicians_df.head()

Unnamed: 0,name,url,country
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan


#### 1.2.1 Data Quality Checks

First, let us check if there are any missing values

In [8]:
politicians_df.isnull().sum()

name       0
url        0
country    0
dtype: int64

There are no missing values!

Now let us see if there are any duplicate values:

In [9]:
print(" # Duplicate values in the dataframe: ", len(politicians_df[politicians_df.duplicated()]))

politicians_name_counts = politicians_df['name'].value_counts()
duplicate_politicians_names = politicians_name_counts[politicians_name_counts  > 1]
print(" # Duplicate values in the dataframe (with same article name): ", len(duplicate_politicians_names))

 # Duplicate values in the dataframe:  0
 # Duplicate values in the dataframe (with same article name):  41


Seems like there are 41 politicians with multiple entries, let us observe then in detail to understand why there are duplicates in the data

In [10]:
duplicate_politicians_names_list = duplicate_politicians_names.index.tolist()
for name in duplicate_politicians_names_list:
    subset_politicians_df = politicians_df[politicians_df['name'] == name]
    display(subset_politicians_df)
    break # Remove to see all the duplicate entries in detail

Unnamed: 0,name,url,country
3451,Torokul Dzhanuzakov,https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov,Kazakhstan
3704,Torokul Dzhanuzakov,https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov,Kyrgyzstan
6504,Torokul Dzhanuzakov,https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov,Tajikistan
6937,Torokul Dzhanuzakov,https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov,Uzbekistan


We can observe that for a single name and url, there are different countries. One possible reason could be that a few individuals are recognized as a politician in multiple countries and may have multiple affiliations different from their original nationality.

To solve the issue, I manually looked into the backgrounds of all the 41 politicians and decided to only have the row which specifically mentions the country of nationaility of that politicians. All the other rows are dropped.

In [11]:
indexes_to_delete = [3451, 6504, 6937, 739, 3864, 4780, 3168, 438, 5725, 5518, 151, 5561, 1758, 5546, 6059, 424, 5443, 6134, 3293, 4773, 6591, 2596, 6815, 6254, 6123, 733, 3093, 1903, 6351, 5576, 2869, 6482, 5374, 6267, 5513, 5534, 5632, 2664, 1119, 2113, 6285, 6356, 4853, 6266]
revised_politicians_df = politicians_df.drop(indexes_to_delete) # drop the indexes that's not of the politicians nationality
revised_politicians_df.to_csv("../data/generated_intermediate_data/revised_politicians_by_country_AUG.2024.csv") # save the dataset for quick reference / to be used in further analysis
revised_politicians_df.head()

Unnamed: 0,name,url,country
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan


Just to make sure that we don't have any more duplicate name entries, let's execte the step below,

In [12]:
politicians_name_counts = revised_politicians_df['name'].value_counts()
duplicate_politicians_names = politicians_name_counts[politicians_name_counts  > 1]
print(" # Duplicate values in the dataframe (with same article name): ", len(duplicate_politicians_names))

 # Duplicate values in the dataframe (with same article name):  0


### 1.3 Get required Wikipedia page information for the articles

For further analysis, we need the `last revision id` and `page id` for each of the politician articles. As mentioned in the section where we defined the method `request_pageinfo_per_article`, we will use the MediaWiki REST API to get these required page information. 

To parallely to get the information for multiple pages at the same time, we have defined `request_pageinfo_for_chunks` method.

In [13]:
# Let us first update the ARTICLE_TITLES constant with the revised names of politicians.
ARTICLE_TITLES = revised_politicians_df['name'].tolist()

In [14]:
# Fetch page info for the titles and have it in a dataframe
wiki_politicians_info = request_pageinfo_for_chunks(ARTICLE_TITLES, chunk_size=50)
wiki_politicians_info_df = pd.DataFrame(wiki_politicians_info)

clear_output(wait=True) # Cleared the cell output to save some space before uploading to github and for better clarity

print("Successfully fetched all the additional page info!")
wiki_politicians_info_df.head()

Successfully fetched all the additional page info!


Unnamed: 0,title,pageid,lastrevid
0,Abdul Baqi Turkistani,27428272.0,1231655000.0
1,Abdul Ghani Ghani,29443640.0,1227026000.0
2,Abdul Rahim Ayoubi,44482763.0,1226326000.0
3,Ahmad Wali Massoud,34682634.0,1221721000.0
4,Aimal Faizi,52438668.0,1185106000.0


The assignment instructions clearly asked us to have revision ids for the articles we just queried. Let us check that,

In [15]:
empty_revid_df = wiki_politicians_info_df[wiki_politicians_info_df['lastrevid'].isnull()]
display(empty_revid_df)

Unnamed: 0,title,pageid,lastrevid
400,Barbara Eibinger-Miedl,,
500,Mehrali Gasimov,,
1150,Kyaw Myint,,
1300,André Ngongang Ouandji,,
1900,Tomás Pimentel,,
2400,Richard Sumah,,
4450,Segun ''Aeroland'' Adewale,,
5650,Bashir Bililiqo,,


Seems there was an issue with eight politicians' articles. Upon digging deep into it, I found that:
- Mehrali Gasimov and Richard Sumah did not have any associated Wikipedia articles.
- Other politicians had Wikipedia articles linked to them; however, these articles were not in English. The MediaWiki REST API call we made specifically searches through only English articles.

One alternative that I could think of was to remove Mehrali Gasimov's and Richard Sumah's title from the dataset and feed in the right `pageid`, and `lastrevid` for other politicians

In [16]:
# Manually update the politians' info that were in different languages
wiki_politicians_info_df.loc[400] = {'title': 'Barbara Eibinger-Miedl', 'pageid': 4534118, 'lastrevid': 247199899} # 'fullurl': "https://de.wikipedia.org/wiki/Barbara_Eibinger-Mied"
wiki_politicians_info_df.loc[1150] = {'title': 'Kyaw Myint', 'pageid': 69195914, 'lastrevid': 1177243609} # 'fullurl': "https://en.wikipedia.org/wiki/Michael_Kyaw_Myint"
wiki_politicians_info_df.loc[1300] = {'title': 'André Ngongang Ouandji', 'pageid': 7152978, 'lastrevid': 210595074} # 'fullurl': "https://fr.wikipedia.org/wiki/André_Ngongang_Ouandji"
wiki_politicians_info_df.loc[1900] = {'title': 'Tomás Pimentel', 'pageid': 9321687, 'lastrevid': 151461200} # 'fullurl': "https://es.wikipedia.org/wiki/Tomás_Pimentel"
wiki_politicians_info_df.loc[4450] = {'title': "Segun ''Aeroland'' Adewale", 'pageid': 45496646, 'lastrevid': 1242960131} # 'fullurl': "https://en.wikipedia.org/wiki/Segun_%22Aeroland%22_Adewale",
wiki_politicians_info_df.loc[5650] = {'title': 'Bashir Bililiqo', 'pageid': 18698, 'lastrevid': 65938} # 'fullurl': "https://ff.wikipedia.org/wiki/Bashir_Bililiqo"

In [17]:
politicians_with_pageinfo_df = pd.merge(revised_politicians_df, 
                                        wiki_politicians_info_df, 
                                        left_on='name', 
                                        right_on='title', 
                                        how='inner')
politicians_with_pageinfo_df.drop('title', axis=1, inplace=True)
politicians_with_pageinfo_df.head()

Unnamed: 0,name,url,country,pageid,lastrevid
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,10483286.0,1233203000.0
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,11966231.0,1230460000.0
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,46841383.0,1225662000.0
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan,71600382.0,1234742000.0
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan,47805901.0,1195651000.0


In [18]:
# remove Mehrali Gasimov's and Richard Sumah's title from the dataset
indexes_to_delete = [513, 2418]
revised_politicians_with_pageinfo_df = politicians_with_pageinfo_df.drop(indexes_to_delete)

# Let us also make sure that we don't have any rows with missing entries
print("# Politicians with no lastrevid: ", len(revised_politicians_with_pageinfo_df[revised_politicians_with_pageinfo_df['lastrevid'].isnull()]))

# Politicians with no lastrevid:  0


In [19]:
# save the results as a CSV for further analysis or quick reference
revised_politicians_with_pageinfo_df.to_csv('../data/generated_intermediate_data/revised_politicians_by_country_with_pageinfo_AUG.2024.csv')

### 1.4 Requesting ORES scores through LiftWing ML Service API

In this section, we will get the predicted quality scores for each article in the Wikipedia dataset using a machine learning system called ORES (Objective Revision Evaluation Service). Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

Wikimedia Foundation (WMF) is reworking access to their APIs. It is likely in the coming years that all API access will require some kind of authentication, either through a simple key/token or through some version of OAuth. For now this is still a work in progress. You can follow the progress from their [API portal](https://api.wikimedia.org/wiki/Main_Page). Another on-going change is better control over API services in situations where those services require additional computational resources, beyond simply serving the text of a web page (i.e., the text of an article). Services like ORES that require running an ML model over the text of an article page is an example of a compute intensive API service.

We will now see how to generate article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org) can be accessed from the main ORES page. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).

We will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

#### 1.4.1 Obtain the API key

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment. I created a [Personal API token](https://api.wikimedia.org/wiki/Authentication) following the documenation to create the server-side app key.

A "best practice" for any code that requires an API key is to make sure that the key does not appear in the plain text of the code or notebook. One approach is to use a code based key manager that stores keys on your local machine. For more information on how to set up the key as an environment variable, refer [here](https://drive.google.com/file/d/15A8BNED9aJIqw_GiJPstsIuOx7U57adC/view?usp=sharing).

**License:**
The below code was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023


In [20]:
# I don't want to distribute my keys with the source of the notebook, so I wrote a key manager object that helps
# track all of my API keys - a username and domain name retrieves the key. The key manager hides the keys on disk separate
# from the code. A common code idiom to hide API keys will use code to extract the key from an OS environment variable. 
#
# You should be able to find a zip file containing the apikeys user module. Install this module into the folder where you keep 
# all of your user modules. This is also the folder that your PYTHONPATH variable points to.

keyman = KeyManager()

# This is my Wikipedia/Wikimedia username. They suggest you request your keys using your Wikipedia username, 
# so I also stored the API key using my Wikipedia username.
USERNAME = "Pj2901"
key_info = keyman.findRecord(USERNAME,API_ORES_LIFTWING_ENDPOINT)
ACCESS_TOKEN = key_info[0]['access_token']
print('Key Description: ', key_info[0]['description'])

Key Description:  Wikimedia JWT Access Token


#### 1.4.2 Requesting ORES scores

There are many ways to make the API call, we will call the function `request_ores_score_per_article` by passing in three items, revision id, email, and access token.

For easy access, let us store the required details in a dictionary and have it as a constant `ARTICLE_REVISIONS`.

In [21]:
# I want it as an integer type before making having it in the dictionary
revised_politicians_with_pageinfo_df['lastrevid'] = revised_politicians_with_pageinfo_df['lastrevid'].fillna(0).astype(int)

ARTICLE_REVISIONS = dict(zip(revised_politicians_with_pageinfo_df['name'],
                            revised_politicians_with_pageinfo_df['lastrevid']))

In [22]:
email_address = "pj2901@uw.edu"
access_token = ACCESS_TOKEN 

score_dict_list = []
failed_score_request_articles = []

# Iterate over each article in ARTICLE_REVISIONS
for i, (article_title, revision_id) in enumerate(ARTICLE_REVISIONS.items(), start=1):
    print(f"({i}/{len(ARTICLE_REVISIONS)}) | Obtaining LiftWing ORES scores for '{article_title}'...")

    # Make the ORES score request
    score = request_ores_score_per_article(article_revid=revision_id,
                                           email_address=email_address,
                                           access_token=access_token)

    # Extract quality prediction and probabilities if available, else, make a note of the article title
    if score and "enwiki" in score and "scores" in score["enwiki"]:
        score_dict = extract_score_details(article_title, revision_id, score)
        score_dict_list.append(score_dict)
        print("Successfully obtained the ORES score!")
    else:
        failed_score_request_articles.append(article_title)
        print("Could not obtain the ORES score.")
    
    clear_output(wait=True) # Cleared the cell output to save some space before uploading to github and for better clarity

scores_df = pd.DataFrame(score_dict_list)
scores_df.head()

Unnamed: 0,article_title,revision_id,quality_prediction,Probability B,Probability C,Probability FA,Probability GA,Probability Start,Probability Stub
0,Majah Ha Adrif,1233202991,Start,0.114586,0.251813,0.006372,0.017664,0.548056,0.061509
1,Haroon al-Afghani,1230459615,B,0.416808,0.377938,0.057959,0.089012,0.052925,0.005359
2,Tayyab Agha,1225661708,Start,0.082645,0.247194,0.005548,0.018469,0.594374,0.05177
3,Khadija Zahra Ahmadi,1234741562,Stub,0.019056,0.034997,0.003352,0.009019,0.264906,0.66867
4,Aziza Ahmadyar,1195651393,Start,0.046899,0.098852,0.004757,0.01937,0.712785,0.117338


In [28]:
print("We failed to fetch ORES scores for the following article(s): ", failed_score_request_articles)

We failed to fetch ORES scores for the following article(s):  []


None of them failed! We successfully obtained the ORES scores for the relevant politician articls.

The error rate is defined as the ratio of the number of articles for which we were not able to get a score divided by the total number of articles.
- We dropped Mehrali Gasimov and Richard Sumah's requests in the previous step as there were no matching Wikipedia article.

We manually dropped rows where a single politician had multiple nationalities (41 of them had this issue) and modified 6 rows that did not have a revision id (with the revision id of the non-english wikipedia page). I do not consider these as an error because we finally got the true predicted quality for them. 

In [45]:
print("# API requests made: ", len(politicians_with_pageinfo_df))

error_rate = (3/len(politicians_with_pageinfo_df))*100
print("Error rate (%): ", error_rate)

# API requests made:  7111
Error rate (%):  0.04218815918998734


The error rate is way below 1%. Hence, we were successful in obtaining the ORES scores for the politicians.

In the next step, we merge the dataset with the revised politicians dataset and save the output politicians dataset for further reference. It took 130mins to run through the 7111 articles and we can definately save some time by saving if we just need it for some analysis.

In [24]:
# Merge the scores with the politicians dataframe
revised_politicians_with_pageinfo_and_scores_df = revised_politicians_with_pageinfo_df.merge(scores_df, left_on='name', right_on='article_title', how='left')

# Drop columns not required
revised_politicians_with_pageinfo_and_scores_df.drop('article_title', axis=1, inplace=True) # it is repeated twice
revised_politicians_with_pageinfo_and_scores_df.drop(columns=['Probability B', 'Probability C', 'Probability FA', 
                                                              'Probability GA', 'Probability Start', 'Probability Stub'], 
                                                              inplace=True) # I don't want to save these for my further analysis


# Save into a CSV for further reference
temp_output_filepath = '../data/generated_intermediate_data/revised_politicians_by_country_with_pageinfo_and_quality_prediction_AUG.2024.csv'
revised_politicians_with_pageinfo_and_scores_df.to_csv(temp_output_filepath, index=False)

### 1.5 Combining the datasets

In this section, we first need to combine the dataset generated in the previous step (`revised_politicians_by_country_with_pageinfo_and_quality_prediction_AUG.2024.csv`) with the population data (`population_by_country_AUG.2024.csv`) using the common field - `country`. We also need to clean and standardize the country/region names before merging. 

#### 1.5.1 Load the population dataset

In [47]:
population_df = pd.read_csv("../data/input_data/population_by_country_AUG.2024.csv")
population_df.head()

Unnamed: 0,Geography,Population
0,WORLD,8009.0
1,AFRICA,1453.0
2,NORTHERN AFRICA,256.0
3,Algeria,46.8
4,Egypt,105.2


#### 1.5.2 Identify Regions and Countries from the dataset

When the 'Geography' is UPPER CASE, it means that it is a 'region'. On the other hand, if it is LOWER CASE, then it denotes a 'country'.

In [48]:
population_df['region'] = population_df['Geography'].apply(lambda x: x if x.isupper() else np.nan).ffill()  # get the region name

# Drop the rows where the Geography is all uppercase (Region)
revised_population_df = population_df[~population_df['Geography'].str.isupper()].reset_index(drop=True)
revised_population_df.rename(columns = {"Geography": "country", "Population": "population"}, inplace=True)
revised_population_df.head()

Unnamed: 0,country,population,region
0,Algeria,46.8,NORTHERN AFRICA
1,Egypt,105.2,NORTHERN AFRICA
2,Libya,6.9,NORTHERN AFRICA
3,Morocco,37.0,NORTHERN AFRICA
4,Sudan,48.1,NORTHERN AFRICA


#### 1.5.3 Merge the population dataset with the politicians dataset

In [49]:
politicians_population_df = pd.merge(revised_politicians_with_pageinfo_and_scores_df, 
                                     revised_population_df, on=["country"], 
                                     how="outer", indicator=True)
politicians_population_df = politicians_population_df[["country", "region", "population", "name", "revision_id", "quality_prediction", "_merge"]]
politicians_population_df.rename(columns = {"name": "article_title", "quality_predictionn": "article_quality", "_merge": "join_type"}, inplace=True)
politicians_population_df.head()

Unnamed: 0,country,region,population,article_title,revision_id,quality_prediction,join_type
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233203000.0,Start,both
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230460000.0,B,both
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225662000.0,Start,both
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234742000.0,Stub,both
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651000.0,Start,both


#### 1.5.4 Obtain the list of countries with no match

The homework specifies "countries with no match" as the list of countries with either the population dataset not having an entry for the equivalent Wikipedia country, or the wWkipedia article not having an associated population data associated to it.
- We will save the list of countries with no match in the path: `data/generated_output_data/wp_countries-no_match.txt`.


In [50]:
# Countries with no population data adssociated with them
countries_missing_population = politicians_population_df[politicians_population_df["join_type"] == "left_only"]
countries_missing_population_list = countries_missing_population["country"].unique().tolist()
print("Countries with no population data associated with them:", countries_missing_population_list)

# Countries with no politicians associated with them (population data with no countries associated with them)
countries_missing_politicians = politicians_population_df[politicians_population_df["join_type"] == "right_only"]
countries_missing_politicians_list = countries_missing_politicians["country"].unique().tolist()
print("\nCountries with no politicians associated with them:", countries_missing_politicians_list)

# To obtain the countries with no match, we need to combine both the lists above,
countries_no_match = countries_missing_population_list + countries_missing_politicians_list
# write to a file called wp_countries-no_match.tx
with open('../data/generated_output_data/wp_countries-no_match.txt', 'w+') as file:
    for country in countries_no_match:
        file.write(country + '\n')

Countries with no population data associated with them: ['Guinea-Bissau', 'Korea, South', 'Korean']

Countries with no politicians associated with them: ['Andorra', 'Australia', 'Brunei', 'Canada', 'China (Hong Kong SAR)', 'China (Macao SAR)', 'Curacao', 'Denmark', 'Dominica', 'Fiji', 'French Guiana', 'French Polynesia', 'Georgia', 'Guadeloupe', 'Guam', 'GuineaBissau', 'Iceland', 'Ireland', 'Jamaica', 'Kiribati', 'Korea (North)', 'Korea (South)', 'Liechtenstein', 'Martinique', 'Mauritius', 'Mayotte', 'Mexico', 'Nauru', 'Netherlands', 'New Caledonia', 'New Zealand', 'Palau', 'Philippines', 'Puerto Rico', 'Reunion', 'Romania', 'San Marino', 'Sao Tome and Principe', 'Suriname', 'United Kingdom', 'United States', 'Western Sahara', 'eSwatini']


#### 1.5.5 Obtain data with successful matches

For countries with successful matches, the merged dataset will have Wikipedia politicians article data (with ORES prediction) and population data. 
- We will save this data in the path: `data/generated_output_data/wp_politicians_by_country.csv` and it can be used for further analysis.

In [51]:
countries_with_match_df = politicians_population_df[politicians_population_df["join_type"] == "both"].copy()
countries_with_match_df.drop(columns=["join_type"], inplace=True)
countries_with_match_df.to_csv("../data/generated_output_data/wp_politicians_by_country.csv", index=False)# save the dataframe for further analysis
countries_with_match_df.head()

Unnamed: 0,country,region,population,article_title,revision_id,quality_prediction
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233203000.0,Start
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230460000.0,B
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225662000.0,Start
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234742000.0,Stub
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651000.0,Start


### 1.6 Analysis

In this section, we will calculate two important metrics: `total-articles-per-capita` and `high-quality-articles-per-capita`. These metrics tells us the availability and quality of Wikipedia articles relative to the population of each country and region.

First let us make sure that we only have rows wih articles where population is > 0

In [52]:
countries_with_match_df = countries_with_match_df[countries_with_match_df['population'] > 0]

Let us then identify the "high quality" articles (articles that have article_quality "FA" (featured article) or "GA" (good article)).

In [53]:
# Define high-quality articles
countries_with_match_df['is_high_quality'] = countries_with_match_df['quality_prediction'].isin(['FA', 'GA'])

Now, we need to get the total articles (specifically list the count of high quality articles) by country and region.

In [54]:
# Get total articles and high-quality articles by country and region
aggregated_politicians_population_df = (countries_with_match_df.groupby(['country', 'region']).agg(total_articles=('article_title', 'count'),
                                                                            high_quality_articles=('is_high_quality', 'sum'),
                                                                            population=('population', 'first')).reset_index())
aggregated_politicians_population_df['population'] *= 1_000_000  # Let us also make sure that we represent the population count as it is and not in millions
aggregated_politicians_population_df.head()

Unnamed: 0,country,region,total_articles,high_quality_articles,population
0,Afghanistan,SOUTH ASIA,85,3,42400000.0
1,Albania,SOUTHERN EUROPE,69,7,2700000.0
2,Algeria,NORTHERN AFRICA,71,1,46800000.0
3,Angola,MIDDLE AFRICA,58,2,36700000.0
4,Antigua and Barbuda,CARIBBEAN,33,0,100000.0


Next, let us calculate the articles per capita,

- `total_articles_per_capita`: Represents the number of articles available for each person in a given country or region.
- `high_quality_articles_per_capita'`: Represents the number of high-quality articles available per person.

In [55]:
aggregated_politicians_population_df['total_articles_per_capita'] = aggregated_politicians_population_df['total_articles'] / aggregated_politicians_population_df['population']
aggregated_politicians_population_df['high_quality_articles_per_capita'] = aggregated_politicians_population_df['high_quality_articles'] / aggregated_politicians_population_df['population']

# For better readability, let's have upto 9 decimal points
aggregated_politicians_population_df['total_articles_per_capita'] = aggregated_politicians_population_df['total_articles_per_capita'].apply(lambda x: f"{x:.9f}")
aggregated_politicians_population_df['high_quality_articles_per_capita'] = aggregated_politicians_population_df['high_quality_articles_per_capita'].apply(lambda x: f"{x:.9f}")

aggregated_politicians_population_df.to_csv('../data/generated_intermediate_data/articles_per_capita_analysis.csv', index=False) # Save for further analysis/quick reference
aggregated_politicians_population_df.head()

Unnamed: 0,country,region,total_articles,high_quality_articles,population,total_articles_per_capita,high_quality_articles_per_capita
0,Afghanistan,SOUTH ASIA,85,3,42400000.0,2.005e-06,7.1e-08
1,Albania,SOUTHERN EUROPE,69,7,2700000.0,2.5556e-05,2.593e-06
2,Algeria,NORTHERN AFRICA,71,1,46800000.0,1.517e-06,2.1e-08
3,Angola,MIDDLE AFRICA,58,2,36700000.0,1.58e-06,5.4e-08
4,Antigua and Barbuda,CARIBBEAN,33,0,100000.0,0.00033,0.0


## Step 2: Data Analysis/Results

In this sub-section, we summarize the results of the analysis by creating 6 tables,
- **Top 10 countries by coverage**
- **Bottom 10 countries by coverage** 
- **Top 10 countries by high quality**
- **Bottom 10 countries by high quality** 
- **Geographic regions by total coverage** 
- **Geographic regions by high quality coverage** 

### 2.1 Top 10 countries by coverage

In this section, we calculate the number of Articles per person for each country and rank them in descending order. The top 10 countries with the highest total articles per capita will be displayed.

In [56]:
top_10_by_coverage = aggregated_politicians_population_df.sort_values(by='total_articles_per_capita', ascending=False).head(10)
display(top_10_by_coverage)

Unnamed: 0,country,region,total_articles,high_quality_articles,population,total_articles_per_capita,high_quality_articles_per_capita
4,Antigua and Barbuda,CARIBBEAN,33,0,100000.0,0.00033,0.0
51,Federated States of Micronesia,OCEANIA,14,0,100000.0,0.00014,0.0
93,Marshall Islands,OCEANIA,13,0,100000.0,0.00013,0.0
148,Tonga,OCEANIA,10,0,100000.0,0.0001,0.0
12,Barbados,CARIBBEAN,25,0,300000.0,8.3333e-05,0.0
124,Seychelles,EASTERN AFRICA,6,0,100000.0,6e-05,0.0
97,Montenegro,SOUTHERN EUROPE,36,3,600000.0,6e-05,5e-06
17,Bhutan,SOUTH ASIA,44,0,800000.0,5.5e-05,0.0
90,Maldives,SOUTH ASIA,33,1,600000.0,5.5e-05,1.667e-06
120,Samoa,OCEANIA,8,0,200000.0,4e-05,0.0


**Observations:**

A noticeable trend above is that small countries like Antigua and Barbuda and Federated States of Micronesia dominate the list. The population in these countries are very less, so even with less number of articles, their articles per person are quite high.

### 2.2 Bottom 10 countries by coverage
We rank the countries in ascending order and display the 10 countries with the lowest number of total articles per capita.

In [57]:
bottom_10_by_coverage = aggregated_politicians_population_df.sort_values(by='total_articles_per_capita', ascending=True).head(10)
display(bottom_10_by_coverage)

Unnamed: 0,country,region,total_articles,high_quality_articles,population,total_articles_per_capita,high_quality_articles_per_capita
31,China,EAST ASIA,16,0,1411300000.0,1.1e-08,0.0
57,Ghana,WESTERN AFRICA,3,1,34100000.0,8.8e-08,2.9e-08
66,India,SOUTH ASIA,151,0,1428600000.0,1.06e-07,0.0
121,Saudi Arabia,WESTERN ASIA,5,2,36900000.0,1.36e-07,5.4e-08
162,Zambia,EASTERN AFRICA,3,0,20200000.0,1.49e-07,0.0
107,Norway,NORTHERN EUROPE,1,0,5500000.0,1.82e-07,0.0
70,Israel,WESTERN ASIA,2,0,9800000.0,2.04e-07,0.0
45,Egypt,NORTHERN AFRICA,32,1,105200000.0,3.04e-07,1e-08
37,Cote d'Ivoire,WESTERN AFRICA,10,0,30900000.0,3.24e-07,0.0
50,Ethiopia,EASTERN AFRICA,43,2,126500000.0,3.4e-07,1.6e-08


**Observations:**

Over here, we see a mix of large population countries like China and India. These countries have very low total articles per capita because their large populations hides whatever article count they have, even if it is a decent number of articles to have.

### 2.3 Top 10 countries by high quality
Here we have the countries with the highest number of high-quality articles (FA or GA class) per capita, ranked in descending order.

In [58]:
top_10_by_high_quality = aggregated_politicians_population_df.sort_values(by='high_quality_articles_per_capita', ascending=False).head(10)
display(top_10_by_high_quality)

Unnamed: 0,country,region,total_articles,high_quality_articles,population,total_articles_per_capita,high_quality_articles_per_capita
97,Montenegro,SOUTHERN EUROPE,36,3,600000.0,6e-05,5e-06
86,Luxembourg,WESTERN EUROPE,27,2,700000.0,3.8571e-05,2.857e-06
1,Albania,SOUTHERN EUROPE,69,7,2700000.0,2.5556e-05,2.593e-06
76,Kosovo,SOUTHERN EUROPE,26,4,1700000.0,1.5294e-05,2.353e-06
90,Maldives,SOUTH ASIA,33,1,600000.0,5.5e-05,1.667e-06
38,Croatia,SOUTHERN EUROPE,64,5,3800000.0,1.6842e-05,1.316e-06
62,Guyana,SOUTH AMERICA,17,1,800000.0,2.125e-05,1.25e-06
110,Palestinian Territory,WESTERN ASIA,61,6,5500000.0,1.1091e-05,1.091e-06
85,Lithuania,NORTHERN EUROPE,57,3,2900000.0,1.9655e-05,1.034e-06
128,Slovenia,SOUTHERN EUROPE,38,2,2100000.0,1.8095e-05,9.52e-07


**Observations:**

Similar to what we observed in 2.1, here too we have many countries with small populations ranked high, eventhough they have less number of high-quality articles.

### 2.4 Bottom 10 countries by high quality
Here we will show the countries with the fewest high-quality articles per capita, ranked in ascending order.

In [59]:
bottom_10_by_high_quality = aggregated_politicians_population_df.sort_values(by='high_quality_articles_per_capita', ascending=True).head(10)
display(bottom_10_by_high_quality)

Unnamed: 0,country,region,total_articles,high_quality_articles,population,total_articles_per_capita,high_quality_articles_per_capita
163,Zimbabwe,EASTERN AFRICA,69,0,16700000.0,4.132e-06,0.0
117,Qatar,WESTERN ASIA,5,0,2700000.0,1.852e-06,0.0
59,Grenada,CARIBBEAN,2,0,100000.0,2e-05,0.0
120,Samoa,OCEANIA,8,0,200000.0,4e-05,0.0
55,Gambia,WESTERN AFRICA,18,0,2800000.0,6.429e-06,0.0
122,Senegal,WESTERN AFRICA,31,0,18300000.0,1.694e-06,0.0
124,Seychelles,EASTERN AFRICA,6,0,100000.0,6e-05,0.0
51,Federated States of Micronesia,OCEANIA,14,0,100000.0,0.00014,0.0
49,Estonia,NORTHERN EUROPE,15,0,1400000.0,1.0714e-05,0.0
48,Eritrea,EASTERN AFRICA,15,0,3700000.0,4.054e-06,0.0


**Observations:**

Similar to what we observed in section 2.2, compared to the population, though we expect these countries to have high number of high-quality articles, they fail to have so. The bottom 10 countries have 0 high-quality articles and are just ranked randomly by the system!

### 2.5 Geographic regions by total coverage
Instead of individual countries, this table aggregates the data by geographic regions. It ranks regions based on the total number of articles per capita, in descending order.

In [60]:
# Compute the total coverage for a geographical region
regions_total_coverage = aggregated_politicians_population_df.groupby('region').agg({'total_articles': 'sum',
                                                                                     'population': 'sum'}).reset_index()

# Calculate the total articles per capita for each region
regions_total_coverage['total_articles_per_capita'] = regions_total_coverage['total_articles'] / regions_total_coverage['population']

# Sort by total articles per capita in descending order
regions_total_coverage_sorted = regions_total_coverage.sort_values(by='total_articles_per_capita', ascending=False)
display(regions_total_coverage_sorted)


Unnamed: 0,region,total_articles,population,total_articles_per_capita
8,NORTHERN EUROPE,188,27800000.0,6.76259e-06
9,OCEANIA,71,11100000.0,6.396396e-06
0,CARIBBEAN,219,36600000.0,5.983607e-06
14,SOUTHERN EUROPE,785,151500000.0,5.181518e-06
1,CENTRAL AMERICA,186,51300000.0,3.625731e-06
17,WESTERN EUROPE,486,181300000.0,2.68064e-06
5,EASTERN EUROPE,701,266200000.0,2.633358e-06
16,WESTERN ASIA,607,295400000.0,2.054841e-06
13,SOUTHERN AFRICA,123,68300000.0,1.800878e-06
4,EASTERN AFRICA,663,480900000.0,1.378665e-06


**Observations:**

- Northern Europe has the highest number of articles per person, i.e., relative to it's population, there are a lot of political content available here.
- Oceania and the Caribbean are just behind. On the other hand
- East Asia has the lowest articles per person. Compared to it's large population, there’s not much political discussion happening here.
- Though South Asia has a total of 667 articles, because it's population is over 2 billion, the number of articles per person is quite low. 
- Similarly, Eastern Africa and Western Africa have many articles, but their large populations result them to have low per person coverage. 

### 2.6 Geographic regions by high quality coverage
Similar to the previous table, but focused on high-quality articles per capita in each region, ranked in descending order.

In [61]:
# Compute the total coverage for high-quality articles for a geographical region
regions_high_quality_coverage = aggregated_politicians_population_df.groupby('region').agg({'high_quality_articles': 'sum',
                                                                                            'population': 'sum'}).reset_index()

# Calculate high-quality articles per capita for each region
regions_high_quality_coverage['high_quality_articles_per_capita'] = regions_high_quality_coverage['high_quality_articles'] / regions_high_quality_coverage['population']

# Sort by high-quality articles per capita in descending order
regions_high_quality_coverage_sorted = regions_high_quality_coverage.sort_values(by='high_quality_articles_per_capita', ascending=False)
display(regions_high_quality_coverage_sorted)


Unnamed: 0,region,high_quality_articles,population,high_quality_articles_per_capita
14,SOUTHERN EUROPE,53,151500000.0,3.49835e-07
8,NORTHERN EUROPE,8,27800000.0,2.877698e-07
0,CARIBBEAN,9,36600000.0,2.459016e-07
1,CENTRAL AMERICA,10,51300000.0,1.949318e-07
5,EASTERN EUROPE,37,266200000.0,1.389932e-07
13,SOUTHERN AFRICA,8,68300000.0,1.171303e-07
17,WESTERN EUROPE,21,181300000.0,1.158301e-07
16,WESTERN ASIA,27,295400000.0,9.140149e-08
9,OCEANIA,1,11100000.0,9.009009e-08
2,CENTRAL ASIA,5,80400000.0,6.218905e-08


**Observations:**

Similar to what we observed in the last section (2.5)
- Northern Europe is ranked 1 for having the highest number of high-quiality articles per person. Relatiove to the population size, we have many high-quality articles here.
- In the same way, Oceania and the Caribbean also have high coverage.
- Regions like East Asia and South Asia continue to show a low number of high-quality articles per person though they have larger populations. It could indicate that the political discussions are either not that accessible in these areas.
- In the similar way, though regions like Eastern Africa and Western Africa, boasts about having a high total number of high-quality articles, their population hinders them to be highly ranked here.

## 3. Conclusion

We were successful in finally generating the observations, however, we did observe some bias.

### 3.1 Bias 

The observations in Section 2 clearly indicate a bias in how various regions prioritize political topics.

The majority of articles analyzed came from English Wikipedia. To mitigate this, I included six additional articles from different languages to help reduce bias related to language. Since English is not the primary language in many Eastern countries, English Wikipedia may lean more towards Western perspectives, leading to a skewed representation of political information.

Before I began my analysis, I expected to find a disproportionate distribution of articles produced by wealthier countries compared to others. I suspected that they would have more articles, and to a certain extent, I was right. However, it was only when we examined the per-capita comparisons that we encountered a different set of bias complications.

Another issue is the "data gap." If someone relies solely on Wikipedia to analyze political engagement, there’s a significant chance they will overlook many viewpoints and events, as a single platform can never portray 100% of what is happening globally. While Wikipedia can definitely serve as a starting point or be useful in the early stages of a statistical analysis, we must be mindful of its limitations.

### 3.2 Solution

We could include more data from other language versions of Wikipedia or even from different platforms that focus on political discussions in underrepresented regions, particularly in Eastern countries. This would enhance reliability by reducing the bias that was previously introduced.