# Homework 2: Considering Bias in Data

In this project, we will be looking at the wikipedia articles of politicians from around the world. We will be combining data sets containing country/region populations and politician article quality to generate some analysis about the quality of articles pertaining to each country.

# Step 1: Getting the Article and Population Data

This data has already been collected for us and can be found in politicians_by_country_AUG.2024.csv and population_by_country_AUG.2024.csv. The first file contains a list of politicians, their article link, and their country while the second file contains a list of countries/regions and their respective populations (in millions).

# Step 2: Getting Article Quality Predictions

In this step, we will use the ORAS machine learning system to get quality predictions on each politician's article. To do this, we first need to get the most recent revision of each article, denoted by a revision id. Then we pass the article name and the revision id to the ORAS system to get a quality prediction. We will also keep track of all articles that are unable to be given a quality score for whatever reason.

In [1]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

import pandas as pd

The following code block is from the wp_page_info_example notebook.

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'ashwin19@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
# PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The following code block is also from the wp_page_info_example notebook and defines a function to make the API call to the page info API.

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


The following code block defines constants used for making ORES API calls.

In [4]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

The following cell needs to be updated with Wikipedia API credentials!

In [5]:
# Add your credentials into this code block and uncomment it, otherwise the API calls cannot be made

# USERNAME = <Put username here>
# ACCESS_TOKEN = <Put access token here>

The following code block defines the function that makes the ORES API calls.

In [6]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


This code block reads the politicians file, and for each article, looks up the corresponding ORES score for the most recent revision. This data is then stored into a file called article_quality_no_region.txt in csv format. 

NOTE: This code block takes a very long time to run so the article_quality_no_region.txt file has been already created and is ready for use in future code blocks. This code block can be skipped.

In [None]:
# SKIP THIS CODE BLOCK, THE OUTPUT HAS ALREADY BEEN CREATED AND IS READY FOR USE
df_politicians = pd.read_csv('politicians_by_country_AUG.2024.csv')
with open('article_quality_no_region.txt', 'a', encoding='utf-8') as output:
    output.write('article_title,revision_id,article_quality,country,population,region\n')
    for index, row in df_politicians.iterrows():
        name = row['name']
        print(name)
        country = row['country']

        try:
            # Request page data
            info = request_pageinfo_per_article(name)['query']['pages']
            if '-1' in info:
                continue

            rev_id = info[next(iter(info))]['lastrevid']
            # Use the revision id to get ORES rating
            score = request_ores_score_per_article(article_revid=rev_id,
                   email_address="ashwin19@uw.edu",
                   access_token=ACCESS_TOKEN)
            if 'enwiki' in score:
                # Write output line to file
                quality = score['enwiki']['scores'][str(rev_id)]['articlequality']['score']['prediction']
                output.write('"' + name + '",' + str(rev_id) + "," + quality + "," + '"' + country + '",' + "," + "\n")
                output.flush()
        except:
            print("ERROR")
print('FINISHED')

The following code block computes the list of articles that were unable to be paired with an ORES quality. This is done by comparing the politicians csv file with the article_quality csv file produced in the prior code block. The displayed error rate is the ratio of the number of articles for which we were not able to get a score divided by the total number of articles.

In [7]:
df1 = pd.read_csv('politicians_by_country_AUG.2024.csv')
df2 = pd.read_csv('article_quality_no_region.txt')

not_in_df2 = df1[~df1['name'].isin(df2['article_title'])]

print("List of Articles without ORES quality:")
print(not_in_df2['name'])

error_rate = round(not_in_df2.shape[0] / df1.shape[0], 5) * 100
print("\nError Rate: " + str(error_rate) + " %")


List of Articles without ORES quality:
187                   Abderrahmane Meziane Chérif
430                        Barbara Eibinger-Miedl
516                               Mehrali Gasimov
1200                                   Kyaw Myint
1342                       André Ngongang Ouandji
1955                               Tomás Pimentel
2427                                Richard Sumah
2430    Sofoklis Avraam Choudaverdoglou-Theodotos
2431                           Christos Daralexis
2899                                      S. Kabo
3088                                   Tariq Najm
3878                      Grzegorz Antoni Ogiński
4496                   Segun ''Aeroland'' Adewale
5719                              Bashir Bililiqo
Name: name, dtype: object

Error Rate: 0.196 %


# Step 3: Combining the Datasets

In this step, we will be combining the region and populationd data with the politician data. The population and region data is found in the following file: population_by_country_AUG.2024.csv. Each politician row will get a population value and region. Countries that don't have matches will be recorded and saved in the following file: wp_countries-no_match.txt. The new data frame containing all the joined data will be saved in csv format to the following file: wp_politicians_by_country.csv. The future analysis sections rely on this joined data file which has already been created in this repository.

The following code block parses the population file and helps keep track of which countries belong to each region.

In [13]:
region_map = {}
country_map = {}
with open('population_by_country_AUG.2024.csv', 'r') as country_file:
    # Iterate through each line in the file
    current_region = ""
    for line in country_file:
        if 'Geography' in line:
            continue
            
        # remove white space and isolate each component
        terms = line.strip().split(',')
        place = terms[0]
        pop = terms[1]
        
        if place.isupper(): # Region
            current_region = place
            region_map[place] = (pop, False)
        else: # Country
            country_map[place] = (pop, current_region, False)

            

The following code block iterates through the data frame and adds in the corresponding region and population data. Countries without matches are stored and send to another file as output. The newly joined data frame is already written to a file as csv output.

In [14]:
countries_not_mapped_1 = []
countries_not_mapped_2 = []
df_quality = pd.read_csv('article_quality_no_region.txt')

# Iterate through the data frame and add in population and region data
for index, row in df_quality.iterrows():
    country = row['country']
    if country in country_map:
        mapping = country_map[country]
        df_quality.loc[index, 'population'] = mapping[0]
        df_quality.loc[index, 'region'] = mapping[1]
        country_map[country] = (mapping[0], mapping[1], True)
    else:
        # Keep track of countries with no matches
        if country not in countries_not_mapped_1:
            countries_not_mapped_1.append(country)

# Keep track of countries in the dictionary with no matches
for key in country_map:
    if not country_map[key][2] and key not in countries_not_mapped_2:
        countries_not_mapped_2.append(key)
    
# Print and write the non-matches to a file
print(countries_not_mapped_1)
print(countries_not_mapped_2)

with open('wp_countries-no_match.txt', 'w') as file:
    # Write each element on a new line
    for element in countries_not_mapped_1:
        file.write(f"{element}\n")
    for element in countries_not_mapped_2:
        file.write(f"{element}\n")
        
# Write the updated data frame to a file
df_quality.to_csv('wp_politicians_by_country.csv', index=False)

['Guinea-Bissau', 'Korean', 'Korea, South']
['Western Sahara', 'GuineaBissau', 'Mauritius', 'Mayotte', 'Reunion', 'Sao Tome and Principe', 'eSwatini', 'Canada', 'United States', 'Mexico', 'Curacao', 'Dominica', 'Guadeloupe', 'Jamaica', 'Martinique', 'Puerto Rico', 'French Guiana', 'Suriname', 'Georgia', 'Brunei', 'Philippines', 'China (Hong Kong SAR)', 'China (Macao SAR)', 'Korea (North)', 'Korea (South)', 'Denmark', 'Iceland', 'Ireland', 'United Kingdom', 'Liechtenstein', 'Netherlands', 'Romania', 'Andorra', 'San Marino', 'Australia', 'Fiji', 'French Polynesia', 'Guam', 'Kiribati', 'Nauru', 'New Caledonia', 'New Zealand', 'Palau']


# Step 4: Analysis

Now that we have constructed our data set (found in wp_politicians_by_country.csv), we will creating a series of tables in Step 5. The tables are all related to per_capita metrics but since our population data is provided in million, we are going to calculating the "per 1 million people" metric instead. And lastly, some of these tables involve the concept of "high quality" articles. In this project, we define high quality articles as those receiving an ORES score of "FA" (Featured Article) or "GA" (Good Article). Pandas data frames support a lot of table operations that will be helpful in calculating these metrics!

# Step 5: Results

In this section, we will be creating 6 tables from the data we collected and cleaned in the prior steps. 

Table 1: The 10 countries with the highest total articles per capita (in descending order).

Table 2: The 10 countries with the lowest total articles per capita (in ascending order).

Table 3: The 10 countries with the highest high quality articles per capita (in descending order).

Table 4: The 10 countries with the lowest high quality articles per capita (in ascending order).

Table 5: A rank ordered list of geographic regions (in descending order) by total articles per capita.

Table 6: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

Read in the dataset we created in the prior steps and remove the rows without population data.

In [15]:
df = pd.read_csv('wp_politicians_by_country.csv')
df_filtered = df.dropna(subset=['population'])

Create table 1 and table 2 (See Step 5 description for table details).

In [16]:
# Group and collapse the data by country
articles_per_country = df_filtered.groupby('country').size().reset_index(name='total_articles')
articles_per_country = articles_per_country.merge(df_filtered[['country', 'population']].drop_duplicates(), on='country')

# Calculate the articles per capita metric and make a new column for it
articles_per_country['articles_per_million'] = articles_per_country['total_articles'] / articles_per_country['population']

# Display table 1
top_10_countries = articles_per_country.sort_values(by='articles_per_million', ascending=False).head(10)
print("10 highest ranking countries by Articles per 1 Million People (Inf indicates that country population is significantly less than 1 million)")
print(top_10_countries[['country', 'articles_per_million']].to_string(index=False))

print()

# Display table 2 
bottom_10_countries = articles_per_country.sort_values(by='articles_per_million', ascending=True).head(10)
print("10 lowest ranking countries by Articles per 1 Million People")
print(bottom_10_countries[['country', 'articles_per_million']].to_string(index=False))

10 highest ranking countries by Articles per 1 Million People (Inf indicates that country population is significantly less than 1 million)
                       country  articles_per_million
                        Monaco                   inf
                        Tuvalu                   inf
           Antigua and Barbuda            330.000000
Federated States of Micronesia            140.000000
              Marshall Islands            130.000000
                         Tonga            100.000000
                      Barbados             83.333333
                    Montenegro             60.000000
                    Seychelles             60.000000
                      Maldives             55.000000

10 lowest ranking countries by Articles per 1 Million People
      country  articles_per_million
        China              0.011337
        Ghana              0.087977
        India              0.105698
 Saudi Arabia              0.135501
       Zambia              0.148515


Create table 3 and table 4 (See Step 5 description for table details).

In [17]:
# Filter the data to just the high quality articles
high_quality_df = df_filtered[df_filtered['article_quality'].isin(['FA', 'GA'])]

# Group and collapse the data by country.
# Calculate the high quality articles per capita metric and make a new column for it.
high_quality_per_country = high_quality_df.groupby('country').size().reset_index(name='high_quality_articles')
high_quality_per_country = high_quality_per_country.merge(df_filtered[['country', 'population']].drop_duplicates(), on='country')
high_quality_per_country['high_quality_articles_per_million'] = high_quality_per_country['high_quality_articles'] / high_quality_per_country['population']

# Display table 3
top_10_high_quality_countries = high_quality_per_country.sort_values(by='high_quality_articles_per_million', ascending=False).head(10)
print("10 highest ranking countries by High Quality Articles per 1 Million People")
print(top_10_high_quality_countries[['country', 'high_quality_articles_per_million']].to_string(index=False))

print()

# Display table 4
bottom_10_high_quality_countries = high_quality_per_country.sort_values(by='high_quality_articles_per_million', ascending=True).head(10)
print("10 lowest ranking countries by Articles per 1 Million People")
print(bottom_10_high_quality_countries[['country', 'high_quality_articles_per_million']].to_string(index=False))

10 highest ranking countries by High Quality Articles per 1 Million People
              country  high_quality_articles_per_million
           Montenegro                           5.000000
           Luxembourg                           2.857143
              Albania                           2.592593
               Kosovo                           2.352941
             Maldives                           1.666667
            Lithuania                           1.379310
              Croatia                           1.315789
               Guyana                           1.250000
Palestinian Territory                           1.090909
             Slovenia                           0.952381

10 lowest ranking countries by Articles per 1 Million People
   country  high_quality_articles_per_million
Bangladesh                           0.005764
     Egypt                           0.009506
  Ethiopia                           0.015810
     Japan                           0.016064
  Paki

Create table 5 and table 6 (See Step 5 description for table details).

In [18]:
# Group the articles by region. Add up the article counts in each region using aggregation.
articles_per_region = df_filtered.groupby('region').agg(
    total_articles=('article_title', 'size')).reset_index()

# Add in the region population data
for index, row in articles_per_region.iterrows():
    articles_per_region.loc[index, 'population_in_millions'] = int(region_map[row['region']][0])

# Calculate articles per million
articles_per_region['articles_per_million'] = articles_per_region['total_articles'] / articles_per_region['population_in_millions']

# Sort
articles_per_region_sorted = articles_per_region.sort_values(by='articles_per_million', ascending=False)

# Display table 5
print("Ranked ordering of Geographic Regions by Articles per 1 Million People")
print(articles_per_region_sorted[['region', 'total_articles', 'population_in_millions', 'articles_per_million']].to_string(index=False))


print()

# Remove articles that are not "high quality"
high_quality_filtered = df_filtered[df_filtered['article_quality'].isin(['FA', 'GA'])]

# Group the articles by region. Add up the article counts in each region using aggregation.
high_quality_per_region = high_quality_df.groupby('region').agg(
    high_quality_articles=('article_title', 'size')).reset_index()

# Add in the region population data
for index, row in high_quality_per_region.iterrows():
    high_quality_per_region.loc[index, 'population_in_millions'] = int(region_map[row['region']][0])

# Calculate high-quality articles per million
high_quality_per_region['high_quality_articles_per_million'] = high_quality_per_region['high_quality_articles'] / high_quality_per_region['population_in_millions']

# Sort
high_quality_per_region_sorted = high_quality_per_region.sort_values(by='high_quality_articles_per_million', ascending=False)

# Display table 6
print("Ranked ordering of Geographic Regions by High Quality Articles per 1 Million People")
print(high_quality_per_region_sorted[['region', 'high_quality_articles', 'population_in_millions', 'high_quality_articles_per_million']].to_string(index=False))


Ranked ordering of Geographic Regions by Articles per 1 Million People
         region  total_articles  population_in_millions  articles_per_million
SOUTHERN EUROPE             795                   152.0              5.230263
      CARIBBEAN             218                    44.0              4.954545
 WESTERN EUROPE             497                   199.0              2.497487
 EASTERN EUROPE             709                   285.0              2.487719
   WESTERN ASIA             608                   299.0              2.033445
NORTHERN EUROPE             190                   108.0              1.759259
SOUTHERN AFRICA             123                    70.0              1.757143
        OCEANIA              72                    45.0              1.600000
 EASTERN AFRICA             664                   483.0              1.374741
  SOUTH AMERICA             569                   426.0              1.335681
   CENTRAL ASIA             106                    80.0              1.