# DATA 512: Human-Centered Data Science
## Homework 2: Considering Bias in Data

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles on political figures from different countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries. Your analysis will consist of a series of tables that show:

1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. The countries with the highest and lowest proportion of high quality articles about politicians.
3. A ranking of geographic regions by articles-per-person and proportion of high quality articles.


# 0. Setup

Importing required packages, and setting up functions required for analysis

In [1]:
import json, time, requests, urllib.parse
import pandas as pd
from tqdm.autonotebook import tqdm

import logging
logging.basicConfig(filename="errors.log", level=logging.ERROR)

  from tqdm.autonotebook import tqdm


## Definining Article Page Info MediaWiki API parameter and function 

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<hbaghar@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        logging.error("Error making request for article title: {}".format(article_title))
        json_response = None
    return json_response

## Defining ORES API parameters and function

In [4]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<hbaghar@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        json_response = None
    return json_response

# 1. Getting Article and Population Data

The first step is getting the data, which lives in several different places. You will need data that lists Wikipedia articles of politicians and data for country populations.
The Wikipedia Category:Politicians by nationality was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries. This data is in the homework folder as [`politicians_by_country.SEPT.2022.csv`](https://docs.google.com/spreadsheets/u/0/d/1Y4vSTYENgNE5KltqKZqnRQQBQZN5c8uKbSM4QTt8QGg/edit).

The population data is available in CSV format as [`population_by_country_2022.csv`](https://docs.google.com/spreadsheets/u/0/d/1POuZDfA1sRooBq9e1RNukxyzHZZ-nQ2r6H5NcXhsMPU/edit) from the homework folder. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

**Some Considerations**

You should be a little careful with the data. Crawling Wikipedia categories to identify relevant page subsets can result in misleading and/or duplicate category labels. Naturally, the data crawl attempted to resolve these, but not all may have been caught. You should document how you handle any data inconsistencies.

The population_by_country_2022.csv contains some rows that provide cumulative regional population counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in `politicians_by_country.SEPT.2022.csv`, but you will want to retain some of them so that you can report coverage and quality by region as specified in the analysis section below.


## 1.1 Loading list of articles that need to be scored

In [6]:
articles = pd.read_csv("https://docs.google.com/spreadsheets/d/1Y4vSTYENgNE5KltqKZqnRQQBQZN5c8uKbSM4QTt8QGg/export?gid=1672307727&format=csv")
articles.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


## 1.2 Loading population dataset and processing it

We want to map regions and countires together instead of the hierarchy being broken down by row in the same column

In [7]:
population = pd.read_csv("https://docs.google.com/spreadsheets/d/1POuZDfA1sRooBq9e1RNukxyzHZZ-nQ2r6H5NcXhsMPU/export?gid=1154770218&format=csv")
population.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


In [8]:
population["Region"] = population["Geography"]
population.loc[~(population["Region"].str.isupper()), "Region"] = pd.NA
population["Region"] = population["Region"].fillna(method="ffill")
population.rename({"Geography": "Country"}, axis=1, inplace=True)
population_totals = population.loc[(population["Region"] == population["Country"]), ["Region", "Population (millions)"]]
population = population.loc[~((population["Region"] == population["Country"]))]
population = population[["Region","Country","Population (millions)"]]
population.head()

Unnamed: 0,Region,Country,Population (millions)
3,NORTHERN AFRICA,Algeria,44.9
4,NORTHERN AFRICA,Egypt,103.5
5,NORTHERN AFRICA,Libya,6.8
6,NORTHERN AFRICA,Morocco,36.7
7,NORTHERN AFRICA,Sudan,46.9


In [9]:
population_totals.head()

Unnamed: 0,Region,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
10,WESTERN AFRICA,430.0
27,EASTERN AFRICA,473.0


## 1.3 Getting the Most Recent Revision ID From The Article Page API

In [10]:
for name in tqdm(articles["name"]):
    try:
        page = request_pageinfo_per_article(name)
        pageid = str(list(page["query"]["pages"].keys())[0])
        revid = str(page["query"]["pages"][pageid]["lastrevid"])
        articles.loc[articles["name"] == name, "pageid"] = pageid
        articles.loc[articles["name"] == name, "revid"] = revid
    except KeyError:
        logging.error("Revision ID not found for article: {}".format(name))
    except Exception as e:
        logging.error("Error making request for article: {}. Error raised: {}".format(name, e.__cause__))

100%|██████████| 7584/7584 [28:59<00:00,  4.36it/s]


# 2. Get Article Quality Predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article


In [11]:
for name,revid in tqdm(zip(articles["name"],articles["revid"])):
    try:
        ores = request_ores_score_per_article(revid)
        ores = ores["enwiki"]["scores"][revid]["articlequality"]["score"]["prediction"]
        articles.loc[articles["revid"] == revid, "ores"] = ores
    except Exception as e:
        logging.error("ORES score not found for article: {}. Error raised: {}".format(name, e.__cause__))

articles.to_csv("articles.csv", index=False)

7584it [25:56,  4.87it/s]


In [13]:
articles.isna().mean()*100

name       0.0000
url        0.0000
country    0.0000
pageid     0.0923
revid      0.0923
ores       0.0923
dtype: float64

We see 0.0923% missing data points, we can choose not to resolve these errors as it is a very tiny set of pages.

# 3. Combining Datasets

In [64]:
articles = pd.read_csv("articles.csv")

In [65]:
combined_data = pd.merge(articles, population, how="outer", left_on="country", right_on="Country")
combined_data.shape

(7609, 9)

In [66]:
combined_data.isna().sum()

name                     25
url                      25
country                  25
pageid                   32
revid                    32
ores                     32
Region                   70
Country                  70
Population (millions)    70
dtype: int64

In [67]:
# In left table only: Checking for countries with missing left table fields and valid Country (right table) field
left_unmatched = combined_data.loc[combined_data["country"].isna(), "Country"].unique()

# In right table only: Checking for countries with missing right table fields and valid country (left table) field
right_unmatched = combined_data.loc[combined_data["Country"].isna(), "country"].unique()

In [68]:
import numpy as np

with open('wp_countries-no_match.txt', 'w') as f:
    for country in np.hstack((left_unmatched, right_unmatched)):
        f.write(country+"\n")


In [69]:
combined_data = combined_data.dropna(axis=0)
combined_data.rename(
    {
        "Population (millions)": "population",
        "Region":"region", 
        "revid":"revision_id", 
        "ores":"article_quality",
        "name":"article_title"
    }, axis=1, inplace=True)
combined_data[["country", "region", "population", "article_title", "revision_id", "article_quality"]].to_csv("wp_politicians_by_country.csv", index=False)

# 4. Analysis

## 4.1 Total Articles per Capita

In [120]:
combined_data = combined_data[["country", "region", "population", "article_title", "revision_id", "article_quality"]]
total_articles_by_country = combined_data.groupby("country").agg({"article_title":"count", "population":"max"}).reset_index()
total_articles_by_country.set_index("country", inplace=True)
total_articles_by_country

Unnamed: 0_level_0,article_title,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,118,41.1
Albania,83,2.8
Algeria,34,44.9
Andorra,10,0.1
Angola,42,35.6
...,...,...
Venezuela,62,28.3
Vietnam,27,99.4
Yemen,61,33.7
Zambia,13,20.0


In [121]:
total_articles_by_region = combined_data.groupby("region").agg({"article_title":"count"}).merge(population_totals, how="left", left_index=True, right_on="Region")
total_articles_by_region.set_index("Region", inplace=True)
total_articles_by_region

Unnamed: 0_level_0,article_title,Population (millions)
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
CARIBBEAN,201,44.0
CENTRAL AMERICA,195,178.0
CENTRAL ASIA,106,78.0
EAST ASIA,245,1674.0
EASTERN AFRICA,650,473.0
EASTERN EUROPE,735,287.0
MIDDLE AFRICA,203,196.0
NORTHERN AFRICA,227,251.0
NORTHERN EUROPE,262,107.0
OCEANIA,86,44.0


## 4.2 High Quality Articles per Capita

In [122]:
high_qual_counts = lambda x: x[x.isin(["FA","GA"])].shape[0]
high_quality_articles_by_country = combined_data.groupby("country").agg({"article_quality":high_qual_counts, "population":"max"})
high_quality_articles_by_country

Unnamed: 0_level_0,article_quality,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,6,41.1
Albania,6,2.8
Algeria,0,44.9
Andorra,2,0.1
Angola,0,35.6
...,...,...
Venezuela,0,28.3
Vietnam,2,99.4
Yemen,2,33.7
Zambia,0,20.0


In [123]:
high_quality_articles_by_region = combined_data.groupby("region").agg({"article_quality":high_qual_counts}).merge(population_totals, how="left", left_index=True, right_on="Region")
high_quality_articles_by_region.set_index("Region", inplace=True)
high_quality_articles_by_region

Unnamed: 0_level_0,article_quality,Population (millions)
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
CARIBBEAN,8,44.0
CENTRAL AMERICA,10,178.0
CENTRAL ASIA,3,78.0
EAST ASIA,16,1674.0
EASTERN AFRICA,15,473.0
EASTERN EUROPE,39,287.0
MIDDLE AFRICA,5,196.0
NORTHERN AFRICA,6,251.0
NORTHERN EUROPE,8,107.0
OCEANIA,2,44.0


# 5. Results

## 5.1 Top 10 countries by coverage

The 10 countries with the highest total articles per capita (in descending order)

In [124]:
total_articles_by_country["article_per_capita"] = total_articles_by_country["article_title"]/total_articles_by_country["population"]
total_articles_by_country = total_articles_by_country.loc[total_articles_by_country["article_per_capita"] != np.inf,:]
total_articles_by_country["article_per_capita"].sort_values(ascending=False).head(10)

country
Antigua and Barbuda               170.000000
Federated States of Micronesia    130.000000
Andorra                           100.000000
Barbados                           93.333333
Marshall Islands                   90.000000
Montenegro                         60.000000
Seychelles                         60.000000
Luxembourg                         52.857143
Bhutan                             51.250000
Grenada                            50.000000
Name: article_per_capita, dtype: float64

## 5.2 Bottom 10 countries by coverage

The 10 countries with the lowest total articles per capita (in ascending order) 

In [125]:
total_articles_by_country["article_per_capita"].sort_values(ascending=True).head(10)

country
China           0.001392
Mexico          0.007843
Saudi Arabia    0.081744
Romania         0.105263
India           0.125600
Sri Lanka       0.133929
Egypt           0.135266
Ethiopia        0.202593
Taiwan          0.215517
Vietnam         0.271630
Name: article_per_capita, dtype: float64

## 5.3 Top 10 countries by high quality

The 10 countries with the highest high quality articles per capita (in descending order)

In [126]:
high_quality_articles_by_country["article_per_capita"] = high_quality_articles_by_country["article_quality"]/high_quality_articles_by_country["population"]
high_quality_articles_by_country = high_quality_articles_by_country.loc[high_quality_articles_by_country["article_per_capita"] != np.inf,:]
high_quality_articles_by_country["article_per_capita"].sort_values(ascending=False).head(10)

country
Andorra                  20.000000
Montenegro                5.000000
Albania                   2.142857
Suriname                  1.666667
Bosnia-Herzegovina        1.470588
Lithuania                 1.071429
Croatia                   1.052632
Slovenia                  0.952381
Palestinian Territory     0.925926
Gabon                     0.833333
Name: article_per_capita, dtype: float64

## 5.4 Bottom 10 countries by high quality

The 10 countries with the lowest high quality articles per capita (in ascending order)

In [127]:
high_quality_articles_by_country["article_per_capita"].sort_values(ascending=True).head(10)

country
Zimbabwe                          0.0
Zambia                            0.0
Kuwait                            0.0
Sri Lanka                         0.0
St. Kitts-Nevis                   0.0
Jamaica                           0.0
Italy                             0.0
Israel                            0.0
Iceland                           0.0
St. Vincent and the Grenadines    0.0
Name: article_per_capita, dtype: float64

## 5.5 Geographic regions by total coverage

A rank ordered list of geographic regions (in descending order) by total articles per capita

In [128]:
total_articles_by_region["article_per_capita"] = total_articles_by_region["article_title"]/total_articles_by_region["Population (millions)"]
total_articles_by_region = total_articles_by_region.loc[total_articles_by_region["article_per_capita"] != np.inf,:]
total_articles_by_region["article_per_capita"].sort_values(ascending=False).head(10)

Region
SOUTHERN EUROPE    5.894040
CARIBBEAN          4.568182
WESTERN EUROPE     3.548223
EASTERN EUROPE     2.560976
NORTHERN EUROPE    2.448598
WESTERN ASIA       2.333333
OCEANIA            1.954545
SOUTHERN AFRICA    1.695652
EASTERN AFRICA     1.374207
CENTRAL ASIA       1.358974
Name: article_per_capita, dtype: float64

## 5.6 Geographic regions by high quality coverage

Rank ordered list of geographic regions (in descending order) by high quality articles per capita

In [130]:
high_quality_articles_by_region["article_per_capita"] = high_quality_articles_by_region["article_quality"]/high_quality_articles_by_region["Population (millions)"]
high_quality_articles_by_region = high_quality_articles_by_region.loc[high_quality_articles_by_region["article_per_capita"] != np.inf,:]
high_quality_articles_by_region["article_per_capita"].sort_values(ascending=False).head(10)

Region
SOUTHERN EUROPE    0.304636
CARIBBEAN          0.181818
EASTERN EUROPE     0.135889
WESTERN EUROPE     0.111675
WESTERN ASIA       0.095238
NORTHERN EUROPE    0.074766
SOUTHERN AFRICA    0.057971
CENTRAL AMERICA    0.056180
OCEANIA            0.045455
CENTRAL ASIA       0.038462
Name: article_per_capita, dtype: float64