# Considering Bias in Data
This notebook will explore the bias in Wikipedias page rating system. Using article data of various political leaders from all around the world, we query the [ORES system](https://ores.wikimedia.org/) in order to obtain rating estimates. 

## License
This code example was developed by Nathan Grant for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided un
der the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - May 13, 2022

## Imports

In [1]:
# 
# These are standard python modules
import json, time, urllib.parse, os
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
# The 'pandas' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import pandas as pd

## Tabular Data Parsing

In [2]:
# Handle data inconsistencies in wikipedia data
politicians_dataframe = pd.read_csv(os.getcwd()+ '\\data\\politicians_by_country_SEPT2022.csv')
politicians_dataframe.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


In [3]:
# Check for duplicate records
politicians_dataframe['name'].value_counts()

Torokul Dzhanuzakov      4
Alexandra Benado         2
Konstantin Jireček       2
Zahur Ahmad Chowdhury    2
Visar Ymeri              2
                        ..
Jhajaira Urresta         1
Ferenc Pulszky           1
Chun Yung-woo            1
Pēteris Pētersons        1
Hassan Adan Wadadid      1
Name: name, Length: 7534, dtype: int64

In [4]:
politicians_dataframe = politicians_dataframe.drop_duplicates(subset='name')

In [5]:
population_dataframe = pd.read_csv(os.getcwd()+ '\\data\\population_by_country_2022.csv')
population_dataframe.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


In [6]:
# Get rows that are regions instead of countries
region_mask = population_dataframe['Geography'].str.isupper()
regions = population_dataframe[region_mask]

# Isolate countries
countries = population_dataframe[~region_mask]

## Obtaining revision ids

In [7]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [8]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [9]:
# Gather wiki responses for each politician and add them to a list
responses = []
for name in politicians_dataframe['name']:
    responses.append(request_pageinfo_per_article(name))

KeyboardInterrupt: 

In [None]:
# Filter responses and display unscessful queries
unsuccessful_queries = []
successful_queries = []
for r in responses:
    if "lastrevid" not in r['query']['pages'][list(r['query']['pages'].keys())[0]]:
        unsuccessful_queries.append(r['query']['pages'][list(r['query']['pages'].keys())[0]]['title'])
    else:
        data = r['query']['pages'][list(r['query']['pages'].keys())[0]]
        successful_queries.append((data['title'],data['lastrevid']))

In [None]:
unsuccessful_queries

## Obtaining page ratings

We are going to use the wp10 model because its the only model that had a wiki outlining its [performance](https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/wp10) on the test data

In [None]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "wp10"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

In [None]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [None]:
# Iterate through all of the revids and get the article scores
score_responses = []
for name, revid in successful_queries:
    response = request_ores_score_per_article(revid)
    score_responses.append((name, response))

In [None]:
# Example score response
score_responses[0]

In [None]:
# Parse score responses

In [None]:
# Filter through the responses to determine which ones were valid
unsuccessful_score_queries = []
successful_score_queries = []
for name, res in score_responses:
    revid = list(res['enwiki']['scores'].keys())[0]
    # If the wp10 element is in the response dictionary
    if 'wp10' in res['enwiki']['scores'][revid]:
        score = res['enwiki']['scores'][revid]['wp10']['score']
        # If prediction isnt in the json
        if 'prediction' not in score:
            # Add it to unsuccessful queries
            unsuccessful_score_queries.append(name)
        else:
            #  Else add it to successful queries
            if score['prediction'] not "NaN":
                successful_score_queries.append((name,revid,score['prediction']))
            else:
                unsuccessful_score_queries.append(name)
    else:
        unsuccessful_score_queries.append(name)

In [None]:
# No unsuccessful queries
len(unsuccessful_score_queries)

In [None]:
# Create a dataframe of the successful queries with their scores
score_df = pd.DataFrame(data=successful_score_queries,columns=['name','revid','score'])
score_df.head()

In [None]:
# Remove politicians whos records could not be resolved
politicians_dataframe = politicians_dataframe[~politicians_dataframe["name"].isin(unsuccessful_queries)]
politicians_dataframe.head(2)

In [None]:
# Combine the politician table with the articles scores
merged_df = pd.merge(score_df,politicians_dataframe, how='inner', on='name')
merged_df.head()

In [None]:
## scheme to generate regions from the countries
geography_map = []
levels = []

# Filter through countries regions
for g,pop in population_dataframe.values:
    # If its upper case then add it to the stack of region names
    if g.isupper():
        levels.append(g)
    # If its not then get the last region on the stack and add it to the list
    else:
        geography_map.append([g,levels[-1],pop])

In [None]:
# Example country tuple
geography_map[0], geography_map[50]

In [None]:
# Crete a datafraome of the countries with their regions and populations
region_df = pd.DataFrame(data=geography_map,columns=["country","region","population"])
region_df.head()

In [None]:
#Merge the regions and populations with the article ratings table
final_df = pd.merge(region_df,merged_df, how='outer', on='country')
# Select necessary columns
final_df = final_df[["country","region","population","name","revid","score"]]
# Reset the column names to the desired table output
final_df.columns = ["country","region","population","article_title","revision_id","article_quality"]
# Remove rows that have NAs
has_nans = final_df[final_df['population'].isna()]
# Final df is all articles that a rating could be generated for
final_df = final_df[~final_df['population'].isna()]
final_df.head()

In [None]:
# Save to disk in the working folder
final_df.to_csv(os.getcwd()+"\\wp_politicians_by_country.csv",index=False)

In [None]:
# Find out which countries couldnt be resolved
all_nans = has_nans['country'].unique()
all_nans

Here there is no data for Korean because in the population CSV, the countries are represented as Korea, North and Korea, South so therefore the table operations result in no key matches. For this reason we will not include these two countries in the analysis

In [None]:
# Save contries that dont have a match to a text tfile
with open(os.getcwd()+"\\wp_countries-no_match.txt",'a') as f:
    for c in all_nans:
        f.write(c+"\n")

## Analysis

Here we will calculate total-articles-per-population (a ratio representing the number of articles per person) and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a country-by-country and regional basis. Also we will normalize them all to be “per capita”.

Top 10 countries by coverage: 
- The 10 countries with the highest total articles per capita (in descending order):

In [None]:
# Group articles together by country and get a count of articles
total_articles_per_country = final_df.groupby("country")['article_title'].count()
total_articles_per_country.head()

In [None]:
# Merge together the articles for each country with countries populations
total_articles_per_pop_country = pd.merge(total_articles_per_country,countries, 
                                          how="inner",left_index=True ,right_on="Geography")
# Create a new column that is the number of articles divided by the normalized population
total_articles_per_pop_country["articles_per_population_mil"] = total_articles_per_pop_country["article_title"]/\
                                                        total_articles_per_pop_country["Population (millions)"]
# Select desired columns select the largest and sort in descending order
total_articles_per_pop_country_highest = total_articles_per_pop_country[["Geography","articles_per_population_mil"]].sort_values(ascending=False, 
                                        by ="articles_per_population_mil")
# Display the top 10
total_articles_per_pop_country_highest.head(10)

Bottom 10 countries by coverage: 
- The 10 countries with the lowest total articles per capita (in ascending order) .

In [None]:
# Select desired columns select the largest and sort in ascending order
total_articles_per_pop_country_lowest = total_articles_per_pop_country[["Geography","articles_per_population_mil"]]. \
                                        sort_values(ascending=False, 
                                        by ="articles_per_population_mil")
total_articles_per_pop_country_lowest.head(10)

Top 10 countries by high quality:
- The 10 countries with the highest high quality articles per capita (in descending order)

In [None]:
# Filter articles that are rated FA or GA by the model
only_quality_articles = final_df[(final_df['score'] == "FA") | (final_df['score'] == "GA")]
only_quality_articles.head()

In [None]:
# Merge together the articles for each country with countries populations
total_qual_articles_per_pop_country = pd.merge(only_quality_articles,countries, 
                                          how="inner",left_index=True ,right_on="Geography")
# Create a new column that is the number of articles divided by the normalized population
total_qual_articles_per_pop_country["articles_per_population_mil"] = total_qual_articles_per_pop_country["article_title"]/\
                                                        total_qual_articles_per_pop_country["Population (millions)"]
# Select desired columns select the largest and sort in descending order
total_qual_articles_per_pop_country_high = total_qual_articles_per_pop_country[["Geography",
                                        "articles_per_population_mil"]].sort_values(ascending=False, 
                                        by ="articles_per_population_mil")
total_qual_articles_per_pop_country_high.head(10)

Bottom 10 countries by high quality: 
- The 10 countries with the lowest high quality articles per capita (in ascending order).

In [None]:
# Select desired columns select the largest and sort in ascending order
total_qual_articles_per_pop_country_low = total_qual_articles_per_pop_country[["Geography",
                                        "articles_per_population_mil"]].sort_values(ascending=True, 
                                        by ="articles_per_population_mil")
total_qual_articles_per_pop_country_low.head(10)

Geographic regions by total coverage: 
- A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [None]:
# Aggregate counts together by region
article_counts_region = final_df.groupby("region")['article_quality'].count()
article_counts_region.head()

In [None]:
# Merge together the articles for each region with regions populations
total_articles_per_pop_reg = pd.merge(article_counts_region,regions, 
                                          how="inner",left_index=True ,right_on="Geography")
# Create a new column that is the number of articles divided by the normalized population
total_articles_per_pop_reg["articles_per_population_mil"] = total_articles_per_pop_reg["article_quality"]/\
                                                        total_articles_per_pop_reg["Population (millions)"]
# Select desired columns and sort values in descending order
total_articles_per_pop_reg = total_articles_per_pop_reg[["Geography",
                                        "articles_per_population_mil"]].sort_values(ascending=False, 
                                        by ="articles_per_population_mil")
# Set column names
total_articles_per_pop_reg.columns = ["country","articles_per_population_mil"]
# Display
total_articles_per_pop_reg

Geographic regions by high quality coverage: 
- Rank ordered list of geographic regions (in descending order) by high quality articles per capita.


In [None]:
# Aggregate counts together by region
article_counts_region_good = only_quality_articles.groupby("region")['article_quality'].count()
# Display
article_counts_region_good.head()

In [None]:
# Merge together the articles for each region with regions populations
total_articles_per_pop_reg_good = pd.merge(article_counts_region_good,regions, 
                                          how="inner",left_index=True ,right_on="Geography")
# Create a new column that is the number of articles divided by the normalized population
total_articles_per_pop_reg_good["articles_per_population_mil"] = total_articles_per_pop_reg_good["article_quality"]/\
                                                        total_articles_per_pop_reg_good["Population (millions)"]
# Select desired columns and sort values in descending order
total_articles_per_pop_reg_good = total_articles_per_pop_reg_good[["Geography",
                                        "articles_per_population_mil"]].sort_values(ascending=False, 
                                        by ="articles_per_population_mil")
# Set column names
total_articles_per_pop_reg_good.columns = ["country","articles_per_population_mil"]
total_articles_per_pop_reg