# Assignment 2 - Bias on Wikipedia

#### Ian Kirkman, 11/1/2017

The goal of this assignment is to explore the ramifications of bias in data. Given the known demographics of english Wikipedia editors (see "Nationality" in https://en.wikipedia.org/wiki/Wikipedia:Wikipedians), we anticipate a bias that affects both the scope and quality of english Wikipedia articles for political figures from various countries. We will analyze the coverage and quality metrics of these articles on political figures, and reflect on our findings.

*Assignment source: https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data*

## 1. Data Prerequisites

Section 1 covers all the information needed to gather data, import the required libraries, and set user inputs. We start below with libraries and parameters. Gathering data will be broken into precomputed source data ([Section 1.2](#sec1.2)) and data pulled from an API ([Section 1.3](#sec1.3)).

### 1.1. Importing Libraries and Setting Parameters

This package includes the following libraries for processing and analysis:
 - `requests`: This is used to pull data from the ORES API.
 - `json`: This is used to format, save, and load raw data after it's pulled from the source.
 - `csv`: This is used to load raw data from csv files, and to write processed data to a csv. 
 - `math`: The functions `floor` and `ceil` are used to split the Residual IDs for ORES API calls.
 - `copy`: The `deepcopy` function is used when processing and analyzing data (Sections 2 and 3). 
 - `operator`: The `itemgetter` function is used when sorting a list of lists (Section 3).
 - `IPython`: The `display` and `markdown` functions are used to embed the final ranking tables in the notebook.
 
User inputs are also set in this section, and referenced throughout the later processing steps. Inputs are split into categories that correspond to later notebook Sections. 

 - [Section 1.2](#sec1.2) of this notebook covers the raw data CSV files that need to be uploaded to the project directory. The inputs in this section represent the filepaths of those uploaded CSVs.

 - [Section 1.3](#sec1.3) of this notebook will cover the ORES API calls to collect the raw ORES data. This section of inputs contains the parameters and endpoint used for the ORES API calls, as well as the file location of where to write the raw API call results.

 - [Section 2](#sec2) of this notebook contains the data processing steps for our project. In that section, we create the dataset of merged data from our 3 sources that is required as assignment output. Below, we enter the file location of where to save the merged data as a CSV file.

**Notes and Assumptions:**
- For all data paths it is assumed that this notebook lives in the project root directory. All paths should be written from the root.
- Project folders are currently split into DATA (all raw data) and OUTPUT (all processing and analysis output).
- Raw data files use the naming convention: `source_description_accessdate`.
- Raw data must be saved in a json format, and our processed data output must be saved as a CSV. Changing the file extensions in the paths will require updating code in the related sections.

In [1]:
import requests
import json
import csv
import math
import copy
from operator import itemgetter
from IPython.display import display, Markdown

############ BEGIN USER INPUTS ###

github_username = 'iankirkman'
uw_email = 'ikirkman@uw.edu'
headers={'User-Agent' : 'https://github.com/%s'%github_username, 'From' : '%s'%uw_email}

# Raw Data Upload Location (from root dir) -- See data source notes in Section 1.2.
raw_wp_data_path = 'DATA/wp_page_data_20171101.csv'
raw_prb_data_path = 'DATA/prb_population_mid2015_20171101.csv'

# ORES Parameters for API calls -- See usage in Section 1.3.
ores_endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
ores_params = {'project' : 'enwiki',
               'model'   : 'wp10'}
raw_ores_data_path = 'DATA/raw_ores_data_20171101.json'
    
# Filepaths of Notebook Output (from root dir) - See usage in Section 2.
merged_wp_prb_ores_data_path = 'OUTPUT/processed_wp_prb_ores_data.csv' 

############ END USER INPUTS ###

<a id='sec1.2'></a>
### 1.2. Uploading Raw Data from Outside Sources

We have three data sources available for this assignment. This section will cover the first two, which are available publicly in CSV format. To use these datasets in our project, we need to pull the CSV files from the online sources and upload them to our data directory.

The [Population Reference Bureau (PRB)](http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14) website contains a dataset with population counts by country circa mid-2015. A CSV file can be downloaded directly from the link by clicking the Excel icon in the top right side of the page. The CSV file must then be uploaded to the project data directory at the path specified in the inputs above. The only fields we use from this data are `Location` and `Data`, which correspond to 'Country' and 'Population' on our final dataset, respectively.

The english Wikipedia page data for political figures by country was provided by Oliver Keyes on [Figshare](https://figshare.com/articles/Untitled_Item/5513449). The CSV file can be downloaded via the download button on the top left, and then uploaded to the project data directory at the path specified in the inputs above. See code and data notes at the link. We will use each field from this dataset, with the following mapping to our final output: 'country' to 'Country', 'page' to 'Article_Name', and 'rev_id' to 'Revision_ID'.

**Notes and Assumptions:**
- The column order is assumed to be consistent for any data download from these sources. Passive header checking has been added as print statements in Sections 1.3 and 2.
- Note that the Wikipedia page data on Figshare has updated a field name from 'last_edit' to 'rev_id'. This change is not currently reflected in the page data documentation.

<a id='sec1.3'></a>
### 1.3. Pulling Raw Data from the ORES API

For our third data source, we will be accessing the [Objective Revision Evaluation Service (ORES)](https://www.mediawiki.org/wiki/ORES) API to collect quality score predictions by article (matched on Revision ID). We use the endpoint and parameters specified in Section 1.1 to call the API with multiple Revision IDs smushed together with a vertical line delimiter. The user-input parameters simply specify the project and model for the API call. Version 3 is assumed and hard-coded into the calls below. Revision IDs are added to the parameters after some initial processing steps from the Figshare data.

API calls return a nested dictionary. To access the score prediction for a given article (using Rev_ID_001 as an example), we pull: `api_results[ores_params['project']]['scores'][Rev_ID_001][ores_params['model']]['score']['prediction']`. 

ORES score predictions are classified as (ordered from best to worst):
- `FA`: Featured article
- `GA`: Good article
- `B`: B-class article
- `C`: C-class article
- `Start`: Start-class article
- `Stub`: Stub-class article

See the ORES API linked above for further details.

We first build a simple get function to return the API call results for a list of Revision IDs. We then batch groups of 50 Revision IDs at a time from the Wikipedia page data, and add the results of each call to our raw ORES data. The combined raw data is exported to a json file in the project data directory, at the path specified in Section 1.1.

**Notes and Assumptions:**
- Some of the API calls return an error dictionary instead of returning a score prediction. Those error dictionaries are saved in place on the raw data, and dealt with in our processing steps of Section 2.
- The pull of Revision IDs from the Figshare data assumes the column ordering is consistent with this download. 
- See lines marked with `## TEST ##` for passive error checking below.

In [2]:
def get_ores_data(revision_ids):
    '''
    Returns a json-formatted dictionary of ORES API results for list of 
    (up to 50) Wikipedia article Revision IDs. 
    
    DEPENDENCIES:
     - Requires Wikipedia page data from figshare uploaded to project
       data directory specified in Section 1.1.
     - Requires ORES endpoint and parameters specified in Sec 1.1.
        
    INPUTS: 
     - revision_ids: list of up to 50 revision ids to pull ORES data on
    
    RETURNS: 
     - json-formatted nested dictionary 
     - See ORES API documentation: https://www.mediawiki.org/wiki/ORES
    '''
    params = {'revids'  : '|'.join(str(x) for x in revision_ids)}
    params.update(ores_params)
    return requests.get(ores_endpoint.format(**params)).json()

# Read uploaded raw CSV of Wikipedia page data
# This is needed to collect Revision IDs for ORES API calls
# Assumes header row = ['page','country','rev_id'] 
wp_data = []
with open(raw_wp_data_path) as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        wp_data.append([row[0],row[1],row[2]])

## TEST ##
# Check Wikipedia data header row:
print('Check WP data headers: %r'%(wp_data[0]==['page','country','rev_id']))
     
# Consolidate list of Revision IDs for ORES API calls
rev_ids = [wp_data[i][2] for i in range(1,len(wp_data))]

# Batch groups of 50 Revision IDs for ORES API calls
ores_data = get_ores_data(rev_ids[50*math.floor(len(rev_ids)/50):len(rev_ids)+1])
for i in range(math.ceil(len(rev_ids)/50)-1):
    ores_data[ores_params['project']]['scores']. \
        update(get_ores_data(rev_ids[i*50:(i+1)*50])[ores_params['project']]['scores'])

## TEST ##
# Check that all rows have been added to ores_data dictionary.
print('Check ores_data for completeness (2):')
# Check total number of rev_ids in ores_data versus wp_data:
print('* 1/2: %r'%(len(ores_data[ores_params['project']]['scores']) == \
                   len(wp_data)-1))# -1 because wp_data has a header row to ignore
# Check ores_data contains last row of wp_data:
print('* 2/2: %r'%(wp_data[-1][2] in \
                   ores_data[ores_params['project']]['scores']))

# Save raw ORES data as json file
with open(raw_ores_data_path, 'w') as outfile:
    json.dump(ores_data, outfile)

Check WP data headers: True
Check ores_data for completeness (2):
* 1/2: True
* 2/2: True


<a id='sec2'></a>
## 2. Processing Data

Processing our data requires two merges, which we have broken into two steps below.   

### 2.1 Merge Wikipedia and Population Data

First we merge the Wikipedia page data with the PRB Population data. Both datasets have a country feature that we can join on. We remove all countries that do not have an exact match in both datasets.

Since we create the merged set by iterating over the Wikipedia page data, the countries in the PRB data that are missing are implicitly removed from our result. However, we added some exclusion tracking below so we can reconcile our data. 

A list of lists is created, called `dsmerge_wp_prb`, with **ordered** headers that correspond to each source value (where WP represents Wikipedia page data and PRB represents PRB population data):

| Column | Value Source |
| :--- | :--- |
| Country	| WP.country & PRB.Location  |
| Article_Name	| WP.page |
| Revision_ID	| WP.rev_id |
| Article_Quality	| '' |
| Population	| PRB.Data |

**Notes and Assumptions:**
- Note that the `Article_Quality` field is an empty string placeholder for the merge with ORES data in Section 2.2.
- The ordering of the list `dsmerge_wp_prb` is assumed in later processing steps.
- Passive checks are added in lines marked by `## TEST ##`.

In [3]:
# Recall the wikipedia page data was read from CSV in Section 1.2.
# This can be repeated here if necessary:
wp_data = []
with open(raw_wp_data_path) as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        wp_data.append([row[0],row[1],row[2]]) # Assumes header row = ['page','country','rev_id'] 

## TEST ##        
# Check Wikipedia data header row:
print('WP Header Check: %r'%(wp_data[0]==['page','country','rev_id']))
        
# Read uploaded raw CSV of PRB Population data into dictionary pop_data
pop_data = {} # Dict format is: {'country':'population'}
hdr = True # Data has header row
with open(raw_prb_data_path) as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        if len(row)==6: # Ignore title rows, include column headers
            ## TEST ##
            if hdr: # Check list order in header row
                print('PRB Header Check: %r'%(row[0]=='Location' and row[4]=='Data'))
                hdr = False
            else: # Add data rows (non-header) to dictionary
                pop_data[row[0]] = row[4]

# Merge PRB population data with Wikipedia data
# Note that this leaves an empty column to join Article_Quality from ores_data
dsmerge_wp_prb = [['Country','Article_Name','Revision_ID','Article_Quality','Population']]

# Tracking exclusions:
# By iterating over wp_data below, we are implicitly skipping countries in pop_data 
    # that are not in wp_data. 
# Therefore, to keep track of excluded countries in each source, we use the dictionaries:
wp_excl,wp_incl = {},{} # add keys as wp countries are skipped
pop_excl = copy.deepcopy(pop_data) # remove keys as prb countries are used

# Iterate over wp_data to construct merged list:
for row in wp_data[1:]: # Skip header row
    # We must also skip rows with countries that are not included in pop_data:
    if row[1] in pop_data:
        dsmerge_wp_prb.append([row[1],row[0],row[2],'',int(pop_data[row[1]].replace(',',''))])
        pop_excl.pop(row[1],None)
        wp_incl.update({row[1] : 1})
    else:
        wp_excl.update({row[1] : 1})
        
# Track totals for data reconciliation:
wp_art_incl_ct = len(dsmerge_wp_prb)
wp_art_excl_ct = len(wp_data)-len(dsmerge_wp_prb)
wp_ctry_incl_ct = len(wp_incl.keys())
wp_ctry_excl_ct = len(wp_excl.keys())
prb_pop_incl_ct = sum([int(v.replace(',','')) for v in pop_data.values()])  - \
                        sum([int(v.replace(',','')) for v in pop_excl.values()]) 
prb_pop_excl_ct = sum([int(v.replace(',','')) for v in pop_excl.values()]) 
prb_ctry_incl_ct = len([k for k in pop_data.keys() if k not in pop_excl.keys()])
prb_ctry_excl_ct = len(pop_excl.keys())

## TEST ##
# Check that number of included countries is the same in both sources
print('Merged Country Check: %s; Match: %r'%(format(wp_ctry_incl_ct,','), \
                                             (wp_ctry_incl_ct == prb_ctry_incl_ct)))

WP Header Check: True
PRB Header Check: True
Merged Country Check: 187; Match: True


#### Exclusion Reconciliation for WP and PRB Data by Country

We can use the tracking values computed above to print some information about the total amount of excluded countries, articles, and people on each of the applicable datasets. This allows us to confirm that we are not excluding a greater propotion than expected, as well as track our data totals throughout all processing steps.

In [4]:
## DATA RECONCILIATION ##
print('WIKIPEDIA DATA RECONCILIATION')
print('------------------------------------')
print('* Excluded Articles:')
print('\tNumber: %s'%format(wp_art_excl_ct,","))
print('\tPercent: %s'%format((wp_art_excl_ct/(wp_art_excl_ct+wp_art_incl_ct)),".2%"))
print('* Excluded Countries:')
print('\tNumber: %s'%format(wp_ctry_excl_ct,","))
print('\tPercent: %s'%format((wp_ctry_excl_ct/(wp_ctry_excl_ct+wp_ctry_incl_ct)),".2%"))
print('* Excluded Country List:')
for k in wp_excl.keys():
    print('\t%s'%k)
print() # Add whitespace
print('PRB POPULATION DATA RECONCILIATION')
print('------------------------------------')
print('* Excluded Population:')
print('\tRaw Count: %s'%format(prb_pop_excl_ct,","))
print('\tPercent of Total Pop.: %s'%format((prb_pop_excl_ct/(prb_pop_excl_ct+prb_pop_incl_ct)),".2%"))
print('* Excluded Countries:')
print('\tNumber: %s'%format(prb_ctry_excl_ct,","))
print('\tPercent: %s'%format((prb_ctry_excl_ct/(prb_ctry_excl_ct+prb_ctry_incl_ct)),".2%"))
print('* Excluded Country List:')
for k in pop_excl.keys():
    print('\t%s'%k)

WIKIPEDIA DATA RECONCILIATION
------------------------------------
* Excluded Articles:
	Number: 1,398
	Percent: 2.96%
* Excluded Countries:
	Number: 32
	Percent: 14.61%
* Excluded Country List:
	Hondura
	Salvadoran
	Saint Kitts and Nevis
	Palauan
	Ivorian
	Saint Vincent and the Grenadines
	Rhodesian
	Omani
	Niuean
	East Timorese
	Faroese
	Cape Colony
	South Korean
	Samoan
	Montserratian
	Pitcairn Islands
	Abkhazia
	Carniolan
	Saint Lucian
	South African Republic
	Incan
	Chechen
	Jersey
	Guernsey
	South Ossetian
	Cook Island
	Tokelauan
	Dagestani
	Greenlandic
	Ossetian
	Somaliland
	Rojava

PRB POPULATION DATA RECONCILIATION
------------------------------------
* Excluded Population:
	Raw Count: 62,366,406
	Percent of Total Pop.: 0.85%
* Excluded Countries:
	Number: 23
	Percent: 10.95%
* Excluded Country List:
	Brunei
	Channel Islands
	Cote d'Ivoire
	Curacao
	El Salvador
	French Polynesia
	Georgia
	Guam
	Honduras
	Hong Kong, SAR
	Macao, SAR
	Mayotte
	New Caledonia
	Oman
	Palau
	Puerto R

### 2.2 Merge ORES Data with the WP/PRB Merged Dataset

Now we merge the ORES data with our previously (Section 2.1) merged dataset from the Wikipedia page data and PRB population data. In this step, we will need to remove articles where the Revision ID returned an error dictionary instead of a score predicition in the ORES API call. We will also track our exclusions due to ORES errors to allow for a full data reconciliation.

We start with a deep copy of our previously merged data, named `dsmerge_wp_prb_ores`. We match the ORES data on the `Revision_ID` column. If the ORES API returned an error dictionary for that Revision ID, then the row is removed from our list and added to our exclusion tracking. If it returned a score dictionary, then we add the score prediction to the `Article_Quality` column.

The final `dsmerge_wp_prb_ores` list of lists dataset has **ordered** headers that correspond to each source value:

| Column | Value Source |
| :--- | :--- |
| Country	| WP.country & PRB.Location  |
| Article_Name	| WP.page |
| Revision_ID	| WP.rev_id & ORES.revid |
| Article_Quality	| ORES.prediction |
| Population	| PRB.Data |

*__This dataset is a requirement of the assignment, and is output to the location specified in the user inputs of Section 1.1.__*

**Notes and Assumptions:**
- The ordering of the list `dsmerge_wp_prb_ores` is assumed in later processing steps.
- Passive checks are added in lines marked by `## TEST ##`.

In [5]:
# If running apart from Section 1, load ores_data from raw json:
with open('%s'%(raw_ores_data_path), 'r') as infile:
    ores_data = json.load(infile)

# We will also need to exclude articles that did not have ORES data.
# The list wp_prb_excl will collect any data from the merged wikipedia and prb sources
#     that is excluded for not having ORES data.
wp_prb_excl = []

# Pull raw ORES API data to add quality prediction to merged list:
dsmerge_wp_prb_ores = copy.deepcopy(dsmerge_wp_prb)
for row in dsmerge_wp_prb_ores[1:]: # skip header row
    if 'error' in ores_data[ores_params['project']]['scores'][row[2]][ores_params['model']]:
        # No quality data for this article-- remove it from the merged set
        wp_prb_excl.append(row)
        dsmerge_wp_prb_ores.remove(row)
    else:
        # Add the quality prediction to the merged data
        row[3] = ores_data[ores_params['project']]['scores'][row[2]][ores_params['model']]['score']['prediction']
        
## TEST ##
print('Check merged row totals: %r'%(len(wp_prb_excl)+len(dsmerge_wp_prb_ores)==len(dsmerge_wp_prb)))

# Write Merged Dataset to Output CSV file
with open(merged_wp_prb_ores_data_path,'w') as csvfile:
    writer = csv.writer(csvfile)
    for row in dsmerge_wp_prb_ores:
        writer.writerow(row)

Check merged row totals: True


#### Exclusion Reconciliation for ORES data with merged WP/PRB data by Revision ID

We can use the tracking values computed above to print some information about the articles excluded by this merge. This allows us to confirm that we are not excluding a greater propotion than expected, as well as track our data totals throughout all processing steps.

In [6]:
## DATA RECONCILIATION ##
print('MERGED WP/PRB/ORES DATA RECONCILIATION')
print('------------------------------------')
print('* Articles from WP/PRB merged set excluded by ORES error:')
for k in wp_prb_excl:
    print('\t%s (%s, Rev ID: %s, Pop: %s)'%(k[1],k[0],k[2],format(k[4],',')))

MERGED WP/PRB/ORES DATA RECONCILIATION
------------------------------------
* Articles from WP/PRB merged set excluded by ORES error:
	Olajide Awosedo (Nigeria, Rev ID: 806811023, Pop: 181,839,400)
	Jalal Movaghar (Iran, Rev ID: 807367030, Pop: 78,483,446)
	Mohsen Movaghar (Iran, Rev ID: 807367166, Pop: 78,483,446)
	Ajay Kannoujiya (India, Rev ID: 807484325, Pop: 1,314,097,616)


## 3. Data Analysis

To analyze the bias in english Wikipedia articles, we compute two metrics for each country in the combined data. To assess coverage of articles in a country, we compute an articles-per-population proportion (reported as a percentage). To assess the quality of articles in a given country, we compute the proportion of articles that are high quality (those that are classified as 'FA' or 'GA', also reported as a percentage).

### 3.1 Developing Metrics for Country Ranking

We use a `countries` dictionary with country names as key. Each value is a dictionary that includes values for the country's population, total articles, and high-quality articles. After counting all the articles from our final merged dataset into the countries dictionary, we can create a simple table containing each of our two (coverage and quality) metrics for each country row. This table is called `countries_all_pcov_pqual` in the code section below. 

We can use this `countries_all_pcov_pqual` table of combined metrics along with some simple sorts to obtain the following country-ranking visualizations:
- Top Ten Countries by Coverage Proportion (Articles-to-Population)
- Bottom Ten Countries by Coverage Proportion (Articles-to-Population)
- Top Ten Countries by Proportion of High Quality Articles
- Bottom Ten Countries by Proportion of High Quality Articles

**Notes and Assumptions:**
- Tie-breakers for equal proportions will be based on previous data sort.

In [7]:
# Compile Article counts in dict of dicts with country names as keys
# e.g.: {'country': {'population': [population],
#                    'tot_articles': [article count],
#                    'hq_articles': [high-quality article count]}}
countries = {}
for row in dsmerge_wp_prb_ores[1:]:
    if row[0] in countries:
        # add to article counts only
        countries[row[0]]['tot_articles'] += 1
        if row[3] in ['GA','FA']:
            countries[row[0]]['hq_articles'] += 1
    else:
        # create new dict entry
        countries[row[0]] = {'population' : row[4],
                             'tot_articles' : 1,
                             'hq_articles' : int(row[3] in ['GA','FA'])}
        
# Create table of all countries with article-per-pop and hq-per-article values
countries_all_pcov_pqual = [['country','prop_coverage','prop_quality']] + \
              [[c, \
                countries[c]['tot_articles']/countries[c]['population'], \
                countries[c]['hq_articles']/countries[c]['tot_articles']] \
                for c in countries.keys()]

# Pull top/bottom 10 country lists from countries_all_pcov_pqual list
# Reference (use of itemgetter): https://stackoverflow.com/questions/10695139/sort-a-list-of-tuples-by-2nd-item-integer-value
countries_top10_pcov = [r for r in sorted(countries_all_pcov_pqual[1:],key=itemgetter(1),reverse=True)[:10]]
countries_bot10_pcov = [r for r in sorted(countries_all_pcov_pqual[1:],key=itemgetter(1),reverse=False)[:10]]
countries_top10_pqual = [r for r in sorted(countries_all_pcov_pqual[1:],key=itemgetter(2),reverse=True)[:10]]
countries_bot10_pqual = [r for r in sorted(countries_all_pcov_pqual[1:],key=itemgetter(2),reverse=False)[:10]]

### Display Rankings Visualizations

We create a simple get function to construct a string that will work with the IPython `markdown` and `display` functions. That function is called for each Rankings display we wish to show. 

In [8]:
def get_embedstr_ranktab(title,rank_table):
    '''
    Creates an embedding string for a country-ranking table via 
    the IPython markdown function.
    
    INPUT:
        - title: the name of the table to display
        - rank_table: the rankings table to display
        
    RETURNS:
        - the string used by IPython display(markdown()) function 
          to embed the country-rankings table
    '''
    embstr = '%s\n----\n'%title + \
             '|Country|Population|Article-per-Population|High Quality-per-Article\n' + \
             '|:-------------|-------------:|-----:|-----:|\n'
    for c in rank_table:
        embstr += '|%s|%s|%s|%s|\n'%(c[0],format(countries[c[0]]['population'],','),format(c[1],'.2%'),format(c[2],'.1%'))
    return embstr + '\n\n\n'

# Display Country Ranking tables in markdown.
# Reference: https://stackoverflow.com/questions/36288670/jupyter-notebook-output-in-markdown
mkdwn_str = get_embedstr_ranktab('Top Ten Countries by Coverage Proportion (Articles-to-Population)', \
                                 countries_top10_pcov) + \
            get_embedstr_ranktab('Bottom Ten Countries by Coverage Proportion (Articles-to-Population)', \
                                 countries_bot10_pcov) + \
            get_embedstr_ranktab('Top Ten Countries by Proportion of High Quality Articles', \
                                 countries_top10_pqual) + \
            get_embedstr_ranktab('Bottom Ten Countries by Proportion of High Quality Articles', \
                                 countries_bot10_pqual)
            
display(Markdown(mkdwn_str))

Top Ten Countries by Coverage Proportion (Articles-to-Population)
----
|Country|Population|Article-per-Population|High Quality-per-Article
|:-------------|-------------:|-----:|-----:|
|Nauru|10,860|0.49%|0.0%|
|Tuvalu|11,800|0.47%|5.5%|
|San Marino|33,000|0.25%|0.0%|
|Monaco|38,088|0.11%|0.0%|
|Liechtenstein|37,570|0.08%|0.0%|
|Marshall Islands|55,000|0.07%|0.0%|
|Iceland|330,828|0.06%|1.0%|
|Tonga|103,300|0.06%|0.0%|
|Andorra|78,000|0.04%|0.0%|
|Federated States of Micronesia|103,000|0.04%|0.0%|



Bottom Ten Countries by Coverage Proportion (Articles-to-Population)
----
|Country|Population|Article-per-Population|High Quality-per-Article
|:-------------|-------------:|-----:|-----:|
|India|1,314,097,616|0.00%|1.3%|
|China|1,371,920,000|0.00%|3.1%|
|Indonesia|255,741,973|0.00%|3.7%|
|Uzbekistan|31,290,791|0.00%|10.3%|
|Ethiopia|98,148,000|0.00%|2.9%|
|Korea, North|24,983,000|0.00%|23.1%|
|Zambia|15,473,900|0.00%|0.0%|
|Thailand|65,121,250|0.00%|2.7%|
|Congo, Dem. Rep. of|73,340,200|0.00%|5.6%|
|Bangladesh|160,411,000|0.00%|0.9%|



Top Ten Countries by Proportion of High Quality Articles
----
|Country|Population|Article-per-Population|High Quality-per-Article
|:-------------|-------------:|-----:|-----:|
|Korea, North|24,983,000|0.00%|23.1%|
|Saudi Arabia|31,565,109|0.00%|11.8%|
|Uzbekistan|31,290,791|0.00%|10.3%|
|Central African Republic|5,551,900|0.00%|10.3%|
|Romania|19,838,662|0.00%|9.8%|
|Guinea-Bissau|1,788,000|0.00%|9.5%|
|Bhutan|757,000|0.00%|9.1%|
|Vietnam|91,714,080|0.00%|8.4%|
|Dominica|68,000|0.02%|8.3%|
|Mauritania|3,641,288|0.00%|7.7%|



Bottom Ten Countries by Proportion of High Quality Articles
----
|Country|Population|Article-per-Population|High Quality-per-Article
|:-------------|-------------:|-----:|-----:|
|Zambia|15,473,900|0.00%|0.0%|
|Solomon Islands|641,900|0.02%|0.0%|
|Nepal|28,039,000|0.00%|0.0%|
|Costa Rica|4,832,000|0.00%|0.0%|
|Moldova|4,109,000|0.01%|0.0%|
|Finland|5,476,031|0.01%|0.0%|
|Switzerland|8,292,851|0.00%|0.0%|
|Belgium|11,211,064|0.00%|0.0%|
|San Marino|33,000|0.25%|0.0%|
|Turkmenistan|5,373,000|0.00%|0.0%|





## 4. Reflections

By completing this assignment, I learned more about data processing and reconciliation steps through experience. While I have some career experience reconciling health data for actuarial analysis, it is new to return to Python and work within an open science framework. I found that I am not very comfortable with the ORES API documentation, and I'm glad we were provided an example to work with. 

#### The English Wikipedia Bias

The remainder of my analysis assumes the nationality-bias in english Wikipedia reported on the [Wikipedia Editors Page](https://en.wikipedia.org/wiki/Wikipedia:Wikipedians). The source states that the nationality of english Wikipedia editors is described as 20% United States. India is indicated as the only non-European/North American country in the top 10 of english Wikipedia editors. 

#### Coverage Findings

The countries with the top 10 coverage proportions (article-per-population) all represent countries with very small populations, so I do not feel like this is a strong metric for representation. However, the table uncovered an interesting surprise that Tuvalu has both a high coverage proportion (.47%) and a high quality proportion (5.5%). While the low population makes the coverage proportion volatile/unreliable, the high quality proportion is an interesting outlier that could further be explored. 

Similarly, the massive populations in the bottom 10 coverage rankings obscure the result. The large population of India could explain the reason it was included in the top 10 number of editors to english Wikipedia. Since we are considering the coverage of articles proportional to the population size of a country, I wonder if it might also have been beneficial to consider the nationality-bias of editors normalized to their nation's proportion of the world population as well. This seems like it may create a stronger link than using raw editor counts for each country.  

#### Quality Findings

I expected the bias in Wikipedia editors to lead to a more self-reflexive content, e.g.: highest proportion of quality articles to be the United States. However, with South Korea and Saudi Arabia at the top of the list, it seems that the higher quality articles may be focused on areas of high interest in the majority of editor's home-country. This theory assumes the ORES algorithm is accurately grading high quality articles, and there is not some other component of interest inflating their scores. The bottom 10 of the proportion of high quality articles did not surprise me, as the list contains countries that do not reflect the nationalities or countries in conflict/of interest to the predominant nationalities of the majority of english Wikipedia editors.