# A2 - Bias in Data

Patrick Peng (ID:2029888)  
DATA 512 AU 2021

In [1]:
import pandas as pd
import numpy as np

## Step 1: Getting the article and population data

The "Politicians by Country" dataset was downloaded from [Figshare](https://figshare.com/articles/dataset/Untitled_Item/5513449) and is licensed [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).

In [2]:
pols_by_country_raw = pd.read_csv('page_data.csv')

The world population data is drawn from the [World Population Data Sheet](https://www.prb.org/international/indicator/population/table/) compiled by the Population Reference Bureau.

In [3]:
country_pop_raw = pd.read_csv('WPDS_2020_data.csv')

## Step 2: Cleaning the data

The "Politicians by Country" dataset contains pages that are not articles. These include templates (pages that start with the string "Template:") and lists (pages that start with "List of") that we want to remove from the dataset. We'll do that here.

In [4]:
pols_by_country = pols_by_country_raw[
    ~pols_by_country_raw['page'].str.startswith('Template:') & 
    ~pols_by_country_raw['page'].str.startswith('List of')
]

In [5]:
pols_by_country

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


Next, we will separate out the country population counts and sub-regional population counts from the world population data into separate DataFrames. These are distinguished in the dataset by whether or not their name is printed in all caps or not.

In [6]:
country_pop = country_pop_raw[~country_pop_raw['Name'].str.isupper()]
subregion_pop = country_pop_raw[country_pop_raw['Name'].str.isupper()]

In [7]:
country_pop

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000
...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000
230,SB,Solomon Islands,Country,2019,0.715,715000
231,TO,Tonga,Country,2019,0.099,99000
232,TV,Tuvalu,Country,2019,0.010,10000


In [8]:
subregion_pop

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000


Before we go any further, we need to associate each country with one or more sub-regions, for performing analysis at the regional level later.  
The `country_pop_raw` dataset is arranged hierarchically. Each row containing a sub-region is followed by rows containing data for the countries within that subregion, repeated for all sub-regions. There is also a higher level of sub-region that contains other sub-regions. Since the `country_pop` and `subregion_pop` dataframes we created preserve the original indices from `country_pop_raw`, we can use the relative position of a country's index to identify its sub-region (basically, the last sub-region entry that appears above the location of the country entry).

In [9]:
subregion_index = subregion_pop.index  # for minor divisions like "Eastern Europe"
subregion2_index = np.array([1,64,67,109,166,216])  # for major divisions like "Europe"
country_index = country_pop.index

country_name = []
country_pop_list = []
subregion_name = []
subregion2_name = []
subregion_pop_list = []
subregion2_pop_list = []
for i in country_index:
    j = subregion_index[int(np.sum(i > subregion_index))-1]
    k = subregion2_index[int(np.sum(i > subregion2_index))-1]
    country_name.append(country_pop['Name'][i])
    country_pop_list.append(country_pop['Population'][i])
    subregion_name.append(subregion_pop['Name'][j])
    subregion_pop_list.append(subregion_pop['Population'][j])
    subregion2_name.append(subregion_pop['Name'][k])
    subregion2_pop_list.append(subregion_pop['Population'][k])
    
country_and_subregions = pd.DataFrame(data={'country':country_name,
                                            'country_pop': country_pop_list,
                                            'subregion':subregion_name,
                                            'subregion_pop':subregion_pop_list,
                                            'subregion2':subregion2_name,
                                            'subregion2_pop':subregion2_pop_list})

Now we have a neat table listing each country, its population, and the subregions it belongs to (along with the subregional populations).

In [10]:
country_and_subregions

Unnamed: 0,country,country_pop,subregion,subregion_pop,subregion2,subregion2_pop
0,Algeria,44357000,NORTHERN AFRICA,244344000,AFRICA,1337918000
1,Egypt,100803000,NORTHERN AFRICA,244344000,AFRICA,1337918000
2,Libya,6891000,NORTHERN AFRICA,244344000,AFRICA,1337918000
3,Morocco,35952000,NORTHERN AFRICA,244344000,AFRICA,1337918000
4,Sudan,43849000,NORTHERN AFRICA,244344000,AFRICA,1337918000
...,...,...,...,...,...,...
205,Samoa,200000,OCEANIA,43155000,OCEANIA,43155000
206,Solomon Islands,715000,OCEANIA,43155000,OCEANIA,43155000
207,Tonga,99000,OCEANIA,43155000,OCEANIA,43155000
208,Tuvalu,10000,OCEANIA,43155000,OCEANIA,43155000


## Step 3: Getting article quality predictions

We'll use the REST API endpoint for ORES to get article quality predictions. We'll set it up here.

In [11]:
import json
import requests

In [12]:
endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki/?models=articlequality&revids={revid}'

headers = {
    'User-Agent': 'https://github.com/ppeng2',
    'From': 'ppeng2@uw.edu'
}

The maximum number of revids in a single API request appears to be 50, so we will have to batch our revids and make sequential API calls. I've written a function `batch_revids` below to perform this.

In [13]:
def batch_revids(batch_sz, revid_input):
    batch_list = []
    count = 0
    while count < len(revid_input):
        start_ind = count
        if count + batch_sz < len(revid_input):
            end_ind = count + batch_sz
        else:
            end_ind = len(revid_input)
        # 0-50 (0-49), 50-100 (50-99), 100-150 (100-149) ... 46200-46250 (46200-46249), 46250-46291
        batch_list.append('|'.join(str(x) for x in revid_input[start_ind:end_ind]))
        count = end_ind
    return batch_list

In [14]:
all_revids = pols_by_country['rev_id'].to_list()
batch_list = batch_revids(50,all_revids)

Next, I wrote some functions to perform the ORES API call (`get_data`) and parse the resulting JSON structure (`parse_json`) to pull out the features of interest, namely the revid and the predicted score. `parse_json` also compiles a list of all revids that ORES couldn't retrieve a score for.

In [15]:
def get_data(revids):
    call = requests.get(endpoint.format(revid = revids), headers=headers)
    response = call.json()
    return response

def parse_json(response):
    revid_list = []
    score_list = []
    unscored_revids = []
    for i in response['enwiki']['scores']:
        try:
            score_list.append(response['enwiki']['scores'][i]['articlequality']['score']['prediction'])
            revid_list.append(i)
        except KeyError:
            unscored_revids.append(i)

    #score_data = pd.DataFrame({'rev_id': revid_list, 'score': score_list})
    return (revid_list,score_list,unscored_revids)

Now, we sequentially call `get_data` and `parse_json` on each batch we prepared. This takes a little bit of time. As each batch completes, we'll add its results to a set of `big_<parameter>_list`s. We'll convert them to a DataFrame once all the batches are done running (it's faster to do it this way rather than create a DataFrame for each batch then concatenate them).

In [16]:
big_revid_list = []
big_score_list = []
big_unscored_revid_list = []
for i, batch in enumerate(batch_list):
    response = get_data(batch)
    (revid_list,score_list,unscored_revids) = parse_json(response)
    big_revid_list.extend(revid_list)
    big_score_list.extend(score_list)
    big_unscored_revid_list.extend(unscored_revids)
    
score_data = pd.DataFrame({'rev_id': big_revid_list, 'score': big_score_list})

In [17]:
score_data

Unnamed: 0,rev_id,score
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub
...,...,...
46014,807481636,C
46015,807482007,GA
46016,807483006,C
46017,807483153,GA


Before we continue, let's save a list of the pages that we couldn't retrieve scores for.

In [18]:
unscored_pages = pols_by_country[pols_by_country['rev_id'].isin(big_unscored_revid_list)]
unscored_pages.to_csv(path_or_buf='unscored_pages.csv',index=False)

## Step 4: Combining the datasets

We can do a database-style inner join of our `pols_by_country` and `score_data` dataframes using `.merge()` with `rev_id` as the key. Since it's an inner join, pages that we couldn't get scores for will not show up in the resulting dataframe.  
But before we do this we have to cast the `rev_id` column of `score_data` to int (currently str) so that it's consistent with that of `pols_by_country`.

In [19]:
score_data['rev_id'] = score_data['rev_id'].astype(int)
combined_dataset = pols_by_country.merge(score_data, on='rev_id')

We're still need to add another column, for population. So we need to do another inner join with the `country_and_subregions` dataframe.

In [20]:
combined_dataset2 = combined_dataset.merge(country_and_subregions, on='country')

This is the size of the dataset after we do the second inner join. In this process we lose about 1,800 pages that couldn't find matches for their Country.

In [21]:
combined_dataset2.shape[0]

44201

Let's take a look at those pages that couldn't get a match for Country and see what countries are causing problems.

In [22]:
no_match = pols_by_country[~pols_by_country['rev_id'].isin(combined_dataset2['rev_id'])]
no_match2 = no_match[~no_match['rev_id'].isin(unscored_pages['rev_id'])] # remove those that didn't have scores
no_match2['country'].unique()

array(['Czech Republic', 'Salvadoran', 'Rhodesian', 'Congo, Dem. Rep. of',
       'Cape Colony', 'Samoan', 'Montserratian', 'Pitcairn Islands',
       'Saint Kitts and Nevis', 'Abkhazia', 'East Timorese', 'Faroese',
       'Niuean', 'Ivorian', 'Carniolan', 'South Korean', 'Saint Lucian',
       'South African Republic', 'Hondura', 'Incan', 'Chechen', 'Jersey',
       'Guernsey', 'Macedonia', 'Saint Vincent and the Grenadines',
       'South Ossetian', 'Cook Island', 'Omani', 'Tokelauan', 'Swaziland',
       'Dagestani', 'Greenlandic', 'Ossetian', 'Palauan', 'Somaliland',
       'Rojava'], dtype=object)

It looks like there are some typos and errors in the `country` field, most commonly the use of the demonym rather than the country name or the use of an outdated name. Just for the heck of it, I'll try to fix some of them and see if we can reduce the number of no-match pages.  

In [23]:
correction_dict = {'Czech Republic': 'Czechia', 
                   'Salvadoran': 'El Salvador', 
                   'Congo, Dem. Rep. of': 'Congo, Dem. Rep.',
                   'Samoan': 'Samoa',
                   'Saint Kitts and Nevis':'St. Kitts-Nevis',
                   'Ivorian': "Cote d'Ivoire",
                   'South Korean': 'Korea, South',
                   'Saint Lucian': 'Saint Lucia',
                   'Hondura': 'Honduras',
                   'Jersey': 'Channel Islands',
                   'Guernsey': 'Channel Islands',
                   'Macedonia': 'North Macedonia',
                   'Saint Vincent and the Grenadines': 'St. Vincent and the Grenadines',
                   'Omani': 'Oman',
                   'Swaziland': 'eSwatini',
                   'Palauan': 'Palau'
                  }

combined_dataset['country'] = combined_dataset['country'].replace(to_replace=correction_dict)

Having done that, we'll do the join again. But this time, we're going to do an outer join, because we also want to get both "countries with no matching articles" and "articles with no matching country"  in there, which we can pull out and save to a file later.

In [24]:
combined_dataset3 = combined_dataset.merge(country_and_subregions, how='outer', on='country')
combined_dataset3

Unnamed: 0,page,country,rev_id,score,country_pop,subregion,subregion_pop,subregion2,subregion2_pop
0,Bir I of Kanem,Chad,355319463.0,Stub,16877000.0,MIDDLE AFRICA,1.797570e+08,AFRICA,1.337918e+09
1,Abdullah II of Kanem,Chad,498683267.0,Stub,16877000.0,MIDDLE AFRICA,1.797570e+08,AFRICA,1.337918e+09
2,Salmama II of Kanem,Chad,565745353.0,Stub,16877000.0,MIDDLE AFRICA,1.797570e+08,AFRICA,1.337918e+09
3,Kuri I of Kanem,Chad,565745365.0,Stub,16877000.0,MIDDLE AFRICA,1.797570e+08,AFRICA,1.337918e+09
4,Mohammed I of Kanem,Chad,565745375.0,Stub,16877000.0,MIDDLE AFRICA,1.797570e+08,AFRICA,1.337918e+09
...,...,...,...,...,...,...,...,...,...
46027,,"China, Hong Kong SAR",,,7494000.0,EAST ASIA,1.641063e+09,ASIA,4.625927e+09
46028,,"China, Macao SAR",,,686000.0,EAST ASIA,1.641063e+09,ASIA,4.625927e+09
46029,,French Polynesia,,,280000.0,OCEANIA,4.315500e+07,OCEANIA,4.315500e+07
46030,,Guam,,,175000.0,OCEANIA,4.315500e+07,OCEANIA,4.315500e+07


Before we save anything to file, we'll rename and reorder some columns.

In [25]:
combined_dataset3 = combined_dataset3[['country','page','rev_id','score','country_pop','subregion','subregion_pop','subregion2','subregion2_pop']]
combined_dataset3.rename(columns={'page':'article_name','score':'article_quality_est'},inplace=True)

Let's pull out the rows where no match for either country or article could be found. We can identify these because they have NaN for one or more columns.

In [26]:
missing_data = combined_dataset3[combined_dataset3.isnull().any(axis=1)]
matched_data = combined_dataset3[~combined_dataset3.isnull().any(axis=1)]

In [27]:
matched_data

Unnamed: 0,country,article_name,rev_id,article_quality_est,country_pop,subregion,subregion_pop,subregion2,subregion2_pop
0,Chad,Bir I of Kanem,355319463.0,Stub,16877000.0,MIDDLE AFRICA,179757000.0,AFRICA,1.337918e+09
1,Chad,Abdullah II of Kanem,498683267.0,Stub,16877000.0,MIDDLE AFRICA,179757000.0,AFRICA,1.337918e+09
2,Chad,Salmama II of Kanem,565745353.0,Stub,16877000.0,MIDDLE AFRICA,179757000.0,AFRICA,1.337918e+09
3,Chad,Kuri I of Kanem,565745365.0,Stub,16877000.0,MIDDLE AFRICA,179757000.0,AFRICA,1.337918e+09
4,Chad,Mohammed I of Kanem,565745375.0,Stub,16877000.0,MIDDLE AFRICA,179757000.0,AFRICA,1.337918e+09
...,...,...,...,...,...,...,...,...,...
46009,Seychelles,Rita Sinon,800323154.0,Stub,98000.0,EASTERN AFRICA,444970000.0,AFRICA,1.337918e+09
46010,Seychelles,Sylvette Frichot,800323798.0,Stub,98000.0,EASTERN AFRICA,444970000.0,AFRICA,1.337918e+09
46011,Seychelles,May De Silva,800969960.0,Start,98000.0,EASTERN AFRICA,444970000.0,AFRICA,1.337918e+09
46012,Seychelles,Vincent Meriton,802051093.0,Stub,98000.0,EASTERN AFRICA,444970000.0,AFRICA,1.337918e+09


In [28]:
missing_data

Unnamed: 0,country,article_name,rev_id,article_quality_est,country_pop,subregion,subregion_pop,subregion2,subregion2_pop
8359,Rhodesian,Gervas Clay,574571582.0,Stub,,,,,
8360,Rhodesian,Harry Davies (politician),669387487.0,Start,,,,,
8361,Rhodesian,William Fairbridge,682576258.0,Start,,,,,
8362,Rhodesian,Washington Malianga,711506001.0,Stub,,,,,
8363,Rhodesian,Ronald Snapper,712233907.0,Stub,,,,,
...,...,...,...,...,...,...,...,...,...
46027,"China, Hong Kong SAR",,,,7494000.0,EAST ASIA,1.641063e+09,ASIA,4.625927e+09
46028,"China, Macao SAR",,,,686000.0,EAST ASIA,1.641063e+09,ASIA,4.625927e+09
46029,French Polynesia,,,,280000.0,OCEANIA,4.315500e+07,OCEANIA,4.315500e+07
46030,Guam,,,,175000.0,OCEANIA,4.315500e+07,OCEANIA,4.315500e+07


Finally, we'll save the matched and no-match datasets to file. By manually fixing some of the country encodings we've reduced the number of lost pages to about 500.

In [29]:
matched_data.to_csv(path_or_buf='wp_wpds_politicians_by_country.csv',index=False)
missing_data.to_csv(path_or_buf='wp_wpds_countries-no_match.csv',index=False)

## Step 5: Analysis

We will perform some pivots on `matched_data` to obtain our desired insights. First, to get a measure of coverage, or total articles for each country.

In [30]:
total_articles = pd.pivot_table(data=matched_data,index=['country','country_pop'],values='article_name',aggfunc='count')
total_articles.reset_index(inplace=True)
total_articles.rename(columns={'article_name':'total_articles_count'},inplace=True)

Next, we'll get a count of GA or FA articles for each country.

In [31]:
# First pivot: for each country, how many articles are in each score class
quality_articles = pd.pivot_table(data=matched_data,index=['country','country_pop','article_quality_est'],values='article_name',aggfunc='count')
quality_articles.reset_index(inplace=True)
quality_articles.rename(columns={'article_name':'total_articles_count'},inplace=True)

# Filter: GA or FA scores only
quality_articles = quality_articles[(quality_articles['article_quality_est']=='GA') | (quality_articles['article_quality_est']=='FA')]

# Second pivot: For each country, sum up the number of GA and FA scores
quality_articles2 = pd.pivot_table(data=quality_articles,index=['country','country_pop'],values='total_articles_count',aggfunc='sum')
quality_articles2.rename(columns={'total_articles_count':'quality_articles_count'},inplace=True)
quality_articles2.reset_index(inplace=True)

Now we'll perform an right outer join with `total_articles` to access the `total_articles_count` attribute so we can calculate a proportion. We're doing an outer join because there might be some countries that have no GA or FA articles.

In [32]:
country_data = quality_articles2.merge(total_articles,on=['country','country_pop'],how='right')
country_data.fillna(value=0,inplace=True)

We now have everything we need to calculate the proportions at the country level. We'll do those calculations now.

In [33]:
country_data['articles_per_capita'] = country_data['total_articles_count']/country_data['country_pop']
country_data['quality_fraction'] = country_data['quality_articles_count']/country_data['total_articles_count']
country_data

Unnamed: 0,country,country_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
0,Afghanistan,38928000.0,13.0,313,0.000008,0.041534
1,Albania,2838000.0,3.0,454,0.000160,0.006608
2,Algeria,44357000.0,2.0,112,0.000003,0.017857
3,Andorra,82000.0,0.0,33,0.000402,0.000000
4,Angola,32522000.0,0.0,106,0.000003,0.000000
...,...,...,...,...,...,...
192,Vietnam,96209000.0,13.0,185,0.000002,0.070270
193,Yemen,29826000.0,3.0,114,0.000004,0.026316
194,Zambia,18384000.0,0.0,25,0.000001,0.000000
195,Zimbabwe,14863000.0,2.0,160,0.000011,0.012500


Now we'll repeat the same analysis at the sub-regional level. Since there are actually two levels of sub-regions, but they all have the same tag, we have to do this twice and then take the union of the two tables.

In [34]:
# For the lower level of subregion e.g "eastern europe"
# coverage (count of all articles for a sub-region)
total_articles = pd.pivot_table(data=matched_data,index=['subregion','subregion_pop'],values='article_name',aggfunc='count')
total_articles.reset_index(inplace=True)
total_articles.rename(columns={'article_name':'total_articles_count'},inplace=True)

# Quality proportion
# First pivot: for each country, how many articles are in each score class
quality_articles = pd.pivot_table(data=matched_data,index=['subregion','subregion_pop','article_quality_est'],values='article_name',aggfunc='count')
quality_articles.reset_index(inplace=True)
quality_articles.rename(columns={'article_name':'total_articles_count'},inplace=True)

# Filter: GA or FA scores only
quality_articles = quality_articles[(quality_articles['article_quality_est']=='GA') | (quality_articles['article_quality_est']=='FA')]

# Second pivot: For each country, sum up the number of GA and FA scores
quality_articles2 = pd.pivot_table(data=quality_articles,index=['subregion','subregion_pop'],values='total_articles_count',aggfunc='sum')
quality_articles2.rename(columns={'total_articles_count':'quality_articles_count'},inplace=True)
quality_articles2.reset_index(inplace=True)

region_data = quality_articles2.merge(total_articles,on=['subregion','subregion_pop'],how='right')
region_data.fillna(value=0,inplace=True)

region_data['articles_per_capita'] = region_data['total_articles_count']/region_data['subregion_pop']
region_data['quality_fraction'] = region_data['quality_articles_count']/region_data['total_articles_count']
region_data

Unnamed: 0,subregion,subregion_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
0,CARIBBEAN,43233000.0,14,782,1.8e-05,0.017903
1,CENTRAL AMERICA,178611000.0,25,1839,1e-05,0.013594
2,CENTRAL ASIA,74961000.0,7,241,3e-06,0.029046
3,EAST ASIA,1641063000.0,76,2543,2e-06,0.029886
4,EASTERN AFRICA,444970000.0,35,2472,6e-06,0.014159
5,EASTERN EUROPE,291902000.0,119,3968,1.4e-05,0.02999
6,MIDDLE AFRICA,179757000.0,22,788,4e-06,0.027919
7,NORTHERN AFRICA,244344000.0,19,882,4e-06,0.021542
8,NORTHERN AMERICA,368193000.0,102,1843,5e-06,0.055345
9,NORTHERN EUROPE,105990000.0,104,3834,3.6e-05,0.027126


In [35]:
# for the higher level of subregion e.g. "europe"
# coverage (count of all articles for a sub-region)
total_articles = pd.pivot_table(data=matched_data,index=['subregion2','subregion2_pop'],values='article_name',aggfunc='count')
total_articles.reset_index(inplace=True)
total_articles.rename(columns={'article_name':'total_articles_count'},inplace=True)

# Quality proportion
# First pivot: for each country, how many articles are in each score class
quality_articles = pd.pivot_table(data=matched_data,index=['subregion2','subregion2_pop','article_quality_est'],values='article_name',aggfunc='count')
quality_articles.reset_index(inplace=True)
quality_articles.rename(columns={'article_name':'total_articles_count'},inplace=True)

# Filter: GA or FA scores only
quality_articles = quality_articles[(quality_articles['article_quality_est']=='GA') | (quality_articles['article_quality_est']=='FA')]

# Second pivot: For each country, sum up the number of GA and FA scores
quality_articles2 = pd.pivot_table(data=quality_articles,index=['subregion2','subregion2_pop'],values='total_articles_count',aggfunc='sum')
quality_articles2.rename(columns={'total_articles_count':'quality_articles_count'},inplace=True)
quality_articles2.reset_index(inplace=True)

region2_data = quality_articles2.merge(total_articles,on=['subregion2','subregion2_pop'],how='right')
region2_data.fillna(value=0,inplace=True)

region2_data['articles_per_capita'] = region2_data['total_articles_count']/region2_data['subregion2_pop']
region2_data['quality_fraction'] = region2_data['quality_articles_count']/region2_data['total_articles_count']
region2_data.rename(columns={'subregion2':'subregion','subregion2_pop':'subregion_pop'},inplace=True)
region2_data

Unnamed: 0,subregion,subregion_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
0,AFRICA,1337918000.0,126,6979,5e-06,0.018054
1,ASIA,4625927000.0,315,11697,3e-06,0.02693
2,EUROPE,746622000.0,354,16079,2.2e-05,0.022016
3,LATIN AMERICA AND THE CARIBBEAN,651036000.0,79,5641,9e-06,0.014005
4,NORTHERN AMERICA,368193000.0,102,1843,5e-06,0.055345
5,OCEANIA,43155000.0,63,3206,7.4e-05,0.019651


Now to take the union of the two subregion tables and get them all in one. Since "oceania" and "northern america" occur in both sets, we will make sure to drop duplicates from the final table.

In [36]:
region_data_all = pd.concat([region_data,region2_data]).drop_duplicates().reset_index(drop=True)
region_data_all

Unnamed: 0,subregion,subregion_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
0,CARIBBEAN,43233000.0,14,782,1.8e-05,0.017903
1,CENTRAL AMERICA,178611000.0,25,1839,1e-05,0.013594
2,CENTRAL ASIA,74961000.0,7,241,3e-06,0.029046
3,EAST ASIA,1641063000.0,76,2543,2e-06,0.029886
4,EASTERN AFRICA,444970000.0,35,2472,6e-06,0.014159
5,EASTERN EUROPE,291902000.0,119,3968,1.4e-05,0.02999
6,MIDDLE AFRICA,179757000.0,22,788,4e-06,0.027919
7,NORTHERN AFRICA,244344000.0,19,882,4e-06,0.021542
8,NORTHERN AMERICA,368193000.0,102,1843,5e-06,0.055345
9,NORTHERN EUROPE,105990000.0,104,3834,3.6e-05,0.027126


## Step 6: Results

### 6.1 Top 10 countries by coverage

In [37]:
country_data.nlargest(10,'articles_per_capita')

Unnamed: 0,country,country_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
182,Tuvalu,10000.0,4.0,50,0.005,0.08
123,Nauru,11000.0,0.0,52,0.004727,0.0
149,San Marino,34000.0,0.0,77,0.002265,0.0
134,Palau,18000.0,1.0,21,0.001167,0.047619
116,Monaco,38000.0,0.0,39,0.001026,0.0
101,Liechtenstein,39000.0,0.0,27,0.000692,0.0
177,Tonga,99000.0,0.0,63,0.000636,0.0
110,Marshall Islands,57000.0,0.0,36,0.000632,0.0
76,Iceland,368000.0,2.0,201,0.000546,0.00995
165,St. Kitts-Nevis,54000.0,0.0,29,0.000537,0.0


### 6.2 Bottom 10 countries by coverage

In [38]:
country_data.nsmallest(10,'articles_per_capita')

Unnamed: 0,country,country_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
77,India,1400100000.0,13.0,964,6.885222e-07,0.013485
78,Indonesia,271739000.0,9.0,208,7.654404e-07,0.043269
35,China,1402385000.0,40.0,1124,8.014917e-07,0.035587
189,Uzbekistan,34174000.0,3.0,28,8.193363e-07,0.107143
56,Ethiopia,114916000.0,2.0,96,8.353928e-07,0.020833
90,"Korea, North",25779000.0,8.0,35,1.357694e-06,0.228571
194,Zambia,18384000.0,0.0,25,1.359878e-06,0.0
39,"Congo, Dem. Rep.",89568000.0,8.0,139,1.551894e-06,0.057554
175,Thailand,66534000.0,3.0,111,1.66832e-06,0.027027
120,Mozambique,31166000.0,0.0,57,1.828916e-06,0.0


### 6.3 Top 10 countries by relative quality

In [39]:
country_data.nlargest(10,'quality_fraction')

Unnamed: 0,country,country_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
90,"Korea, North",25779000.0,8.0,35,1.357694e-06,0.228571
151,Saudi Arabia,35041000.0,15.0,116,3.310408e-06,0.12931
144,Romania,19241000.0,42.0,338,1.756665e-05,0.12426
112,Mauritania,4650000.0,5.0,46,9.892473e-06,0.108696
189,Uzbekistan,34174000.0,3.0,28,8.193363e-07,0.107143
31,Central African Republic,4830000.0,6.0,64,1.325052e-05,0.09375
69,Guatemala,18066000.0,7.0,83,4.594265e-06,0.084337
48,Dominica,72000.0,1.0,12,0.0001666667,0.083333
182,Tuvalu,10000.0,4.0,50,0.005,0.08
171,Syria,19398000.0,10.0,127,6.547067e-06,0.07874


### 6.4 Bottom 10 countries by relative quality

In [40]:
country_data.nsmallest(10,'quality_fraction')

Unnamed: 0,country,country_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
3,Andorra,82000.0,0.0,33,0.000402,0.0
4,Angola,32522000.0,0.0,106,3e-06,0.0
5,Antigua and Barbuda,98000.0,0.0,24,0.000245,0.0
11,Bahamas,393000.0,0.0,20,5.1e-05,0.0
12,Bahrain,1465000.0,0.0,42,2.9e-05,0.0
14,Barbados,287000.0,0.0,14,4.9e-05,0.0
17,Belize,419000.0,0.0,16,3.8e-05,0.0
30,Cape Verde,556000.0,0.0,34,6.1e-05,0.0
37,Comoros,870000.0,0.0,49,5.6e-05,0.0
40,Costa Rica,5111000.0,0.0,143,2.8e-05,0.0


Note that these are just the first 10 entries in an alphabetized list of all countries with 0 GA or FA ranked articles. Not a particularly interesting result.

### 6.5 Geographic regions by coverage

In [41]:
region_data_all.sort_values(by=['articles_per_capita'],ascending=False)

Unnamed: 0,subregion,subregion_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
10,OCEANIA,43155000.0,63,3206,7.4e-05,0.019651
9,NORTHERN EUROPE,105990000.0,104,3834,3.6e-05,0.027126
15,SOUTHERN EUROPE,153251000.0,75,3731,2.4e-05,0.020102
18,WESTERN EUROPE,195479000.0,56,4546,2.3e-05,0.012319
21,EUROPE,746622000.0,354,16079,2.2e-05,0.022016
0,CARIBBEAN,43233000.0,14,782,1.8e-05,0.017903
5,EASTERN EUROPE,291902000.0,119,3968,1.4e-05,0.02999
1,CENTRAL AMERICA,178611000.0,25,1839,1e-05,0.013594
14,SOUTHERN AFRICA,67732000.0,9,659,1e-05,0.013657
17,WESTERN ASIA,280927000.0,89,2563,9e-06,0.034725


### 6.6 Geographic regions by relative quality

In [42]:
region_data_all.sort_values(by=['quality_fraction'],ascending=False)

Unnamed: 0,subregion,subregion_pop,quality_articles_count,total_articles_count,articles_per_capita,quality_fraction
8,NORTHERN AMERICA,368193000.0,102,1843,5e-06,0.055345
13,SOUTHEAST ASIA,661845000.0,72,2004,3e-06,0.035928
17,WESTERN ASIA,280927000.0,89,2563,9e-06,0.034725
5,EASTERN EUROPE,291902000.0,119,3968,1.4e-05,0.02999
3,EAST ASIA,1641063000.0,76,2543,2e-06,0.029886
2,CENTRAL ASIA,74961000.0,7,241,3e-06,0.029046
6,MIDDLE AFRICA,179757000.0,22,788,4e-06,0.027919
9,NORTHERN EUROPE,105990000.0,104,3834,3.6e-05,0.027126
20,ASIA,4625927000.0,315,11697,3e-06,0.02693
21,EUROPE,746622000.0,354,16079,2.2e-05,0.022016


## Step -1: Reflection and writeup

Many of the results of this analysis were unsurprising. The Coverage metric, that is, articles per capita, predictably privileges small, low-population countries while penalizing large, populous countries. This can be seen by the top 10 countries on the Coverage metric being exclusively island nations and European microstates (as well as with sparsely populated Oceania topping the regional Coverage standings), while the bottom of the league table is occupied by population heavyweights like India and China (who also drag down their respective regions in the regional standings). I do not believe this is due to bias so much as the simple fact is that having 4-5 orders of magnitude more citizens does not necessarily mean that a country will have 4-5 orders of magnitude more notable political figures than another.

Far more interesting to me is the Relative Quality metric, which measures the proportion of articles that the ORES model predicts would be classified as "Good Articles" (GA) or "Featured Articles" (FA). While it would not be difficult to theorize why North Korea performs well on this metric (this famously secretive country has few known political figures, and those that are known are fairly notorious), there are some baffling inclusions, such as Romania's whopping 42 quality articles, representing nearly a third of Eastern Europe's total number of quality articles. Upon further investigation, many Romania-related articles that ORES predicts would be GA or FA actually receive lower scores from human reviewers, such as B or C. As ORES' ML model makes its predictions based largely on the structural characteristics of the article rather than the quality of the writing, it's possible that Romanian editors on English Wikipedia (I can't imagine there are that many of them) are simply very good at laying out articles in a way that fools ORES' algorithm.

Before beginning this analysis, I expected wealthier, English-speaking countries to perform better on both Coverage and Relative Quality metrics. This being English Wikipedia, I guessed that most editors would hail from such countries and be interested in writing about them. At the country level, any bias in this direction appears to have been washed out by other countries who either have low populations or a small number of articles, but it is partially reflected by Northern America's ranking at the top of the table for Relative Quality. As this sub-region consists solely of the United States and Canada, two wealthy, English-speaking countries, this result could be a result of homer bias on the part of English Wikipedia editors. However, the rest of the data do not particularly suggest a high degree of Anglo bias on English Wikipedia. 

I think a useful improvement to this analysis would be to use the number of unique editors who have worked on politician articles associated with each country, rather than the country's population, as the denominator in the articles-per-capita calculation. This would help mitigate the Coverage benefit/penalty associated with very low- or very high- population countries, since I would assume that China-related articles do not have 10,000 times as many editors than Vanuatu-related ones. We would also be able to see the extent to which a homer bias exists on the part of editors. 