# Exploring Bias In the Quality of Political Wikipedia Articles Per Country
**Joshua Bone**  
**November 1, 2018**  
**UW DATA 512, Assignment 2**  

### Overview
The purpose of this assignment is to explore bias in the quality of English Wikipedia articles about politicians in various countries. 

Article quality is judged by a machine learning system known as the "[Objective Revision Evaluation Service](https://www.mediawiki.org/wiki/ORES)", or ORES, which has a publicly available API for assessing the quality of Wikipedia articles. Article quality is rated on the following scale (from highest to lowest quality):
    - FA, or "Featured Article"
    - GA, or "Good Article"
    - B, or "B-class Article"
    - C, or "C-class Article"
    - Start, or "Start-class Article"
    - Stub, or "Stub-class Article"
For the purposes of this study, we define **FA** and **GA** articles as high quality, and the rest as low quality.
        
### Data
#### Political Articles By Country
This dataset lists English Wikipedia articles on various politicians along with their nationality. It has been made available by Os Keyes on [Figshare](https://figshare.com/articles/Untitled_Item/5513449) under the [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.

#### Population Data By Country
This original version of the data comes from the "Population mid-2018" dataset available for download at [PRB](https://www.prb.org/international/indicator/population/table/). This data has been modified and is available on [Dropbox](https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0). This notebook uses the modified version.

#### Population Data Supplements
Population data for 4 additional countries not represented in the PRB dataset were gathered from the [World Population API](https://www.programmableweb.com/api/world-population).

### Summary
<p>Overall, the study was inconclusive. Some results showed likely bias, for example, 3 of the 4 largest countries in the world occupied the bottom 3 spots for numbers of high quality articles per capita, but the United States (the 3rd most populous country) did not show up on that list. The two countries at the top of the list for percentage of all articles that are high quality, North Korea and Saudi Arabia, seem like they could be in those spots due to their disproportionately high level of interest to the West (national security in the first case, and oil in the second). 
<p>However, I found the rest of the results completely surprising, even baffling. The factors I would have predicted to lead to a high percentage of high quality English articles would have been:
    
- GDP
- So-called 'first-world' status (i.e. the U.S., Canada, and the E.U.)
- English speaking countries

<p> In fact, I did find some support for my theories. The U.S. did appear on the list for percentage of articles that are high quality. However, it only placed at number 9. I found that 17 of the 46 countries having NO high quality political articles were African nations, including 7 of the top 9 countries by population in that category.
    
<p> Central African Republic, consistently ranked as one of the world's poorest and least developed countries, appeared at number 3, blowing all of my theories out of the water. In fact, 3 African countries appeared on that top 10 list, all of them ranking in the bottom half by GDP and GDP per capita of countries on that continent. The only European Union country to make the list was Romania, at number 4. I am unable to explain these results. 
    
<p> Additionally, there were surprising contrasts in the data. The country of Bhutan, a landlocked region in the Himalayas, ranked as number 6 on the percentage of articles that are high quality. However, its neighbor Nepal, 40 miles to the west, was tied with 45 other countries for last place, having no high quality political articles. 17 underdeveloped African countries did show up as having zero high quality articles. But then 3 others from this region made the top 10 list. 
    
<p> In summary, more research would be needed to draw any strong conclusions from this data. It would be a mistake to accept the results where they appear to confirm the hypothesized factors, without having an explanation for the strange outliers discovered in this study.

We begin by importing the Python modules we will need to run our notebook.

In [157]:
import csv
import json
import numpy as np
import pandas as pd
import requests

### Notebook Configuration
We can edit this variable to change the way the notebook runs. If `USE_CACHED_API_RESULTS` is set to `False`, the notebook will attempt to fetch the raw data from the API, which may take up to 10 minutes. It is recommended to set it to `True` instead, to use the cached raw data which is stored in this notebook's directory.

In [179]:
USE_CACHED_API_RESULTS = True

### File, Data, and API Configuration
We define the file names, column names, and API information as constants here.

In [159]:
POPULATION_FILENAME = "WPDS_2018_data.csv"
PAGEDATA_FILENAME = "page_data.csv"
CSV_FILENAME = "cleaned.csv"

COUNTRY = "country"
POPULATION = "population"
REV_ID = "revision_id"
RATING = "article_quality"
ARTICLE = "article_name"
ARTICLE_CT = "article_ct"
HQ_CT = "hq_article_ct"
ARTICLES_PER_CAP = "articles_per_person_pct"
HQ_PCT = "hq_article_pct"

API_ENDPOINT = "https://ores.wikimedia.org/v3/scores/{context}/?models={model}&revids={revids}"
API_HEADERS = {'User-Agent' : 'https://github.com/joshua-bone', 'From' : 'joshbone@uw.edu'}
API_CONTEXT = 'enwiki'
API_MODEL = 'wp10'

### Population Data By Country
Read in the population data from the CSV file, and rename the columns to the constants defined above. Check the first 5 rows to make sure the format looks OK.

In [160]:
pop_data = pd.read_csv(POPULATION_FILENAME)
pop_data = pop_data.rename(index=str, columns={"Geography": COUNTRY, "Population mid-2018 (millions)": POPULATION})
pop_data.head()

Unnamed: 0,country,population
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


### Page Data (Political Wikipedia Articles By Country)
Read in the page data from the CSV file, and rename the columns to the constants defined above. Check the first 5 rows to make sure the format looks OK.

In [161]:
page_data = pd.read_csv(PAGEDATA_FILENAME)
page_data = page_data.rename(index=str, columns={"page": ARTICLE, "rev_id": REV_ID})
page_data.head()

Unnamed: 0,article_name,country,revision_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


### Merge Population Data With Page Data

Our goal is to merge `pop_data` with `page_data`, joining on the `country` column. Before we do this we should first compare the two tables to see how well they match on this column. 

In [162]:
pop_countries = set(pop_data[COUNTRY])
page_countries = set(page_data[COUNTRY])
print("These countries have population data, but no page data:")
print(pop_countries - page_countries)
print("\nThese countries have page data, but no population data:")
print(page_countries - pop_countries)

These countries have population data, but no page data:
{"Cote d'Ivoire", 'Honduras', 'NORTHERN AMERICA', 'Timor-Leste', 'Georgia', 'Congo, Dem. Rep.', 'Western Sahara', 'eSwatini', 'Curacao', 'Oman', 'St. Kitts-Nevis', 'St. Vincent and the Grenadines', 'Puerto Rico', 'Czechia', 'El Salvador', 'AFRICA', 'Samoa', 'LATIN AMERICA AND THE CARIBBEAN', 'Palau', 'ASIA', 'EUROPE', 'New Caledonia', 'Brunei', 'OCEANIA', 'Guam', 'French Polynesia', 'Saint Lucia'}

These countries have page data, but no population data:
{'Saint Vincent and the Grenadines', 'Rhodesian', 'Incan', 'Palestinian Territory', 'Cape Colony', 'Saint Lucian', 'Salvadoran', 'Jersey', 'Omani', 'Greenlandic', 'Swaziland', 'Congo, Dem. Rep. of', 'Palauan', 'Ossetian', 'Dagestani', 'South Korean', 'Chechen', 'Guadeloupe', 'Carniolan', 'Hondura', 'Pitcairn Islands', 'Abkhazia', 'South African Republic', 'Guernsey', 'French Guiana', 'Rojava', 'Somaliland', 'Niuean', 'Czech Republic', 'Saint Kitts and Nevis', 'Montserratian', 'Faro

We can see that many of the countries actually do correspond, but are spelled differently. We fix this by updating the mapping manually.

In [163]:
update_page_data={"Ivorian":"Cote d'Ivoire","Congo, Dem. Rep. of":"Congo, Dem. Rep.","Salvadoran":"El Salvador",
                 "Hondura":"Honduras","Saint Kitts and Nevis":"St. Kitts-Nevis","Saint Lucian":"Saint Lucia",
                 "Saint Vincent and the Grenadines":"St. Vincent and the Grenadines","Omani":"Oman",
                  "Samoan":"Samoa","Swaziland":"eSwatini","Czech Republic":"Czechia","South Korean":"Korea, South",
                  "Palauan":"Palau","East Timorese":"Timor-Leste","South African Republic":"South Africa"}
for country in update_page_data:
  print("Replacing '%s' with '%s'." % (country, update_page_data[country]))
  page_data.loc[page_data['country'] == country, 'country'] \
    = update_page_data[country]


Replacing 'Saint Kitts and Nevis' with 'St. Kitts-Nevis'.
Replacing 'Hondura' with 'Honduras'.
Replacing 'Samoan' with 'Samoa'.
Replacing 'Congo, Dem. Rep. of' with 'Congo, Dem. Rep.'.
Replacing 'Palauan' with 'Palau'.
Replacing 'Saint Vincent and the Grenadines' with 'St. Vincent and the Grenadines'.
Replacing 'South Korean' with 'Korea, South'.
Replacing 'East Timorese' with 'Timor-Leste'.
Replacing 'South African Republic' with 'South Africa'.
Replacing 'Swaziland' with 'eSwatini'.
Replacing 'Saint Lucian' with 'Saint Lucia'.
Replacing 'Salvadoran' with 'El Salvador'.
Replacing 'Omani' with 'Oman'.
Replacing 'Ivorian' with 'Cote d'Ivoire'.
Replacing 'Czech Republic' with 'Czechia'.


There are still a few countries or regions that are missing population data. We could choose to ignore these, but one region in particular stands out to me based on world events as being interesting for this study, namely the Palestinian Territories. We can fill some of this missing data in from another resource. The [World Population API](https://www.programmableweb.com/api/world-population) has a simple API that returns the population of a given region at a given date. To stay consistent with the other population data, which is labeled as being from "Mid-2018", we choose to call the World Population API for three more regions using a date of June 1st, 2018.

In [164]:
#Supplement from World Population API (https://www.programmableweb.com/api/world-population)

country_set={
    "West Bank and Gaza",
    "Martinique",
    "Guadeloupe",  
    "French Guiana",
}
DATE = "2018-06-01"
POP_ENDPOINT = "http://api.population.io:80/1.0/population/{country}/{date}/"
for country in country_set:
  params = {'country': country, 'date':DATE}
  result = requests.get(POP_ENDPOINT.format(**params), API_HEADERS).json()
  pop = round(result['total_population']['population'] / 1000000, 1)
  pop_data = pop_data.append({COUNTRY:country, POPULATION:pop}, ignore_index=True) 
  print("Supplementing pop_data for %s: Population %.1fM" % (country, pop))

#Rename 'West Bank and Gaza' to match 'Palestinian Territory' in page_data
pop_data.loc[pop_data['country'] == 'West Bank and Gaza', 'country'] = "Palestinian Territory"

Supplementing pop_data for West Bank and Gaza: Population 5.1M
Supplementing pop_data for Guadeloupe: Population 0.5M
Supplementing pop_data for Martinique: Population 0.4M
Supplementing pop_data for French Guiana: Population 0.3M


We do one final check for the mismatched countries. We could try to look up more of the missing data elsewhere, but this looks good enough for our purposes.

In [165]:
pop_countries = set(pop_data[COUNTRY])
page_countries = set(page_data[COUNTRY])
print("These countries have population data, but no page data:")
print(pop_countries - page_countries)
print("\nThese countries have page data, but no population data:")
print(page_countries - pop_countries)

These countries have population data, but no page data:
{'AFRICA', 'EUROPE', 'NORTHERN AMERICA', 'Brunei', 'Georgia', 'OCEANIA', 'Guam', 'Western Sahara', 'French Polynesia', 'Curacao', 'LATIN AMERICA AND THE CARIBBEAN', 'Puerto Rico', 'ASIA', 'New Caledonia'}

These countries have page data, but no population data:
{'Montserratian', 'Greenlandic', 'Faroese', 'Ossetian', 'Pitcairn Islands', 'Abkhazia', 'Dagestani', 'Rhodesian', 'Incan', 'South Ossetian', 'Tokelauan', 'Cape Colony', 'Carniolan', 'Guernsey', 'Chechen', 'Rojava', 'Cook Island', 'Jersey', 'Somaliland', 'Niuean'}


We merge the page_data with the pop_data, using the default inner join (drops rows from either set that do not match a country in the other set), and check the first 5 rows to make sure the format is OK.

In [166]:
merged_page_pop = page_data.merge(pop_data, on=COUNTRY)
merged_page_pop.head()

Unnamed: 0,article_name,country,revision_id,population
0,Template:ZambiaProvincialMinisters,Zambia,235107991,17.7
1,Gladys Lundwe,Zambia,757566606,17.7
2,Mwamba Luchembe,Zambia,764848643,17.7
3,Thandiwe Banda,Zambia,768166426,17.7
4,Sylvester Chisembele,Zambia,776082926,17.7


### Call the ORES API to Rank Article Quality (Or Use Cached Results)
The default configuration of this notebook is to skip the API calls entirely and used the raw cached results. If it is desired to call the API when this notebook is run, the line at the beginning of this notebook may be updated to read `USE_CACHED_API_RESULTS = False`. The user should be advised that this sequence of API calls may take 5-10 minutes to complete.

In [167]:
def call_api(revision_id_list):
  revision_id_string = "|".join(str(id) for id in revision_id_list)
  params = {'context' : API_CONTEXT,
            'revids' : revision_id_string,
            'model' : API_MODEL}
  return requests.get(API_ENDPOINT.format(**params), API_HEADERS).json()

# Split the revision ids up into chunks of 100
CHUNK_SIZE = 100
rev_ids = list(page_data[REV_ID])
chunks = [rev_ids[i:i+CHUNK_SIZE] for i in range(0, len(rev_ids), CHUNK_SIZE)]

# Only call the API if we are not using the cached raw results.
if not USE_CACHED_API_RESULTS:
  results, calls_made, one_percent = {}, 0, len(chunks) // 100 + 1
  print("Calling API %d times. This may take a while." % len(chunks))
  for chunk in chunks:
    #Update the status so user knows that progress is being made.
    if (calls_made % one_percent == 0): print("%02d%% " % (calls_made/one_percent), end='')
    if ((calls_made + 1) % (one_percent * 10) == 0): print()
    results.update(call_api(chunk)[API_CONTEXT]['scores'])
    calls_made += 1
  print("[FINISHED]")
  #save the raw api results
  with open('raw_api_results.json', 'w') as f:
    json.dump(results, f)

#Read in the raw API results from file.
with open('raw_api_results.json', 'r') as f:
  raw_data = json.load(f)

Calling API 472 times. This may take a while.
00% 01% 02% 03% 04% 05% 06% 07% 08% 09% 
10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 
20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 
30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 
40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 
50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 
60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 
70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 
80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 
90% 91% 92% 93% 94% [FINISHED]


### Clean the ORES API Data and Merge With Page and Population Data
First we remove the rows for which the ORES API returned an error message.

In [168]:
num_skipped = 0
ids, ratings = [], []
for rev_id in list(raw_data):
  if 'error' in raw_data[rev_id][API_MODEL]:
    num_skipped += 1
  else:
    ids.append(int(rev_id))
    ratings.append(raw_data[rev_id][API_MODEL]['score']['prediction'])
print("Skipped %d rows out of %d due to errors in API results.\n" % (num_skipped, len(ids) + num_skipped))

Skipped 105 rows out of 47197 due to errors in API results.



Next, we create a new data frame and check the first 5 rows to make sure the format looks OK.

In [169]:
ores_df=pd.DataFrame({REV_ID:ids, RATING:ratings})
print(ores_df.head())

  article_quality  revision_id
0            Stub    726608422
1            Stub    719625016
2            Stub    723097428
3            Stub    784881283
4           Start    718621498


Finally, we merge the ORES dataframe with the previously merged population and page dataframes. We check the first 5 rows to make sure the format looks OK, and save to CSV.

In [170]:
final_merge = merged_page_pop.merge(ores_df, on=REV_ID)
final_merge.to_csv(CSV_FILENAME, index=False)
final_merge.head()

Unnamed: 0,article_name,country,revision_id,population,article_quality
0,Gladys Lundwe,Zambia,757566606,17.7,Stub
1,Mwamba Luchembe,Zambia,764848643,17.7,Stub
2,Thandiwe Banda,Zambia,768166426,17.7,Start
3,Sylvester Chisembele,Zambia,776082926,17.7,C
4,Victoria Kalima,Zambia,776530837,17.7,Start


### Calculate the Derived Fields
We need four fields:
<ul>
    <li>The total article count per country</li>
    <li>The high quality article count per country</li>
    <li>The percentage representing the number of articles per capita, per country</li>
    <li>The percentage representing the fraction of articles that are high quality, per country</li>
</ul>
First, we read the merged data back in from the CSV.

In [171]:
merged_data = pd.read_csv(CSV_FILENAME)

We get the total article count by grouping by country (we also group by population just to keep that data in the dataframe).

In [172]:
tot_ct = merged_data.groupby([COUNTRY, POPULATION]).size().to_frame(ARTICLE_CT).reset_index()
tot_ct.head()

Unnamed: 0,country,population,article_ct
0,Afghanistan,36.5,326
1,Albania,2.9,460
2,Algeria,42.7,119
3,Andorra,0.08,34
4,Angola,30.4,110


We get the high quality article count by adding a new boolean field and then grouping by country. Inspect the first 5 rows.

In [173]:
merged_data['is_hq'] = merged_data[RATING].isin(('FA', 'GA'))
hq_ct = merged_data.groupby([COUNTRY, 'is_hq']).size().to_frame(HQ_CT).reset_index()
hq_ct.head()

Unnamed: 0,country,is_hq,hq_article_ct
0,Afghanistan,False,316
1,Afghanistan,True,10
2,Albania,False,456
3,Albania,True,4
4,Algeria,False,117


Next, we merge the two derived dataframes together and calculate the required percentages. Inspect the first 5 rows.

In [174]:
#Left join, retaining only counts where 'is_hq'==True.
derived = tot_ct.merge(hq_ct.loc[hq_ct['is_hq'] == True], on=COUNTRY, how='left').drop('is_hq', 1)
#Countries having zero high quality articles will get NaN for the new column, so we set these to zero.
derived[HQ_CT].fillna(0, inplace=True)
#Convert back to integer since column was cast to floating point during the join.
derived[HQ_CT] = derived[HQ_CT].astype(int)
#Remove commas from population strings and cast to floating point.
derived[POPULATION] = derived[POPULATION].str.replace(',', '').astype(float)
#Calculate the number articles per capita, per country as a percentage.
derived[ARTICLES_PER_CAP] = derived[ARTICLE_CT]*100/(derived[POPULATION]*1000000)
#Calculate the percentage of articles that are high quality, per country.
derived[HQ_PCT] = derived[HQ_CT]*100/derived[ARTICLE_CT]
derived.head()

Unnamed: 0,country,population,article_ct,hq_article_ct,articles_per_person_pct,hq_article_pct
0,Afghanistan,36.5,326,10,0.000893,3.067485
1,Albania,2.9,460,4,0.015862,0.869565
2,Algeria,42.7,119,2,0.000279,1.680672
3,Andorra,0.08,34,0,0.0425,0.0
4,Angola,30.4,110,0,0.000362,0.0


# Results
### 10 Countries With Highest Percentages Of High Quality Articles
There is no clear pattern to the countries having the highest quality articles. North Korea leads the pack with nearly 18% of the articles being high quality, which is perhaps not surprising given their high levels of political tension with the West. Saudi Arabia comes in at \#2, which could make sense given their historical importance as an oil-producing ally of the West. The United States unsurprisingly appears on the list, but only at \#9. Other than those three, I think the results are quite surprising. Three of the top 10 (Central African Republic, Mauritania, and Benin) are relatively impoverished African nations (see Wikipedia article, [List of African countries by GDP](https://en.wikipedia.org/wiki/List_of_African_countries_by_GDP_(nominal))). Two are tiny island nations (Tuvalu and Dominica). The landlocked Himalayan nation of Bhutan makes the list as well. The only member of the European Union present in the top 10 is Romania.

In [183]:
derived.loc[:, [COUNTRY, POPULATION, ARTICLE_CT, HQ_CT, HQ_PCT]] \
  .sort_values(by=HQ_PCT, ascending=False).head(10)

Unnamed: 0,country,population,article_ct,hq_article_ct,hq_article_pct
89,"Korea, North",25.6,39,7,17.948718
150,Saudi Arabia,33.4,119,16,13.445378
31,Central African Republic,4.7,68,8,11.764706
143,Romania,19.5,348,40,11.494253
112,Mauritania,4.5,52,5,9.615385
19,Bhutan,0.8,33,3,9.090909
182,Tuvalu,0.01,55,5,9.090909
47,Dominica,0.07,12,1,8.333333
187,United States,328.0,1092,82,7.509158
18,Benin,11.5,94,7,7.446809


### Countries With Lowest Percentages Of High Quality Articles
As it turns out, there are 46 countries that have do not have any high quality articles (about 23% of all countries in the dataset). It would be meaningless to make a table of the bottom 10, so instead we can simply list the countries that are tied for lowest in this category. We include their populations to make it easier to identify trends.

We find that there is a clear trend here, with 17 of the 46 countries on the list (and 7 of the top 9 by population) being African nations. Nepal and Kazakhstan in Central Asia are not surprising in the top 10, being about as remote, both geographically and politically, from the English speaking world as one could get. 

What is very surprising is the geographical contrast with the top 10 countries by highest quality percentage. Nepal has zero high quality articles, but its neighbor Bhutan (just 40mi away) is \#6 in the world for percentage of high quality articles. Cameroon has zero high quality articles, but its neighbor Central African Republic--currently the poorest nation in the world according to [Business Insider](https://www.businessinsider.com/the-25-poorest-countries-in-the-world-2016-4?r=UK&IR=T#2-democratic-republic-of-congo--gdp-per-capita-753-525-24)--is \#3 in the world for percentage of high quality articles.

In [176]:
zero_hq = derived[derived[HQ_CT]==0]
print("There are %d countries that do not have any high quality articles:" % len(zero_hq))
zero_hq.loc[:, [COUNTRY, POPULATION]].sort_values(by=POPULATION, ascending=False)

There are 46 countries that do not have any high quality articles:


Unnamed: 0,country,population
183,Uganda,44.1
120,Mozambique,30.5
4,Angola,30.4
124,Nepal,29.7
28,Cameroon,25.6
40,Cote d'Ivoire,24.9
86,Kazakhstan,18.4
194,Zambia,17.7
179,Tunisia,11.6
16,Belgium,11.4


### Countries With The Highest Number Of Articles Per Capita
It makes sense that countries with very low populations will have a higher number of articles per capita. The only outlier here is Iceland--despite having 10x the population of most of the other countries on the list, they still manage to make the list with a whopping 206 political articles on English Wikipedia, despite the fact that their [official language is Icelandic](https://en.wikipedia.org/wiki/Languages_of_Iceland). 

In [177]:
derived.loc[:, [COUNTRY, POPULATION, ARTICLE_CT, ARTICLES_PER_CAP]] \
  .sort_values(by=ARTICLES_PER_CAP, ascending=False).head(10)

Unnamed: 0,country,population,article_ct,articles_per_person_pct
182,Tuvalu,0.01,55,0.55
123,Nauru,0.01,53,0.53
148,San Marino,0.03,82,0.273333
133,Palau,0.02,23,0.115
116,Monaco,0.04,40,0.1
100,Liechtenstein,0.04,29,0.0725
164,St. Kitts-Nevis,0.05,32,0.064
177,Tonga,0.1,63,0.063
110,Marshall Islands,0.06,37,0.061667
75,Iceland,0.4,206,0.0515


### Countries With The Lowest Number Of Articles Per Capita
Following the same trend as we saw above, countries with higher populations tend to have a lower number of articles per capita. In this case, 3 out of the 4 most populous countries (China, India, and Indonesia) make up the top 3 spots on the list. The United States (the 3rd most populous country in the world) is conspicuously absent. It is interesting to note also that all of the countries present in this category are African or Asian. 

In [185]:
derived.loc[:, [COUNTRY, POPULATION, ARTICLE_CT, ARTICLES_PER_CAP]] \
  .sort_values(by=ARTICLES_PER_CAP, ascending=True).head(10)

Unnamed: 0,country,population,article_ct,articles_per_person_pct
76,India,1371.3,986,7.2e-05
77,Indonesia,265.2,214,8.1e-05
34,China,1393.8,1135,8.1e-05
189,Uzbekistan,32.9,29,8.8e-05
55,Ethiopia,107.5,105,9.8e-05
194,Zambia,17.7,25,0.000141
89,"Korea, North",25.6,39,0.000152
38,"Congo, Dem. Rep.",84.3,142,0.000168
174,Thailand,66.2,112,0.000169
13,Bangladesh,166.4,323,0.000194
