# Considering Bias in Data


The goal of this notebook is to explore the concept of bias in data using Wikipedia articles. As a part of this analysis, articles on political figures from different countries have been considered. We will use a machine learning service called ORED to estimate the quality of each article. Our analysis will consist of a series of table that show:
 - The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
 - The countries with the highest and lowest proportion of high quality articles about politicians.
 - A ranking of geographic regions by articles-per-person and proportion of high quality articles.
 
 At the end we will be covering a short reflection that focuses on our findings from this analysis and the below lines of code capture all the details that led us to reach this final stsge to understand the causes and consequences of biased data in large, complex data science projects.
 

In [1]:
# importing the necessary Python modules

import json, time, urllib.parse
import requests
import os
import pandas as pd
import numpy as np
from tqdm import tqdm 

## Step 1: Getting the Article and Population Data

The first step is getting the data, which is available in several different places on the wikipedia. We need data that lists Wikipedia articles of politicians and data for country populations.<br />
The Wikipedia Category:Politicians by nationality was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries.<br />
The population data is drawn from the world population data sheet published by the Population Reference Bureau. This csv file had only two columns, Geography and Population. 


Going one path down in the folder to access the data files dynamically in the subsequent cells.

In [2]:
#defining the absolute path for the project

os.chdir("../")
DATA_PATH = os.path.abspath(os.curdir)

Reading the 'politicians_by_country_SEPT' csv file that contains the politicans with their respective articles on wikipedia.

In [3]:
CSV_PATH_POLITICAIN = DATA_PATH +'/data/csv_files/politicians_by_country_SEPT.2022.csv'

df_politician = pd.read_csv(CSV_PATH_POLITICAIN)
df_politician.head(2)

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan


The population_by_country_2022.csv contains some rows that provide cumulative regional population counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in politicians_by_country.SEPT.2022.csv. For the sake of report coverage and analysis, we re-organized the data into four columns namely, continent, region, geography and population.


In [4]:
# Load reorganized country population data from the cleaned directory

CSV_PATH_POPULATION = DATA_PATH +'/data/Cleaned_data/population_by_country_2022_cleaned.csv'

df_population = pd.read_csv(CSV_PATH_POPULATION)
df_population.head()

Unnamed: 0,continent,region,country,population
0,AFRICA,NORTHERN AFRICA,Algeria,44.9
1,AFRICA,NORTHERN AFRICA,Egypt,103.5
2,AFRICA,NORTHERN AFRICA,Libya,6.8
3,AFRICA,NORTHERN AFRICA,Morocco,36.7
4,AFRICA,NORTHERN AFRICA,Sudan,46.9


## Step 2: Getting Article Quality Predictions

In this step, we need to get the predicted quality scores for each article in the Wikipedia dataset using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article 

<br/>
These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures. These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors.<br/>
ORES requires a specific revision ID of a specific article to be able to make a label prediction. We can use the API:Info request to get a range of metadata on an article, including the most current revision ID of the article page.
<br/><br/>
The next line of cells try to get a Wikipedia page quality prediction from ORES for each politician’s article page by: 

- reading each line of politicians_by_country.SEPT.2022.csv, 
- making a page info request to get the current page revision, and 
- making an ORES request using the page title and current revision id.


In [5]:
# This code will access page info data using the 
# [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page).
#
#    CONSTANTS
#
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

REQUEST_HEADERS = {
    'User-Agent': '<sdawark@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",     
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [6]:
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


This functions below generate quality scores for article revisions using [ORES](https://www.mediawiki.org/wiki/ORES).The API documentation can be access from the main [ORES](https://ores.wikimedia.org) page. 

In [7]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there

REQUEST_HEADERS = {
    'User-Agent': '<sdawark@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article revisions. Therefore, the main parameter is article_revid.

In [8]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


This process will loop through all articles (politicians by country) and try to retreive their revision counts. If an article is found without a revision, it is saved inside a list called ARTICLE_NO_REVISION. All articles with revisions are saved inside a dictionary with the name ARTICLE_REVISIONS

In [9]:
#Function to dump json response

def outputToJson(filename,data):
    out_path = filename + '.json'
    with open(out_path, 'w') as f:
        json.dump(data, f) # dumping data to disk

In [10]:
ARTICLE_NO_REVISION = []
ARTICLE_REVISIONS = {}

ARTICLE_TITLES = df_politician['name']

for i in tqdm(range(6289, len(ARTICLE_TITLES))):
    info = request_pageinfo_per_article(ARTICLE_TITLES[i])
    obj = info['query']['pages']
    info_key = list(obj.keys())[0]
    revision_id = 0
    if 'lastrevid' in info['query']['pages'][info_key]:
        revision_id = info['query']['pages'][info_key]['lastrevid']
    # Check if article as a revision
    if revision_id and revision_id>0:
        # Update ARTICLE_REVISIONS dict with the article and last revision number
        ARTICLE_REVISIONS.update({ARTICLE_TITLES[i]:revision_id})
    else:
        # update the list of articles with no revision
        ARTICLE_NO_REVISION.append(ARTICLE_TITLES[i])
        
outputToJson("/Users/qwert/Documents/UW_Data_Science/Human_Centered_Data_Science/Homeworks/data-512-homework_2/data/output/article_revisions_test", ARTICLE_REVISIONS)
print('Completed')

100%|██████████| 1295/1295 [04:51<00:00,  4.44it/s]

Completed





In [11]:
#Function to read the JSON file

def read_json(filename):
    data = {}
    with open(filename, "r") as f:
        data = json.loads(f.read())
    return data

The below cell reads the JSON file that has been genearetd in the precious cell and retrieves the quality corresponding to each article.

In [14]:
ARTICLE_REVISIONS = read_json("/Users/qwert/Documents/UW_Data_Science/Human_Centered_Data_Science/Homeworks/data-512-homework_2/data/output/article_revisions_test.json")

ARTICLE_QUALITY = {}
for ARTICLE in tqdm(ARTICLE_REVISIONS):
    score = request_ores_score_per_article(ARTICLE_REVISIONS[ARTICLE])
    obj = score['enwiki']['scores']
    info_key = list(obj.keys())[0]
    quality = score['enwiki']['scores'][info_key]['articlequality']['score']['prediction']
    ARTICLE_QUALITY.update({ARTICLE:quality})
outputToJson("/Users/qwert/Documents/UW_Data_Science/Human_Centered_Data_Science/Homeworks/data-512-homework_2/data/output/article_quality", ARTICLE_QUALITY)
print('Quality Data Retreived Succesfully from the JSON file')

100%|██████████| 1293/1293 [04:21<00:00,  4.95it/s]

Quality Data Retreived Succesfully from the JSON file





## Step 3: Combining the Datasets

After retrieving and including the ORES data for each article, we merged the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, we run into entries which cannot be merged because the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.

Consolidate the remaining data into a single CSV file called:

wp_politicians_by_country.csv

In [15]:
ARTICLE_QUALITY = read_json("/Users/qwert/Documents/UW_Data_Science/Human_Centered_Data_Science/Homeworks/data-512-homework_2/data/output/article_quality.json")
df_article_quality = pd.DataFrame(ARTICLE_QUALITY.items(), columns=['article_title', 'article_quality'])

ARTICLE_REVISIONS = read_json("/Users/qwert/Documents/UW_Data_Science/Human_Centered_Data_Science/Homeworks/data-512-homework_2/data/output/article_revisions_test.json")
df_article_revisions = pd.DataFrame(ARTICLE_REVISIONS.items(), columns=['article_title', 'revision_Id'])

In [17]:
df_article_quality.head(5)

Unnamed: 0,article_title,article_quality
0,Suleiman Mohamoud Adan,Stub
1,Zamzam Abdi Adan,Stub
2,Ahmed Aw Dahir,Start
3,Mohammed Ahmed Alin,Stub
4,Mohamed Nour Arrale,Stub


In [18]:
df_article_revisions.head(5)

Unnamed: 0,article_title,revision_Id
0,Suleiman Mohamoud Adan,1100648418
1,Zamzam Abdi Adan,1104180941
2,Ahmed Aw Dahir,1062918876
3,Mohammed Ahmed Alin,1067230288
4,Mohamed Nour Arrale,1010517655


Merging the df_revision and df_article_quality dataframes to get a single dataframe containing article_title, revision_Id and article_quality

In [19]:
df_revision_quality = pd.merge(df_article_revisions,df_article_quality, how='left', on='article_title')
df_revision_quality.head(5)

Unnamed: 0,article_title,revision_Id,article_quality
0,Suleiman Mohamoud Adan,1100648418,Stub
1,Zamzam Abdi Adan,1104180941,Stub
2,Ahmed Aw Dahir,1062918876,Start
3,Mohammed Ahmed Alin,1067230288,Stub
4,Mohamed Nour Arrale,1010517655,Stub


In the next cell we merge the df_revision_quality with the df_df_politician to get the df dataframe that has the columns as article_title, revision_Id, article_quality and country

In [21]:
df_politician.rename(columns = {'name':'article_title'}, inplace = True)
df = df_revision_quality.merge(df_politician[['article_title', 'country']], how='left', on="article_title")
df.head(5)

Unnamed: 0,article_title,revision_Id,article_quality,country
0,Suleiman Mohamoud Adan,1100648418,Stub,Somalia
1,Zamzam Abdi Adan,1104180941,Stub,Somalia
2,Ahmed Aw Dahir,1062918876,Start,Somalia
3,Mohammed Ahmed Alin,1067230288,Stub,Somalia
4,Mohamed Nour Arrale,1010517655,Stub,Somalia


In order to get the final dataframe for our analysis, we merge the df dataframe with the df_population dataframe so that the final dataframe has the columns mentioned below:
- country
- region
- population
- article_title
- revision_id
- article_quality

In [44]:
df_final = df.merge(df_population[['country', 'region', 'population']], how='left', on="country")

df_final.to_csv("/Users/qwert/Documents/UW_Data_Science/Human_Centered_Data_Science/Homeworks/data-512-homework_2/data/output/wp_politicians_by_country.csv", encoding='utf-8', index=False)


In [45]:
df_final.head(2)

Unnamed: 0,article_title,revision_Id,article_quality,country,region,population
0,Suleiman Mohamoud Adan,1100648418,Stub,Somalia,EAST AFRICA,17.6
1,Zamzam Abdi Adan,1104180941,Stub,Somalia,EAST AFRICA,17.6


We identified all countries for which there are no matches and output a list of those countries, with each country on a separate line named:

wp_countries-no_match.txt

In [25]:
# List of countries for which there are no matches

df_no_matches = df_population[~df_population.country.isin(df_final.country)]
df_no_matches.to_csv("/Users/qwert/Documents/UW_Data_Science/Human_Centered_Data_Science/Homeworks/data-512-homework_2/data/output/wp_countries-no_match.txt", sep='\t')


In [26]:
df_final.head(5)

Unnamed: 0,article_title,revision_Id,article_quality,country,region,population
0,Suleiman Mohamoud Adan,1100648418,Stub,Somalia,EAST AFRICA,17.6
1,Zamzam Abdi Adan,1104180941,Stub,Somalia,EAST AFRICA,17.6
2,Ahmed Aw Dahir,1062918876,Start,Somalia,EAST AFRICA,17.6
3,Mohammed Ahmed Alin,1067230288,Stub,Somalia,EAST AFRICA,17.6
4,Mohamed Nour Arrale,1010517655,Stub,Somalia,EAST AFRICA,17.6


## Step 4: Analysis

The dataframe for our analysis is ready from the above section of our notebook. Using the above dataframe, we start our analysis in the below section


In [29]:
df_by_country = df_final.groupby(by=['region','country','population'])\
                .agg({'article_title':'count'}).reset_index()\
                .rename(columns={'article_title':'article_count', 'population':'population (millions)'})

Calculating the ratio of article_count to population(millions) as a new column - total_articles_per_population

In [30]:
df_by_country['total_articles_per_population'] = df_by_country['article_count']/(df_by_country['population (millions)'] * 1000000)

In [31]:
df_by_country.head(5)

Unnamed: 0,region,country,population (millions),article_count,total_articles_per_population
0,CARIBBEAN,Cuba,11.1,1,9.009009e-08
1,CARIBBEAN,St. Kitts-Nevis,0.1,3,3e-05
2,CARIBBEAN,St. Vincent and the Grenadines,0.1,3,3e-05
3,CARIBBEAN,Trinidad and Tobago,1.4,16,1.142857e-05
4,CENTRAL AMERICA,El Salvador,6.3,1,1.587302e-07


Filtering the records where the quality is in ['FA', 'GA']

In [32]:
quality = ['FA','GA']

df_quality_articles = df_final.loc[df_final['article_quality'].isin(quality)]
df_quality_by_country = df_quality_articles.groupby(by=['country'])\
                        .agg({'article_title':'count'}).reset_index()\
                        .rename(columns={'article_title':'quality_count', 'population':'population (millions)'})

In [33]:
df_quality_by_country.head(5)

Unnamed: 0,country,quality_count
0,Netherlands,1
1,Somalia,1
2,South Africa,4
3,South Sudan,2
4,Spain,17


In [34]:
df_country_final = pd.merge(df_by_country, df_quality_by_country, how='left', on="country")
df_country_final['quality_count'] = df_country_final['quality_count'].fillna(0)
df_country_final['high_quality_per_person'] = df_country_final['quality_count']/(df_by_country['population (millions)'] * 1000000)
df_country_final.head(5)

Unnamed: 0,region,country,population (millions),article_count,total_articles_per_population,quality_count,high_quality_per_person
0,CARIBBEAN,Cuba,11.1,1,9.009009e-08,0.0,0.0
1,CARIBBEAN,St. Kitts-Nevis,0.1,3,3e-05,0.0,0.0
2,CARIBBEAN,St. Vincent and the Grenadines,0.1,3,3e-05,0.0,0.0
3,CARIBBEAN,Trinidad and Tobago,1.4,16,1.142857e-05,0.0,0.0
4,CENTRAL AMERICA,El Salvador,6.3,1,1.587302e-07,0.0,0.0


## Step 5: Results
Our results from this analysis will be produced in the form of data tables.

####  - Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) 

In [35]:
df_top_10_total_articles = df_country_final.copy()
df_top_10_total_articles.replace([np.inf, -np.inf], np.nan, inplace=True)
# Drop rows with NaN
df_top_10_total_articles.dropna(inplace=True)
df_top_10_total_articles.nlargest(10, 'total_articles_per_population')

Unnamed: 0,region,country,population (millions),article_count,total_articles_per_population,quality_count,high_quality_per_person
27,SOUTH AMERICA,Suriname,0.6,23,3.8e-05,1.0,1.666667e-06
1,CARIBBEAN,St. Kitts-Nevis,0.1,3,3e-05,0.0,0.0
2,CARIBBEAN,St. Vincent and the Grenadines,0.1,3,3e-05,0.0,0.0
23,OCEANIA,Tonga,0.1,3,3e-05,0.0,0.0
25,OCEANIA,Vanuatu,0.3,5,1.7e-05,0.0,0.0
34,SOUTHEAST ASIA,Timor-Leste,1.3,17,1.3e-05,0.0,0.0
3,CARIBBEAN,Trinidad and Tobago,1.4,16,1.1e-05,0.0,0.0
28,SOUTH AMERICA,Uruguay,3.6,39,1.1e-05,1.0,2.777778e-07
11,EAST AFRICA,South Sudan,10.9,92,8e-06,2.0,1.834862e-07
46,WESTERN EUROPE,Switzerland,8.8,63,7e-06,3.0,3.409091e-07


####  -  Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) 

In [36]:
df_bottom_10_total_articles = df_country_final.copy()
df_bottom_10_total_articles.nsmallest(10, 'total_articles_per_population')

Unnamed: 0,region,country,population (millions),article_count,total_articles_per_population,quality_count,high_quality_per_person
32,SOUTHEAST ASIA,Indonesia,275.5,1,3.629764e-09,0.0,0.0
17,EASTERN EUROPE,Russia,144.3,1,6.930007e-09,0.0,0.0
37,SOUTHERN EUROPE,Italy,58.9,1,1.697793e-08,0.0,0.0
26,SOUTH AMERICA,Chile,19.8,1,5.050505e-08,0.0,0.0
5,CENTRAL ASIA,Kazakhstan,19.2,1,5.208333e-08,0.0,0.0
45,WESTERN EUROPE,Netherlands,17.7,1,5.649718e-08,1.0,5.649718e-08
31,SOUTHEAST ASIA,Cambodia,16.8,1,5.952381e-08,0.0,0.0
0,CARIBBEAN,Cuba,11.1,1,9.009009e-08,0.0,0.0
30,SOUTH ASIA,Sri Lanka,22.4,3,1.339286e-07,0.0,0.0
6,CENTRAL ASIA,Kyrgyzstan,6.8,1,1.470588e-07,0.0,0.0


####  -  Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) 

In [37]:
df_top_10_total_quality = df_country_final.copy()
df_top_10_total_quality.nlargest(10, 'high_quality_per_person')

Unnamed: 0,region,country,population (millions),article_count,total_articles_per_population,quality_count,high_quality_per_person
24,OCEANIA,Tuvalu,0.0,11,inf,1.0,inf
27,SOUTH AMERICA,Suriname,0.6,23,3.8e-05,1.0,1.666667e-06
43,WESTERN ASIA,United Arab Emirates,9.4,37,4e-06,4.0,4.255319e-07
38,SOUTHERN EUROPE,Spain,47.4,152,3e-06,17.0,3.586498e-07
46,WESTERN EUROPE,Switzerland,8.8,63,7e-06,3.0,3.409091e-07
22,NORTHERN EUROPE,Sweden,10.5,63,6e-06,3.0,2.857143e-07
28,SOUTH AMERICA,Uruguay,3.6,39,1.1e-05,1.0,2.777778e-07
20,NORTHERN AFRICA,Tunisia,11.8,54,5e-06,3.0,2.542373e-07
11,EAST AFRICA,South Sudan,10.9,92,8e-06,2.0,1.834862e-07
41,WESTERN ASIA,Syria,22.1,36,2e-06,3.0,1.357466e-07


####  -  Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)

In [38]:
df_buttom_10_total_quality = df_country_final.copy()
df_buttom_10_total_quality.replace([0.0], np.nan, inplace=True)
# Drop rows with NaN
df_buttom_10_total_quality.dropna(inplace=True)
df_buttom_10_total_quality.nsmallest(10, 'high_quality_per_person')

Unnamed: 0,region,country,population (millions),article_count,total_articles_per_population,quality_count,high_quality_per_person
33,SOUTHEAST ASIA,Thailand,66.8,28,4.191617e-07,1.0,1.497006e-08
35,SOUTHEAST ASIA,Vietnam,99.4,27,2.716298e-07,2.0,2.012072e-08
13,EAST AFRICA,Uganda,47.2,44,9.322034e-07,1.0,2.118644e-08
19,NORTHERN AFRICA,Sudan,46.9,33,7.036247e-07,1.0,2.132196e-08
45,WESTERN EUROPE,Netherlands,17.7,1,5.649718e-08,1.0,5.649718e-08
10,EAST AFRICA,Somalia,17.6,44,2.5e-06,1.0,5.681818e-08
44,WESTERN ASIA,Yemen,33.7,61,1.810089e-06,2.0,5.934718e-08
36,SOUTHERN AFRICA,South Africa,60.6,85,1.40264e-06,4.0,6.60066e-08
18,EASTERN EUROPE,Ukraine,41.0,73,1.780488e-06,4.0,9.756098e-08
41,WESTERN ASIA,Syria,22.1,36,1.628959e-06,3.0,1.357466e-07


In [39]:
df_by_region = df_country_final.groupby(by=['region'])\
                        .agg({'article_count':'sum', 'population (millions)':'sum', 'quality_count':'sum'}).reset_index()

In [40]:
df_by_region['total_articles_per_population'] = df_by_region['article_count']/(df_by_region['population (millions)'] * 1000000)

In [41]:
df_by_region['high_quality_per_person'] = df_by_region['quality_count']/(df_by_region['population (millions)'] * 1000000)

####  -  Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita

In [42]:
df_region_by_coverage = df_by_region.copy()
df_region_by_coverage.nlargest(18, 'total_articles_per_population')

Unnamed: 0,region,article_count,population (millions),quality_count,total_articles_per_population,high_quality_per_person
8,OCEANIA,19,0.4,1.0,4.75e-05,2.5e-06
7,NORTHERN EUROPE,64,16.1,3.0,3.975155e-06,1.863354e-07
16,WESTERN EUROPE,64,26.5,4.0,2.415094e-06,1.509434e-07
9,SOUTH AMERICA,125,52.3,2.0,2.390057e-06,3.824092e-08
0,CARIBBEAN,23,12.7,0.0,1.811024e-06,0.0
14,WESTERN AFRICA,15,8.8,0.0,1.704545e-06,0.0
3,EAST AFRICA,283,177.5,4.0,1.594366e-06,2.253521e-08
6,NORTHERN AFRICA,87,58.7,4.0,1.482112e-06,6.81431e-08
13,SOUTHERN EUROPE,153,106.3,17.0,1.439323e-06,1.599247e-07
12,SOUTHERN AFRICA,85,60.6,4.0,1.40264e-06,6.60066e-08


####  -  Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita

In [43]:
df_region_by_quality = df_by_region.copy()
df_region_by_quality.nlargest(18, 'high_quality_per_person')

Unnamed: 0,region,article_count,population (millions),quality_count,total_articles_per_population,high_quality_per_person
8,OCEANIA,19,0.4,1.0,4.75e-05,2.5e-06
7,NORTHERN EUROPE,64,16.1,3.0,3.975155e-06,1.863354e-07
13,SOUTHERN EUROPE,153,106.3,17.0,1.439323e-06,1.599247e-07
16,WESTERN EUROPE,64,26.5,4.0,2.415094e-06,1.509434e-07
6,NORTHERN AFRICA,87,58.7,4.0,1.482112e-06,6.81431e-08
12,SOUTHERN AFRICA,85,60.6,4.0,1.40264e-06,6.60066e-08
15,WESTERN ASIA,191,153.4,9.0,1.245111e-06,5.867014e-08
9,SOUTH AMERICA,125,52.3,2.0,2.390057e-06,3.824092e-08
3,EAST AFRICA,283,177.5,4.0,1.594366e-06,2.253521e-08
5,EASTERN EUROPE,74,185.3,4.0,3.993524e-07,2.158662e-08
