## A2 - Bias in Data

The goal of this assignment is to explore the concept of bias in data through ana analysis of English Wikipedia article quality on political figures from a variety of countries. This analysis uses a datasets of Wikipedia articles and country populations and employs the ORES machine learning service to estimate article quality.

Before we begin, we must import any packages needed:

In [1]:
import pandas as pd
import numpy as np
import json
import requests

### Step 1: Getting the Article and Population Data

Data was collected from two different sources, [Politicians by Country](https://figshare.com/articles/Untitled_Item/5513449) is from the _page_data.csv_ file on Figshare and the [attached spreadsheet](https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing) drawn from the [World Population Data Sheet (WPDS)](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.

The data was read into a pandas dataframe from the _/raw_data/_ directory and stored below. We also previewed the data and investigated its shape.

In [2]:
politicians_df = pd.read_csv('raw_data/page_data.csv')
population_df = pd.read_csv('raw_data/WPDS_2020_data.csv')

print('politicians shape:', politicians_df.shape)
print('population shape:', population_df.shape)

politicians shape: (47197, 3)
population shape: (234, 6)


In [3]:
# preview Politicians by Country
politicians_df.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [4]:
# preview WPDS population data
population_df.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


### Step 2: Cleaning the Data

First, we will clean `politicians_df` which contains some _pages_ that are NOT Wikipedia article names. These start with the string _'Template:'_ and will be excluded from future analysis.

In [5]:
politicians_df = politicians_df[~politicians_df['page'].str.startswith('Template:')]

print('politicians shape:', politicians_df.shape)
politicians_df.head()

politicians shape: (46701, 3)


Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


Next, we will clean `population_df` which has some rows with cumulative (region or world) population counts. These are indicated by ALL CAPS in the _Name_ field.

__Note:__ If this cleaning is done by filtering the _Type_ field for `Type = 'Country'` this will exclude the _Channel Islands_. Although the _Channel Islands_ are not a country, they are also not a region and will be treated like a country for this analysis.

In [6]:
country_df = population_df[~population_df['Name'].str.isupper()]

print('country shape:', country_df.shape)
country_df.head() 

country shape: (210, 6)


Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


Finally, we will create a new dataframe that maps each country to their appropriate region.

__Note:__ We made the assumption that each of the 24 regions was at the same level and that the ordering of the WPDS dataset matters (i.e. the nearest region above a given country in the original _WPDS_2020_data.csv_ is its region).

- Create a dataframe of all the regional data we discarded to create _country_df_

In [7]:
region_df = population_df[population_df['Name'].str.isupper()]

print('region shape:', region_df.shape)
region_df.head() 

region shape: (24, 6)


Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000


- Get the indices of each region in the original WPDS dataset

In [8]:
region_index_df = pd.DataFrame(columns = ['region_index','region', 'region_population'])
region_index_df['region'] = region_df['Name']
region_index_df['region_population'] = region_df['Population']
region_index_df['region_index'] = region_df.index

region_indices = region_index_df['region_index'].unique()
print(region_indices)

[  0   1   2  10  27  48  58  64  67  68  77  95 109 110 129 135 145 157
 166 167 179 189 200 216]


- Initial _country_region_df_ to serve as the mapping between country and region (and their respective populations) by index

In [9]:
country_region_df = pd.DataFrame(columns = ['country_index' ,'country', 'country_population', 'region_index'])
country_region_df['country'] = country_df['Name']
country_region_df['country_population'] = country_df['Population']
country_region_df['country_index'] = country_df.index

country_region_df.head()

Unnamed: 0,country_index,country,country_population,region_index
3,3,Algeria,44357000,
4,4,Egypt,100803000,
5,5,Libya,6891000,
6,6,Morocco,35952000,
7,7,Sudan,43849000,


- For each element of _region_indicies_, if the value is greater than the _country_index_, map to that region. Then join to also map populations at country and regional level.

__Note:__ If there are no countries in a given region, the region will be dropped here from future analysis.

In [10]:
country_region_df.shape

(210, 4)

In [11]:
for i in region_indices:
    country_region_df.loc[country_region_df['country_index'] > i, 'region_index'] = i
    
country_region_df = pd.merge(country_region_df, region_index_df, how = 'inner', on = 'region_index')
country_region_df = country_region_df.drop(columns = ['country_index', 'region_index'])

country_region_df.head()

Unnamed: 0,country,country_population,region,region_population
0,Algeria,44357000,NORTHERN AFRICA,244344000
1,Egypt,100803000,NORTHERN AFRICA,244344000
2,Libya,6891000,NORTHERN AFRICA,244344000
3,Morocco,35952000,NORTHERN AFRICA,244344000
4,Sudan,43849000,NORTHERN AFRICA,244344000


Write the 4 cleaned dataframes to _/prelim_data/_ for investigation later.

In [12]:
politicians_df.to_csv('prelim_data/politicians_prelim.csv', index = False)
country_df.to_csv('prelim_data/country_prelim.csv', index = False)
region_df.to_csv('prelim_data/region_prelim.csv', index = False)
country_region_df.to_csv('prelim_data/country_region_mapping.csv', index = False)

### Step 3: Getting Article Quality Predictions

As discussed in __Step 1__, the article quality predictions are generated using ORES. This is a machine learning system that learned scoring based on articles in Wikipedia that had been peer-reviewed using the [Wikipedia content assessment](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment) guidelines and grouped into a subset of 6 categories.

The 6 categories for article quality (from best-to-worst) are:
1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

There are two options for calling this API (Option 1 uses the _ORES client_, see some demo code under __Step 3: Option 1:__ [here](https://docs.google.com/document/d/11eswL84T-H6bli8aX_-XndCN6tAZ4bIb9Z2ywiIf2fE/edit#)). 

We will be using the second recommended method, the ORES REST API endpoint [documentation](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model). This API expects a project (_enwiki_ for English Wikipedia), a model (we will be using _articlequality_), and a revision ID (aka _{rev_id}_) as parameters.

__Note:__ The ORES REST API can take up to 50 rev_ids at a time (format: _{rev_id_1}|{rev_id_2}|...|{rev_id_50}_).

It is possible that a given _rev_id_ will NOT have a score, if this occurs, it will be logged in a separate list.

- First, we need a function that groups the _rev_ids_ from into batches of 50. See [here](https://www.geeksforgeeks.org/break-list-chunks-size-n-python/) for some documentation on how to create the `batches` function.

In [13]:
def batches(rev_ids, n):
    for i in range(0, len(rev_ids), n):
        yield rev_ids[i:i + n]

In [14]:
rev_ids = politicians_df['rev_id']
batch_list = list(batches(rev_ids, 50))

print('batch count:', len(batch_list))

batch count: 935


- Next, we need a function to call the API the passes in a list of _rev_ids_ as a parameter.

In [15]:
headers = {
    'User-Agent': 'https://github.com/nriggio',
    'From': 'nriggi2@uw.edu'
}

url = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_ids}'

In [16]:
def api_call(url, rev_ids, headers):
    call = requests.get(url.format(rev_ids = rev_ids), headers=headers)
    response = call.json()
    
    return response

- Then, we loop through the `batch_list` and use `api_call` to pass in a list of _rev_ids_.
- If a score is found, it is stored in the _page_scores_ list. If there is no score, the _rev_id_ is stored in _page_errors_.

__Note:__ It should take ~5-10 minutes to execute all the API calls and store the results.

In [17]:
# intialize lists
page_scores = []
page_errors = []

# loop through each batch list and call the API
for i in range(len(batch_list)):
    
    batch_ids = '|'.join(str(x) for x in batch_list[i])
    scores = api_call(url, batch_ids, headers)

    # for each revision_id, append score (or indication of no score) to the appropriate list)
    for rev_id in batch_list[i]:
        try:
            page_scores.append((rev_id, scores['enwiki']['scores'][str(rev_id)]['articlequality']['score']['prediction']))
        except KeyError:
            page_errors.append(rev_id)

- To check our results, we get a count of the _rev_ids_ in each list and validate the total count.

In [18]:
print('pages with ORES scores:', len(page_scores))
print('pages without ORES scores:', len(page_errors))
print('total page count:', len(page_scores) + len(page_errors))

pages with ORES scores: 46425
pages without ORES scores: 276
total page count: 46701


- Finally, the output of each list is turned into a pandas dataframe and stored in a CSV under _/prelim_data/_.

In [19]:
# page_scores processing
ORES_scores_df = pd.DataFrame(page_scores, columns = ['revision_id', 'article_quality_est.'])
ORES_scores_df.to_csv('prelim_data/ORES_scores.csv', index = False)

print('score types:', ORES_scores_df['article_quality_est.'].unique())
ORES_scores_df.head()

score types: ['Stub' 'Start' 'C' 'B' 'GA' 'FA']


Unnamed: 0,revision_id,article_quality_est.
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [20]:
# page_errors processing
no_score_df = pd.DataFrame(page_errors, columns = ['revision_id'])
no_score_df.to_csv('prelim_data/no_ORES_scores.csv', index = False)

no_score_df.head()

Unnamed: 0,revision_id
0,516633096
1,550682925
2,627547024
3,636911471
4,669987106


### Step 4: Combining the Datasets

To set us up for analysis, we will need to merge the wikipedia data and population data together by country name in `\data\wp_wpds_politicians_by_country.csv`.

__Note:__ There will be entries that can't be merged on country name. These rows are removed and stored in `\data\wp_wpds_countries-no_match.csv`

- In order to do this, we must first join `ORES_scores_df` to `politicians_df` on _rev_id_ to get a view of each page and its score.

In [21]:
politician_ORES_df = pd.merge(politicians_df, ORES_scores_df, how = 'inner', left_on='rev_id', right_on='revision_id')

print(len(politician_ORES_df))
politician_ORES_df.head()

46425


Unnamed: 0,page,country,rev_id,revision_id,article_quality_est.
0,Bir I of Kanem,Chad,355319463,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,393276188,Stub
2,Yos Por,Cambodia,393822005,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,395526568,Stub


- Now we can outer join `politician_ORES_df` with our `country_df` on country. If the country is in both lists the _merge_ column produced by `indicator = True` will display _both_.

In [22]:
final_df = pd.merge(politician_ORES_df, country_df, indicator = True,
                           how = 'outer', left_on = 'country', right_on = 'Name')

print(len(final_df))
final_df.head()

46452


Unnamed: 0,page,country,rev_id,revision_id,article_quality_est.,FIPS,Name,Type,TimeFrame,Data (M),Population,_merge
0,Bir I of Kanem,Chad,355319463.0,355319463.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,both
1,Abdullah II of Kanem,Chad,498683267.0,498683267.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,both
2,Salmama II of Kanem,Chad,565745353.0,565745353.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,both
3,Kuri I of Kanem,Chad,565745365.0,565745365.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,both
4,Mohammed I of Kanem,Chad,565745375.0,565745375.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,both


- We have to clean up this join a bit to get the desired data format.

In [23]:
final_df = final_df.drop(columns = ['rev_id', 'FIPS', 'Name', 'Type', 'TimeFrame', 'Data (M)'])
final_df.rename(columns={'page':'article_name', 'Population':'population'}, inplace = True)
final_df = final_df[['country', 'article_name', 'revision_id', 'article_quality_est.', 'population', '_merge']]
final_df['revision_id'] = final_df['revision_id'].astype('Int64')
final_df['population'] = final_df['population'].astype('Int64')

print(final_df['_merge'].unique())
print(len(final_df))
final_df.head()

['both', 'left_only', 'right_only']
Categories (3, object): ['both', 'left_only', 'right_only']
46452


Unnamed: 0,country,article_name,revision_id,article_quality_est.,population,_merge
0,Chad,Bir I of Kanem,355319463,Stub,16877000,both
1,Chad,Abdullah II of Kanem,498683267,Stub,16877000,both
2,Chad,Salmama II of Kanem,565745353,Stub,16877000,both
3,Chad,Kuri I of Kanem,565745365,Stub,16877000,both
4,Chad,Mohammed I of Kanem,565745375,Stub,16877000,both


- If there is a country match, output to `\final_data\wp_wpds_politicians_by_country.csv`. If not, log results in `\final_data\wp_wpds_countries-no_match.csv`.

In [24]:
match = final_df[final_df['_merge'] == 'both']
match = match.drop(columns = ['_merge'])
print('match count:', len(match))
match.to_csv('final_data/wp_wpds_politicians_by_country.csv', index = False)

no_match = final_df[final_df['_merge'] != 'both']
no_match = no_match.drop(columns = ['_merge'])
print('no match count:', len(no_match))
no_match.to_csv('final_data/wp_wpds_countries-no_match.csv', index = False)

match count: 44568
no match count: 1884


### Step 5: Analysis

We want to calculate the proportion (as a __percent__) of articles-per-population and high-quality-articles by country AND region.

__Note:__ "High quality" articles refer to those with a ORES score of 'FA' or 'GA'

- Get an article count by country

In [25]:
article_cnt_c = match.groupby(['country']).size().reset_index(name = 'article_cnt')

print(sum(article_cnt_c['article_cnt']))
print(article_cnt_c.shape)
article_cnt_c.head()

44568
(183, 2)


Unnamed: 0,country,article_cnt
0,Afghanistan,319
1,Albania,456
2,Algeria,116
3,Andorra,34
4,Angola,106


- Get a high-quality article count by country

In [26]:
hq_article_cnt_c = match.groupby(['country', 'article_quality_est.']).size().reset_index(name = 'hq_article_cnt')
hq_article_cnt_c = hq_article_cnt_c[hq_article_cnt_c['article_quality_est.'].str.contains('FA|GA')].reset_index(drop = True)
hq_article_cnt_c = hq_article_cnt_c.groupby(['country']).sum()

print(sum(hq_article_cnt_c['hq_article_cnt']))
print(hq_article_cnt_c.shape)
hq_article_cnt_c.head()

1028
(146, 1)


Unnamed: 0_level_0,hq_article_cnt
country,Unnamed: 1_level_1
Afghanistan,13
Albania,3
Algeria,2
Argentina,16
Armenia,5


- Merge by country and calculate the proportions of _articles-per-population_ and _high-quality-articles_ as a __percent__

__Note__: Left join to keep all countries (even those without any articles or any high-quality articles)

In [27]:
# merge in article count
by_country = pd.merge(country_region_df, article_cnt_c, how = 'left', on = 'country')

# merge in high-quality article count (and fill missing values with 0)
by_country = pd.merge(by_country, hq_article_cnt_c, how = 'left', on = 'country')
by_country['hq_article_cnt'] = by_country['hq_article_cnt'].fillna(0)

# clean up extra columns
by_country = by_country.drop(columns = ['region', 'region_population'])

In [28]:
by_country['articles_per_pop_pct'] = by_country['article_cnt'] / by_country['country_population'] * 100
by_country['hq_articles_pct'] = by_country['hq_article_cnt'] / by_country['article_cnt'] * 100

by_country.head()

Unnamed: 0,country,country_population,article_cnt,hq_article_cnt,articles_per_pop_pct,hq_articles_pct
0,Algeria,44357000,116.0,2.0,0.000262,1.724138
1,Egypt,100803000,234.0,10.0,0.000232,4.273504
2,Libya,6891000,110.0,4.0,0.001596,3.636364
3,Morocco,35952000,206.0,1.0,0.000573,0.485437
4,Sudan,43849000,95.0,2.0,0.000217,2.105263


Repeat the above analysis for the regional level.

In [29]:
# article count by region

article_cnt_r = pd.merge(match, country_region_df, how = 'left', on = 'country')
article_cnt_r = article_cnt_r.groupby(['region']).size().reset_index(name = 'article_cnt')

print(sum(article_cnt_r['article_cnt']))
article_cnt_r.head()

44568


Unnamed: 0,region,article_cnt
0,CARIBBEAN,695
1,CENTRAL AMERICA,1543
2,CENTRAL ASIA,245
3,EAST ASIA,2473
4,EASTERN AFRICA,2502


In [30]:
# high-quality article count by region

hq_article_cnt_r = pd.merge(match, country_region_df, how = 'left', on = 'country')
hq_article_cnt_r = hq_article_cnt_r.groupby(['region', 'article_quality_est.']).size().reset_index(name = 'hq_article_cnt')
hq_article_cnt_r = hq_article_cnt_r[hq_article_cnt_r['article_quality_est.'].str.contains('FA|GA')].reset_index(drop = True)
hq_article_cnt_r = hq_article_cnt_r.groupby(['region']).sum()

print(sum(hq_article_cnt_r['hq_article_cnt']))
print(hq_article_cnt_r.shape)
hq_article_cnt_r.head()

1028
(19, 1)


Unnamed: 0_level_0,hq_article_cnt
region,Unnamed: 1_level_1
CARIBBEAN,13
CENTRAL AMERICA,23
CENTRAL ASIA,7
EAST ASIA,76
EASTERN AFRICA,35


In [31]:
# Merge by region, calculate articles-per-population and high-quality-articles (as a percent)

# merge in article count
by_region = pd.merge(country_region_df, article_cnt_r, how = 'inner', on = 'region')

# merge in high-quality article count (and fill missing values with 0)
by_region = pd.merge(by_region, hq_article_cnt_r, how = 'left', on = 'region')
by_region['hq_article_cnt'] = by_region['hq_article_cnt'].fillna(0)

# clean up extra columns
by_region = by_region.drop(columns = ['country', 'country_population'])
by_region = by_region.drop_duplicates().reset_index(drop = True)

# calculate proportions
by_region['articles_per_pop_pct'] = by_region['article_cnt'] / by_region['region_population'] * 100
by_region['hq_articles_pct'] = by_region['hq_article_cnt'] / by_region['article_cnt'] * 100

print(by_region.shape)
by_region.head()

(19, 6)


Unnamed: 0,region,region_population,article_cnt,hq_article_cnt,articles_per_pop_pct,hq_articles_pct
0,NORTHERN AFRICA,244344000,899,19,0.000368,2.113459
1,WESTERN AFRICA,401115000,2139,40,0.000533,1.870033
2,EASTERN AFRICA,444970000,2502,35,0.000562,1.398881
3,MIDDLE AFRICA,179757000,665,16,0.00037,2.406015
4,SOUTHERN AFRICA,67732000,634,9,0.000936,1.419558


### Step 6: Results

Produce the following 6 data tables:

1. __Top 10 countries by coverage:__ 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. __Bottom 10 countries by coverage:__ 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. __Top 10 countries by relative quality:__ 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. __Bottom 10 countries by relative quality:__ 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. __Geographic regions by coverage:__ Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. __Geographic regions by relative quality:__ Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality


__Top 10 countries by coverage__

In [32]:
df_1 = by_country.nlargest(n = 10, columns = 'articles_per_pop_pct').reset_index(drop = True)
df_1[['country', 'country_population', 'article_cnt', 'articles_per_pop_pct']]

Unnamed: 0,country,country_population,article_cnt,articles_per_pop_pct
0,Tuvalu,10000,54.0,0.54
1,Nauru,11000,52.0,0.472727
2,San Marino,34000,81.0,0.238235
3,Monaco,38000,40.0,0.105263
4,Liechtenstein,39000,28.0,0.071795
5,Marshall Islands,57000,37.0,0.064912
6,Tonga,99000,63.0,0.063636
7,Iceland,368000,201.0,0.05462
8,Andorra,82000,34.0,0.041463
9,Federated States of Micronesia,106000,36.0,0.033962


__Bottom 10 countries by coverage__

In [33]:
df_2 = by_country.nsmallest(n = 10, columns = 'articles_per_pop_pct').reset_index(drop = True)
df_2[['country', 'country_population', 'article_cnt', 'articles_per_pop_pct']]

Unnamed: 0,country,country_population,article_cnt,articles_per_pop_pct
0,India,1400100000,968.0,6.9e-05
1,Indonesia,271739000,209.0,7.7e-05
2,China,1402385000,1129.0,8.1e-05
3,Uzbekistan,34174000,28.0,8.2e-05
4,Ethiopia,114916000,101.0,8.8e-05
5,Zambia,18384000,25.0,0.000136
6,"Korea, North",25779000,36.0,0.00014
7,Thailand,66534000,112.0,0.000168
8,Mozambique,31166000,58.0,0.000186
9,Bangladesh,169809000,317.0,0.000187


__Top 10 countries by relative quality__

In [34]:
df_3 = by_country.nlargest(n = 10, columns = 'hq_articles_pct').reset_index(drop = True)
df_3[['country','article_cnt', 'hq_article_cnt', 'hq_articles_pct']]

Unnamed: 0,country,article_cnt,hq_article_cnt,hq_articles_pct
0,"Korea, North",36.0,8.0,22.222222
1,Saudi Arabia,117.0,15.0,12.820513
2,Romania,343.0,42.0,12.244898
3,Central African Republic,66.0,8.0,12.121212
4,Uzbekistan,28.0,3.0,10.714286
5,Mauritania,48.0,5.0,10.416667
6,Guatemala,83.0,7.0,8.433735
7,Dominica,12.0,1.0,8.333333
8,Syria,128.0,10.0,7.8125
9,Benin,91.0,7.0,7.692308


__Bottom 10 countries by relative quality__

__Note:__ There were more than 10 countries with no high-quality articles. This is just a selection of those 10.

In [35]:
df_4 = by_country.nsmallest(n = 10, columns = 'hq_articles_pct').reset_index(drop = True)
df_4[['country','article_cnt', 'hq_article_cnt', 'hq_articles_pct']]

Unnamed: 0,country,article_cnt,hq_article_cnt,hq_articles_pct
0,Tunisia,138.0,0.0,0.0
1,Cape Verde,36.0,0.0,0.0
2,Comoros,51.0,0.0,0.0
3,Djibouti,37.0,0.0,0.0
4,Eritrea,16.0,0.0,0.0
5,Mozambique,58.0,0.0,0.0
6,Seychelles,21.0,0.0,0.0
7,Zambia,25.0,0.0,0.0
8,Angola,106.0,0.0,0.0
9,Sao Tome and Principe,21.0,0.0,0.0


__Geographic regions by coverage__

In [36]:
df_5 = by_region.sort_values(by = 'articles_per_pop_pct', ascending = False).reset_index(drop = True)
df_5[['region', 'region_population', 'article_cnt', 'articles_per_pop_pct']]

Unnamed: 0,region,region_population,article_cnt,articles_per_pop_pct
0,OCEANIA,43155000,3126,0.007244
1,NORTHERN EUROPE,105990000,3763,0.00355
2,SOUTHERN EUROPE,153251000,3710,0.002421
3,WESTERN EUROPE,195479000,4560,0.002333
4,CARIBBEAN,43233000,695,0.001608
5,EASTERN EUROPE,291902000,3732,0.001279
6,SOUTHERN AFRICA,67732000,634,0.000936
7,WESTERN ASIA,280927000,2563,0.000912
8,CENTRAL AMERICA,178611000,1543,0.000864
9,SOUTH AMERICA,429191000,3032,0.000706


__Geographic regions by relative quality__

In [37]:
df_6 = by_region.sort_values(by = 'hq_articles_pct', ascending = False).reset_index(drop = True)
df_6[['region','article_cnt', 'hq_article_cnt', 'hq_articles_pct']]

Unnamed: 0,region,article_cnt,hq_article_cnt,hq_articles_pct
0,NORTHERN AMERICA,1901,104,5.470805
1,SOUTHEAST ASIA,2020,73,3.613861
2,WESTERN ASIA,2563,89,3.472493
3,EASTERN EUROPE,3732,118,3.161844
4,EAST ASIA,2473,76,3.07319
5,CENTRAL ASIA,245,7,2.857143
6,NORTHERN EUROPE,3763,102,2.710603
7,MIDDLE AFRICA,665,16,2.406015
8,NORTHERN AFRICA,899,19,2.113459
9,OCEANIA,3126,63,2.015355
