# A2 - Bias in Data Assignment

Exploring the concept of bias through data on Wikipedia articles--specifically, articles on political figures from a variety of countries.

## Step 1: Getting the Article and Population Data

Both source data files, below, are formatted as CSVs in the `data` folder and can be read directly in from the folder. 
1. Wikipedia politicians by country dataset, and
2. World population data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read in Wikipedia politicians by country dataset
page_data = pd.read_csv('data/page_data.csv')

# Read in world population data
world_population = pd.read_csv('data/WPDS_2020_data.csv')

In [3]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [4]:
world_population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


## Step 2: Cleaning the Data

Even through visual inspection, it can be seen that both datasets contain some rows that will need to be filtered out and/or ignored when before combining the datasets in the next step.

In the case of `page_data.csv`, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should excluded from the analysis.

In [5]:
# Filter out page names that start with the string "Template:"
page_data = page_data[page_data.page.str.contains('^(?!Template:).+')]
page_data.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


Similarly, `WPDS_2020_data.csv` contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by ALL CAPS values in the `Name` field (e.g. AFRICA, OCEANIA). They can retained in a dataset different than those in a country-level. Furthermore, retain the ordering of the subregions as new columns in the country-level dataset.

In [6]:
# Separate cum. regional population rows from world population
regional_population = world_population[world_population.Name.str.isupper()]

# Identify continents by inspection
large_subregions = ['AFRICA', 'LATIN AMERICA AND THE CARIBBEAN', 'ASIA', 'EUROPE']

# Create addition columns to retain continents and more specific subregions
world_population['subregion_0'] = 'WORLD'
world_population['subregion_1'] = np.where(world_population.Name.isin(large_subregions), world_population.Name, np.nan)
world_population['subregion_2'] = np.where(world_population.Name.str.isupper(), world_population.Name, np.nan)
world_population = world_population.fillna(method='ffill')

# Filter out cum. regional population
country_population = world_population[~world_population.Name.isin(regional_population.Name)]

## Step 3: Getting Article Quality Predictions

ORES is a machine learning system that can provide estimates of Wikipedia article quality. It assigns one of these 6 categories to any `rev_id` sent to it.

The article quality estimates are, from best to worst:
1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

Please review the ORES REST [documentation](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model). In addition to the revision ID, it also expects a model and the name of the context to find model, which is `articlequality` and `enwiki`, respectively.

In [7]:
ores_endpoint = "https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_ids}"

Since article predicts are needed for each article, it's best to do the API call in batches. The `api_call_batch` function will call the API endpoint in batches of 50, such that the `batch = 1` will return the predicted quality scores for the first 50 articles/revision ids, and so on.

In [8]:
import json
import requests

In [9]:
def api_call_batch(endpoint, df, batch):
    
    batch_start = batch*50
    batch_end = (batch + 1)*50 - 1
    
    batch_ids = df.rev_id[batch_start:batch_end]
    rev_id = "|".join(str(x) for x in batch_ids)
    
    call = requests.get(endpoint.format(rev_ids=rev_id))
    response = call.json()
    
    return response

In [10]:
# Calculate total number of batches
batch_total = int(np.floor(len(page_data)/50))
# batch_total = 1
# Create empty data frame for predictions
ores_score = pd.DataFrame()

# Call api until batches covers all pages
for batch in range(batch_total):
    response = api_call_batch(ores_endpoint, page_data, batch)
    
    # Transform json object into dataframe with just predition and rev_id
    scores = pd.json_normalize(response['enwiki']['scores'])
    pred = scores.filter(regex='prediction$', axis=1).transpose()
    pred['rev_id'] = pred.index.str.split('.', 1).str[0].str.strip()
    pred = pred.rename({0: 'prediction'}, axis=1)
    pred = pred.reset_index(drop=True)
    
    # Append json object to article quality predictions df
    ores_score = ores_score.append(pred)
    

It is only normal that some articles do not return a score. Notice how all predictions fall within article quality estimates, list above, and there are fewer predictions than pages. The specific articles with missing ORES scores will be identified when combining datasets in the next step.

In [11]:
ores_score.describe()

Unnamed: 0,prediction,rev_id
count,45495,45495
unique,6,45495
top,Stub,355319463
freq,23715,1


In [12]:
ores_score.prediction.unique()

array(['Stub', 'Start', 'C', 'B', 'GA', 'FA'], dtype=object)

In [13]:
print('There are {} articles is missing ORES scores.'.format(len(page_data) - len(ores_score)))

There are 1206 articles is missing ORES scores.


## Step 4: Combining the Datasets

The analysis needs complete data. Thus, when combining the ORES data for each article, the wikipedia data and popultion data, remove rows that do not have complete data. Since incomplete data can be part of the bias, maintain logs of:
- articles for which an ORES score could not be retrieved
- countries that are not in the population or articles datasets

In [14]:
# Combine ORES, article and population data
page_data = page_data.astype('str')
page_ores_country = page_data.merge(ores_score, how='inner', on='rev_id').merge(country_population, left_on='country', right_on='Name')

# Format schema for analysis
page_ores_country = page_ores_country.filter(items=['country', 'page', 'rev_id', 'prediction', 'Population'])
page_ores_country = page_ores_country.rename(columns={'page': 'article_name', 'rev_id': 'revision_id', 'prediction': 'article_quality_est', 'Population': 'population'})

# Output CSV of those with complete matches
page_ores_country.to_csv('data/wp_wpds_politicians_by_country.csv', index=False, encoding='utf-8')

Combine just the ORES and politician article data to identify all articles for which an ORES score could not be retrieved.

In [15]:
# Combine ORES and article data
page_ores_data = page_data.merge(ores_score, how='outer', on='rev_id')

# Create log articles with missing ORES scores
missing_ores = page_ores_data[page_ores_data.prediction.isna()]

# Save missing ORES scores log as CSV
missing_ores.to_csv('data/articles_missing_ores.csv', index=False, encoding='utf-8')

Similarly, combine just the country population and politician article data and keep any rows that do not have matching data. Since volume of missing information per country influences bias, there is no need to remove duplicates.

In [16]:
# Combine country population and article data
page_population_data = page_data.merge(country_population, how='outer', left_on='country', right_on="Name")

# Create df of non-matching countries
country_no_match = page_population_data[page_population_data.country.isna()| page_population_data.Name.isna()]

# Save non-matching countries as CSV
country_no_match.to_csv('data/wp_wpds_countries-no_match.csv', index=False, encoding='utf-8')

## Step 5: Analysis

The analysis consists of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country and for each geographic region. For this analysis, "high quality" articles includes those with in "FA" (featured article) or "GA" (good article) classes.

In [17]:
high_quality = ['GA', 'FA']

# Summarize population and articles by country
countries = page_ores_country.groupby('country').agg({
    'population':'first', 
    'article_name':'count', 
    'article_quality_est': lambda x: sum(x.isin(high_quality))
})

countries = countries.rename(columns={
    'article_name':'article_count',
    'article_quality_est':'high_quality_count'
})

# Calculate proportions of articles per population
countries['coverage'] = countries.article_count/countries.population

# Calculate percentage of high-quality articles
countries['relative_quality'] = countries.high_quality_count/countries.article_count

countries.describe()

Unnamed: 0,population,article_count,high_quality_count,coverage,relative_quality
count,183.0,183.0,183.0,183.0,183.0
mean,41535210.0,238.693989,5.52459,0.000126888,0.023748
std,151118300.0,279.757203,9.938724,0.0005631424,0.02932
min,10000.0,12.0,0.0,6.806657e-07,0.0
25%,2186500.0,56.5,1.0,6.204482e-06,0.004573
50%,9375000.0,126.0,2.0,1.746271e-05,0.015625
75%,30534500.0,319.5,6.0,4.968669e-05,0.030931
max,1402385000.0,1636.0,80.0,0.0054,0.222222


Recall that cumulative regional population data were retained in a separate dataset and that there is sub-region data within the `country_population` datatset. Using this, create a mapping of country to sub-region and perform proportion of articles-per-population and high-quality articles for each sub-region.

In [18]:
# Create key for country and geographical region
country_geo_map = country_population.melt(id_vars=['FIPS', 'Name', 'Type', 'TimeFrame', 'Data (M)', 'Population']).rename(columns = {'Name': 'country', 'value': 'subregion'}).filter(items=['country', 'subregion'])

# Append key to combined dataset
geographic_region = page_ores_country.merge(country_geo_map, on='country')

# Summarize population and articles by subregion
geographic_region = geographic_region.groupby('subregion').agg({
    'population':'sum', 
    'article_name':'count', 
    'article_quality_est': lambda x: sum(x.isin(high_quality))
})

geographic_region = geographic_region.rename(columns={
    'article_name':'article_count',
    'article_quality_est':'high_quality_count'
})

# Calculate proportions of articles per population
geographic_region['coverage'] = geographic_region.article_count/geographic_region.population

# Calculate percentage of high-quality articles
geographic_region['relative_quality'] = geographic_region.high_quality_count/geographic_region.article_count

geographic_region.describe()

Unnamed: 0,population,article_count,high_quality_count,coverage,relative_quality
count,24.0,24.0,24.0,24.0,24.0
mean,644121000000.0,5460.125,126.375,3.096814e-08,0.023527
std,1251351000000.0,9082.291457,211.544386,3.257701e-08,0.009691
min,3399359000.0,242.0,7.0,1.490658e-09,0.012307
25%,56340300000.0,1776.5,32.0,1.179055e-08,0.016243
50%,162364000000.0,2743.0,69.5,2.050642e-08,0.021774
75%,427482500000.0,4330.25,101.0,3.775798e-08,0.02754
max,5152968000000.0,43681.0,1011.0,1.462408e-07,0.055734


Since countries that did not reconciled with both the politician article and popultion datasets were removed, the population totals need to be recalculated. Below are the World and Subregion population statistics summarized.

In [19]:
regional_population.describe()

Unnamed: 0,TimeFrame,Data (M),Population
count,24.0,24.0,24.0
mean,2019.0,954.466792,954466800.0
std,0.0,1754.222964,1754223000.0
min,2019.0,43.155,43155000.0
25%,2019.0,172.271,172271000.0
50%,2019.0,330.0475,330047500.0
75%,2019.0,683.03925,683039200.0
max,2019.0,7772.85,7772850000.0


## Step 6: Results

The results from this analysis will be published in the form of data tables. There are six total, each looking at either,
- coverage: number of politician articles as a proportion of country/geographics population
- relative quality: relative proportion of politician articles that are of GA and FA-quality

In the following tables, the proportion of interest will be highlighted next to the country.

In [20]:
# Order columns for proportion type
cols = countries.columns.tolist()
coverage_cols = cols[-2:] + cols[:-2]
relative_quality_cols = cols[-1:] + cols[-2:-1] + cols[:-2]

### Top 10 countries by coverage
Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [21]:
countries.sort_values(by='coverage', ascending=False).head(10)[coverage_cols]

Unnamed: 0_level_0,coverage,relative_quality,population,article_count,high_quality_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Tuvalu,0.0054,0.074074,10000,54,4
Nauru,0.004727,0.0,11000,52,0
San Marino,0.002353,0.0,34000,80,0
Monaco,0.001,0.0,38000,38,0
Liechtenstein,0.000692,0.0,39000,27,0
Tonga,0.000636,0.0,99000,63,0
Marshall Islands,0.000632,0.0,57000,36,0
Iceland,0.00053,0.005128,368000,195,1
Andorra,0.000402,0.0,82000,33,0
Federated States of Micronesia,0.00034,0.0,106000,36,0


### Bottom 10 countries by coverage

In [22]:
countries.sort_values(by='coverage', ascending=True).head(10)[coverage_cols]

Unnamed: 0_level_0,coverage,relative_quality,population,article_count,high_quality_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
India,6.806657e-07,0.013641,1400100000,953,13
Indonesia,7.507204e-07,0.034314,271739000,204,7
China,7.815258e-07,0.033759,1402385000,1096,37
Uzbekistan,8.193363e-07,0.107143,34174000,28,3
Ethiopia,8.614988e-07,0.020202,114916000,99,2
Zambia,1.359878e-06,0.0,18384000,25,0
"Korea, North",1.396486e-06,0.222222,25779000,36,8
Thailand,1.65329e-06,0.027273,66534000,110,3
Mozambique,1.79683e-06,0.0,31166000,56,0
Bangladesh,1.813803e-06,0.00974,169809000,308,3


### Top 10 countries by relative quality
10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [23]:
countries.sort_values(by='relative_quality', ascending=False).head(10)[relative_quality_cols]

Unnamed: 0_level_0,relative_quality,coverage,population,article_count,high_quality_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Korea, North",0.222222,1.396486e-06,25779000,36,8
Saudi Arabia,0.133929,3.196256e-06,35041000,112,15
Romania,0.125,1.746271e-05,19241000,336,42
Central African Republic,0.123077,1.345756e-05,4830000,65,8
Uzbekistan,0.107143,8.193363e-07,34174000,28,3
Mauritania,0.104167,1.032258e-05,4650000,48,5
Guatemala,0.085366,4.538913e-06,18066000,82,7
Dominica,0.083333,0.0001666667,72000,12,1
Syria,0.08,6.443963e-06,19398000,125,10
Benin,0.077778,7.371611e-06,12209000,90,7


### Bottom 10* countries by relative quality
There are \*37 countires that have tied for the lowest number of politician articles as a proportion of country population. This totals tables is also sorted by descending article count.

In [33]:

countries.sort_values(by=['relative_quality','article_count'], ascending=(True,False)).head(countries.relative_quality.value_counts()[0])[relative_quality_cols]

Unnamed: 0_level_0,relative_quality,coverage,population,article_count,high_quality_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Finland,0.0,0.000102,5529000,562,0
Moldova,0.0,0.000116,3535000,410,0
Namibia,0.0,6.2e-05,2541000,157,0
Estonia,0.0,0.00011,1331000,147,0
Costa Rica,0.0,2.8e-05,5111000,144,0
Tunisia,0.0,1.1e-05,11896000,133,0
Angola,0.0,3e-06,32522000,105,0
Solomon Islands,0.0,0.000134,715000,96,0
San Marino,0.0,0.002353,34000,80,0
Kazakhstan,0.0,4e-06,18732000,78,0


### Geographic regions by coverage
Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [25]:
geographic_region.sort_values(by='coverage', ascending=False)[coverage_cols]

Unnamed: 0_level_0,coverage,relative_quality,population,article_count,high_quality_count
subregion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CARIBBEAN,1.462408e-07,0.017647,4649866000,680,12
MIDDLE AFRICA,7.261663e-08,0.024279,9075056000,659,16
CENTRAL ASIA,7.118989e-08,0.028926,3399359000,242,7
OCEANIA,6.865854e-08,0.01964,44495556000,3055,60
NORTHERN EUROPE,5.18419e-08,0.027078,71235819000,3693,100
WESTERN ASIA,4.168287e-08,0.035018,60288553000,2513,88
SOUTHERN EUROPE,3.644968e-08,0.02028,100110620000,3649,74
EUROPE,3.159319e-08,0.021981,586075655000,18516,407
EASTERN AFRICA,2.878411e-08,0.014274,85185877000,2452,35
SOUTHERN AFRICA,2.737475e-08,0.01454,22612075000,619,9


### Geographic regions by relative quality
Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [26]:
geographic_region.sort_values(by='relative_quality', ascending=False)[relative_quality_cols]

Unnamed: 0_level_0,relative_quality,coverage,population,article_count,high_quality_count
subregion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NORTHERN AMERICA,0.055734,4.981073e-09,374618060000,1866,104
WESTERN ASIA,0.035018,4.168287e-08,60288553000,2513,88
SOUTHEAST ASIA,0.034465,1.20673e-08,163499728000,1973,68
EASTERN EUROPE,0.032329,1.993309e-08,183112638000,3650,118
EAST ASIA,0.030265,1.490658e-09,1618077163000,2412,73
CENTRAL ASIA,0.028926,7.118989e-08,3399359000,242,7
NORTHERN EUROPE,0.027078,5.18419e-08,71235819000,3693,100
ASIA,0.026873,3.216157e-09,3552064856000,11424,307
AFRICA,0.025991,1.235398e-08,694513022000,8580,223
MIDDLE AFRICA,0.024279,7.261663e-08,9075056000,659,16
