# A2: Bias in Wikipedia data
#### Jacob Warwick, October 2018
This work completed as part of [Human Centered Data Science (Fall 2018)](https://wiki.communitydata.cc/Human_Centered_Data_Science_(Fall_2018). See README.md for an overview of the project, the data, and the license.


## Environment
This analysis was performed in Python in October 2018.

In [40]:
from sys import version
import pandas as pd
import numpy as np
import requests
import json
from math import ceil
from typing import List
from time import sleep
from datetime import datetime

print(version)

3.6.2 |Anaconda custom (64-bit)| (default, Sep 21 2017, 18:29:43) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


## Data parsing and collection

### Load page and population data
In this step, I load the Wikipedia page data and the world population data into dataframes.

In [11]:
page_data = pd.read_csv("data/page_data.csv")
print(page_data.shape)
page_data.head()

(47197, 3)


Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [124]:
wpds = pd.read_csv("data/WPDS_2018_data.csv")
wpds.columns = ['geography', 'population_mil']
wpds['population_mil'] = wpds.population_mil.map(lambda x: float(x.replace(',','')))
print(wpds.shape)
wpds.head()

(207, 2)


Unnamed: 0,geography,population_mil
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


### Call the ORES estimates API
Thanks and credit to Os' example notebook for this code, which I have slightly modified:
https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb

To be a responsible consumer of the API, I chose to submit no more than 100 page IDs at a time, with a .5 second delay between requests.

In [92]:
def get_ores_data(revision_ids):
    headers = {'User-Agent' : 'https://github.com/jacobw124', 'From' : 'jacobw4@uw.edu'}
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    params = {
        'project' : 'enwiki',
        'model'   : 'wp10',
        'revids'  : '|'.join(str(x) for x in revision_ids)
    }
    api_call = requests.get(endpoint.format(**params))
    try:
        return api_call.json()
    except:
        raise ValueError(api_call.text)

BATCH_SIZE=100
ores_responses = list()
n_batches = ceil(len(page_data)/BATCH_SIZE)
for i in range(n_batches):
    if i%20 == 0 or i == n_batches-1:
        print(f"{datetime.now()}: {i+1} of {n_batches}")
    revision_ids = list(
        page_data.loc[
            i*BATCH_SIZE : min((i+1)*BATCH_SIZE, len(page_data))-1, 
            "rev_id"
        ].values
    )
    response = get_ores_data(revision_ids)['enwiki']['scores']
    ores_responses.extend(
        [
            response[str(rev)]['wp10']['score']['prediction'] 
            if not 'error' in response[str(rev)]['wp10'].keys()
            else None
            for rev in revision_ids
        ]
    )
    sleep(.5)

print("DONE QUERYING ORES")

2018-10-28 15:07:35.279369: 1 of 472
2018-10-28 15:07:52.118688: 21 of 472
2018-10-28 15:08:08.606281: 41 of 472
2018-10-28 15:08:25.047467: 61 of 472
2018-10-28 15:08:41.658304: 81 of 472
2018-10-28 15:08:58.978012: 101 of 472
2018-10-28 15:09:15.407633: 121 of 472
2018-10-28 15:09:32.326903: 141 of 472
2018-10-28 15:09:50.948997: 161 of 472
2018-10-28 15:10:07.224439: 181 of 472
2018-10-28 15:10:24.055653: 201 of 472
2018-10-28 15:10:40.420445: 221 of 472
2018-10-28 15:10:56.649138: 241 of 472
2018-10-28 15:11:12.784431: 261 of 472
2018-10-28 15:11:30.232519: 281 of 472
2018-10-28 15:11:46.450993: 301 of 472
2018-10-28 15:12:02.919601: 321 of 472
2018-10-28 15:12:19.346449: 341 of 472
2018-10-28 15:12:35.726007: 361 of 472
2018-10-28 15:12:52.234633: 381 of 472
2018-10-28 15:13:08.783012: 401 of 472
2018-10-28 15:13:28.842419: 421 of 472
2018-10-28 15:13:45.885048: 441 of 472
2018-10-28 15:14:02.695298: 461 of 472
2018-10-28 15:14:11.697051: 472 of 472
DONE QUERYING ORES


Next, I save the data I queried from ORES and link it back up to the page_data table.

In [93]:
with open(datetime.now().strftime("data/ores_responses_%Y-%m-%d.json"), "w") as ores_out:
    ores_out.write(json.dumps(ores_responses, indent=4, sort_keys=True))

In [96]:
page_data['ores_response'] = [np.nan if i is None else i for i in ores_responses]
page_data.head()

Unnamed: 0,page,country,rev_id,ores_response
0,Template:ZambiaProvincialMinisters,Zambia,235107991,
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


### Merge the page and country data
I also create a "quality_bin", a boolean which is true for "featured article" and "good article" ORES scores, and save the resulting table.

In [125]:
pages_countries = pd.merge(
    page_data,
    wpds,
    left_on = ['country'],
    right_on = ['geography'],
    how='inner'
)[['country', 'page', 'rev_id', 'ores_response', 'population_mil']]
pages_countries.columns = ['country', 'article', 'rev_id', 'quality', 'population_mil']
pages_countries['high_quality'] = pages_countries.quality.map(
    lambda x: x in ('FA', 'GA')
)
pages_countries.to_csv("data/pages_countries.csv", index=False)

print(pages_countries.high_quality.value_counts())
pages_countries.head()

False    44097
True       980
Name: high_quality, dtype: int64


Unnamed: 0,country,article,rev_id,quality,population_mil,high_quality
0,Zambia,Template:ZambiaProvincialMinisters,235107991,,17.7,False
1,Zambia,Gladys Lundwe,757566606,Stub,17.7,False
2,Zambia,Mwamba Luchembe,764848643,Stub,17.7,False
3,Zambia,Thandiwe Banda,768166426,Start,17.7,False
4,Zambia,Sylvester Chisembele,776082926,C,17.7,False


### Analyze join omissions
To assess the bias I just introduced into the analysis by excluding countries and articles that didn't match, here are the value counts of the most frequent mismatches on either side.

In [157]:
all_pages_countries = pages_countries.country.unique()
all_wpds_countries = wpds.geography.unique()

print("COUNTRIES IN pages_countries MISSING FROM WPDS:")
print(
    pages_countries[
        pages_countries.country.map(lambda x: x not in all_wpds_countries)
    ].country.value_counts()
)

print("GEOGRAPHIES IN wpds MISSING FROM pages_countries:")
print(
    wpds[
        wpds.geography.map(lambda x: x not in all_pages_countries)
    ].geography.value_counts()
)

COUNTRIES IN pages_countries MISSING FROM WPDS:
Series([], Name: country, dtype: int64)
GEOGRAPHIES IN wpds MISSING FROM pages_countries:
Czechia                            1
French Polynesia                   1
ASIA                               1
AFRICA                             1
Saint Lucia                        1
Timor-Leste                        1
NORTHERN AMERICA                   1
St. Vincent and the Grenadines     1
OCEANIA                            1
Puerto Rico                        1
Georgia                            1
Guam                               1
EUROPE                             1
Cote d'Ivoire                      1
LATIN AMERICA AND THE CARIBBEAN    1
New Caledonia                      1
Congo, Dem. Rep.                   1
El Salvador                        1
eSwatini                           1
Western Sahara                     1
Honduras                           1
St. Kitts-Nevis                    1
Curacao                            1
Palau      

A good number of these omissions are probably due to differences in naming conventions. Notable omissions include Honduras, eSwatini (neé Swaziland), El Salvador, the Democratic Republic of the Congo, Puerto Rico, Czechia, and Georgia.

I am not going to attempt to fix these join problems but wish to note the potential source of bias.

## Analysis
In this section, I summarize the articles per capita, and the high quality articles per capita, for each country in the dataset, then print out the requested tables showing the highest and lowest ranked countries by both measures.

In [139]:
percap = pages_countries
percap['articles'] = 1
percap = percap.groupby(['country', 'population_mil'])\
    .sum().reset_index()[['country', 'population_mil', 'articles', 'high_quality']]

percap['articles_percap'] = percap.articles/(percap.population_mil*1e6)
percap['hq_articles_percap'] = percap.high_quality/(percap.population_mil*1e6)

percap.to_csv("data/pages_per_capita.csv", index=False)
percap.head()

Unnamed: 0,country,population_mil,articles,high_quality,articles_percap,hq_articles_percap
0,Afghanistan,36.5,327,10.0,9e-06,2.739726e-07
1,Albania,2.9,460,4.0,0.000159,1.37931e-06
2,Algeria,42.7,119,2.0,3e-06,4.683841e-08
3,Andorra,0.08,34,0.0,0.000425,0.0
4,Angola,30.4,110,0.0,4e-06,0.0


### 10 highest-ranked countries for politician articles per-capita

In [142]:
percap.sort_values('articles_percap', ascending=False)[:10]

Unnamed: 0,country,population_mil,articles,high_quality,articles_percap,hq_articles_percap
166,Tuvalu,0.01,55,5.0,0.0055,0.0005
115,Nauru,0.01,53,0.0,0.0053,0.0
135,San Marino,0.03,82,0.0,0.002733,0.0
108,Monaco,0.04,40,0.0,0.001,0.0
93,Liechtenstein,0.04,29,0.0,0.000725,0.0
161,Tonga,0.1,63,1.0,0.00063,1e-05
103,Marshall Islands,0.06,37,0.0,0.000617,0.0
68,Iceland,0.4,206,2.0,0.000515,5e-06
3,Andorra,0.08,34,0.0,0.000425,0.0
52,Federated States of Micronesia,0.1,38,0.0,0.00038,0.0


### 10 lowest-ranked countries for politician articles per-capita

In [143]:
percap.sort_values('articles_percap', ascending=True)[:10]

Unnamed: 0,country,population_mil,articles,high_quality,articles_percap,hq_articles_percap
69,India,1371.3,990,14.0,7.219427e-07,1.020929e-08
70,Indonesia,265.2,215,8.0,8.107089e-07,3.016591e-08
34,China,1393.8,1138,33.0,8.16473e-07,2.367628e-08
173,Uzbekistan,32.9,29,1.0,8.81459e-07,3.039514e-08
51,Ethiopia,107.5,105,1.0,9.767442e-07,9.302326e-09
178,Zambia,17.7,26,0.0,1.468927e-06,0.0
82,"Korea, North",25.6,39,7.0,1.523437e-06,2.734375e-07
159,Thailand,66.2,112,3.0,1.691843e-06,4.531722e-08
13,Bangladesh,166.4,324,3.0,1.947115e-06,1.802885e-08
112,Mozambique,30.5,60,0.0,1.967213e-06,0.0


### 10 highest-ranked countries for high-quality politician articles per-capita

In [144]:
percap.sort_values('hq_articles_percap', ascending=False)[:10]

Unnamed: 0,country,population_mil,articles,high_quality,articles_percap,hq_articles_percap
166,Tuvalu,0.01,55,5.0,0.0055,0.0005
44,Dominica,0.07,12,1.0,0.000171,1.4e-05
61,Grenada,0.1,36,1.0,0.00036,1e-05
161,Tonga,0.1,63,1.0,0.00063,1e-05
174,Vanuatu,0.3,62,3.0,0.000207,1e-05
100,Maldives,0.4,84,2.0,0.00021,5e-06
68,Iceland,0.4,206,2.0,0.000515,5e-06
73,Ireland,4.9,381,24.0,7.8e-05,5e-06
19,Bhutan,0.8,33,3.0,4.1e-05,4e-06
74,Israel,8.5,498,21.0,5.9e-05,2e-06


### 10 lowest-ranked countries for high-quality politician articles per-capita

In [145]:
percap.sort_values('hq_articles_percap', ascending=True)[:10]

Unnamed: 0,country,population_mil,articles,high_quality,articles_percap,hq_articles_percap
143,Slovakia,5.4,119,0.0,2.2e-05,0.0
90,Lesotho,2.3,30,0.0,1.3e-05,0.0
28,Cameroon,25.6,106,0.0,4e-06,0.0
30,Cape Verde,0.6,37,0.0,6.2e-05,0.0
178,Zambia,17.7,26,0.0,1e-06,0.0
36,Comoros,0.8,51,0.0,6.4e-05,0.0
116,Nepal,29.7,363,0.0,1.2e-05,0.0
154,Switzerland,8.5,407,0.0,4.8e-05,0.0
43,Djibouti,1.0,39,0.0,3.9e-05,0.0
145,Solomon Islands,0.7,98,0.0,0.00014,0.0


## Reflection

These results are of limited usefulness. Only considering the highest-ranked countries for politician pages per-capita, the list is dominated by sates with tiny populations; their populations are so much smaller that they jump to the top of the list. But if we are trying to infer political activism through this metric, I think the population distribution has much more of an influence here than any actual effect.

Similarly, on the low end, we see countries like India, Indonesia, and China, which have some of the worlds largest populations. The rest of the list, however, is pretty interesting; I think these are countries that truly have significantly fewer pages per capita than the rest.

Just considering high quality pages, the countries with the highest number are again almost entirely the countries with the lowest populations, with the exception of Ireland, and Israel. Unfortunately, the lowest-ranked list is entirely composed of countries with 0 high quality articles, so the "rank" is meaningless. This metric just doesn't have a lot of fidelity in the low end, so I don't think that table is particularly useful.