# A study of bias in data on Wikipedia

The purpose of this study is to explore bias in data on Wikipedia by analyzing Wikipedia articles on politicians from various countries with respect to their populations. A further metric used for comparison is the quality of articles on politicians across different countries.

### Import libraries

In [1]:
# For getting data from API
import requests
import json

# For data analysis
import pandas as pd
import numpy as np

### Load datasets

**Data Sources:** 
  
We will combine the below two datasets for our analysis of bias in data on Wikipedia:
  
1) **Wikipedia articles** : This dataset contains information on Wikipedia articles for politicians by country. Details include the article name, revision id (last edit id) and country. This dataset can be downloaded from [figshare](https://figshare.com/articles/Untitled_Item/5513449). A downloaded version "page_data.csv" (downloaded on 28th Oct 2018)  is also uploaded to the [git](https://github.com/priyankam22/DATA-512-Human-Centered-Data-Science/tree/master/data-512-a2) repository.  
  
2) **Country Population** : This dataset contains a list of countries and their populations till mid-2018 in millions. This dataset is sourced from the [Population Reference Bureau] (https://www.prb.org/data/). As the dataset is copyrighted, it is not available on this repository. The data might have changed when you extract it from the website. For reproducibility, i have included the intermediate merged file for the final analysis.  

In [8]:
# Load the Wikipedia articles
wiki_articles = pd.read_csv('page_data.csv')
wiki_articles.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [9]:
# Load the country population
country_pop = pd.read_csv('WPDS_2018_data.csv')
country_pop.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


In [10]:
print("Number of records in Wikipedia articles dataset: ", wiki_articles.shape[0])
print("Number of records in country population dataset: ", country_pop.shape[0])

Number of records in Wikipedia articles dataset:  47197
Number of records in country population dataset:  207


### Get the quality of Wikipedia articles

To get the quality score of Wikipedia articles, we will use the machine learning system called [ORES](https://www.mediawiki.org/wiki/ORES) ("Objective Revision Evaluation Service"). ORES estimates the quality of a given Wikipedia article by assigning a series of probabilities that the article belongs to one of the six quality categories and returns the most probable category as the prediction. The quality of an article (from best to worst) can be categorized into six categories as below.
  
1. FA    - Featured article
2. GA    - Good article
3. B     - B-class article
4. C     - C-class article
5. Start - Start-class article
6. Stub  - Stub-class article
  
More details about these categories can be found at [Wikipedia: Content Assessment](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades)

We will use a Wikimedia RESTful API endpoint for ORES to get the predictions for each of the Wikipedia articles. Documentation for the API can be found [here](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model).

In [11]:
# Set the headers with your github ID and email address. This will be used for identification while making calls to the API
headers = {'User-Agent' : 'https://github.com/priyankam22', 'From' : 'mhatrep@uw.edu'}

# Function to get the predictions for Wikipedia articles using API calls
def get_ores_predictions(rev_ids, headers):
    '''
    Takes a list of revision ids of Wikipedia articles and returns the quality of each article.
    
    Input:
    rev_ids: A list of revision ids of Wikipedia articles
    headers: a dictionary with identifying information to be passed to the API call
    
    Output: a dictionary of dictionaries storing a final predicted label and probabilities for each of the categories 
            for every revision id passed.
    '''
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters for the endpoint 
    params = {'project' : 'enwiki',
              'model'   : 'wp10',    
              'revids'  : '|'.join(str(x) for x in rev_ids) # A single string with all revision ids separated by '|'
              }
    
    # make the API call
    api_call = requests.get(endpoint.format(**params))
    
    # Get the response in json format
    response = api_call.json()
    
    return response

Lets look at the output of the API call by calling the function on a sample list of revision ids.

In [12]:
get_ores_predictions(list(wiki_articles['rev_id'])[0:5], headers) 

{'enwiki': {'models': {'wp10': {'version': '0.6.1'}},
  'scores': {'235107991': {'wp10': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:235107991)',
      'type': 'RevisionNotFound'}}},
   '355319463': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.0037293011286007372,
       'C': 0.003856823065973545,
       'FA': 0.0005009114577946061,
       'GA': 0.0009278080381894021,
       'Start': 0.008398482183096077,
       'Stub': 0.9825866741263456}}}},
   '391862046': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.00752908372935955,
       'C': 0.011698750542107464,
       'FA': 0.001217297276719427,
       'GA': 0.0018271099726449593,
       'Start': 0.12703001272170586,
       'Stub': 0.8506977457574628}}}},
   '391862070': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.007528602399161758,
       'C': 0.011761932099515725,
       'FA': 0.0012172194555714589,
       'GA': 0.00182699316650

We need to extract the prediction for each of the revision ids from the response. Note that the prediction is a key in one of the nested dictionaries.

We will call the API for all the Wikipedia articles in batches of 100 so that we do not overload the server with our requests. 100 was chosen after trial and error. Higher batchsize can throw an error.

In [13]:
# Make calls to the API in batches and append the scores portion of the dictionary response to the scores list.
scores = []
batch_size = 100

for begin_ind in range(0,len(wiki_articles),batch_size):
    
    # set the end index by adding the batchsize except for the last batch.
    end_ind = begin_ind+batch_size if begin_ind+batch_size <= len(wiki_articles) else len(wiki_articles)
    
    # make the API call
    output = get_ores_predictions(list(wiki_articles['rev_id'])[begin_ind:end_ind], headers)    
    
    # Append the scores extratced from the dictionary to scores list
    scores.append(output['enwiki']['scores'])

Let us now extract the predicted labels for each revision_id from the list of scores.

In [14]:
# A list to store all the predicted labels
prediction = []

# Loop through all the scores dictionaries from the scores list. 
for i in range(len(scores)):
    # Get the predicted label from the value of all the keys(revision_ids)
    for val in scores[i].values():
        # Use the get function to get the value of 'score' key. If the score is not found (in case of no matches), none is returned.
        prediction.append(val['wp10'].get('score')['prediction'] if val['wp10'].get('score') else None)

In [15]:
print("Number of predictions extracted : " , len(prediction))

Number of predictions extracted :  47197


This matches the number of revision ids we passed earlier.

In [16]:
print("Unique predictions extracted : " , set(prediction))

Unique predictions extracted :  {'C', 'Start', 'FA', None, 'B', 'Stub', 'GA'}


In [17]:
# Merging the predictions with the Wikipedia articles
wiki_articles['quality'] = prediction
wiki_articles.head()

Unnamed: 0,page,country,rev_id,quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


### Merging the Wikipedia Quality data with the country population

In [18]:
# Create separate columns with lowercase country name so that we can join without mismatch
wiki_articles['country_lower'] = wiki_articles['country'].apply(lambda x: x.lower())
country_pop['Geography_lower'] = country_pop['Geography'].apply(lambda x: x.lower())

# Merge the two datasets on lowercase country name. Inner join will remove any countriess that do not have matching rows
dataset = wiki_articles.merge(country_pop, how='inner', left_on='country_lower', right_on='Geography_lower')

In [19]:
dataset.head()

Unnamed: 0,page,country,rev_id,quality,country_lower,Geography,Population mid-2018 (millions),Geography_lower
0,Template:ZambiaProvincialMinisters,Zambia,235107991,,zambia,Zambia,17.7,zambia
1,Gladys Lundwe,Zambia,757566606,Stub,zambia,Zambia,17.7,zambia
2,Mwamba Luchembe,Zambia,764848643,Stub,zambia,Zambia,17.7,zambia
3,Thandiwe Banda,Zambia,768166426,Start,zambia,Zambia,17.7,zambia
4,Sylvester Chisembele,Zambia,776082926,C,zambia,Zambia,17.7,zambia


### Data cleaning

In [20]:
# Drop the extra country columns.
dataset.drop(['country_lower','Geography','Geography_lower'], axis=1, inplace=True)

# Rename the remaining columns
dataset.columns = ['article_name','country','revision_id','article_quality','population']

# Remove columns where quality is None (not found from ORES)
quality_none_idx = dataset[dataset['article_quality'].isnull()].index

print("%d rows removed as ORES could not return the quality of the article" % len(quality_none_idx))
dataset.drop(quality_none_idx, inplace=True)

104 rows removed as ORES could not return the quality of the article


In [21]:
# Check the datatypes of the columns
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44973 entries, 1 to 45076
Data columns (total 5 columns):
article_name       44973 non-null object
country            44973 non-null object
revision_id        44973 non-null int64
article_quality    44973 non-null object
population         44973 non-null object
dtypes: int64(1), object(4)
memory usage: 2.1+ MB


In [22]:
# Population is stored as text. Let us remove the commas used for separation and convert it to float 
dataset['population'] = dataset['population'].apply(lambda x: float(x.replace(',','')))

In [23]:
dataset.shape

(44973, 5)

In [24]:
# Save the final dataset as a csv file for future reproducibility
dataset.to_csv('wiki_articles_country_pop.csv')

### Data Analysis

We will now perform some analysis on the number of articles on politicians with respect to a country's population and what proportion of these articles are good quality articles. By comparing the highest and lowest ranking countries in the list, we can get a fair idea of bias in the data on Wikipedia. Ideally we would expect to see similiar proportions in all countries.

In [25]:
# If you are skipping all the above steps then the prepared dataset can be loaded.
dataset = pd.read_csv('wiki_articles_country_pop.csv')

In [26]:
# Add a new binary column to classify the articles as good quality or not where good quality is defined as either FA or GA.
dataset['is_good_quality'] = dataset['article_quality'].apply(lambda x: 1 if x == 'FA' or x == 'GA' else 0)

To get an idea of the overall political coverage in Wikipedia by country, let us aggregate the data by country. We are interested in the total number of articles per country, the population of each country and the number of good articles per country.

In [27]:
output = dataset[['country','population','is_good_quality']].groupby(['country'], as_index=False).agg(['count','max','sum']).reset_index()

In [28]:
output.head()

Unnamed: 0_level_0,country,population,population,population,is_good_quality,is_good_quality,is_good_quality
Unnamed: 0_level_1,Unnamed: 1_level_1,count,max,sum,count,max,sum
0,Afghanistan,326,36.5,11899.0,326,1,10
1,Albania,460,2.9,1334.0,460,1,4
2,Algeria,119,42.7,5081.3,119,1,2
3,Andorra,34,0.08,2.72,34,0,0
4,Angola,110,30.4,3344.0,110,0,0


In [29]:
# Drop the columns we dont need for the analysis.
output.drop(('population','count'), axis=1, inplace=True)
output.drop(('population','sum'), axis=1, inplace=True)
output.drop(('is_good_quality','max'), axis=1, inplace=True)

# Rename the useful columns
output.columns = ['country','population','total_articles','quality_articles']

In [30]:
output.head()

Unnamed: 0,country,population,total_articles,quality_articles
0,Afghanistan,36.5,326,10
1,Albania,2.9,460,4
2,Algeria,42.7,119,2
3,Andorra,0.08,34,0
4,Angola,30.4,110,0


To be able to compare different countries, let us calculate the proportion of articles by unit population and the proportion of good quality articles.

In [31]:
# Create a new column with the proportion of articles per 100 ppl.
output['article_prop'] = np.round(output['total_articles']/(output['population']*10**4)*100,6)

# Create a new column for proportion of good quality articles 
output['quality_prop'] = output['quality_articles']/output['total_articles']*100

### Results

### 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [32]:
# Sort by article_prop and extract top 10 countries
high_art_prop = output.sort_values(by='article_prop',ascending=False)[0:10].drop(['quality_articles','quality_prop'], axis=1).reset_index(drop=True)

# Rename the columns
high_art_prop.columns = ['Country', 'Population till mid-2018 (in millions)', 'Total Articles', 'Articles Per 100 Persons']

high_art_prop

Unnamed: 0,Country,Population till mid-2018 (in millions),Total Articles,Articles Per 100 Persons
0,Tuvalu,0.01,55,55.0
1,Nauru,0.01,53,53.0
2,San Marino,0.03,82,27.333333
3,Monaco,0.04,40,10.0
4,Liechtenstein,0.04,29,7.25
5,Tonga,0.1,63,6.3
6,Marshall Islands,0.06,37,6.166667
7,Iceland,0.4,206,5.15
8,Andorra,0.08,34,4.25
9,Federated States of Micronesia,0.1,38,3.8


### 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [33]:
# Sort by article_prop and extract lowest 10 countries
low_art_prop = output.sort_values(by='article_prop',ascending=True)[0:10:].drop(['quality_articles','quality_prop'], axis=1).reset_index(drop=True)

# Rename the columns
low_art_prop.columns = ['Country', 'Population till mid-2018 (in millions)', 'Total Articles', 'Articles Per 100 Persons']

low_art_prop

Unnamed: 0,Country,Population till mid-2018 (in millions),Total Articles,Articles Per 100 Persons
0,India,1371.3,986,0.00719
1,Indonesia,265.2,214,0.008069
2,China,1393.8,1135,0.008143
3,Uzbekistan,32.9,29,0.008815
4,Ethiopia,107.5,105,0.009767
5,Zambia,17.7,25,0.014124
6,"Korea, North",25.6,39,0.015234
7,Thailand,66.2,112,0.016918
8,Bangladesh,166.4,323,0.019411
9,Mozambique,30.5,60,0.019672


As seen in above tables, there is a huge variation in the proportion of Wikipedia articles on politicians with respect to the population of the country. The highest ranking country is Tuvalu with a population of 0.01 million and 55 Wikipedia articles (55 articles per 100 persons) on politicians whereas the lowest ranking country is India with a population of 1371.3 million and only 986 Wikipedia articles on politicians (0.007% per 100 persons). One important trend to be noted here is that all the highest ranking countries (except Iceland) have extremely low populations (less than 100K). All the high ranking countries have very high populations. Most of the low ranking countries are developing countries which can explain the bias seen in the data. 

### 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [34]:
# Sort by quality_prop and extract highest 10 countries
high_qual_art = output.sort_values(by='quality_prop',ascending=False)[0:10].drop(['population','article_prop'], axis=1).reset_index(drop=True)

# Rename the columns
high_qual_art.columns = ['Country', 'Total Articles', 'Good Quality Articles', 'Proportion Of Good Quality Articles (%)']

high_qual_art

Unnamed: 0,Country,Total Articles,Good Quality Articles,Proportion Of Good Quality Articles (%)
0,"Korea, North",39,7,17.948718
1,Saudi Arabia,119,16,13.445378
2,Central African Republic,68,8,11.764706
3,Romania,348,40,11.494253
4,Mauritania,52,5,9.615385
5,Bhutan,33,3,9.090909
6,Tuvalu,55,5,9.090909
7,Dominica,12,1,8.333333
8,United States,1092,82,7.509158
9,Benin,94,7,7.446809


### 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [35]:
# Sort by quality_prop and extract highest 10 countries
low_qual_art = output.sort_values(by='quality_prop',ascending=True)[0:10].drop(['population','article_prop'], axis=1).reset_index(drop=True)

# Rename the columns
low_qual_art.columns = ['Country', 'Total Articles', 'Good Quality Articles', 'Proportion Of Good Quality Articles (%)']

low_qual_art

Unnamed: 0,Country,Total Articles,Good Quality Articles,Proportion Of Good Quality Articles (%)
0,Sao Tome and Principe,22,0,0.0
1,Mozambique,60,0,0.0
2,Cameroon,105,0,0.0
3,Guyana,20,0,0.0
4,Turkmenistan,33,0,0.0
5,Monaco,40,0,0.0
6,Moldova,426,0,0.0
7,Comoros,51,0,0.0
8,Marshall Islands,37,0,0.0
9,Costa Rica,150,0,0.0


As seen in above two tables, the proportion of good quality articles is highest in North Korea at 17.94% and lowest at 0% in many countries like Sao Tome and Principe, Mozambique, Cameroon, etc. It seems like there are many countries with zero good quality articles. Lets find out all such countries.

In [41]:
no_good_quality_articles = list(output[output['quality_articles'] == 0]['country'])

In [43]:
len(no_good_quality_articles)

37

There are 37 countries with no good quality articles. All the countries are listed below.

In [42]:
no_good_quality_articles

['Andorra',
 'Angola',
 'Antigua and Barbuda',
 'Bahamas',
 'Barbados',
 'Belgium',
 'Belize',
 'Cameroon',
 'Cape Verde',
 'Comoros',
 'Costa Rica',
 'Djibouti',
 'Federated States of Micronesia',
 'Finland',
 'Guyana',
 'Kazakhstan',
 'Kiribati',
 'Lesotho',
 'Liechtenstein',
 'Macedonia',
 'Malta',
 'Marshall Islands',
 'Moldova',
 'Monaco',
 'Mozambique',
 'Nauru',
 'Nepal',
 'San Marino',
 'Sao Tome and Principe',
 'Seychelles',
 'Slovakia',
 'Solomon Islands',
 'Switzerland',
 'Tunisia',
 'Turkmenistan',
 'Uganda',
 'Zambia']