## Exploring Bias in Wikipedia's Political Articles by Country

Here, I retrieve English Wikipedia's political article names by country and merge it with world population data. Then, I analyze the coverage and quality of Wikipedia's politican articles by country. 

Our defintion of coverage and quality will be:
> <p>__Coverage__: The percent of political articles per country population.</p>
<p>__Quality__: The percent of high-quality political articles per country's total political articles. A high-quality article will be one considered either a _Freatured Article (FA)_ or a _Good Article (GA)_ (the highest 2 of Wikipedia's 6 article quality options).</p>

### Setup
First, we will import the packages necessary to run the following code.

In [28]:
import csv
import requests
import numpy as np
import pandas as pd

### Data Acquistion

#### World Population Data
The world population data can be found [here](https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0). 

Here we:
* Import the world population data into a dataframe.
* Rename the columns for simplicity and convert the population from millions.

In [34]:
# import population data
countries_df = pd.read_csv('./data/WPDS_2018_data.csv')

# rename columns
countries_df.columns = ['country', 'population']

# convert population from millions
countries_df['population'] = [float(c.replace(",", "")) * 1000000 for c in countries_df['population']]

# display
countries_df.head()

Unnamed: 0,country,population
0,AFRICA,1284000000.0
1,Algeria,42700000.0
2,Egypt,97000000.0
3,Libya,6500000.0
4,Morocco,35200000.0


#### Wikipedia's Political Article Data


##### Article Name Data:
<p>The article data can be found [here](https://figshare.com/articles/Untitled_Item/5513449), along the filepath country/data/page_data.csv.</p>

Here we:
* Import the article data into a dataframe.

In [3]:
# import article data
page_data_df = pd.read_csv('./data/page_data.csv')

# display
page_data_df.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


##### Article Quality Data:

Next, we use Wikimedia's API to call thier _Objective Revision Evalution Service_ ([ORES](https://www.mediawiki.org/wiki/ORES)) which will output the predicted quality for each article.

To get started, here we:
* Create parameters called headers and model to pass to the API calls.
* Define the API endpoint URL.
* Covert the article revision IDs from our article name dataset into a list we will loop over in the next step.

In [4]:
# Parameters for the API call. Customize "headers" with your own information
headers = {'User-Agent': 'https://github.com/mag3141592',
           'From': 'starkm5@uw.edu'}
model = 'wp10'

# API Endpoint
url = 'https://ores.wikimedia.org/v3/scores/enwiki/'

# Make list of article revision ids
rev_ids = list(page_data_df['rev_id'])

Here we loop over our 47198 revision IDs. Documentation for ORES can be found [here](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model). Through some experimentation, iterating over 100 at a time prevented it from crashing.  Each request returns a json, from which we can extract the quality prediction. I set up a try and except scenerio, because for several revision IDs there is no available prediction.  After, we parse for the prediction value we save it to our _predictions_ list. The below will continue iterating over 100 revision IDs at a time, until all 47198 requests have been processed.

In [5]:
# Empty list to store returned prediction values and revision IDs without predictions, respectively
predictions = []
missing = []

# Initiate index and define stop index
idx = 0
pages = len(rev_ids)

# Define number of revision IDs to send to the API at a time
threshold = 100

while idx < pages:
    
    # Define end index to never be larger than stopping index
    end_idx = min(idx + threshold, pages)
    
    # Subsets the revision IDs
    rev_param = '|'.join(str(x) for x in rev_ids[idx:end_idx])
    params = {'model' : model,
              'revids': rev_param}
    
    # Calls API and stores the response JSON 
    call = requests.get(url, params, headers = headers)
    response = call.json()   
    
    # Trys to retrieve a quolity prediction from the return JSON. If it fails, it stores NaN as the prediction and then revision_id in our missing list.
    for rev in response['enwiki']['scores']:
        try:
            predict = response['enwiki']['scores'][rev][model]['score']['prediction']
        except:
            missing.append(rev)
            predict = np.nan
        
        predictions.append(predict)
    
    # Print statement to see iteration progress
    if end_idx%5000 == 0:
        print(end_idx, ' processed')
        
    # Updates the starting index 
    idx += threshold

5000  processed
10000  processed
15000  processed
20000  processed
25000  processed
30000  processed
35000  processed
40000  processed
45000  processed


Looking at the results of above, we see below, 113 article revision IDs failed to return a prediction. Those revision IDs are listed below.

In [6]:
print(missing)

['235107991', '550682925', '671484594', '684023803', '684023859', '698572327', '703773782', '712872338', '712872421', '712872473', '712872531', '712873183', '712873308', '712873386', '712878000', '712878267', '712878343', '712878396', '712881543', '712881676', '712881741', '712881882', '712889562', '712889594', '712889683', '712889781', '712889809', '712891291', '712891354', '712891378', '712891476', '713368646', '715273866', '717927381', '719581803', '720054719', '720356159', '720688837', '721509220', '726600165', '730950147', '734957625', '738514517', '738984692', '745915558', '747688056', '749326717', '755180326', '756697478', '757313957', '757961591', '763558111', '765662083', '768013050', '768871687', '769271454', '771213598', '771642775', '774023957', '777163201', '779101752', '779135011', '779954797', '779957437', '782170063', '783382630', '787181453', '787398581', '788310383', '788722110', '789281061', '789285762', '789286413', '790028876', '790147995', '791866288', '792400552'

### Combining Datasets
Now that we have all the data (population, article names, and quality prediction), below we add a quality predictions column onto our article name dataframe (page_data_df).

In [37]:
page_data_df['prediction'] = predictions
page_data_df.head(10)

Unnamed: 0,page,country,rev_id,prediction
0,Template:ZambiaProvincialMinisters,Zambia,235107991,
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub
5,Template:Nigeria-politician-stub,Nigeria,391862819,Stub
6,Template:Colombia-politician-stub,Colombia,391863340,Stub
7,Template:Chile-politician-stub,Chile,391863361,Stub
8,Template:Fiji-politician-stub,Fiji,391863617,Stub
9,Template:Solomons-politician-stub,Solomon Islands,391863809,Stub


Now we merge the population dataframe with the article (name and prediction) dataframe. I used an inner join below to remove countries that existed in one file but not the other. Then we output our new combined dataset to a csv.

In [38]:
# Inner joins population dataframe with the article dataframe
csv_df = countries_df.join(page_data_df.set_index('country'), how = 'inner', on = 'country')

# Rename and reorder columns
csv_df.columns = ['country', 'population', 'article_name', 'revision_id', 'article_quality']
csv_df = csv_df[['country', 'article_name', 'revision_id', 'article_quality', 'population']]

# Outputs CSV of merged data
csv_df.to_csv('./data/hcds-a2-bias-data.csv', index = False)

# Displays
csv_df.head(10)

Unnamed: 0,country,article_name,revision_id,article_quality,population
1,Algeria,Template:Algeria-politician-stub,544347736,Stub,42700000.0
1,Algeria,Template:Algeria-diplomat-stub,567620838,Stub,42700000.0
1,Algeria,Template:AlgerianPres,665948270,Stub,42700000.0
1,Algeria,Ali Fawzi Rebaine,686269631,Stub,42700000.0
1,Algeria,Ahmed Attaf,705910185,Stub,42700000.0
1,Algeria,Ahmed Djoghlaf,707427823,Stub,42700000.0
1,Algeria,Hammi Larouissi,708060571,Stub,42700000.0
1,Algeria,Salah Goudjil,708980561,Stub,42700000.0
1,Algeria,Yazid Zerhouni,711888752,Stub,42700000.0
1,Algeria,Saad Dahlab,712810569,Stub,42700000.0


### Analysis

#### Coverage Analysis
First, I will focus on calculating our _coverage_ metric. To do so, we need to find the total number of poltical articles by country. Which is done below, by grouping by country and counting article names.

In [39]:
# Group and count articles over group
articles_per_country = csv_df.groupby(['country']).count()[['article_name']]

# Format and display results
articles_per_country.columns = ['total_article_count']
articles_per_country.head(10)

Unnamed: 0_level_0,total_article_count
country,Unnamed: 1_level_1
Afghanistan,327
Albania,460
Algeria,119
Andorra,34
Angola,110
Antigua and Barbuda,25
Argentina,496
Armenia,199
Australia,1566
Austria,340


Next, we inner join our total articles per country dataframe with our countries data. We divide the articles per country by the respective population and convert to percent inorder to calculate _coverage_.

In [40]:
# Join article count and population datasets
apc_df = countries_df.join(articles_per_country, how = 'inner', on = 'country')

# Calculate coverage
apc_df['articles_per_population_%']= apc_df['total_article_count']/apc_df['population'] * 100

# Format and display results
apc_df = apc_df.set_index('country')
apc_df.head(10)

Unnamed: 0_level_0,population,total_article_count,articles_per_population_%
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Algeria,42700000.0,119,0.000279
Egypt,97000000.0,239,0.000246
Libya,6500000.0,111,0.001708
Morocco,35200000.0,208,0.000591
Sudan,41700000.0,98,0.000235
Tunisia,11600000.0,140,0.001207
Benin,11500000.0,94,0.000817
Burkina Faso,20300000.0,97,0.000478
Cape Verde,600000.0,37,0.006167
Gambia,2200000.0,82,0.003727


Now that we've finished out coverage metric, we will display the 10 highest and the 10 lowest ranking countries in terms of political article coverage. 

In [41]:
# 1. Sort merged dataframe by descending coverage
apc_df.sort_values('articles_per_population_%', ascending = False).head(10)

Unnamed: 0_level_0,population,total_article_count,articles_per_population_%
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tuvalu,10000.0,55,0.55
Nauru,10000.0,53,0.53
San Marino,30000.0,82,0.273333
Monaco,40000.0,40,0.1
Liechtenstein,40000.0,29,0.0725
Tonga,100000.0,63,0.063
Marshall Islands,60000.0,37,0.061667
Iceland,400000.0,206,0.0515
Andorra,80000.0,34,0.0425
Federated States of Micronesia,100000.0,38,0.038


In [42]:
# 2. Sort merged dataframe by ascending coverage
apc_df.sort_values('articles_per_population_%', ascending = True).head(10)

Unnamed: 0_level_0,population,total_article_count,articles_per_population_%
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,1371300000.0,990,7.2e-05
Indonesia,265200000.0,215,8.1e-05
China,1393800000.0,1138,8.2e-05
Uzbekistan,32900000.0,29,8.8e-05
Ethiopia,107500000.0,105,9.8e-05
Zambia,17700000.0,26,0.000147
"Korea, North",25600000.0,39,0.000152
Thailand,66200000.0,112,0.000169
Bangladesh,166400000.0,324,0.000195
Mozambique,30500000.0,60,0.000197


Next, we focus on calucalting our quality metric. To do so, we subset out joint population and article quality dataframe to just articles with high-quality ratings (FA and GA). Then we will, again, group by country and count the articles over the country.

In [43]:
# Subset our population and article quality dataframe into only articles of FA and GA quality
hq_df = csv_df[(csv_df['article_quality'] == 'FA')|(csv_df['article_quality'] == 'GA')]

# Group by country and count over articles
hq_df = hq_df.groupby(['country']).count()[['article_name']]

# Format and display
hq_df.columns = ['hq_article_count']
hq_df.head(10)

Unnamed: 0_level_0,hq_article_count
country,Unnamed: 1_level_1
Afghanistan,10
Albania,4
Algeria,2
Argentina,15
Armenia,5
Australia,42
Austria,3
Azerbaijan,2
Bahrain,1
Bangladesh,3


Then join our total articles per countries dataframe with our total high-quality articles per country.  I will use a left join (the default) here in or to perserve the countries that had articles but no high-quality articles.

In [44]:
# Left join total articles per country with total high-quality articles per country
hq_df = articles_per_country.join(hq_df)

# Replace NaN with 0, these occured when the country had 0 high-quality articles as a result of the left join
hq_df = hq_df.fillna(0)

# Display
hq_df.head(10)

Unnamed: 0_level_0,total_article_count,hq_article_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,327,10.0
Albania,460,4.0
Algeria,119,2.0
Andorra,34,0.0
Angola,110,0.0
Antigua and Barbuda,25,0.0
Argentina,496,15.0
Armenia,199,5.0
Australia,1566,42.0
Austria,340,3.0


Finally, we will calculate our quality metric. Below we take the total high-quality articles per country and divide them by the total articles per country and convert to a percent.

In [45]:
# Calculate quality metric
hq_df['hq_article_%'] = hq_df['hq_article_count']/hq_df['total_article_count'] * 100

Now that we've finished out quality metric, we will display the 10 highest and the 10 lowest ranking countries in terms of political article quality. 

In [46]:
# 3. Sort merged dataframe by descending quality
hq_df.sort_values('hq_article_%', ascending = False).head(10)

Unnamed: 0_level_0,total_article_count,hq_article_count,hq_article_%
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Korea, North",39,7.0,17.948718
Saudi Arabia,119,16.0,13.445378
Central African Republic,68,8.0,11.764706
Romania,348,40.0,11.494253
Mauritania,52,5.0,9.615385
Bhutan,33,3.0,9.090909
Tuvalu,55,5.0,9.090909
Dominica,12,1.0,8.333333
United States,1098,82.0,7.468124
Benin,94,7.0,7.446809


In [47]:
# 4. Sort merged dataframe by ascending quality
hq_df.sort_values('hq_article_%', ascending = True).head(10)

Unnamed: 0_level_0,total_article_count,hq_article_count,hq_article_%
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sao Tome and Principe,22,0.0,0.0
Mozambique,60,0.0,0.0
Cameroon,106,0.0,0.0
Guyana,20,0.0,0.0
Turkmenistan,33,0.0,0.0
Monaco,40,0.0,0.0
Moldova,426,0.0,0.0
Comoros,51,0.0,0.0
Marshall Islands,37,0.0,0.0
Costa Rica,150,0.0,0.0


Above, we see the 10 lowest ranked countries in terms of quality all have a quality value of 36. Below I will show the total 28 countries with a quality of 0.

In [49]:
zeros = hq_df[(hq_df['hq_article_count'] == 0)].reset_index()
zeros

Unnamed: 0,country,total_article_count,hq_article_count,hq_article_%
0,Andorra,34,0.0,0.0
1,Angola,110,0.0,0.0
2,Antigua and Barbuda,25,0.0,0.0
3,Bahamas,20,0.0,0.0
4,Barbados,14,0.0,0.0
5,Belgium,523,0.0,0.0
6,Belize,16,0.0,0.0
7,Cameroon,106,0.0,0.0
8,Cape Verde,37,0.0,0.0
9,Comoros,51,0.0,0.0


Now we will exclude all countries with quality = 0 just to see the new 10 lowest ranking countries. 

In [50]:
# Subset to excluding quality = 0 countries and sort by ascending quality
hq_df[(hq_df['hq_article_count'] > 0)].sort_values('hq_article_%', ascending = True).head(10)

Unnamed: 0_level_0,total_article_count,hq_article_count,hq_article_%
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tanzania,408,1.0,0.245098
Peru,354,1.0,0.282486
Lithuania,248,1.0,0.403226
Nigeria,684,3.0,0.438596
Morocco,208,1.0,0.480769
Fiji,199,1.0,0.502513
Bolivia,187,1.0,0.534759
Brazil,556,3.0,0.539568
Luxembourg,180,1.0,0.555556
Sierra Leone,166,1.0,0.60241
