# A2: Investigating Bias in Wikipedia Article Counts by Country

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - 
specifically, articles on political figures from a variety of countries.
analysis will consist of a series of tables that show:
The analysis would cover following aspects to analyse the bias:
-  The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
-  The countries with the highest and lowest proportion of high quality articles about politicians.

# Import Necessary Libraries

In [6]:
import json
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import requests

%matplotlib inline

# Data collection

The anaysis require data from 3 different data srources which would be then combined together based on comman attributes.
The primary data sources are :
   -  Wikipedia.
   -  Population dataset.
   -  Article quality using ORES dataset.


## Wikipedia

This section involves using Wikipedia dataset about political articles which attributes each articles with its corresponding country and based on articles/pages unqiue id.
The data can be downloaded from [Figshare](https://figshare.com/articles/Untitled_Item/5513449). Figshare link provides
country.zip file which contains page_data.csv having the requisite data for our analysis.
We'll save this file in our data folder for easy access.

In [7]:
# Load page_data from data folder
page_data = pd.read_csv('data/page_data.csv')
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


A quick glance of the data set reveals 3 primary attributes of the data. This data contains for each page its Country, each page has a unique identifer as rev_id

## Population dataset.

Population data for each country can be downloaded from [Population Research Bureau website](https://www.prb.org/data/)

In [8]:
# Load population from data folder
population_data = pd.read_csv('data/wiki_population_2018_data.csv')
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


A quick glance of the data set reveals 2 attributes of the data Geography and Population mid-2018 (millions) columns. This data contains for each Country its corresponding population(Data column).

## Article quality using ORES dataset.

Next step involves using ORES API to get each article/page score. ORES gives a class to each article which can be:
 - FA - Featured article
 - GA - Good article
 - B - B-class article
 - C - C-class article
 - Start - Start-class article
 - Stub - Stub-class article
 
I am using this notebook as reference to make the api calls and get the predictions: [ORES](https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb)

In [9]:
HEADERS = {'User-Agent': 'https://github.com/pshivraj', 'From': 'pshivraj@uw.edu'}

def get_ores_data(revision_ids):
    """
    Utility function to get page quality scores from ORES API.
    param :
        revision_ids: The revision id of the page,
        headers: api headers.
    outputs :
        preds : revision_id and score.
    """
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
 
    params = {
        'project': 'enwiki',
        'model': 'wp10',
        'revids': '|'.join(str(x) for x in revision_ids)
    }
    json_response = requests.get(endpoint.format(**params),HEADERS).json()
    preds = []
    
    # Unpack predictions according to the response structure, which can be found in the reference notebook
    for key, value in json_response["enwiki"]["scores"].items():
        result_dict = value["wp10"]
        if "error" not in result_dict:
            prediction = {
                'rev_id': int(key),
                'prediction': result_dict["score"]["prediction"]
            }
            preds.append(prediction)
    
    return preds

Since there is rate limited for number of request to the API, we'll send 500 requests at a time to evade getting rate limitted.

In [10]:
revision_score_arr = []
for rev in np.array_split(page_data, 500):
    rev_ids = rev['rev_id'].tolist()
    # extending list for next iteration
    rev_score = get_ores_data(rev_ids)
    revision_score_arr.extend(rev_score)

Combining this with the article data this might lead to loss of data frame rows as not all article has prediction. we can inspect this based on the initial page data shape vs number of returned predictions shape.

In [11]:
print('Total articles in wiki data {} vs total articles with prediction {}'.format(len(page_data), len(revision_score_arr)))

Total articles in wiki data 47197 vs total articles with prediction 47092


In [12]:
revision_score_df = pd.DataFrame(revision_score_arr)
article_df = page_data.merge(revision_score_df, on='rev_id')
article_df.head(5)

Unnamed: 0,page,country,rev_id,prediction
0,Bir I of Kanem,Chad,355319463,Stub
1,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
2,Template:Uganda-politician-stub,Uganda,391862070,Stub
3,Template:Namibia-politician-stub,Namibia,391862409,Stub
4,Template:Nigeria-politician-stub,Nigeria,391862819,Stub


Lets check if we got predictions without any anamolies. We can see below that we have prediction string which are in line with all the classes we expected to get with no missing values.

In [13]:
article_df['prediction'].value_counts()

Stub     24633
Start    14819
C         5855
B          762
GA         732
FA         291
Name: prediction, dtype: int64

In [14]:
# save the page_data with score in our data folder.
article_df.to_csv('data/page_data_with_scores.csv', index=False)

In [15]:
# merging population and score data to get a merged data frame
population_data.columns = ['country', 'population']
merged_data = article_df.merge(population_data, on='country')
merged_data.head(5)

Unnamed: 0,page,country,rev_id,prediction,population
0,Bir I of Kanem,Chad,355319463,Stub,15.4
1,Abdullah II of Kanem,Chad,498683267,Stub,15.4
2,Salmama II of Kanem,Chad,565745353,Stub,15.4
3,Kuri I of Kanem,Chad,565745365,Stub,15.4
4,Mohammed I of Kanem,Chad,565745375,Stub,15.4


In [16]:
#save to local directory data folder.
merged_data.to_csv('data/article_quality_with_population.csv', index=False)

# Analysis of articles

We will be trying to answer folowing questions using the merged data generated above:
  - 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
  - 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
  - 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
  - 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

Lets first generate for each country number of articles published which would help us answer question per capita for a country.

In [17]:
article_count_by_country = pd.DataFrame(merged_data.groupby('country').count()['rev_id']).reset_index()
article_count_by_country.columns = ['country','article_count']

We'll now merge the aricle count per country data with the parent data frame to add the new feature article_count to the existing data frame- merged_data

In [18]:
final_df = pd.merge(merged_data, article_count_by_country, on='country')

Since we need to answer question related to high quality articles as well which is defined as prediction from ORES API having prediction as either 'FA' or 'GA', we'll subset the existing data frame to get just the articles which has there predictions.
Eventually we will generate high quality article count as we did for all the articles in the above cell.

In [19]:
high_quality_articles = merged_data.loc[merged_data['prediction'].isin(['FA','GA'])]
high_quality_article_count_by_country = pd.DataFrame(high_quality_articles.groupby('country').count()['rev_id']).reset_index()
high_quality_article_count_by_country.columns = ['country','high_quality_article_count']

Lets merge the newly generated data frame with high quality articles info with our parent data frame-'final_df'

In [20]:
final_df = pd.merge(final_df, high_quality_article_count_by_country, on='country')

Since our population field is of type string and we need to scale it to the millions we need to strip string attributes and change the scale from millions to unit.

In [21]:
#convert population to float and proper scale
final_df["population"] = final_df["population"].apply(lambda s: s.replace(",", "")).apply(float)*1000000
# create new features to analyze articles and high_quality articles per capita.
final_df['articles_per_population'] = final_df['article_count'] / final_df['population']
final_df["high_quality_articles_per_population"] = final_df.high_quality_article_count / final_df.article_count

In [22]:
final_df.head(10)

Unnamed: 0,page,country,rev_id,prediction,population,article_count,high_quality_article_count,articles_per_population,high_quality_articles_per_population
0,Bir I of Kanem,Chad,355319463,Stub,15400000.0,100,2,6e-06,0.02
1,Abdullah II of Kanem,Chad,498683267,Stub,15400000.0,100,2,6e-06,0.02
2,Salmama II of Kanem,Chad,565745353,Stub,15400000.0,100,2,6e-06,0.02
3,Kuri I of Kanem,Chad,565745365,Stub,15400000.0,100,2,6e-06,0.02
4,Mohammed I of Kanem,Chad,565745375,Stub,15400000.0,100,2,6e-06,0.02
5,Kuri II of Kanem,Chad,669719757,Stub,15400000.0,100,2,6e-06,0.02
6,Bir II of Kanem,Chad,670893206,Stub,15400000.0,100,2,6e-06,0.02
7,Mahamat Hissene,Chad,693055898,Stub,15400000.0,100,2,6e-06,0.02
8,Othman I,Chad,705432607,Stub,15400000.0,100,2,6e-06,0.02
9,Alphonse Kotiga,Chad,707593108,Stub,15400000.0,100,2,6e-06,0.02


Now we have a data frame with requisite information to answer all our questions. We just need to group by on appropriate columns and have a look as to what we see or the abovementioned questions we want to answer.

  - 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [23]:
final_df.groupby('country', as_index=False)['articles_per_population'].mean().sort_values("articles_per_population", ascending=False).head(10).reset_index(drop=True)

Unnamed: 0,country,articles_per_population
0,Tuvalu,0.0055
1,Tonga,0.00063
2,Iceland,0.000515
3,Grenada,0.00036
4,Luxembourg,0.0003
5,Fiji,0.000221
6,Maldives,0.00021
7,Vanuatu,0.0002
8,Dominica,0.000171
9,New Zealand,0.000161


  - 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [24]:
final_df.groupby('country', as_index=False)['articles_per_population'].mean().sort_values("articles_per_population", ascending=True).head(10).reset_index(drop=True)

Unnamed: 0,country,articles_per_population
0,India,7.190257e-07
1,Indonesia,8.069382e-07
2,China,8.143206e-07
3,Uzbekistan,8.81459e-07
4,Ethiopia,9.767442e-07
5,"Korea, North",1.523437e-06
6,Thailand,1.691843e-06
7,Bangladesh,1.941106e-06
8,Vietnam,2.016895e-06
9,Sudan,2.35012e-06


  - 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [25]:
final_df.groupby('country', as_index=False)['high_quality_articles_per_population'].mean().sort_values("high_quality_articles_per_population", ascending=False).head(10).reset_index(drop=True)

Unnamed: 0,country,high_quality_articles_per_population
0,"Korea, North",0.179487
1,Saudi Arabia,0.134454
2,Central African Republic,0.117647
3,Romania,0.114943
4,Mauritania,0.096154
5,Bhutan,0.090909
6,Tuvalu,0.090909
7,Dominica,0.083333
8,United States,0.075092
9,Benin,0.074468


 - 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [26]:
final_df.groupby('country', as_index=False)['high_quality_articles_per_population'].mean().sort_values("high_quality_articles_per_population", ascending=True).head(10).reset_index(drop=True)

Unnamed: 0,country,high_quality_articles_per_population
0,Tanzania,0.002451
1,Peru,0.002825
2,Lithuania,0.004032
3,Nigeria,0.004399
4,Morocco,0.004808
5,Fiji,0.005025
6,Bolivia,0.005348
7,Brazil,0.005445
8,Luxembourg,0.005556
9,Sierra Leone,0.006024
