# Bias in Data
### Kenten Danas
#### An exploration of bias in data on Wikipedia, originally completed for University of Washington's DATA 512 class in Autumn 2018

The purpose of this notebook is to explore bias in data by looking at English Wikipedia pages on politicians from a variety of countries. I combine the article data with data on the countries' populations and a prediction of the quality of the article garnered from ORES (more information on this below), and explore whether these have an impact on the number of articles on politicians overall and the number of higher quality articles.

This notebook is used to process the downloaded data, get article quality prediction data from ORES, munge the data from the different sources, and complete an analysis of the combined data. A discussion of the results can be found in the ReadMe of this repository.

In [35]:
#Import necessary packages and initialize desired notebook settings
import json
import numpy as np
import pandas as pd
import requests

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_inateractivity = "all"

## Getting Data

Two sources of data were used for this analysis; both are described in separate sections below. For both the data is publically available and easily downloaded.

### Article Data

The first data set contains data about English Wikipedia articles on politicians, by country. It can be found at this link: 

https://figshare.com/articles/Untitled_Item/5513449

This data is released under the CC-BY-SA 4.0 license, and so can be included here in this repo. 

From this page you can download a zip file containing the data, as well as R code used to generate the data. For the purpose of this analysis, only the page_data.csv is needed, so I have extracted that from the download and included it in the 'raw_data' folder of this repository.

In [15]:
#Read page data csv from local data and look at first couple of rows
pages = pd.read_csv('data_raw/page_data.csv')
pages.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


Consistent with what is described on the figshare page for this data, page_data.csv contains the following columns:

 - page: the name of the English Wikipedia page (not cleaned)
 - country: the name of the country the politician was from
 - rev_id: the id of the last revision made to the page

### Population Data

The second data set used for this analysis contains world population by country, in millions of people, as of 2018. The data set can be found here: 

https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0

The origin of this data is from the Population Reference Bureau; more information on the data can be found on their website, here: https://www.prb.org/2018-world-population-data-sheet-with-focus-on-changing-age-structures/

Since this data is not licensed, I have not included it in this repository. The code below loads the data locally. If you want to replicate this analysis, you can download the data from the link above and update the filepath used to read in the csv in the code below.

Note that here I also do some minor data processing to make the analysis below easier, namely renaming the columns and converting the population to the full value rather than millions.

In [136]:
#Read population data from local csv and review first couple of rows
pop = pd.read_csv('C:/Users/kentd/OneDrive/Documents/School-Grad/DATA_512/WPDS_2018_data.csv')

#Rename columns to make future analysis easier
pop.rename(columns={'Geography': 'country', 'Population mid-2018 (millions)': 'Population'}, inplace=True)

#Convert population to int, and then to raw value
pop['Population'] = pd.to_numeric(pop['Population'], errors='coerce')
pop['Population'] = pop['Population'] * 1e6
pop.head()

Unnamed: 0,country,Population
0,AFRICA,
1,Algeria,42700000.0
2,Egypt,97000000.0
3,Libya,6500000.0
4,Morocco,35200000.0


## Article Quality Predictions Using ORES

For this analysis, I get the prediction of article quality from ORES, an API for machine learning developed by Wikimedia. See the following link for more information:

https://www.mediawiki.org/wiki/ORES

ORES takes a Wikipedia article ID and assigns a probability that the quality of the article falls into one of six categories. The highest probability is the category assigned to the article. The categories are (from best to worst quality):

 - FA - Featured article
 - GA - Good article
 - B - B-class article
 - C - C-class article
 - Start - Start-class article
 - Stub - Stub-class article
 
For this analysis, I consider articles classified as 'FA' and 'GA' to be "high quality" articles.

To get the predictions, I feed the pages dataset to the ORES system using the code below. Since the API documentation recommends batching 50 revision IDs per request, I have split the data into chunks of 50 rather than sending the entire dataframe. The function 'ores' I used to make the API requests is based on the example provided here: https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb

In [30]:
#Start with some housekeeping - define headers & API endpoint
headers = {'User-Agent' : 'https://github.com/kentdanas', 'From' : 'kdanas@uw.edu'}
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

#Now define function to make the API call for a given list of revision ids
def ores(rev_ids, headers=headers, endpoint=endpoint):
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in rev_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

In [101]:
#Use the ORES function above to get predictions for each article by looping through 50 articles at a time. 
# Note, this chunk will take a few minutes to run; this is a good time to grab a cup of coffee!

#Define start indexes
index_start = 0
index_stop = 49

#Create empty dataframe to store results
page_quality = pd.DataFrame(columns=('rev_id', 'article_quality'))

while index_start < len(pages):
    #Pull chunk of revision ids
    rev_ids = list(pages['rev_id'][index_start:index_stop])

    #Feed revision ids to ORES and get predictions
    response = ores(rev_ids)

    #pull out prediction for each rev_id from resulting json dump. Note that if the article was not found by ORES 
    # it will not have a prediction, hence the try/except statement. For articles with no prediction I filled the
    # results with 'nan'
    for rev_id in rev_ids:
        rev_id = str(rev_id)
        
        try:
            prediction = response['enwiki']['scores'][rev_id]['wp10']['score']['prediction']

        except:
            prediction = np.nan

        #append results to dataframe
        page_quality = page_quality.append({'rev_id':rev_id, 'article_quality':prediction}, ignore_index=True) 

    #Redefine indexes
    index_start += 50
    index_stop = min(index_stop+50, len(pages))

## Combining Datasets

Now that I have the article quality prediction data from ORES, I combine the three data sets into one so they can be used for the analysis. The final dataframe has the following columns:

 - article_name: the (uncleaned) article name
 - country: the country of the politician in the article
 - revision_id: the id of the last revision of the article
 - article_quality: the ORES prediction of the article quality
 - population: the population of the country

In [135]:
#First combine the ORES predictions and the page data (note that I need to convert the rev_id type to an int first)
page_quality['rev_id'] = pd.to_numeric(page_quality['rev_id'], errors='coerce')
final_dataset = pages.merge(page_quality, how='left', on=['rev_id'])

#Next, merge with population data. This is done as an inner join since I only want to keep articles for which 
#there is a population, and I don't need the population for any countries without any articles
final_dataset = final_dataset.merge(pop, how='inner', on=['country'])

#Rename a couple of columns
final_dataset.rename(columns={'page': 'article_name', 'rev_id': 'revision_id'}, inplace=True)

#Save to CSV
final_dataset.to_csv('data_clean/final_article_dataset.csv', index = False)

#Take a look at the resulting dataframe
final_dataset.head()

Unnamed: 0,article_name,country,revision_id,article_quality,Population
0,Template:ZambiaProvincialMinisters,Zambia,235107991,,17700000.0
1,Gladys Lundwe,Zambia,757566606,Stub,17700000.0
2,Mwamba Luchembe,Zambia,764848643,Stub,17700000.0
3,Thandiwe Banda,Zambia,768166426,Start,17700000.0
4,Sylvester Chisembele,Zambia,776082926,C,17700000.0


## Analysis

The analysis on this data set consists of a calculation of the proportion of articles per population for each country, and proportion of high quality articles for each country. A detailed discussion of the results of this analysis can be found in the ReadMe of this repository.