Peter Meleney  
DATA 512  
Assignment A2  
10/17/19

## Assignment A2: Bias in Data

### Objective of Notebook

The objective of this notebook is to satisfy the requirements of assignment A2 for DATA 512 in Autum quarter of 2019, part of the University of Washington Master's of Science in Data Science (MSDS) program.  

### Objective of Assignment 

The stated objective of this assignment "is to explore the concept of bias through data on Wikipedia articles." [1]  I will analyze the coverage of politicians on Wikipedia and how the quality of coverage varies between countries and regions.  I will also write a reflection, focusing on how my understanding of bias in data improved through the completion of this project.  

### Data Provenance

The Wikidpedia politicians by country dataset is hosted on Figshare: https://figshare.com/articles/Untitled_Item/5513449, accessed October 13, 2019.  
I downloaded the population data from a file hosed on Canvas (A2: Bias in data/WPDS 2018 data.csv): https://canvas.uw.edu/courses/1319253/files/, accessed October 13, 2019.  These data reflect data drawn from the world population datasheet for 2018: https://www.prb.org/international/indicator/population/table/.  The ratings of various articles was recorded through the ORES API discussed in the relevant section.

### Imports

In [566]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re
import time

# from ores import api

import requests
import json

%matplotlib inline

### Issues with ORES

I ran into an issue installing ORES, the error indicated that:  

```Command "python setup.py egg_info" failed with error code 1```

This was preceeded by an indication that the "'enchant' C library was not found.":

```ImportError: The 'enchant' C library was not found. Please install it via your OS package manager, or use a pre-built binary wheel from PyPI.```

Rather than delve into solving this issue, which I'm sure is tractable, but would take considerable time.  I chose to follow **Option 2** in the instructions, I will use the REST API endpoint to score the quality of the applicable pages.

## Make Directories to hold Raw and Clean Data

In [567]:
#Make appropriate data, raw_data, and clean_data dirs in place.

if not os.path.exists("data/"):
    os.mkdir('data/')
    os.mkdir("data/raw_data")
    os.mkdir("data/clean_data")
elif not os.path.exists("data/raw_data"):
    os.mkdir('data/raw_data')
else:
    pass

if not os.path.exists("data/clean_data"):
    os.mkdir('data/clean_data')
else:
    pass

### import raw data

In [568]:
pop_df = pd.read_csv('data/raw_data/WPDS_2018_data.csv')
wikipedia_df = pd.read_csv('data/raw_data/page_data.csv')

### Brief exploration of raw data tables

#### Population dataframe

In [569]:
pop_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207 entries, 0 to 206
Data columns (total 2 columns):
Geography                         207 non-null object
Population mid-2018 (millions)    207 non-null object
dtypes: object(2)
memory usage: 3.3+ KB


In [570]:
pop_df.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


#### Wikipedia dataframe

In [571]:
wikipedia_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47197 entries, 0 to 47196
Data columns (total 3 columns):
page       47197 non-null object
country    47197 non-null object
rev_id     47197 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


In [572]:
wikipedia_df.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


#### Discussion

I will need to clean these tables a bit.  The population dataframe contains regions in all-caps (e.g. AFRICA), these should be moved to a regions dataframe for further analysis, but not maintained in the country population dataframe as the regions will not match any particular country.

The wikipedia dataframe contains pages that start with the work "Template."  These pages should not be included in our analysis.

### Data Cleaning

#### Population DataFrame

I will remove all the region headings from the population dataframe instead list them as an additional column.  This will come in handy during analysis.

In [573]:
#First we find all the rows with only uppercase characters.
uppercase_finder = re.compile(r"^([A-Z ':]+$)", re.M)

uppercase_rows = []
for i in range(0,pop_df.shape[0]):
    uppercase_found = []
    uppercase_found = uppercase_finder.findall(pop_df.iloc[i,0])
    if uppercase_found:
        uppercase_rows.append(i)

# create region_df that stores region IDs and total populations
region_df = pd.DataFrame([])
for j in uppercase_rows:
    region_df = pd.concat([region_df, pd.DataFrame(pop_df.iloc[j]).T], sort=False)

# Append regional geography to population df
region_map = uppercase_rows.copy()
region_map.append(pop_df.shape[0])
list_regions = []
for i in range(0,len(region_map)-1):
    num_regions = region_map[i+1] - region_map[i]
    list_regions.append(num_regions*[pop_df[['Geography', 'Population mid-2018 (millions)']].iloc[region_map[i]]])

l_regions = pd.DataFrame([item for sublist in list_regions for item in sublist])
l_regions.index = pop_df.index

pop_df = pd.concat([l_regions, pop_df], axis = 1)

In [574]:
# Drop uppercase_rows from the pop_df dataframe
for row in uppercase_rows:
    pop_df.drop(row, axis = 0, inplace=True)

pop_df.columns = ['Region', 'Region Population mid-2018 (millions)', 'Country', 'Population mid-2018 (millions)']
# Check that the heading AFRICA was removed from the top of the dataframe
pop_df.sample(10)

Unnamed: 0,Region,Region Population mid-2018 (millions),Country,Population mid-2018 (millions)
50,AFRICA,1284,Sao Tome and Principe,0.2
28,AFRICA,1284,Ethiopia,107.5
24,AFRICA,1284,Burundi,11.8
22,AFRICA,1284,Sierra Leone,7.7
203,OCEANIA,41,Solomon Islands,0.7
122,ASIA,4536,Iran,81.6
96,ASIA,4536,Armenia,3.0
80,LATIN AMERICA AND THE CARIBBEAN,649,Saint Lucia,0.2
163,EUROPE,746,Switzerland,8.5
133,ASIA,4536,Philippines,107.0


#### Wikipedia DataFrame

I will remove all the instances of "Template" from the 

In [575]:
wikipedia_df.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [576]:
# Collect row numbers in list template_rows
template_rows = []
for i in range(0,wikipedia_df.shape[0]):
    if wikipedia_df.iloc[i]['page'][0:8] == "Template":
        template_rows.append(i)

# Drop template_rows from wikipedia dataframe
for row in template_rows:
    wikipedia_df.drop(row, axis = 0, inplace=True)
    
# Print updated info for wikipedia dataframe
wikipedia_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46701 entries, 1 to 47196
Data columns (total 3 columns):
page       46701 non-null object
country    46701 non-null object
rev_id     46701 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [577]:
wikipedia_df.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


#### Discussion

Cleaning the wikipedia dataframe resulted in the removal of 496 rows from the dataframe (47197-46701).  These rows were not saved because they are not applicable to our discussion here.  

## Save Cleaned Data to clean_data Folder

In [578]:
pop_df.to_csv('data/clean_data/population.csv')
region_df.to_csv('data/clean_data/region.csv')
wikipedia_df.to_csv('data/clean_data/wikipedia.csv')

## Ores REST API

### Split wikipedia_df rev_id Column to avoid API 

As per the notes in [2] I will split the rev_id column into 100-element long lists to avoid hitting ORES API call limits.

In [580]:
wikipedia_df.shape[0]//100, wikipedia_df.shape[0]%100 

(467, 1)

We will expect a list of lists of length 468, and there to be one element in the final list.  The following code creates a list of lists, each 100 elements long to pass to the get_ores_data function defined in the next cell.

In [582]:
list_of_rev_id_lists = []

#This code creates a list of lists which each contain 100
for i in range(wikipedia_df.shape[0]//100+wikipedia_df.shape[0]%100):
    rev_id_list = []
    for rev_id in wikipedia_df['rev_id'].iloc[100*i:100*(i+1)]:
        rev_id_list.append(rev_id)
    list_of_rev_id_lists.append(rev_id_list)

In [583]:
# The code in this cell is modified from https://github.com/jtmorgan/data-512-a2/blob/master/NEW_hcds-a2-bias_demo.ipynb 
# see full citation below.[2]

headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'pmeleney@uw.edu'} 

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return json.dumps(response, indent=4, sort_keys=True)

### Article Quality Codes from ORES

In [584]:
# Article quality codes and their dexscriptions.

pd.DataFrame([['FA', 'GA', 'B', 'C', 'Start', 'Stub'], 
              ['Featured Article', 'Good Article', 'B-Class Article', 'C-Class Article', 'Start-Class Article', 'Stub-Class Article']],
            index = ['code', 'description']).T

Unnamed: 0,code,description
0,FA,Featured Article
1,GA,Good Article
2,B,B-Class Article
3,C,C-Class Article
4,Start,Start-Class Article
5,Stub,Stub-Class Article


## WARNING: The code cell below takes a long time to run

This next cell takes quite some time to run (I experienced between 10 minutes and 171 minutes depending on the conditions of the API).  It connects to the ORES API, returns a 2-column dataframe containing the rev_id of a wikipedia article and the ORES predicted article class, and saves the resulting dataframe to a csv file in the raw_data folder.  The code ignores rev_ids that return an error code.  We will recover codes that errored out by comparing a flattened list_of_rev_id_lists with the first column of the dataframe.

In [585]:
i = 0
returned_score_ids = []
predictions = pd.DataFrame([])
for list_of_rev_ids in list_of_rev_id_lists:
    i += 1
    ores_data = get_ores_data(list_of_rev_ids, headers)
    print('Completed the', i, 'th list of 100')
    predictions_temp = []
    returned_score_ids = []
    for rev_id in list_of_rev_ids:
        if str(rev_id) in json.loads(ores_data)['enwiki']['scores'].keys() and \
        'error' not in json.loads(ores_data)['enwiki']['scores'][str(rev_id)]['wp10'].keys():
            returned_score_ids.append(str(rev_id))
            predictions_temp.append(json.loads(ores_data)['enwiki']['scores'][str(rev_id)]['wp10']['score']['prediction'])

    predictions = pd.concat([pd.DataFrame(zip(returned_score_ids, predictions_temp)), predictions], axis = 0)
predictions.columns = ['rev_id', 'ORES_predicted class']

predictions.to_csv('data/raw_data/predictions.csv')

Completed the 1 th list of 100
Completed the 2 th list of 100
Completed the 3 th list of 100
Completed the 4 th list of 100
Completed the 5 th list of 100
Completed the 6 th list of 100
Completed the 7 th list of 100
Completed the 8 th list of 100
Completed the 9 th list of 100
Completed the 10 th list of 100
Completed the 11 th list of 100
Completed the 12 th list of 100
Completed the 13 th list of 100
Completed the 14 th list of 100
Completed the 15 th list of 100
Completed the 16 th list of 100
Completed the 17 th list of 100
Completed the 18 th list of 100
Completed the 19 th list of 100
Completed the 20 th list of 100
Completed the 21 th list of 100
Completed the 22 th list of 100
Completed the 23 th list of 100
Completed the 24 th list of 100
Completed the 25 th list of 100
Completed the 26 th list of 100
Completed the 27 th list of 100
Completed the 28 th list of 100
Completed the 29 th list of 100
Completed the 30 th list of 100
Completed the 31 th list of 100
Completed the 32 

KeyboardInterrupt: 

## Load Predictions from Saved File

In [586]:
predictions = pd.read_csv('data/raw_data/predictions.csv', index_col = 0)

## Brief Exploration of predictions Dataframe

In [587]:
predictions.head()

Unnamed: 0,rev_id,ORES_predicted class
0,807454176,C
1,807454234,Stub
2,807454631,Start
3,807454637,GA
4,807454951,Stub


In [588]:
predictions['ORES_predicted class'].unique()

array(['C', 'Stub', 'Start', 'GA', 'B', 'FA'], dtype=object)

In [589]:
predictions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46546 entries, 0 to 96
Data columns (total 2 columns):
rev_id                  46546 non-null int64
ORES_predicted class    46546 non-null object
dtypes: int64(1), object(1)
memory usage: 1.1+ MB


## Identify Which rev_ids Did Not Return Results

In [590]:
# Flatten the list_of_rev_id_lists to a list_of_rev_ids
list_of_rev_ids = []
for sublist in list_of_rev_id_lists:
    for rev_id in sublist:
        list_of_rev_ids.append(rev_id)

In [591]:
len(list_of_rev_ids) - predictions.shape[0]

155

We will expect 155 rev_ids to not have returned data from the ORES API.

In [592]:
list_of_rev_ids_not_in_predictions = []
i = 0
list_predictions_rev_id = list(predictions.rev_id)
for rev_id in list_of_rev_ids:
    i +=1
    if i%10000 == 0:
        print(i)
    if rev_id not in list_predictions_rev_id:
        list_of_rev_ids_not_in_predictions.append(rev_id)

10000
20000
30000
40000


In [593]:
# Expected value 155
len(list_of_rev_ids_not_in_predictions)

155

## Save df of list_of_rev_ids_not_in_predictions to Cleaned_Data folder

In [594]:
df_rev_ids_not_in_predictions = pd.DataFrame(list_of_rev_ids_not_in_predictions)
df_rev_ids_not_in_predictions.columns = ['rev_id']
df_rev_ids_not_in_predictions.to_csv('data/clean_data/rev_ids_not_in_predictions.csv')

## Combining the datasets

We will combine the datasets to make a final csv file which will be saved to the clean_data folder as wp_wpds_politicians_by_country.csv per the instructions.  first the wikipedia and predictions datasets are combined into a df_temp dataframe, and the rev_id columns are dropped (because these data are already contained in the index).  Then I merge the population data onto the dataframe to create the final dataset.  NA values are then removed from the dataset to create a dense file which is saved as wp_wpds_politicians_by_country.csv.

In [595]:
# Concat the predictions dataframe with the wikipedia dataframe on rev_id
predictions.index = predictions.rev_id
wikipedia_df.index = wikipedia_df.rev_id
df_temp = pd.concat([wikipedia_df, predictions], axis = 1)
df_temp.drop(['rev_id'], axis=1, inplace = True)
df_temp.columns = ['Page', 'Country', 'ORES_predicted_class']

In [596]:
df_temp.head()

Unnamed: 0_level_0,Page,Country,ORES_predicted_class
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
355319463,Bir I of Kanem,Chad,Stub
393276188,Information Minister of the Palestinian Nation...,Palestinian Territory,Stub
393822005,Yos Por,Cambodia,Stub
395521877,Julius Gregr,Czech Republic,Stub
395526568,Edvard Gregr,Czech Republic,Stub


In [597]:
# Create final dataframe by merging pop_df on 'country'
df = df_temp.merge(pop_df, how='left', on='Country')
df = df.dropna()

In [599]:
df.to_csv('data/clean_data/wp_wpds_politicians_by_country.csv')

## Analysis

In the next cell I calculate the average coverage of politicians as a percentage of the population and the quality of coverage, which is the percentage of pages which are rated as Featured Article (FA) or Good Article (GA) by the ORES classification API.

In [495]:
# Count the number of articles per country regardless of quality.
num_pages = pd.DataFrame(df.groupby('Country').count()['Page'])
num_pages.columns = ['count_pages']

# Count the number of "goog" articles per country, i.e. articles of quality 'GA' or 'FA'.
num_good_pages = pd.DataFrame(df[df['ORES_predicted_class'].isin(['GA', 'FA'])].groupby('Country').count()['Page'])
num_good_pages.columns = ['count_good_pages']

# Combine the populations of each country with the number of pages and number of good pages into one dataframe called df_count
pop_df.index = pop_df.Country
df_count = pd.concat([num_pages, num_good_pages, pop_df], sort=False, axis = 1)

# Calculate the "coverage" and "quality coverage" as the percent of number of pages per population and percent of quality pages
# per total pages respectively.
df_count['Population mid-2018 (millions)'] = df_count['Population mid-2018 (millions)'].apply(lambda x: float(x.replace(',','')))
df_count['coverage'] = 100*df_count['count_pages']/(1000000 * df_count['Population mid-2018 (millions)']) 
df_count['quality_coverage'] = 100*df_count['count_good_pages']/df_count['count_pages']

## Results

### Top 10 Countries by Coverage

In [501]:
df_count.sort_values(['coverage'], ascending=False)[['coverage', 'Population mid-2018 (millions)']][0:10]

Unnamed: 0,coverage,Population mid-2018 (millions)
Tuvalu,0.54,0.01
Nauru,0.52,0.01
San Marino,0.27,0.03
Monaco,0.1,0.04
Liechtenstein,0.07,0.04
Tonga,0.063,0.1
Marshall Islands,0.061667,0.06
Iceland,0.05025,0.4
Andorra,0.0425,0.08
Grenada,0.036,0.1


### Bottom 10 Countries by Coverage

In [502]:
df_count.sort_values(['coverage'], ascending=True)[['coverage', 'Population mid-2018 (millions)']][0:10]

Unnamed: 0,coverage,Population mid-2018 (millions)
India,7.1e-05,1371.3
Indonesia,7.9e-05,265.2
China,8.1e-05,1393.8
Uzbekistan,8.5e-05,32.9
Ethiopia,9.4e-05,107.5
"Korea, North",0.000141,25.6
Zambia,0.000141,17.7
Thailand,0.000169,66.2
Mozambique,0.00019,30.5
Bangladesh,0.000192,166.4


### Top 10 Countries by Relative Quality

In [503]:
df_count.sort_values(['quality_coverage'], ascending=False)[['quality_coverage', 'Population mid-2018 (millions)']][0:10]

Unnamed: 0,quality_coverage,Population mid-2018 (millions)
"Korea, North",19.444444,25.6
Saudi Arabia,12.711864,33.4
Mauritania,12.5,4.5
Central African Republic,12.121212,4.7
Romania,11.370262,19.5
Tuvalu,9.259259,0.01
Bhutan,9.090909,0.8
Dominica,8.333333,0.07
Syria,7.8125,18.3
Benin,7.692308,11.5


### Bottom 10 countries by Relative Quality

In [504]:
df_count.sort_values(['quality_coverage'], ascending=True)[['quality_coverage', 'Population mid-2018 (millions)']][0:10]

Unnamed: 0,quality_coverage,Population mid-2018 (millions)
Belgium,0.192308,11.4
Tanzania,0.246914,59.1
Switzerland,0.248756,8.5
Nepal,0.280112,29.7
Peru,0.285714,32.2
Nigeria,0.295421,195.9
Colombia,0.350877,49.8
Lithuania,0.409836,2.8
Fiji,0.507614,0.9
Azerbaijan,0.558659,9.9


### Geographic Regions by Coverage

In [604]:
region_df.index = region_df['Geography']
region_by_coverage_df = pd.DataFrame([100*df_count.groupby('Region').sum()['count_pages']/(1000000*region_df['Population mid-2018 (millions)'].apply(lambda x: int(x.replace(',',''))))]).T.sort_values(0, ascending=False)
region_by_coverage_df.columns = ['Coverage (%)']
region_by_coverage_df

Unnamed: 0,Coverage (%)
OCEANIA,0.007629
EUROPE,0.002127
LATIN AMERICA AND THE CARIBBEAN,0.000796
AFRICA,0.000534
NORTHERN AMERICA,0.000526
ASIA,0.000254


### Geographic Regions by Quality of Coverage

In [606]:
region_by_good_coverage_df = pd.DataFrame([100*df_count.groupby('Region').sum()['count_good_pages']/df_count.groupby('Region').sum()['count_pages']]).T
region_by_good_coverage_df.columns = ['Good Coverage (%)']
region_by_good_coverage_df.sort_values('Good Coverage (%)', ascending=False)

Unnamed: 0_level_0,Good Coverage (%)
Region,Unnamed: 1_level_1
NORTHERN AMERICA,5.153566
ASIA,2.688405
OCEANIA,2.109974
EUROPE,2.029753
AFRICA,1.824551
LATIN AMERICA AND THE CARIBBEAN,1.334881


## Discussion

I expected that English Wikipedia would be heavily biased towards coverage of North America and European politicians, and would have less coverage of the other regions.  This is partially because I expected that the database would be biased toards covering English-speaking countries, and partially because I expected there to be more coverage of liberal western democracies that encourage free speech and political discussion.  This was partially bourne out by the data.  North America dominates the quality of coverage by region table (5.15%, see last table in results section) with double that of the next highest coverage (Asia at 2.69%).  However, unexpectedly, Europe is not in second place.  

I did expect that population would dominate political coverage however, with small (in population) countries dominating the "top 10 countries by coverage" list, and large countries (China, Indonesia, India) dominating the "bottom 10 countries by coverage" list.  Interestingly North Korea appears both in the bottom 10 countries in terms of absolute coverage, and in the top 10 countries in terms of quality coverage.  This means that not many politicians have pages compared to population of North Korea, but of those politicians that are covered, they have relatively extensive pages.

I think that exploring only the English Wikipedia results in considerable bias in the data.  We are not seeing the data as people in the country might see it if they speak a language other than English.  These data are applicable to political coverage of the world in English-speaking nations, but it would be better to consider political coverage in the language native to the country being investigated.  This would take considerably more effort in terms of classification and coding, however.  I would be interested in seeing these same results but using just length of articles (as a proxy for quality) and normalized by information density of various languages.  However this may introduce issues in nations where multiple languages are spoken by large sections of the population.

### References

[1] Morgan, Johnathan T. (2019, October 3)  Human Centered Data Science (Fall 2019)/Assignments.  Retrieved from: https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments, accessed on October 13, 2019.

[2] Morgan, Johnathan T. (2017, October 28) Bias on Wikipedia - Making ORES requests jtmorgan/data-512-a2.  Retrieved from: https://github.com/jtmorgan/data-512-a2/blob/master/NEW_hcds-a2-bias_demo.ipynb, accessed October 16, 2019.