# Assignment 2 : Bias in Data Assignment
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

In [96]:
import pandas as pd
import requests
import numpy as np
import json

## Data cleaning

In [115]:
# load dataframes
page_data_df = pd.read_csv("data/page_data.csv")
wpds_data_df = pd.read_csv("data/wpds_data.csv")

We first need to get rid of rows started with `template:` in the `page_data` dataset as these pages are not Wikipedia articles.

In [116]:
page_data_df = page_data_df[~(page_data_df["page"].str.startswith("Template:"))]
page_data_df

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


Similarly, we need to exclude rows in all caps for column `Name` in the `wpds_data` dataset as these rows provide cumulative regional population counts rather than country-level counts. Side note: we still want to retain the cumulative rows in a separate dataFrame.

In [117]:
wpds_data_cumulative_df = wpds_data_df[(wpds_data_df['Name'].str.isupper())] # we still want to retain them
wpds_data_df = wpds_data_df[~(wpds_data_df['Name'].str.isupper())]
wpds_data_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000
...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000
230,SB,Solomon Islands,Country,2019,0.715,715000
231,TO,Tonga,Country,2019,0.099,99000
232,TV,Tuvalu,Country,2019,0.010,10000


## Requesting ORES to gather article quality predictions
We are using [ORES](https://ores.wikimedia.org/v3/#/) to get the predicted quality score for each article in the `page_data` dataset. 

In [118]:
# first define our headers and create function call to call the ORES apis
req_headers = {
    "User-Agent": "https://github.com/jell0wed",
    "From": "jepoisso@uw.edu"
}

def api_call(endpoint, params):
    call = requests.get(endpoint.format(**params), headers=req_headers)
    resp = call.json()
    return resp

In [147]:
cache = {}

In [137]:
ores_articlequality_endpoint = "https://ores.wikimedia.org/v3/scores/{context}/{revid}/{model}"
ores_articlequality_params = {
    "context": "enwiki",
    "revid": None,
    "model": "articlequality"
}

def fetch_ores_score(rev_id):
    ores_articlequality_params["revid"] = rev_id
    if rev_id in cache:
        return json.dumps(cache[rev_id])
    resp = api_call(ores_articlequality_endpoint, ores_articlequality_params)
    article_quality_dict = resp['enwiki']['scores'][str(rev_id)]['articlequality']
    cache[rev_id] = {}
    if 'score' in article_quality_dict:
        cache[rev_id] = article_quality_dict['score']['probability']
        return json.dumps(article_quality_dict['score']['probability'])
    print("processed %d" % (len(cache)))

# do not uncomment the following lines, this took 50m to run, pre-processed data can be found in in `data/ores_data.csv`
#page_data_df["ores_score"] = page_data_df["rev_id"]
#page_data_df["ores_score"] = page_data_df["ores_score"].apply(fetch_ores_score)
#page_data_df.to_csv("data/ores_data.csv")

In [202]:
page_data_df = pd.read_csv("data/ores_data.csv", index_col=0)

In [203]:
# filter out rows for which the prediction was not available and log them as a separate file
page_data_df_excl_df = page_data_df[page_data_df["ores_score"].isnull()]
page_data_df_excl_df.to_csv("data/ores_data_excluded.csv")

page_data_df = page_data_df[~(page_data_df["ores_score"].isnull())]
page_data_df.to_csv("data/ores_data_filtered.csv")

ORES predictions for some rows was unavailable. Those rows have been filtered out into the `data/ores_data_excluded.csv` file.

## Merging datasets
Now that we have gathered all the ORES data and flagged entries for which no prediction is available, we can merge the wikipedia data as well as the population data together.

We first need to merge the Wikipedia pages w/ ORES prediction data with the population data. Wikipedia page entries for which no suitable country mapping was found have been filtered out and can be found in `data/wp_wpds_countries-no_match.csv`.

The rest of the merged entries of wikipedia pages w/ country population has been re-shaped to the following schema: 

 - country
 - article_name
 - revision_id
 - article_quality_est.
 - population

and can be found in `data/wp_wpds_politicians_by_country.csv`.

In [204]:
page_data_df

Unnamed: 0,page,country,rev_id,ores_score
1,Bir I of Kanem,Chad,355319463,"{""B"": 0.005643168767502225, ""C"": 0.00564142487..."
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,"{""B"": 0.008182590350064387, ""C"": 0.00889480723..."
12,Yos Por,Cambodia,393822005,"{""B"": 0.0215499089652523, ""C"": 0.0204201917590..."
23,Julius Gregr,Czech Republic,395521877,"{""B"": 0.006954685153294305, ""C"": 0.00686855097..."
24,Edvard Gregr,Czech Republic,395526568,"{""B"": 0.006954685153294305, ""C"": 0.00686855097..."
...,...,...,...,...
47191,Hal Bidlack,United States,807481636,"{""B"": 0.42610551787151274, ""C"": 0.461833143841..."
47192,Yahya Jammeh,Gambia,807482007,"{""B"": 0.07547654136000348, ""C"": 0.030938036288..."
47193,Lucius Fairchild,United States,807483006,"{""B"": 0.1484472077260258, ""C"": 0.7083028612242..."
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153,"{""B"": 0.1276617385878031, ""C"": 0.0411570565811..."


In [208]:
wpds_data_df['country'] = wpds_data_df['Name']
# merge both the wikipedia page dataset w/ countries
merged_df = pd.merge(page_data_df, wpds_data_df, how="left", on="country")

# remove entries for which there was no country match
merged_no_match_df = merged_df[merged_df['Name'].isnull()]
merged_no_match_df.to_csv("data/wp_wpds_countries-no_match.csv")

# completed merge dataset, re-shape the data according to schema
merged_df = merged_df[~(merged_df['Name'].isnull())]

In [209]:
# next we need to extract the best prediction of article quality from the `ores_score` column.
# the article quality prediction is the prediction which has the highest probability
def extract_article_quality(ores_json_score):
    ores_preds = json.loads(ores_json_score)
    quality = max(ores_preds, key=ores_preds.get)
    return quality

merged_df["article_quality_est"] = merged_df["ores_score"].apply(extract_article_quality)

In [210]:
# reshape the data to the correct schema
merged_df = merged_df.rename(columns={
    "page": "article_name",
    "rev_id": "revision_id",
    "Population": "population"
})
merged_df = merged_df.drop(columns=['ores_score', 'FIPS', 'Name', 'Type', 'TimeFrame', 'Data (M)'])
merged_df = merged_df[['country', 'article_name', 'revision_id', 'article_quality_est', 'population']]
merged_df.to_csv("data/wp_wpds_politicians_by_country.csv")

## Analysis

In [270]:
# coverage; we are interested in the proportion of the number of politician wikipedia articles as a proportion of the 
# country population
coverage_pop_df = merged_df.groupby("country").agg(
    article_count=pd.NamedAgg(column="article_name", aggfunc="count"),
    population=pd.NamedAgg(column="population", aggfunc="first")
)
coverage_pop_df = coverage_pop_df.reset_index()
coverage_pop_df['proportion'] = coverage_pop_df['article_count'] / coverage_pop_df['population']
coverage_pop_df

Unnamed: 0,country,article_count,population,proportion
0,Afghanistan,319,38928000.0,0.000008
1,Albania,456,2838000.0,0.000161
2,Algeria,116,44357000.0,0.000003
3,Andorra,34,82000.0,0.000415
4,Angola,106,32522000.0,0.000003
...,...,...,...,...
178,Venezuela,130,28645000.0,0.000005
179,Vietnam,187,96209000.0,0.000002
180,Yemen,116,29826000.0,0.000004
181,Zambia,25,18384000.0,0.000001


In [238]:
# high quality article; we are interested in the proportion of the number of high quality (FA, GA) politician 
# wikipedia articles as a proportion of the country population
high_quality_article_df = merged_df[merged_df['article_quality_est'].isin(["FA", "GA"])]
hq_pop_df = high_quality_article_df.groupby("country").agg(
    hq_count=pd.NamedAgg(column="article_name", aggfunc="count"),
    population=pd.NamedAgg(column="population", aggfunc="first")
)
hq_pop_df = hq_pop_df.reset_index()
hq_pop_df['proportion'] = hq_pop_df['hq_count'] / hq_pop_df['population']
hq_pop_df

Unnamed: 0,country,hq_count,population,proportion
0,Afghanistan,13,38928000.0,3.339499e-07
1,Albania,3,2838000.0,1.057082e-06
2,Algeria,2,44357000.0,4.508871e-08
3,Argentina,16,45377000.0,3.526015e-07
4,Armenia,5,2956000.0,1.691475e-06
...,...,...,...,...
141,Vanuatu,3,321000.0,9.345794e-06
142,Venezuela,3,28645000.0,1.047303e-07
143,Vietnam,13,96209000.0,1.351225e-07
144,Yemen,3,29826000.0,1.005834e-07


In [271]:
# trick from https://stackoverflow.com/questions/68714674/select-all-below-rows-till-we-get-next-match-in-column-pandas-issue-resolved-bu
df = pd.read_csv("data/wpds_data.csv")
df = df[df['Type'].isin(["Sub-Region", "Country"])]
sub_regions = len(df.loc[df.Type == 'Sub-Region'])

# assign new ID to rows below until we find the next sub-region
df.loc[df.Type == 'Sub-Region', 'new_id'] = [n for n in range(1, sub_regions+1)]
df.fillna(method='ffill', inplace=True)

sub_region_mapping_df = df.groupby("new_id").agg(
    sub_region=pd.NamedAgg(column="Name", aggfunc="first"),
    sub_region_population=pd.NamedAgg("Population", aggfunc="first")
)
sub_region_mapping = sub_region_mapping_df.to_dict()

sub_region_mapping

# map the subregion to each country & sub-region population
df['sub_region'] = df['new_id'].apply(lambda x: sub_region_mapping['sub_region'][x])
df['sub_region_population'] = df['new_id'].apply(lambda x: sub_region_mapping['sub_region_population'][x])

# now merge with the wikipedia articles
merged_df_subregion = pd.merge(merged_df, df, left_on="country", right_on="Name")
merged_df_subregion = merged_df_subregion.drop(columns=["FIPS", "Name", "Type", "TimeFrame", "Data (M)", "Population", "new_id"])
coverage_pop_subregion_df = merged_df_subregion.groupby("sub_region").agg(
    article_count=pd.NamedAgg(column="article_name", aggfunc="count"),
    sub_region_population=pd.NamedAgg(column="sub_region_population", aggfunc="first")
)
coverage_pop_subregion_df['proportion'] = coverage_pop_subregion_df['article_count'] / coverage_pop_subregion_df['sub_region_population']
coverage_pop_subregion_df

Unnamed: 0_level_0,article_count,sub_region_population,proportion
sub_region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CARIBBEAN,695,43233000,1.6e-05
CENTRAL AMERICA,1543,178611000,9e-06
CENTRAL ASIA,245,74961000,3e-06
Channel Islands,3763,172000,0.021878
EAST ASIA,2473,1641063000,2e-06
EASTERN AFRICA,2502,444970000,6e-06
EASTERN EUROPE,3732,291902000,1.3e-05
MIDDLE AFRICA,665,179757000,4e-06
NORTHERN AFRICA,899,244344000,4e-06
NORTHERN AMERICA,1901,368193000,5e-06
