# Assignment 2 : Bias in Data Assignment
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

In [89]:
import pandas as pd
import requests
import numpy as np

## Data cleaning

In [84]:
# load dataframes
page_data_df = pd.read_csv("data/page_data.csv")
wpds_data_df = pd.read_csv("data/wpds_data.csv")

We first need to get rid of rows started with `template:` in the `page_data` dataset as these pages are not Wikipedia articles.

In [85]:
page_data_df = page_data_df[~(page_data_df["page"].str.startswith("Template:"))]
page_data_df

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


Similarly, we need to exclude rows in all caps for column `Name` in the `wpds_data` dataset as these rows provide cumulative regional population counts rather than country-level counts. Side note: we still want to retain the cumulative rows in a separate dataFrame.

In [86]:
wpds_data_cumulative_df = wpds_data_df[(wpds_data_df['Name'].str.isupper())] # we still want to retain them
wpds_data_df = wpds_data_df[~(wpds_data_df['Name'].str.isupper())]
wpds_data_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000
...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000
230,SB,Solomon Islands,Country,2019,0.715,715000
231,TO,Tonga,Country,2019,0.099,99000
232,TV,Tuvalu,Country,2019,0.010,10000


## Requesting ORES to gather article quality predictions
We are using [ORES](http://google.com) to get the predicted quality score for each article in the `page_data` dataset. 

In [87]:
# first define our headers and create function call to call the ORES apis
req_headers = {
    "User-Agent": "https://github.com/jell0wed",
    "From": "jepoisso@uw.edu"
}

def api_call(endpoint, params):
    call = requests.get(endpoint.format(**params), headers=req_headers)
    resp = call.json()
    return resp

In [91]:
ores_articlequality_endpoint = "https://ores.wikimedia.org/v3/scores/{context}/{revid}/{model}"
ores_articlequality_params = {
    "context": "enwiki",
    "revid": None,
    "model": "articlequality"
}

i = 0
def fetch_ores_score(row):
    global i
    rev_id = row["rev_id"]
    ores_articlequality_params["revid"] = rev_id
    resp = api_call(ores_articlequality_endpoint, ores_articlequality_params)
    article_quality_dict = resp['enwiki']['scores'][str(rev_id)]['articlequality']
    if 'score' in article_quality_dict:
        row["ores_scores"] = article_quality_dict['score']['probability']
    i += 1
    print("processed %d" % (i))
    return row

page_data_df.apply(fetch_ores_score, axis=1)
page_data_df

processed 1
processed 2
processed 3
processed 4
processed 5
processed 6
processed 7
processed 8
processed 9
processed 10
processed 11
processed 12
processed 13
processed 14
processed 15
processed 16
processed 17
processed 18
processed 19
processed 20
processed 21
processed 22
processed 23
processed 24
processed 25
processed 26
processed 27
processed 28
processed 29
processed 30
processed 31
processed 32
processed 33
processed 34
processed 35
processed 36
processed 37
processed 38
processed 39
processed 40


KeyboardInterrupt: 

In [65]:
ores_articlequality_params["revid"] = "3945809923"
test = api_call(ores_articlequality_endpoint, ores_articlequality_params)
test['enwiki']['scores']["3945809923"]['articlequality']['error']

{'message': 'RevisionNotFound: Could not find revision ({revision}:3945809923)',
 'type': 'RevisionNotFound'}

In [None]:
page_data_df
i