# Assignment 2: Bias in Data
## Emily Linebarger

### 1. Data extraction

The two datasets I'll be using for this analysis are the Politicians by Country dataset from FigShare (https://figshare.com/articles/dataset/Untitled_Item/5513449) and the World Population Data Sheet (https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit#gid=283125346), from the Population Reference Bureau (https://www.prb.org/international/indicator/population/table/).

All data was downloaded on October 9, 2021 and was placed in the "raw" folder without edits. 

### 2. Data cleaning

In [152]:
import pandas as pd
import numpy as np

# First, clean the data on politicians by country
politicians = pd.read_csv("raw/country/data/page_data.csv")

In [153]:
politicians.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [154]:
politicians.shape

(47197, 3)

In [155]:
# All of the 'page' rows that start with "Template" are not Wikipedia articles, and should be dropped. 
mask = politicians.page.str.contains("^Template")
politicians[mask]

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409
5,Template:Nigeria-politician-stub,Nigeria,391862819
...,...,...,...
44916,Template:New Zealand prime minister electoral ...,New Zealand,806286945
44966,Template:Current New Zealand political party l...,New Zealand,806301302
45587,Template:Lists of US Presidents and Vice Presi...,United States,806668141
45823,Template:Prime Ministers of Australia,Australia,806799996


In [156]:
politicians = politicians[~mask]
politicians.shape # This drops 496 rows. 

(46701, 3)

In [157]:
# Next, clean the population data. 
# There are some regional aggregates, which are distinguished by all-caps in the 'geography' field.
# These won't match the country strings in the politicians dataset, but they're important to keep around 
# to get regional aggregates. 
population = pd.read_csv('raw/WPDS_2020_data - WPDS_2020_data.csv.csv')
population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


In [158]:
# Merge and save cleaned data 
population.rename(columns={'Name': 'country'}, inplace=True)
data = politicians.merge(population, on = 'country', how = 'outer')
data.head()

Unnamed: 0,page,country,rev_id,FIPS,Type,TimeFrame,Data (M),Population
0,Bir I of Kanem,Chad,355319463.0,TD,Country,2019.0,16.877,16877000.0
1,Abdullah II of Kanem,Chad,498683267.0,TD,Country,2019.0,16.877,16877000.0
2,Salmama II of Kanem,Chad,565745353.0,TD,Country,2019.0,16.877,16877000.0
3,Kuri I of Kanem,Chad,565745365.0,TD,Country,2019.0,16.877,16877000.0
4,Mohammed I of Kanem,Chad,565745375.0,TD,Country,2019.0,16.877,16877000.0


In [159]:
# EMILY REVIEW THIS - this doesn't seem right. 
# Print out some summary statistics to show which rows couldn't be joined
# Which rows were only in the population dataset, and not in politicians? 
data.loc[(data.page == '') | (data.page == np.nan)].shape[0]

0

In [160]:
# Which rows were only in the politicians dataset, and not in population? 
data.loc[data.Population == np.nan].shape[0]

0

In [161]:
data.shape

(46752, 8)

### 3. Getting article quality predictions

To get article quality scores, I will use the ORES API, which uses a machine-learning model to attach a quality score to a given revision ID. 
Documentation is here: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model

For each group of revision IDs, I'll need to build up a URL string of the format: 
https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids=355319463%7C498683267
This queries the "enwiki" database (the content parameter), with the "articlequality" model (model parameter). 
From the API documentation, the database errors when more than 200 revision IDs are queried, so I'll query them in batches and write out temporary files. 

In [164]:
import requests
import json

def query_api_batch(start_idx, end_idx, data):
    # Get the revision IDs from the start to the end index
    rev_ids = data.rev_id[start_idx:end_idx].astype('int')
    rev_ids = rev_ids.astype('str')
    rev_ids = '|'.join(rev_ids.to_list())
    
    # Query the API
    r = requests.get(f"https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_ids}")
    
    # Manipulate the data to get the 'prediction' column for each ID
    data = json.loads(r.text)
    # Save this query output
    with open(f'api_queries_raw/{start_idx}_{end_idx}.txt', 'w') as outfile:
        json.dump(data, outfile)
        
    # Extract just the columns you need from the queries - prediction and revision ID
    cleaned_data = dict()
    for rev_id in data['enwiki']['scores'].keys():
        if 'error' in data['enwiki']['scores'][rev_id]['articlequality'].keys():
            score = np.nan
        else:
            score = data['enwiki']['scores'][rev_id]['articlequality']['score']['prediction']
        cleaned_data[rev_id] = score
    cleaned_data = pd.DataFrame({'rev_id': cleaned_data.keys(), 'score': cleaned_data.values()})
    cleaned_data.to_csv(f'cleaned_queries/{start_idx}_{end_idx}.csv')

In [165]:
# Iterate through the entire dataset, and save all query results
step_size = 50
for i in range(0, data.shape[0], step_size):
    start_idx = i # First start index will be 0, then 50, 100, etc.
    end_idx = i + (step_size - 1) # First end index will be 49, then 99, 149, etc. 
    if (end_idx > data.shape[0]):
        print("Reached the end!")
        end_idx = data.shape[0] # If you've reached the end, only query the remaining IDs available

    query_api_batch(start_idx, end_idx, data)
    print(f"Start at idx {start_idx}, end at idx {end_idx}")

Start at idx 0, end at idx 49
Start at idx 50, end at idx 99
Start at idx 100, end at idx 149
Start at idx 150, end at idx 199
Start at idx 200, end at idx 249
Start at idx 250, end at idx 299
Start at idx 300, end at idx 349
Start at idx 350, end at idx 399
Start at idx 400, end at idx 449
Start at idx 450, end at idx 499
Start at idx 500, end at idx 549
Start at idx 550, end at idx 599
Start at idx 600, end at idx 649
Start at idx 650, end at idx 699
Start at idx 700, end at idx 749
Start at idx 750, end at idx 799
Start at idx 800, end at idx 849
Start at idx 850, end at idx 899
Start at idx 900, end at idx 949
Start at idx 950, end at idx 999
Start at idx 1000, end at idx 1049
Start at idx 1050, end at idx 1099
Start at idx 1100, end at idx 1149
Start at idx 1150, end at idx 1199
Start at idx 1200, end at idx 1249
Start at idx 1250, end at idx 1299
Start at idx 1300, end at idx 1349
Start at idx 1350, end at idx 1399
Start at idx 1400, end at idx 1449
Start at idx 1450, end at idx 

Start at idx 11700, end at idx 11749
Start at idx 11750, end at idx 11799
Start at idx 11800, end at idx 11849
Start at idx 11850, end at idx 11899
Start at idx 11900, end at idx 11949
Start at idx 11950, end at idx 11999
Start at idx 12000, end at idx 12049
Start at idx 12050, end at idx 12099
Start at idx 12100, end at idx 12149
Start at idx 12150, end at idx 12199
Start at idx 12200, end at idx 12249
Start at idx 12250, end at idx 12299
Start at idx 12300, end at idx 12349
Start at idx 12350, end at idx 12399
Start at idx 12400, end at idx 12449
Start at idx 12450, end at idx 12499
Start at idx 12500, end at idx 12549
Start at idx 12550, end at idx 12599
Start at idx 12600, end at idx 12649
Start at idx 12650, end at idx 12699
Start at idx 12700, end at idx 12749
Start at idx 12750, end at idx 12799
Start at idx 12800, end at idx 12849
Start at idx 12850, end at idx 12899
Start at idx 12900, end at idx 12949
Start at idx 12950, end at idx 12999
Start at idx 13000, end at idx 13049
S

Start at idx 22800, end at idx 22849
Start at idx 22850, end at idx 22899
Start at idx 22900, end at idx 22949
Start at idx 22950, end at idx 22999
Start at idx 23000, end at idx 23049
Start at idx 23050, end at idx 23099
Start at idx 23100, end at idx 23149
Start at idx 23150, end at idx 23199
Start at idx 23200, end at idx 23249
Start at idx 23250, end at idx 23299
Start at idx 23300, end at idx 23349
Start at idx 23350, end at idx 23399
Start at idx 23400, end at idx 23449
Start at idx 23450, end at idx 23499
Start at idx 23500, end at idx 23549
Start at idx 23550, end at idx 23599
Start at idx 23600, end at idx 23649
Start at idx 23650, end at idx 23699
Start at idx 23700, end at idx 23749
Start at idx 23750, end at idx 23799
Start at idx 23800, end at idx 23849
Start at idx 23850, end at idx 23899
Start at idx 23900, end at idx 23949
Start at idx 23950, end at idx 23999
Start at idx 24000, end at idx 24049
Start at idx 24050, end at idx 24099
Start at idx 24100, end at idx 24149
S

ConnectionError: HTTPSConnectionPool(host='ores.wikimedia.org', port=443): Max retries exceeded with url: /v3/scores/enwiki?models=articlequality&revids=779489192%7C779489233%7C779489239%7C779489268%7C779489346%7C779489369%7C779489378%7C781617534%7C783185183%7C783189113%7C783347992%7C783443340%7C783498374%7C783696632%7C784387002%7C785582787%7C785954264%7C786521863%7C786582103%7C786608437%7C786710381%7C786851549%7C786852045%7C786930391%7C787092748%7C787165337%7C787165351%7C787351865%7C787358624%7C787481225%7C788111011%7C788301751%7C788469417%7C788477044%7C788480658%7C788620926%7C788621006%7C788621059%7C788648865%7C788969473%7C788969579%7C788969663%7C788969900%7C788969912%7C788969952%7C788970550%7C788970575%7C788971356%7C788971401 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ffbf1f5bbe0>: Failed to establish a new connection: [Errno 12] Cannot allocate memory'))