# Assignment 2: Bias in Data
## Emily Linebarger

### 1. Data extraction

The two datasets I'll be using for this analysis are the Politicians by Country dataset from FigShare (https://figshare.com/articles/dataset/Untitled_Item/5513449) and the World Population Data Sheet (https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit#gid=283125346), from the Population Reference Bureau (https://www.prb.org/international/indicator/population/table/).

All data was downloaded on October 9, 2021 and was placed in the "raw" folder without edits. 

### 2. Data cleaning

In [185]:
import pandas as pd
import numpy as np

# First, clean the data on politicians by country
politicians = pd.read_csv("raw/country/data/page_data.csv")

In [186]:
politicians.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [187]:
politicians.shape

(47197, 3)

In [188]:
# All of the 'page' rows that start with "Template" are not Wikipedia articles, and should be dropped. 
mask = politicians.page.str.contains("^Template")
politicians[mask]

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409
5,Template:Nigeria-politician-stub,Nigeria,391862819
...,...,...,...
44916,Template:New Zealand prime minister electoral ...,New Zealand,806286945
44966,Template:Current New Zealand political party l...,New Zealand,806301302
45587,Template:Lists of US Presidents and Vice Presi...,United States,806668141
45823,Template:Prime Ministers of Australia,Australia,806799996


In [189]:
politicians = politicians[~mask]
politicians.shape # This drops 496 rows. 
politicians.to_csv('clean/politicians.csv')

In [190]:
# Next, clean the population data. 
# There are some regional aggregates, which are distinguished by all-caps in the 'geography' field.
# These won't match the country strings in the politicians dataset, but they're important to keep around 
# to get regional aggregates. 
population = pd.read_csv('raw/WPDS_2020_data - WPDS_2020_data.csv.csv')
population = population.rename(columns={'Name':'country'}) # Rename to match politicians schema
population.head()

Unnamed: 0,FIPS,country,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


In [191]:
population.to_csv('clean/population.csv')

### 3. Getting article quality predictions

To get article quality scores, I will use the ORES API, which uses a machine-learning model to attach a quality score to a given revision ID. 
Documentation is here: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model

For each group of revision IDs, I'll need to build up a URL string of the format: 
https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids=355319463%7C498683267
This queries the "enwiki" database (the content parameter), with the "articlequality" model (model parameter). 
From the API documentation, the database errors when more than 200 revision IDs are queried, so I'll query them in batches and write out temporary files. 

In [192]:
import requests
import json
from datetime import datetime
import os

def query_api_batch(start_idx, end_idx, data, date):
    # Get the revision IDs from the start to the end index
    rev_ids = data.rev_id[start_idx:end_idx].astype('int')
    rev_ids = rev_ids.astype('str')
    rev_ids = '|'.join(rev_ids.to_list())
    
    # Create a datetime string for data saving
    date = datetime.today().strftime("%Y_%m_%d_%H_%M_%S")
    os.makedirs(f"api_queries_raw/{date}", exist_ok = True)
    os.makedirs(f"cleaned_queries/{date}", exist_ok = True)
    
    # Query the API
    r = requests.get(f"https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_ids}")
    
    # Manipulate the data to get the 'prediction' column for each ID
    data = json.loads(r.text)
    # Save this query output
    with open(f'api_queries_raw/{date}/{start_idx}_{end_idx}.txt', 'w') as outfile:
        json.dump(data, outfile)
        
    # Extract just the columns you need from the queries - prediction and revision ID
    cleaned_data = dict()
    for rev_id in data['enwiki']['scores'].keys():
        if 'error' in data['enwiki']['scores'][rev_id]['articlequality'].keys():
            score = np.nan
        else:
            score = data['enwiki']['scores'][rev_id]['articlequality']['score']['prediction']
        cleaned_data[rev_id] = score
    cleaned_data = pd.DataFrame({'rev_id': cleaned_data.keys(), 'score': cleaned_data.values()})
    cleaned_data.to_csv(f'cleaned_queries/{date}/{start_idx}_{end_idx}.csv')

In [193]:
# First, read in past results. The API starts to reject requests after a certain number of queries, so I had
# to query in batches and save results to disk. 
# ** Note - for the first two runs on 10/9/2021 and 10/11/2021, I did not save the time. So I've given these 
# folders a time of midnight (00_00_00).
from pathlib import Path
all_dates = [x for x in Path('cleaned_queries').iterdir()]
previous_results = list()
for date in all_dates:
    previous_results.extend([x for x in date.iterdir() if x.is_file()])
print(f"Previous results found: {len(previous_results)}")

Previous results found: 1007


In [194]:
# Glob all of these results together 
wiki_codes = []

for filename in previous_results:
    df = pd.read_csv(filename)
    wiki_codes.append(df)

wiki_codes = pd.concat(wiki_codes, axis=0, ignore_index=True)

In [195]:
wiki_codes.head()

Unnamed: 0.1,Unnamed: 0,rev_id,score
0,0,699260156,
1,1,708813010,
2,2,715457941,
3,3,717369009,Stub
4,4,717927381,


In [196]:
wiki_codes.shape

(47783, 3)

In [197]:
# Merge these results onto data, so you only query lines that are missing 
data = pd.read_csv('clean/politicians.csv')
data['rev_id'] = np.round(data['rev_id'])
data.head()

Unnamed: 0.1,Unnamed: 0,page,country,rev_id
0,1,Bir I of Kanem,Chad,355319463
1,10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,12,Yos Por,Cambodia,393822005
3,23,Julius Gregr,Czech Republic,395521877
4,24,Edvard Gregr,Czech Republic,395526568


In [198]:
wiki_codes = wiki_codes[['rev_id', 'score']]
scored_data = data.merge(wiki_codes, on = 'rev_id', how = 'outer')
scored_data.head()

Unnamed: 0.1,Unnamed: 0,page,country,rev_id,score
0,1,Bir I of Kanem,Chad,355319463,Stub
1,10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,12,Yos Por,Cambodia,393822005,Stub
3,23,Julius Gregr,Czech Republic,395521877,Stub
4,24,Edvard Gregr,Czech Republic,395526568,Stub


In [199]:
# Save this data out
has_scores = scored_data.loc[~scored_data.score.isnull()]
has_scores.to_csv('clean/pages_with_scores.csv')

In [200]:
# Pull out the missing lines, and query the database for their scores. 
missing_scores = scored_data.loc[scored_data.score.isnull()]
missing_scores = missing_scores.drop_duplicates()
missing_scores.head()

Unnamed: 0.1,Unnamed: 0,page,country,rev_id,score
14,126,List of politicians in Poland,Poland,516633096,
25,222,Tingtingru,Vanuatu,550682925,
59,330,Daud Arsala,Afghanistan,627547024,
87,359,Book:Two Political Biographies,India,636911471,
196,514,Dilaver Bey,Turkey,669987106,


In [201]:
missing_scores.shape

(277, 5)

In [202]:
# Save the results you were unable to score to disk
missing_scores.to_csv('clean/unable_to_score_pages.csv')

In [203]:
# Iterate through the entire dataset, and save all query results
# There are 277 pages that couldn't be scored. Iterate through this loop again in you find more than this. 
if missing_scores.shape[0] > 277:
    # Create a datetime string for data saving
    date = datetime.today().strftime("%Y_%m_%d_%H_%M_%S")
    os.makedirs(f"api_queries_raw/{date}", exist_ok = True)
    os.makedirs(f"cleaned_queries/{date}", exist_ok = True)
    
    # Iterate through missing data
    step_size = 50
    for i in range(0, missing_scores.shape[0], step_size):
        start_idx = i # First start index will be 0, then 50, 100, etc.
        end_idx = i + (step_size - 1) # First end index will be 49, then 99, 149, etc. 
        if (end_idx > missing_scores.shape[0]):
            print("Reached the end!")
            end_idx = missing_scores.shape[0] # If you've reached the end, only query the remaining IDs available

        query_api_batch(start_idx, end_idx, missing_scores, date)
        print(f"Start at idx {start_idx}, end at idx {end_idx}")
else:
    print("All pages have been scored!")

All pages have been scored!


# 4. Combining the datasets

Now, I'll merge the scored pages with the population data. 

In [222]:
scored_politicians = pd.read_csv('clean/pages_with_scores.csv')
population = pd.read_csv('clean/population.csv')

# Do an outer merge on the 'country' column, so nonmatching observations are kept.
results = scored_politicians.merge(population, on='country', how='outer')
results = results[['page', 'country', 'rev_id', 'score', 'FIPS', 'Type', 'TimeFrame', 'Data (M)', 'Population']]
results.head()

Unnamed: 0,page,country,rev_id,score,FIPS,Type,TimeFrame,Data (M),Population
0,Bir I of Kanem,Chad,355319463.0,Stub,TD,Country,2019.0,16.877,16877000.0
1,Abdullah II of Kanem,Chad,498683267.0,Stub,TD,Country,2019.0,16.877,16877000.0
2,Salmama II of Kanem,Chad,565745353.0,Stub,TD,Country,2019.0,16.877,16877000.0
3,Kuri I of Kanem,Chad,565745365.0,Stub,TD,Country,2019.0,16.877,16877000.0
4,Mohammed I of Kanem,Chad,565745375.0,Stub,TD,Country,2019.0,16.877,16877000.0


In [223]:
# Write to disk any rows that did not exist in both datasets 
no_match = results.loc[(results.Population.isnull()) | (results.score.isnull())]
no_match.to_csv("clean/wp_wpds_countries-no_match.csv")

In [225]:
# Save the results that did match.
match = results.loc[~results.rev_id.isin(no_match.rev_id)]
match = match[['country', 'page', 'rev_id', 'score', 'Population']]
match.columns = ['country', 'article_name', 'revision_id', 'article_quality_est', 'population']
match.to_csv("clean/wp_wpds_politicians_by_country.csv")

# 5. Analysis

# 6. Results