# A2: Bias In Data

The goal of this repository is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries

#### Table of Contents

  1. [Data Acquisition](#acquisition)
  2. [Data Cleaning and Processing](#cleaning)
  3. [Analysis and Results](#analysis)

In [1]:
import os
import json

import requests

import pandas as pd

from tqdm import tqdm_notebook as tqdm

<a id="acquisition"></a>

## Data Acquisition

We use two local data sources:
  1. The Wikipedia English article dataset under the "Category: Politicians by nationality" category
  2. The population dataset

In [2]:
wiki_articles_df = pd.read_csv("./data/page_data.csv")
population_df = pd.read_csv("./data/wikipedia_population_2018_data.csv")

In [3]:
wiki_articles_df.head(2)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463


In [4]:
population_df.head(2)

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7


Rename the columns, and make the population count more explicit

In [5]:
population_df.columns = ['country', 'population']
population_df["population"] = population_df["population"].apply(lambda s: s.replace(",", "")).apply(float)*1000000
population_df.head()

Unnamed: 0,country,population
0,AFRICA,1284000000.0
1,Algeria,42700000.0
2,Egypt,97000000.0
3,Libya,6500000.0
4,Morocco,35200000.0


### Retrieving Article Quality

We also need to generate the quality of each article, for which we use the [ORES API](https://www.mediawiki.org/wiki/ORES)

This API returns a prediction which is one of the following categories:

  1. FA - Featured article
  2. GA - Good article
  3. B - B-class article
  4. C - C-class article
  5. Start - Start-class article
  6. Stub - Stub-class article

The following code is inspired from the [A2 reference notebook](https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb)

In [6]:
HEADERS = {'User-Agent': 'https://github.com/havanagrawal', 'From': 'agrawh@uw.edu'}

def get_ores_data(revision_ids, headers=HEADERS):
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
 
    params = {
        'project': 'enwiki',
        'model': 'wp10',
        'revids': '|'.join(str(x) for x in revision_ids)
    }
    json_response = requests.get(endpoint.format(**params)).json()
    quality_predictions = []
    
    # Unpack predictions according to the response structure, which can be found in the reference notebook
    for key, value in json_response["enwiki"]["scores"].items():
        result_dict = value["wp10"]
        if "error" not in result_dict:
            prediction = {
                'rev_id': int(key),
                'prediction': result_dict["score"]["prediction"]
            }
            quality_predictions.append(prediction)
    
    return quality_predictions

In order to minimize the number of calls to the API, we can group revision ids into chunks of 50-100, and then call the API once for each group

To enable this, we use a quick recipe to iterate n items at a time from a collection

In [7]:
def grouper(lst, n):
    """Collect data into fixed-length chunks or blocks
    
    >>> grouper('ABCDEFG', 3)
    "ABC DEF G"
    """
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

If you already have the `article_quality.csv` file, then you need not retrieve predictions from ORES, since it can take up to 20 minutes, depending on your internet speed.

In [8]:
QUALITY_PREDICTION_FILEPATH = "./data/article_quality.csv"
DOWNLOAD_PREDICTIONS = not os.path.exists(QUALITY_PREDICTION_FILEPATH)
DOWNLOAD_PREDICTIONS

False

Retrieve and concatenate all JSON results into a single pandas DataFrame, and save it, if the output file doesn't already exist

In [9]:
if DOWNLOAD_PREDICTIONS:
    revision_ids = wiki_articles_df.rev_id.tolist()
    
    # Group revision IDs into chunks of 100
    grouped_ids = list(grouper(revision_ids, 100))
    
    # Get the article predictions from ORES in JSON format
    article_quality_json_data = [get_ores_data(subset) for subset in tqdm(grouped_ids)]
    
    # Convert the JSON data into DataFrames
    temp_dfs = [pd.DataFrame.from_records(json_subset) for json_subset in article_quality_json_data]
    
    # Concatenate and save the DataFrames
    article_quality_df = pd.concat(temp_dfs)
    article_quality_df.to_csv(QUALITY_PREDICTION_FILEPATH, index=False)
else:
    article_quality_df = pd.read_csv(QUALITY_PREDICTION_FILEPATH)

In [10]:
article_quality_df.head(2)

Unnamed: 0,prediction,rev_id
0,Stub,355319463
1,Stub,391862046


<a id="cleaning"></a>

## Data Processing and Cleaning

We want to find and report:
  1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
  2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
  3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
  4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

To achieve this, we need to merge the quality prediction with the original dataset. This may lead to some data loss since not all revisions will have a prediction

In [11]:
print("DataFrame Shape Before Merging\t", wiki_articles_df.shape)

wiki_articles_df = wiki_articles_df.merge(article_quality_df, on='rev_id')

print("DataFrame Shape After Merging\t", wiki_articles_df.shape)

DataFrame Shape Before Merging	 (47197, 3)
DataFrame Shape After Merging	 (47092, 4)


We can now perform a group by country, and count
 1. The total number of articles 
 2. The total number of high-quality articles
 
where high quality is defined as a prediction of either "FA" or "GA"

In [12]:
def is_high_quality(s):
    return s == "FA" or s == "GA"

In [13]:
high_quality_only = wiki_articles_df[wiki_articles_df.prediction.apply(is_high_quality)]

Counting the total number of articles by country:

In [14]:
all_article_counts = pd.DataFrame(wiki_articles_df.groupby('country').count()['rev_id'])
all_article_counts = all_article_counts.reset_index()
all_article_counts.columns = ['country', 'all_article_counts']
all_article_counts.head()

Unnamed: 0,country,all_article_counts
0,Abkhazia,16
1,Afghanistan,326
2,Albania,460
3,Algeria,119
4,Andorra,34


Similarly, counting the total number of high-quality (HQ) articles by country

In [15]:
hq_article_counts = pd.DataFrame(high_quality_only.groupby('country').count()['rev_id'])
hq_article_counts = hq_article_counts.reset_index()
hq_article_counts.columns = ['country', 'hq_article_counts']
hq_article_counts.head()

Unnamed: 0,country,hq_article_counts
0,Abkhazia,1
1,Afghanistan,10
2,Albania,4
3,Algeria,2
4,Argentina,15


We can now perform a three-way merge between the population, all article count and high quality article count datasets:

In [16]:
temp = pd.merge(hq_article_counts, all_article_counts, on='country')
final_df = pd.merge(temp, population_df, on='country')
final_df.head()

Unnamed: 0,country,hq_article_counts,all_article_counts,population
0,Afghanistan,10,326,36500000.0
1,Albania,4,460,2900000.0
2,Algeria,2,119,42700000.0
3,Argentina,15,496,44500000.0
4,Armenia,5,198,3000000.0


We apply a final transformation to get the articles/population counts

In [17]:
final_df["hq_articles_per_pop"] = final_df.hq_article_counts / final_df.population
final_df["all_articles_per_pop"] = final_df.all_article_counts / final_df.population

In [26]:
final_df.sample(5, random_state=42)

Unnamed: 0,country,hq_article_counts,all_article_counts,population,hq_articles_per_pop,all_articles_per_pop
117,Spain,34,879,46700000.0,7.280514e-07,1.9e-05
19,Burundi,1,76,11800000.0,8.474576e-08,6e-06
82,Mauritania,5,52,4500000.0,1.111111e-06,1.2e-05
97,Panama,5,109,4200000.0,1.190476e-06,2.6e-05
56,Iran,11,826,81600000.0,1.348039e-07,1e-05


Save the results to a CSV, so that advanced analysis can be performed independently.

In [19]:
final_df.to_csv("./data/article_quality_with_population.csv", index=False)

We can now report the desired metrics

<a id="analysis"></a>

## Analysis and Results

The `.reset_index().drop('index', axis=1)` correctly numbers the rows.

#### 1. 10 Highest Ranked Countries in terms of number of politician articles as a proportion of country population

In [20]:
final_df.sort_values("all_articles_per_pop", ascending=False).head(10).reset_index().drop('index', axis=1)

Unnamed: 0,country,hq_article_counts,all_article_counts,population,hq_articles_per_pop,all_articles_per_pop
0,Tuvalu,5,55,10000.0,0.0005,0.0055
1,Tonga,1,63,100000.0,1e-05,0.00063
2,Iceland,2,206,400000.0,5e-06,0.000515
3,Grenada,1,36,100000.0,1e-05,0.00036
4,Luxembourg,1,180,600000.0,2e-06,0.0003
5,Fiji,1,199,900000.0,1e-06,0.000221
6,Maldives,2,84,400000.0,5e-06,0.00021
7,Vanuatu,3,60,300000.0,1e-05,0.0002
8,Dominica,1,12,70000.0,1.4e-05,0.000171
9,New Zealand,12,790,4900000.0,2e-06,0.000161


#### 2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population


In [21]:
final_df.sort_values("all_articles_per_pop", ascending=True).head(10).reset_index().drop('index', axis=1)

Unnamed: 0,country,hq_article_counts,all_article_counts,population,hq_articles_per_pop,all_articles_per_pop
0,India,14,986,1371300000.0,1.020929e-08,7.190257e-07
1,Indonesia,8,214,265200000.0,3.016591e-08,8.069382e-07
2,China,33,1135,1393800000.0,2.367628e-08,8.143206e-07
3,Uzbekistan,1,29,32900000.0,3.039514e-08,8.81459e-07
4,Ethiopia,1,105,107500000.0,9.302326e-09,9.767442e-07
5,"Korea, North",7,39,25600000.0,2.734375e-07,1.523437e-06
6,Thailand,3,112,66200000.0,4.531722e-08,1.691843e-06
7,Bangladesh,3,323,166400000.0,1.802885e-08,1.941106e-06
8,Vietnam,13,191,94700000.0,1.372756e-07,2.016895e-06
9,Sudan,1,98,41700000.0,2.398082e-08,2.35012e-06


#### 3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [22]:
final_df.sort_values("hq_articles_per_pop", ascending=False).head(10).reset_index().drop('index', axis=1)

Unnamed: 0,country,hq_article_counts,all_article_counts,population,hq_articles_per_pop,all_articles_per_pop
0,Tuvalu,5,55,10000.0,0.0005,0.0055
1,Dominica,1,12,70000.0,1.4e-05,0.000171
2,Vanuatu,3,60,300000.0,1e-05,0.0002
3,Grenada,1,36,100000.0,1e-05,0.00036
4,Tonga,1,63,100000.0,1e-05,0.00063
5,Maldives,2,84,400000.0,5e-06,0.00021
6,Iceland,2,206,400000.0,5e-06,0.000515
7,Ireland,24,381,4900000.0,5e-06,7.8e-05
8,Bhutan,3,33,800000.0,4e-06,4.1e-05
9,Israel,21,497,8500000.0,2e-06,5.8e-05


#### 4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [23]:
final_df.sort_values("hq_articles_per_pop", ascending=True).head(10).reset_index().drop('index', axis=1)

Unnamed: 0,country,hq_article_counts,all_article_counts,population,hq_articles_per_pop,all_articles_per_pop
0,Ethiopia,1,105,107500000.0,9.302326e-09,9.767442e-07
1,India,14,986,1371300000.0,1.020929e-08,7.190257e-07
2,Brazil,3,551,209400000.0,1.432665e-08,2.631328e-06
3,Nigeria,3,682,195900000.0,1.531394e-08,3.481368e-06
4,Tanzania,1,408,59100000.0,1.692047e-08,6.903553e-06
5,Bangladesh,3,323,166400000.0,1.802885e-08,1.941106e-06
6,China,33,1135,1393800000.0,2.367628e-08,8.143206e-07
7,Sudan,1,98,41700000.0,2.398082e-08,2.35012e-06
8,Morocco,1,208,35200000.0,2.840909e-08,5.909091e-06
9,Indonesia,8,214,265200000.0,3.016591e-08,8.069382e-07
