# Assignment 2: Bias in Data

## Analyzing Wikipedia articles on political figures from different countries.

### Matthew Blake

The main objective of this notebook is to provide a detailed walkthrough of the steps involved in analysing quality of the Wikipedia articles on political figures from various countries. The data involved in the analysis are obtained from two sources:

Politicians by Country from the English-language Wikipedia dataset: Source: Figshare
2020 World Populataion Data Sheet by the Population Reference Bureau: Source: Dropbox

The data quality is assesed using ORES (Objective Revision Evaluation Service), which is a web service and API that provides machine learning as a service for Wikimedia projects maintained by the Scoring Platform team.

The notebook is organized into the following sections:

1. Getting Article and Population data
2. Getting Article quality scores using the ORES API
3. Creating a final dataset composing of both the Article and Population 
4. Analysis and Results

In [59]:
# Loading the necessary packages
import json
import string
import pandas as pd
import numpy as np
import requests

Defining generic functions to read the datasets, access the ORES service, and modify the population dataset.

In [52]:
def get_data(file):
    """
    Function to retrieve the specified file as a dataframe.
    The file passed as a parameter must be a .csv file.
    Args:
        file(str): Path of the target file
    Returns:
        pandas.Dataframe
    Raises:
        FileNotFoundError: If file doesn't exist.
    """
    try:
        df = pd.read_csv(file,thousands=',')
        return df
    except FileNotFoundError:
        raise FileNotFoundError("The file {} does not exist".format(file))
        
HEADERS = {'User-Agent' : 'https://github.com/mcb2016', 'From' : 'mcb2016@uw.edu'}

def get_ores_data(revision_ids):
    """
    Function to retrieve the quality scores from ORES API.
    Args:
        revision_ids (int): The revision id of the articles,
        headers(dict): Request headers.
    Returns:
        rev_score_arr(list:dict) : List of dictionaries with revision_id:score pairs.
        error_revs(list:int) : List of revision ids for which we didnt get the score.
    """
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    params = {'project' : 'enwiki',
              'model'   : 'articlequality',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params), headers=HEADERS)
    response = api_call.json()
    
    # Stripping out the scores in the predictions.
    rev_score_arr = []
    error_revs = []
    for rev_id in revision_ids:
        try:
            score = response['enwiki']["scores"][str(rev_id)]["articlequality"]["score"]["prediction"]
            rev_score_arr.append({'rev_id':rev_id,
                                  'score':score})
        except:
            # Storing the rev_ids for which we couldn't get any score.
            error_revs.append(rev_id)
    return rev_score_arr, error_revs

def is_all_uppercase(a_str):
    for c in a_str:
        if c not in string.ascii_uppercase and c != ' ':
            return False
    return True

## Section 1: Getting Article and Population data

The population data used here is obtained from the Population Reference Bureau's 2020 estimates. The link can be found in project description given above and also in the readme. For convenience I have downloaded and stored it in the raw_data folder under the name WPDS_2020_data.csv

In [53]:
population_df = get_data('raw_data\\WPDS_2020_data.csv')
# An initial look at the data
population_df.head(20)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000
8,TN,Tunisia,Country,2019,11.896,11896000
9,EH,Western Sahara,Country,2019,0.597,597000


From above, we see that the individual countries are listed under their a parent geographical locations such as their respective continents etc. Let's check for the parent level geographical locations. We can filter them by searching for location names in all caps.

In [54]:
population_df[population_df["Name"].str.isupper()]

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000


There are 24 regions in total. Let's modify the population data such that each country is associated with its parent region.

In [55]:
# the following function will produce a list of regions associated with each country to add to the world population dataset

# We'll instatiate an empty list
regions = []

def get_regions(wp_regions):
    for region in wp_regions:
        if is_all_uppercase(region):
            index = region
            regions.append(index)
        else:
            regions.append(index)
    return regions

population_df['Region'] = get_regions(population_df['Name'])
population_df.head(20)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region
0,WORLD,WORLD,World,2019,7772.85,7772850000,WORLD
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,AFRICA
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,NORTHERN AFRICA
3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA
4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA
5,LY,Libya,Country,2019,6.891,6891000,NORTHERN AFRICA
6,MA,Morocco,Country,2019,35.952,35952000,NORTHERN AFRICA
7,SD,Sudan,Country,2019,43.849,43849000,NORTHERN AFRICA
8,TN,Tunisia,Country,2019,11.896,11896000,NORTHERN AFRICA
9,EH,Western Sahara,Country,2019,0.597,597000,NORTHERN AFRICA


### Wikipedia Article Data

The article dataset is the metadata of articles on politicians by country published on Wikipedia. The link is provided in the description above as well as in the readme. The data is also stored locally as page_data.csv in the raw_data folder.

In [56]:
page_df = get_data('raw_data\\page_data.csv')
page_df.head(10)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409
5,Template:Nigeria-politician-stub,Nigeria,391862819
6,Template:Colombia-politician-stub,Colombia,391863340
7,Template:Chile-politician-stub,Chile,391863361
8,Template:Fiji-politician-stub,Fiji,391863617
9,Template:Solomons-politician-stub,Solomon Islands,391863809


As seen above, there are pages that are denoted by the 'Template:' substring. These pages are not Wikipedia articles, and should not be included in our analyses.

In [57]:
new_page_df = page_df[~page_df.page.str.startswith("Template:")]

new_page_df.head(10)

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
25,Robert Douglas Cook,Canada,401577829
44,List of Grand Viziers of Egypt,Egypt,442937236
105,Sehba Musharraf,Pakistan,448555418
111,Butler-Belmont family,United States,470173494
114,List of Canadian incumbents by year,Canada,477962574


## Section 2: Getting the article scores using ORES.

As mentioned above ORES is a machine learning service that returns the quality of an article in the form of the following levels from highest to lowest:

- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article

Here the classes FA and GA are assigned to articles that are deemed high quality. In order to retrieve the scores from ORES, we need to provide a revision ID, which is the third column in page_data.csv and the machine learning model, which is articlequality.

According to the API docs, it can handle at most 50 calls, so we need to send the requests in chunks to avoid hitting the rate limitter.

In [58]:
## The number of rows ~50000, so dividing the data into 500 chunks of roughly 50 rows.
revision_score_arr = []
revs_with_error = []
for i, chunk in enumerate(np.array_split(new_page_df, 1000)):
    if (i+1)%100 == 0:
        print("Chunk: {}".format(i))
    rev_ids = chunk['rev_id'].tolist()
    # Getting the scores and storing the results in arrays.
    rev_score_chunk, error_revs = get_ores_data(rev_ids)
    revision_score_arr.extend(rev_score_chunk)
    revs_with_error.extend(error_revs)

Chunk: 99
Chunk: 199
Chunk: 299
Chunk: 399
Chunk: 499
Chunk: 599
Chunk: 699
Chunk: 799
Chunk: 899
Chunk: 999


In [9]:
len(revision_score_arr), len(revs_with_error)

(46425, 276)

We got the score for most of articles in our wikipedia dataset, save for 276 articles.

We now need to merge this result with the article data and then combine that with the population data to get the final analytical dataset.

In [10]:
# Creating dataframe containing revision id and scores.
revision_score_df = pd.DataFrame(revision_score_arr)

# Combining this with the article data.
new_page_w_scores_df = new_page_df.merge(revision_score_df, on='rev_id')
new_page_df.head(10)

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
25,Robert Douglas Cook,Canada,401577829
44,List of Grand Viziers of Egypt,Egypt,442937236
105,Sehba Musharraf,Pakistan,448555418
111,Butler-Belmont family,United States,470173494
114,List of Canadian incumbents by year,Canada,477962574


We should also note the articles that failed to merge with the existing article data as no revision score was assigned to them

In [11]:
# Creating dataframe containing articles that didn't receive a score
no_score_df = pd.DataFrame(revs_with_error)
no_score_df = no_score_df.rename(columns={ 0 : 'rev_id'})

# Combining this with the article data to determine which articles didn't receive scores
articles_no_score = new_page_df.merge(no_score_df, on='rev_id')
articles_no_score.head()

Unnamed: 0,page,country,rev_id
0,List of politicians in Poland,Poland,516633096
1,Tingtingru,Vanuatu,550682925
2,Daud Arsala,Afghanistan,627547024
3,Book:Two Political Biographies,India,636911471
4,Dilaver Bey,Turkey,669987106


Before merging the population and page data, let's find out which countries in the population data don't have a corresponding article in the page data, then write those countries to a csv file in the final_data folder.

In [50]:
excluded_data = population_df[~population_df.Name.isin(new_page_df.country)]

# We'll write the data set containing countries/regions with no associated page to a csv file
excluded_data.to_csv("final_data\\wp_wpds_countries-no_match.csv")
excluded_data.shape[0]

AttributeError: 'DataFrame' object has no attribute 'Name'

Now we need to merge this data with the population data.

In [13]:
# First, let's rename the 'Name' column of the population data with 'country'
population_df = population_df.rename(columns={'Name' : 'country'})
population_df.head()

Unnamed: 0,FIPS,country,Type,TimeFrame,Data (M),Population,Region
0,WORLD,WORLD,World,2019,7772.85,7772850000,WORLD
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,AFRICA
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,NORTHERN AFRICA
3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA
4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA


In [14]:
# Now we'll merge the population and article datasets on the 'country' column
combined_data = new_page_w_scores_df.merge(population_df, on='country')
combined_data.head()

Unnamed: 0,page,country,rev_id,score,FIPS,Type,TimeFrame,Data (M),Population,Region
0,Bir I of Kanem,Chad,355319463,Stub,TD,Country,2019,16.877,16877000,MIDDLE AFRICA
1,Abdullah II of Kanem,Chad,498683267,Stub,TD,Country,2019,16.877,16877000,MIDDLE AFRICA
2,Salmama II of Kanem,Chad,565745353,Stub,TD,Country,2019,16.877,16877000,MIDDLE AFRICA
3,Kuri I of Kanem,Chad,565745365,Stub,TD,Country,2019,16.877,16877000,MIDDLE AFRICA
4,Mohammed I of Kanem,Chad,565745375,Stub,TD,Country,2019,16.877,16877000,MIDDLE AFRICA


The final dataset doesn't appear to have any problems. We lost a couple of hundreds of articles because we didn't have matching names for several locations in the population dataset. No we should save this data to ensure reproducibility.

In [15]:
combined_data.to_csv('final_data\\wp_wpds_politicians_by_country.csv', index=False)

## Section 3: Analysis of articles

The analysis involves processing the final data to find the article per person for each country and also examining the proportion of high quality articles. The steps involved are:

- Finding the number of articles for each country.
- Dividing by the population gives number of articles per person.
- Sort the values in decreasing order.

In [16]:
# Reading the analytical dataset.
score_with_population_data = get_data('final_data\\wp_wpds_politicians_by_country.csv')

# Finding the number of articles by country
articles_by_country = score_with_population_data.groupby(['country','Population']).agg('count')['page'].reset_index()
# Renaming the column to reflect the value it contains.
articles_by_country = articles_by_country.rename(columns={'page':'num_article'})

# Finding the number of articles per person
articles_by_country["article_per_person"] = articles_by_country['num_article'] / articles_by_country['Population']
articles_by_country = articles_by_country.sort_values(by='article_per_person', ascending=False)

Q1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [17]:
print(articles_by_country['country'].head(10))

169                            Tuvalu
117                             Nauru
138                        San Marino
110                            Monaco
95                      Liechtenstein
104                  Marshall Islands
164                             Tonga
70                            Iceland
3                             Andorra
52     Federated States of Micronesia
Name: country, dtype: object


Q.2 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population.

In [18]:
print(articles_by_country['country'].tail(10)[::-1])

71            India
72        Indonesia
34            China
176      Uzbekistan
51         Ethiopia
181          Zambia
84     Korea, North
162        Thailand
114      Mozambique
13       Bangladesh
Name: country, dtype: object


Q.3 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [19]:
# Finding number of High quality papers and total number of articles by country.
def find_article_counts_by_type(group):
    """
    Function to find the number of articles by type, i.e, high quality or not.
    Args:
        group(pandas.Dataframe):  Dataframe for each country.
    Returns:
        df(pandas.Dataframe): Dataframe with counts by article type.
    """
    high_quality_articles = group.query('score == "FA" or score == "GA"')
    return pd.DataFrame([{'hq_article_counts':high_quality_articles.shape[0],
                          'total_article_counts':group.shape[0]}])
    
article_type_by_country = score_with_population_data.groupby(
    'country').apply(find_article_counts_by_type).reset_index(level=1, drop=True)
article_type_by_country.head(5)

Unnamed: 0_level_0,hq_article_counts,total_article_counts
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,13,319
Albania,3,456
Algeria,2,116
Andorra,0,34
Angola,0,106


In [20]:
# Finding proportion of hq articles and sorting them.
article_type_by_country['proportion_hq_articles'] = (article_type_by_country['hq_article_counts'] / 
                                                     article_type_by_country['total_article_counts'])
article_type_by_country.sort_values(by='proportion_hq_articles', ascending=False, inplace=True)
# Dropping all locations that hasn't published any hq articles so far.
article_type_by_country_hq = article_type_by_country.query('proportion_hq_articles != 0')

In [21]:
article_type_by_country_hq.head(10)

Unnamed: 0_level_0,hq_article_counts,total_article_counts,proportion_hq_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Korea, North",8,36,0.222222
Saudi Arabia,15,117,0.128205
Romania,42,343,0.122449
Central African Republic,8,66,0.121212
Uzbekistan,3,28,0.107143
Mauritania,5,48,0.104167
Guatemala,7,83,0.084337
Dominica,1,12,0.083333
Syria,10,128,0.078125
Benin,7,91,0.076923


Q.4 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [22]:
article_type_by_country_hq.tail(10)[::-1]

Unnamed: 0_level_0,hq_article_counts,total_article_counts,proportion_hq_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Belgium,1,519,0.001927
Tanzania,1,404,0.002475
Switzerland,1,402,0.002488
Nepal,1,356,0.002809
Peru,1,350,0.002857
Nigeria,2,676,0.002959
Portugal,1,318,0.003145
Colombia,1,285,0.003509
Lithuania,1,244,0.004098
Morocco,1,206,0.004854


Q.5 10 highest-ranked regions in terms of number of politician articles as a proportion of regional population

In [61]:
# We'll answer this question similar to how we answered Q.1

# Finding the number of articles by region
articles_by_region = score_with_population_data.groupby(['Region']).agg('count')['page'].reset_index()

# Renaming the column to reflect the value it contains.
articles_by_region = articles_by_region.rename(columns={'page':'num_article'})

region_proportion = []
for indx,row in articles_by_region.iterrows():
    region_population = population_df[population_df['country'] == row['Region']]['Population']
    region_proportion.append(int(row['num_article'] / region_population))
    
#articles_by_region = articles_by_region.sort_values(by='articles_per_region', ascending=False)
#articles_by_region['articles_per_region'] = region_proportion



### Note: Could not finish analyses 5 & 6 before due date. 