# A2 - Bias in Data
### DATA 512
### Laura Thriftwood

The purpose of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. I combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. I then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries.

In [1]:
import pandas as pd
import numpy as np
import json
import math
import requests

## Step 1: Getting the Article and Population Data

The Wikipedia politicians by country dataset comes from Figshare. The .zip file was downloaded and unzipped where the _page_data.csv_ file was located.

The population data was drawn from the World Population data sheet published by the Population Reference Bureau and was downloaded as a .csv file named _WPDS_2020_data.csv_. 

These files can be located in the __data__ folder.


## Step 2: Cleaning the Data

First let's read in the politicians by country dataset and take a look at the structure of the data.

In [2]:
df_politicians = pd.read_csv('data/country/data/page_data.csv')

In [3]:
df_politicians.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [4]:
df_politicians.shape

(47197, 3)

We want to remove/ignore the rows of data that include "Template:" in the string of the page name as these entries are not Wikipedia articles and should not be included in the analysis.

In [5]:
df_politicians = df_politicians[~df_politicians.page.str.contains('Template:')].reset_index(drop=True)

In [6]:
df_politicians.head()

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,Yos Por,Cambodia,393822005
3,Julius Gregr,Czech Republic,395521877
4,Edvard Gregr,Czech Republic,395526568


In [7]:
df_politicians.shape

(46701, 3)

We can see that this removed 496 rows. Now let's read in and take a look at our other dataset with population information, _WPDS_2020_data.csv_.

In [8]:
df_population = pd.read_csv('data/WPDS_2020_data.csv')
df_population.head(20)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000
8,TN,Tunisia,Country,2019,11.896,11896000
9,EH,Western Sahara,Country,2019,0.597,597000


We want to ignore rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the __Name__ field. We can move these into a separate dataframe to reference later when reporting coverage and quality by region in the analysis section.

Note: Initially, I used the designation in the __Type__ field to determine exclusion criteria but there exists an entry for Channel Islands that has a Sub-Region __Type__ but is not displayed in ALL CAPS in the __Name__ field.

As we want to preserve the data we are removing in this step, we first make a copy of the population data.

In [9]:
df_sub_region_population = df_population.copy()
df_sub_region_population = df_sub_region_population[df_sub_region_population['Name'].str.isupper().fillna(False)]
df_sub_region_population

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000


In [10]:
df_sub_region_population.shape

(24, 6)

Our challenge is to create a new field for each country that indicates the Sub-Region it belongs to. I started by getting a list of the indices in the Sub-Region DataFrame, then creating a list of Sub-Region names that repeats _n_ number of times, with _n_ being calculated based on the different between Sub-Regions. I freely admit this is an inelegant solution, but it worked in this case.

In [11]:
# get a list of indices
sub_region_index = pd.Series(df_sub_region_population.index.values.tolist()) 

#create a a list of repeating index values based on range between list items
rep_items = sub_region_index.diff()

#reset indices
df_sub_region_population_copy = df_sub_region_population.copy().reset_index(drop=True)

#create a column for the number of reps needed for each Sub-Region
df_sub_region_population_copy['reps'] = rep_items

#shift the entries in the reps column up a row
df_sub_region_population_copy['reps'] = df_sub_region_population_copy['reps'].shift(periods = -1, fill_value = 18.0)

#drop the top entries
df_sub_region_population_copy = df_sub_region_population_copy.drop([0,1])

In [12]:
df_sub_region_population_copy

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,reps
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,8.0
3,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000,17.0
4,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000,21.0
5,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000,10.0
6,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000,6.0
7,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000,3.0
8,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000,1.0
9,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000,9.0
10,CARIBBEAN,CARIBBEAN,Sub-Region,2019,43.233,43233000,18.0
11,SOUTH AMERICA,SOUTH AMERICA,Sub-Region,2019,429.191,429191000,14.0


In [13]:
#create a dataframe for the Sub-Region name, and reps needed to assign to our countries data, less 1 so it aligns
sub_reg_reps = df_sub_region_population_copy[['Name', 'reps']]
sub_reg_reps.loc[:, 'reps'] = sub_reg_reps['reps'].apply(lambda x: x - 1).astype(int)
sub_reg_reps

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


Unnamed: 0,Name,reps
2,NORTHERN AFRICA,7
3,WESTERN AFRICA,16
4,EASTERN AFRICA,20
5,MIDDLE AFRICA,9
6,SOUTHERN AFRICA,5
7,NORTHERN AMERICA,2
8,LATIN AMERICA AND THE CARIBBEAN,0
9,CENTRAL AMERICA,8
10,CARIBBEAN,17
11,SOUTH AMERICA,13


In [14]:
repeating = sub_reg_reps.loc[sub_reg_reps.index.repeat(sub_reg_reps.reps)]
repeating.shape

(210, 2)

In [15]:
repeating_series = repeating['Name'].squeeze().reset_index(drop=True)
repeating_series

0      NORTHERN AFRICA
1      NORTHERN AFRICA
2      NORTHERN AFRICA
3      NORTHERN AFRICA
4      NORTHERN AFRICA
            ...       
205            OCEANIA
206            OCEANIA
207            OCEANIA
208            OCEANIA
209            OCEANIA
Name: Name, Length: 210, dtype: object

In [16]:
#drop the Sub-Regions (ALL CAPS) from the original population data
df_population = df_population[~df_population['Name'].str.isupper().fillna(False)].reset_index(drop=True)
df_population

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,DZ,Algeria,Country,2019,44.357,44357000
1,EG,Egypt,Country,2019,100.803,100803000
2,LY,Libya,Country,2019,6.891,6891000
3,MA,Morocco,Country,2019,35.952,35952000
4,SD,Sudan,Country,2019,43.849,43849000
...,...,...,...,...,...,...
205,WS,Samoa,Country,2019,0.200,200000
206,SB,Solomon Islands,Country,2019,0.715,715000
207,TO,Tonga,Country,2019,0.099,99000
208,TV,Tuvalu,Country,2019,0.010,10000


In [17]:
#add column to df_population that notes the sub-region
df_population['Sub_Region'] = repeating_series
df_population

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Sub_Region
0,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA
1,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA
2,LY,Libya,Country,2019,6.891,6891000,NORTHERN AFRICA
3,MA,Morocco,Country,2019,35.952,35952000,NORTHERN AFRICA
4,SD,Sudan,Country,2019,43.849,43849000,NORTHERN AFRICA
...,...,...,...,...,...,...,...
205,WS,Samoa,Country,2019,0.200,200000,OCEANIA
206,SB,Solomon Islands,Country,2019,0.715,715000,OCEANIA
207,TO,Tonga,Country,2019,0.099,99000,OCEANIA
208,TV,Tuvalu,Country,2019,0.010,10000,OCEANIA


## Step 3: Getting Article Quality Predictions

We need to get the predicted quality scores for each article in the Wikipedia dataset using a machine learning system called ORES that provides estimates of Wikipedia article quality. The article quality estimates (from best to worst) are:

1.	FA - Featured article
2.	GA - Good article
3.	B - B-class article
4.	C - C-class article
5.	Start - Start-class article
6.	Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures. These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. ORES will assign one of these 6 categories to any rev_id we send it.


In order to get article predictions for each article in the Wikipedia dataset, we use the value of each entry in the rev_id column to make an API query.

In [18]:
from ores import api

In [19]:
#API query headers
#headers = {
#    'User-Agent': 'https://github.com/laurathriftwood',
#    'From': 'lwood3@uw.edu'
#}

In [20]:
#API endpoint
#endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki/?models=articlequality&revids={rev_id}'

We start by extracting a list of `rev_id`'s from our `df_politicians` dataframe for which we want associated ORES scores. Since the full list is quite long, I've left some test batches commented out just in case. The smaller batch includes a `rev_id` for which there is no associate score so we can verify our error handling methodology.

In [21]:
#full batch of rev_ids
rev_list = df_politicians['rev_id']

#test smaller batch
#rev_list = [502721672, 516633096, 521986779]
    
#test larger batch
#rev_list = df_politicians['rev_id'][0:50]

In [22]:
#ORES session
ores_session = api.Session("https://ores.wikimedia.org", "DATA 512 Class project lwood3@uw.edu")

#get the full set of results for all rev_id values included in our dataset
results = ores_session.score("enwiki", ["articlequality"], rev_list)

#create an empty list to load the predictions into
predictions = []

In [23]:
#getting the predictions from our results
for score in results:
    while True:
        #attempt to retrieve the prediction for the rev_id and add it to a Series
        try: 
            predictions.append(score["articlequality"]["score"]["prediction"])
            break
            
        #if no prediction is available, we will see an error in the score
        #we note this error with a "No_Score" string entry
        except KeyError:
            predictions.append(str("No_Score"))
            break

Now that we have a list of `rev_ids` and a list of associated predictions, we merge our results into a single dataframe and verify that all `rev_id`s were processed by comparing the resulting shape against the length of the original `rev_list`. 

In [24]:
data = {'rev_id':rev_list, 'prediction':predictions}
predictions_df = pd.DataFrame(data)
print(predictions_df.shape)

(46701, 2)


Now we merge our predictions with the original politicians dataframe using the `rev_ids`.

In [25]:
merged_df = df_politicians.merge(predictions_df, how = 'inner', on = ['rev_id', 'rev_id'])
merged_df.head()

Unnamed: 0,page,country,rev_id,prediction
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


We check the shape of the merged dataframe to ensure we have not lost any rows of (missing/unmatched) data in the process.

In [26]:
print(merged_df.shape)
merged_df.head()

(46701, 4)


Unnamed: 0,page,country,rev_id,prediction
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


Let's see how many articles came back without associated scores/predictions.

In [27]:
merged_df.loc[merged_df.prediction == 'No_Score', 'prediction'].count()

276

We see that there are 276 `rev_id`s that do not have a prediction. We will extract those rows from our working dataframe and store them in a separate output file for our records.

In [28]:
df_politicians_no_score = merged_df[merged_df['prediction'] == 'No_Score']
print(df_politicians_no_score.shape)
df_politicians_no_score.to_csv(r'output/wp_wpds_politicians_no_score.csv', index = False, header = True)

(276, 4)


In [29]:
#drop rows that have a No_Score prediction value and check the shape to ensure 276 rows were dropped
merged_df = merged_df[~merged_df.prediction.str.contains('No_Score')].reset_index(drop=True)
merged_df.shape

(46425, 4)

In [30]:
df_population.shape

(210, 7)

In [31]:
merged_df.shape

(46425, 4)

## Combining the Datasets

We now need to merge our two datasets - the Wikipedia data in `merged_df` and the population data in `df_population` on their respective __country__ and __Name__ fields. Since we want maintain a record of the subset that does not have matching data, we will use an outer join to retain those rows.

In [32]:
all_data_df = merged_df.merge(df_population, how = 'outer', left_on='country', right_on='Name')
print(all_data_df.shape)
all_data_df

(46452, 11)


Unnamed: 0,page,country,rev_id,prediction,FIPS,Name,Type,TimeFrame,Data (M),Population,Sub_Region
0,Bir I of Kanem,Chad,355319463.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,MIDDLE AFRICA
1,Abdullah II of Kanem,Chad,498683267.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,MIDDLE AFRICA
2,Salmama II of Kanem,Chad,565745353.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,MIDDLE AFRICA
3,Kuri I of Kanem,Chad,565745365.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,MIDDLE AFRICA
4,Mohammed I of Kanem,Chad,565745375.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0,MIDDLE AFRICA
...,...,...,...,...,...,...,...,...,...,...,...
46447,,,,,PF,French Polynesia,Country,2019.0,0.280,280000.0,OCEANIA
46448,,,,,GU,Guam,Country,2019.0,0.175,175000.0,OCEANIA
46449,,,,,NC,New Caledonia,Country,2019.0,0.295,295000.0,OCEANIA
46450,,,,,PW,Palau,Country,2019.0,0.018,18000.0,OCEANIA


We want to extract rows in the __country__ or __Name__ columns where our data doesn't match (contains NaN values) and export it to a .csv file for our records. We are only interested in removing rows with NaN values in the country/Name columns, but as there are NaN values in the FIPS column, we need to be specific in our column operations

In [33]:
df_no_match = all_data_df[(all_data_df['country'].isnull()) | (all_data_df['Name'].isnull())] #1884 rows
df_no_match.to_csv(r'output/wp_wpds_countries-no_match.csv', index = False, header = True)
print(df_no_match.shape)
df_no_match

(1884, 11)


Unnamed: 0,page,country,rev_id,prediction,FIPS,Name,Type,TimeFrame,Data (M),Population,Sub_Region
488,Julius Gregr,Czech Republic,395521877.0,Stub,,,,,,,
489,Edvard Gregr,Czech Republic,395526568.0,Stub,,,,,,,
490,Miroslav Poche,Czech Republic,672862914.0,Stub,,,,,,,
491,Vojtěch Mynář,Czech Republic,673008587.0,Stub,,,,,,,
492,Jan Malypetr,Czech Republic,704424304.0,Stub,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
46447,,,,,PF,French Polynesia,Country,2019.0,0.280,280000.0,OCEANIA
46448,,,,,GU,Guam,Country,2019.0,0.175,175000.0,OCEANIA
46449,,,,,NC,New Caledonia,Country,2019.0,0.295,295000.0,OCEANIA
46450,,,,,PW,Palau,Country,2019.0,0.018,18000.0,OCEANIA


 We will also drop these rows from our working dataframe and check the shape before and after to ensure it matched the number of rows we identified in the previous step.

In [34]:
print(all_data_df.shape)
all_data_df = all_data_df.dropna(subset = ['country', 'Name']).reset_index(drop=True)
print(all_data_df.shape)

(46452, 11)
(44568, 11)


Let's clean up our dataset to match the schema in the assignment instructions by dropping unnecessary columns, renaming the column headers, and reordering the columns. Then we can export our final dataset to a .csv file.

In [35]:
#convert rev_id and Population to integer
all_data_df = all_data_df.astype({"rev_id": int, "Population": int})

In [36]:
all_data_df

Unnamed: 0,page,country,rev_id,prediction,FIPS,Name,Type,TimeFrame,Data (M),Population,Sub_Region
0,Bir I of Kanem,Chad,355319463,Stub,TD,Chad,Country,2019.0,16.877,16877000,MIDDLE AFRICA
1,Abdullah II of Kanem,Chad,498683267,Stub,TD,Chad,Country,2019.0,16.877,16877000,MIDDLE AFRICA
2,Salmama II of Kanem,Chad,565745353,Stub,TD,Chad,Country,2019.0,16.877,16877000,MIDDLE AFRICA
3,Kuri I of Kanem,Chad,565745365,Stub,TD,Chad,Country,2019.0,16.877,16877000,MIDDLE AFRICA
4,Mohammed I of Kanem,Chad,565745375,Stub,TD,Chad,Country,2019.0,16.877,16877000,MIDDLE AFRICA
...,...,...,...,...,...,...,...,...,...,...,...
44563,Rita Sinon,Seychelles,800323154,Stub,SC,Seychelles,Country,2019.0,0.098,98000,EASTERN AFRICA
44564,Sylvette Frichot,Seychelles,800323798,Stub,SC,Seychelles,Country,2019.0,0.098,98000,EASTERN AFRICA
44565,May De Silva,Seychelles,800969960,Start,SC,Seychelles,Country,2019.0,0.098,98000,EASTERN AFRICA
44566,Vincent Meriton,Seychelles,802051093,Stub,SC,Seychelles,Country,2019.0,0.098,98000,EASTERN AFRICA


In [37]:
#convert rev_id and Population to integer
all_data_df = all_data_df.astype({"rev_id": int, "Population": int})

#drop unnecessary columns
all_data_df = all_data_df.drop(all_data_df.columns[[4, 5, 6, 7, 8]], axis=1)
all_data_df

#rename columns
all_data_df = all_data_df.rename(columns={'page': 'article_name', 
                                          'rev_id': 'revision_id', 
                                          'prediction': 'article_quality_est', 
                                          'Population': 'population',
                                         'Sub_Region': 'subregion'})

#reorder columns
all_data_df = all_data_df[['country', 'subregion', 'article_name', 'revision_id', 'article_quality_est', 'population']]

In [38]:
all_data_df

Unnamed: 0,country,subregion,article_name,revision_id,article_quality_est,population
0,Chad,MIDDLE AFRICA,Bir I of Kanem,355319463,Stub,16877000
1,Chad,MIDDLE AFRICA,Abdullah II of Kanem,498683267,Stub,16877000
2,Chad,MIDDLE AFRICA,Salmama II of Kanem,565745353,Stub,16877000
3,Chad,MIDDLE AFRICA,Kuri I of Kanem,565745365,Stub,16877000
4,Chad,MIDDLE AFRICA,Mohammed I of Kanem,565745375,Stub,16877000
...,...,...,...,...,...,...
44563,Seychelles,EASTERN AFRICA,Rita Sinon,800323154,Stub,98000
44564,Seychelles,EASTERN AFRICA,Sylvette Frichot,800323798,Stub,98000
44565,Seychelles,EASTERN AFRICA,May De Silva,800969960,Start,98000
44566,Seychelles,EASTERN AFRICA,Vincent Meriton,802051093,Stub,98000


In [39]:
#make a copy to export that drops the subregion
all_data_df_export = all_data_df.drop(all_data_df.columns[[1]], axis=1)
all_data_df_export

Unnamed: 0,country,article_name,revision_id,article_quality_est,population
0,Chad,Bir I of Kanem,355319463,Stub,16877000
1,Chad,Abdullah II of Kanem,498683267,Stub,16877000
2,Chad,Salmama II of Kanem,565745353,Stub,16877000
3,Chad,Kuri I of Kanem,565745365,Stub,16877000
4,Chad,Mohammed I of Kanem,565745375,Stub,16877000
...,...,...,...,...,...
44563,Seychelles,Rita Sinon,800323154,Stub,98000
44564,Seychelles,Sylvette Frichot,800323798,Stub,98000
44565,Seychelles,May De Silva,800969960,Start,98000
44566,Seychelles,Vincent Meriton,802051093,Stub,98000


In [40]:
#make a copy to export that drops the subregion
all_data_df_export = all_data_df.drop(all_data_df.columns[[1]], axis=1)

#output to file as a .csv
all_data_df_export.to_csv(r'output/wp_wpds_politicians_by_country.csv', index = False, header = True)

Let's take a look at our final dataset.

In [41]:
all_data_df

Unnamed: 0,country,subregion,article_name,revision_id,article_quality_est,population
0,Chad,MIDDLE AFRICA,Bir I of Kanem,355319463,Stub,16877000
1,Chad,MIDDLE AFRICA,Abdullah II of Kanem,498683267,Stub,16877000
2,Chad,MIDDLE AFRICA,Salmama II of Kanem,565745353,Stub,16877000
3,Chad,MIDDLE AFRICA,Kuri I of Kanem,565745365,Stub,16877000
4,Chad,MIDDLE AFRICA,Mohammed I of Kanem,565745375,Stub,16877000
...,...,...,...,...,...,...
44563,Seychelles,EASTERN AFRICA,Rita Sinon,800323154,Stub,98000
44564,Seychelles,EASTERN AFRICA,Sylvette Frichot,800323798,Stub,98000
44565,Seychelles,EASTERN AFRICA,May De Silva,800969960,Start,98000
44566,Seychelles,EASTERN AFRICA,Vincent Meriton,802051093,Stub,98000


## Step 4: Analysis

We'd like to calculate the proportion (as a percentage) of articles-per-population and proportion of all articles of high-quality for each country AND for each geographic region. 

"High quality" here is defined as having ORES quality prediction scores as either "FA" - Featured Article or "GA" - Good Article. As such, we add a column that converts the __article_quality_est__ predictions into binary indicators, with $1$ representing FA or GA scores and $0$ representing all other scores.

We then need to create a dataframe that groups our data by __country__ and by __high_quality__. 

To generate the six tables requested in Step 5, we will need a table with the following columns, with one row for each country in our dataset:
- country
- geographic region
- population
- total number of articles
- total number of high quality articles
- coverage (number of total articles / population)
- relative quality (number of high quality articles / total number of articles)

In [42]:
all_data_df['high_quality'] = np.where(all_data_df.article_quality_est.str.contains('GA' or 'FA'), 1, 0)
all_data_df

Unnamed: 0,country,subregion,article_name,revision_id,article_quality_est,population,high_quality
0,Chad,MIDDLE AFRICA,Bir I of Kanem,355319463,Stub,16877000,0
1,Chad,MIDDLE AFRICA,Abdullah II of Kanem,498683267,Stub,16877000,0
2,Chad,MIDDLE AFRICA,Salmama II of Kanem,565745353,Stub,16877000,0
3,Chad,MIDDLE AFRICA,Kuri I of Kanem,565745365,Stub,16877000,0
4,Chad,MIDDLE AFRICA,Mohammed I of Kanem,565745375,Stub,16877000,0
...,...,...,...,...,...,...,...
44563,Seychelles,EASTERN AFRICA,Rita Sinon,800323154,Stub,98000,0
44564,Seychelles,EASTERN AFRICA,Sylvette Frichot,800323798,Stub,98000,0
44565,Seychelles,EASTERN AFRICA,May De Silva,800969960,Start,98000,0
44566,Seychelles,EASTERN AFRICA,Vincent Meriton,802051093,Stub,98000,0


In [43]:
#create new dataframe that gets a count of high_quality articles per country
all_data_df_quality = all_data_df[(all_data_df['high_quality'] == 1)]
all_data_df_quality

Unnamed: 0,country,subregion,article_name,revision_id,article_quality_est,population,high_quality
82,Chad,MIDDLE AFRICA,Hissène Habré,803166806,GA,16877000,1
199,Palestinian Territory,WESTERN ASIA,Abdullah Rimawi,788953220,GA,5008000,1
204,Palestinian Territory,WESTERN ASIA,Khalida Jarrar,791881528,GA,5008000,1
218,Palestinian Territory,WESTERN ASIA,Ahmed Yassin,797122322,GA,5008000,1
225,Palestinian Territory,WESTERN ASIA,Marwan Barghouti,798913975,GA,5008000,1
...,...,...,...,...,...,...,...
44292,Saudi Arabia,WESTERN ASIA,Mohammad bin Salman,807463170,GA,35041000,1
44293,Saudi Arabia,WESTERN ASIA,Fahd of Saudi Arabia,807483153,GA,35041000,1
44321,Trinidad and Tobago,CARIBBEAN,Jack Warner (football executive),805253461,GA,1369000,1
44368,Dominica,CARIBBEAN,Eugenia Charles,802175384,GA,72000,1


In [44]:
#country, region, and population count table
summary_df = all_data_df.groupby(['country', 'subregion', 'population']).sum('high_quality').reset_index()
summary_df

Unnamed: 0,country,subregion,population,revision_id,high_quality
0,Afghanistan,SOUTH ASIA,38928000,247043956038,12
1,Albania,SOUTHERN EUROPE,2838000,357171119064,3
2,Algeria,NORTHERN AFRICA,44357000,90374847695,2
3,Andorra,SOUTHERN EUROPE,82000,26179556617,0
4,Angola,MIDDLE AFRICA,32522000,80804303350,0
...,...,...,...,...,...
178,Venezuela,SOUTH AMERICA,28645000,100771979914,3
179,Vietnam,SOUTHEAST ASIA,96209000,145278362558,6
180,Yemen,WESTERN ASIA,29826000,89361103584,2
181,Zambia,EASTERN AFRICA,18384000,19659512534,0


In [45]:
summary_df = summary_df.drop(columns = ['revision_id'])
summary_df

Unnamed: 0,country,subregion,population,high_quality
0,Afghanistan,SOUTH ASIA,38928000,12
1,Albania,SOUTHERN EUROPE,2838000,3
2,Algeria,NORTHERN AFRICA,44357000,2
3,Andorra,SOUTHERN EUROPE,82000,0
4,Angola,MIDDLE AFRICA,32522000,0
...,...,...,...,...
178,Venezuela,SOUTH AMERICA,28645000,3
179,Vietnam,SOUTHEAST ASIA,96209000,6
180,Yemen,WESTERN ASIA,29826000,2
181,Zambia,EASTERN AFRICA,18384000,0


In [46]:
#total number of articles by country
article_counts = all_data_df[['country', 'article_name']].groupby("country").count().astype(int).reset_index()
article_counts

Unnamed: 0,country,article_name
0,Afghanistan,319
1,Albania,456
2,Algeria,116
3,Andorra,34
4,Angola,106
...,...,...
178,Venezuela,130
179,Vietnam,187
180,Yemen,116
181,Zambia,25


In [47]:
#merge the above data into a new table that includes population
combined = summary_df.merge(article_counts, how = 'left', on='country')

#rename article_name to total_articles
combined.rename(columns={'article_name': 'total_articles'}, inplace=True)  
combined

Unnamed: 0,country,subregion,population,high_quality,total_articles
0,Afghanistan,SOUTH ASIA,38928000,12,319
1,Albania,SOUTHERN EUROPE,2838000,3,456
2,Algeria,NORTHERN AFRICA,44357000,2,116
3,Andorra,SOUTHERN EUROPE,82000,0,34
4,Angola,MIDDLE AFRICA,32522000,0,106
...,...,...,...,...,...
178,Venezuela,SOUTH AMERICA,28645000,3,130
179,Vietnam,SOUTHEAST ASIA,96209000,6,187
180,Yemen,WESTERN ASIA,29826000,2,116
181,Zambia,EASTERN AFRICA,18384000,0,25


Let's make a copy of this starter table for our subregion analysis and drop the extra columns. We'll also drop the subregion column from the combined table.

In [48]:
combined_sub = combined.copy()
combined_sub = combined_sub.drop(columns = ['country'])
combined_sub = combined_sub.groupby(['subregion']).sum('high_quality').reset_index()
combined = combined.drop(columns = ['subregion'])

Now that we have our columns with basic counts, let's add columns to calculate:
- __%coverage__ which we define as the proportion of all of a country's articles per population
- __%relative_quality__ which we define as the proportion of high_quality articles per total articles

We want to display the actual percentage so we multiply by 100. I've chosen to round to 4 decimal points as the percentages for the __%relative_quality__ values are so small.

We do this for both the combined (by country) table and the combined_sub (by subregion) table.

In [49]:
#coverage (number of total articles/population)
combined['%coverage'] = combined.apply(lambda x: round(((x['total_articles']/x['population'])*100), 4), axis=1)

#relative quality (number of high quality articles/total number of articles)
combined['%relative_quality'] = combined.apply(lambda x: round(((x['high_quality']/x['total_articles'])*100), 4), axis=1)

In [50]:
#coverage (number of high quality articles/population)
combined_sub['%coverage'] = combined_sub.apply(lambda x: round(((x['total_articles']/x['population'])*100), 4), axis=1)

#relative quality (number of high quality articles/total number of articles)
combined_sub['%relative_quality'] = combined_sub.apply(lambda x: round(((x['high_quality']/x['total_articles'])*100), 4), axis=1)

In [51]:
combined

Unnamed: 0,country,population,high_quality,total_articles,%coverage,%relative_quality
0,Afghanistan,38928000,12,319,0.0008,3.7618
1,Albania,2838000,3,456,0.0161,0.6579
2,Algeria,44357000,2,116,0.0003,1.7241
3,Andorra,82000,0,34,0.0415,0.0000
4,Angola,32522000,0,106,0.0003,0.0000
...,...,...,...,...,...,...
178,Venezuela,28645000,3,130,0.0005,2.3077
179,Vietnam,96209000,6,187,0.0002,3.2086
180,Yemen,29826000,2,116,0.0004,1.7241
181,Zambia,18384000,0,25,0.0001,0.0000


In [52]:
combined_sub

Unnamed: 0,subregion,population,high_quality,total_articles,%coverage,%relative_quality
0,CARIBBEAN,39056000,11,695,0.0018,1.5827
1,CENTRAL AMERICA,162267000,16,1543,0.001,1.0369
2,CENTRAL ASIA,74960000,6,245,0.0003,2.449
3,EAST ASIA,1632883000,59,2473,0.0002,2.3858
4,EASTERN AFRICA,443825000,28,2502,0.0006,1.1191
5,EASTERN EUROPE,281186000,68,3732,0.0013,1.8221
6,MIDDLE AFRICA,90189000,11,665,0.0007,1.6541
7,NORTHERN AFRICA,243748000,11,899,0.0004,1.2236
8,NORTHERN AMERICA,368068000,73,1901,0.0005,3.8401
9,NORTHERN EUROPE,105680000,75,3763,0.0036,1.9931


## Step 5: Results

Now that we have the table we need to produce the six results tables requested in the assignment.

### Table 1
#### Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population.

We start by sorting our combined country table on the __%coverage__ column in descending order and produce the top 10 rows.

In [53]:
combined.sort_values(['%coverage'], axis=0, ascending=False, inplace = True, kind='quicksort', ignore_index=True)
combined.head(10)

Unnamed: 0,country,population,high_quality,total_articles,%coverage,%relative_quality
0,Tuvalu,10000,4,54,0.54,7.4074
1,Nauru,11000,0,52,0.4727,0.0
2,San Marino,34000,0,81,0.2382,0.0
3,Monaco,38000,0,40,0.1053,0.0
4,Liechtenstein,39000,0,28,0.0718,0.0
5,Marshall Islands,57000,0,37,0.0649,0.0
6,Tonga,99000,0,63,0.0636,0.0
7,Iceland,368000,2,201,0.0546,0.995
8,Andorra,82000,0,34,0.0415,0.0
9,Federated States of Micronesia,106000,0,36,0.034,0.0


### Table 2
#### Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

Since we already have our combined country table sorted, we instead output 10 entries from the tail of our sorted dataframe in the previous step.

In [54]:
combined.tail(10)

Unnamed: 0,country,population,high_quality,total_articles,%coverage,%relative_quality
173,Thailand,66534000,3,112,0.0002,2.6786
174,Sudan,43849000,1,95,0.0002,1.0526
175,Egypt,100803000,4,234,0.0002,1.7094
176,China,1402385000,30,1129,0.0001,2.6572
177,Uzbekistan,34174000,2,28,0.0001,7.1429
178,Ethiopia,114916000,2,101,0.0001,1.9802
179,India,1400100000,12,968,0.0001,1.2397
180,Indonesia,271739000,7,209,0.0001,3.3493
181,"Korea, North",25779000,7,36,0.0001,19.4444
182,Zambia,18384000,0,25,0.0001,0.0


### Table 3
#### Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality.

We simply need to re-sort our combined country table using the __%relative_quality__ column for this table, and output the top 10 entries.

In [55]:
combined.sort_values(['%relative_quality'], axis=0, ascending=False, inplace = True, kind='quicksort', ignore_index=True)
combined.head(10)

Unnamed: 0,country,population,high_quality,total_articles,%coverage,%relative_quality
0,"Korea, North",25779000,7,36,0.0001,19.4444
1,Saudi Arabia,35041000,13,117,0.0003,11.1111
2,Central African Republic,4830000,6,66,0.0014,9.0909
3,Dominica,72000,1,12,0.0167,8.3333
4,Mauritania,4650000,4,48,0.001,8.3333
5,Tuvalu,10000,4,54,0.54,7.4074
6,Singapore,5769000,5,68,0.0012,7.3529
7,Uzbekistan,34174000,2,28,0.0001,7.1429
8,Bhutan,730000,2,33,0.0045,6.0606
9,Guatemala,18066000,5,83,0.0005,6.0241


### Table 4
#### Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

Again, as we already have our combined country table sorted on this value from the previous table, we simply need to output 10 items from the tail.

In [56]:
combined.tail(10)

Unnamed: 0,country,population,high_quality,total_articles,%coverage,%relative_quality
173,Andorra,82000,0,34,0.0415,0.0
174,Antigua and Barbuda,98000,0,24,0.0245,0.0
175,Djibouti,988000,0,37,0.0037,0.0
176,Nicaragua,6596000,0,114,0.0017,0.0
177,Nauru,11000,0,52,0.4727,0.0
178,Guyana,787000,0,20,0.0025,0.0
179,Congo,5518000,0,147,0.0027,0.0
180,Costa Rica,5111000,0,147,0.0029,0.0
181,Bahrain,1465000,0,42,0.0029,0.0
182,Zambia,18384000,0,25,0.0001,0.0


### Table 5
#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

For this table, we simply sort our combined subregion table on the __%coverage__ column and display in descending order. 

In [57]:
combined_sub.sort_values(['%coverage'], axis=0, ascending=False, inplace = True, kind='quicksort', ignore_index=True)
combined_sub

Unnamed: 0,subregion,population,high_quality,total_articles,%coverage,%relative_quality
0,OCEANIA,42031000,49,3126,0.0074,1.5675
1,NORTHERN EUROPE,105680000,75,3763,0.0036,1.9931
2,SOUTHERN EUROPE,151136000,37,3710,0.0025,0.9973
3,WESTERN EUROPE,195479000,44,4560,0.0023,0.9649
4,CARIBBEAN,39056000,11,695,0.0018,1.5827
5,EASTERN EUROPE,281186000,68,3732,0.0013,1.8221
6,CENTRAL AMERICA,162267000,16,1543,0.001,1.0369
7,SOUTHERN AFRICA,66628000,7,634,0.001,1.1041
8,WESTERN ASIA,272499000,70,2563,0.0009,2.7312
9,MIDDLE AFRICA,90189000,11,665,0.0007,1.6541


### Table 6
#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

We re-sort our combined sub-region table on the __%relative_quality__ column and display in descending order. 

In [58]:
combined_sub.sort_values(['%relative_quality'], axis=0, ascending=False, inplace = True, kind='quicksort', ignore_index=True)
combined_sub

Unnamed: 0,subregion,population,high_quality,total_articles,%coverage,%relative_quality
0,NORTHERN AMERICA,368068000,73,1901,0.0005,3.8401
1,WESTERN ASIA,272499000,70,2563,0.0009,2.7312
2,SOUTHEAST ASIA,660056000,54,2020,0.0003,2.6733
3,CENTRAL ASIA,74960000,6,245,0.0003,2.449
4,EAST ASIA,1632883000,59,2473,0.0002,2.3858
5,NORTHERN EUROPE,105680000,75,3763,0.0036,1.9931
6,EASTERN EUROPE,281186000,68,3732,0.0013,1.8221
7,MIDDLE AFRICA,90189000,11,665,0.0007,1.6541
8,CARIBBEAN,39056000,11,695,0.0018,1.5827
9,OCEANIA,42031000,49,3126,0.0074,1.5675


## Writeup: Reflections and Implications

1.	What biases did you expect to find in the data (before you started working with it), and why?

When considering geographic regions, I strongly suspected that there would be a bias regarding the “coverage” (number of articles as a proportion of population) in favor of countries in North America, and specifically the United States when broken down by country. This assumption was simply because our analysis would use English Wikipedia pages so there would exist an overrepresentation of articles about politicians from primarily English-speaking countries in North America. Regarding the “relative quality” assessment, I suspected a negative bias toward countries in Africa based on the overwhelming prevalence of anti-black bias in the world that would affect the amount attention paid to articles written about politicians in African countries which would reduce the quality ratings.

2.	What (potential) sources of bias did you discover in the course of your data processing and analysis?

I was initially surprised to see that North Korea was the highest-ranked country in terms of relative quality of articles about its politicians. After some reflection, I concluded that the relative quality was likely an indicator of the amount of news/media attention paid to politicians in that country. This attention wouldn't necessarily need to be positive, more likely it would be born of an abundance unrest or turmoil the country experienced. This theory is supported by other entries in the top 10 countries in terms of the relative proportion of politician articles that are of GA and FA-quality (Table 3) such as Saudi Arabia and the Central African Republic. 

3. What might your results suggest about the internet and global society in general?
    
These results suggest that politicians from countries that dominate global news coverage generate more interest in the form of Wikipedia contributions to their articles. A larger volume of contributions forces a higher degree of scrutiny of the contents of the article by other contributors, editors, and Wikipedia itself which will likely increase the overall quality ratings.
 
4. How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?

I think the most impactful supplement to this dataset would be to include an analysis of Wikipedia articles from non-English speaking pages. As this would obviously result in multiple conflicting quality predictions for the same politicians, there would need to be weights assigned to each of the quality ratings and then aggregated into a single metric for each country. This would normalize the data and allow for a better global comparison between countries and regions, removing the bias introduced by the English language restriction in this analysis. It would also potentially uncover other language-specific biases through comparison of predicted ratings for the same countries between language pages. 