# <center> DATA 512 A2 - Bias in Data Assignment </center>
<center>Madalyn Li <br>
Fall 2021</center>

The purpose of this project is to uncover insights about biases through analyzing English Wikipedia articles on political figures from various countries. We will be acquiring and merging three different data sets from various sources to produce 6 total tables displaying proportion of articles per population and proportion of high quality articles by country and geographic region. At the end, we will reflect on the results and discuss issues and potential for future improvements.

#### This notebook is divided into 6 general sections:
1. Acquiring Article and Population Data
2. Cleaning Data
3. Acquiring Article Quality Predictions
4. Combining Datasets
5. Analyzing Data and Displaying Results
6. Reflection

In [1]:
import json
import requests
import pandas as pd
import numpy as np

### 1. Acquiring Article and Population Data

In the first step, our goal is to obtain and download the data from their sources and add them into the notebook as dataframes for further cleaning and analysis in later steps.

The first dataset: *page_data.csv* is acquired from Figshare, and it includes data on English Wikipedia articles within the category: "Politicians by nationality". The documentation and source can be found [here](https://figshare.com/articles/dataset/Untitled_Item/5513449). This file comprises of the following information:

1. **page**: Contains the page title of the article
2. **country**: Contains the country name extracted from the category
3. **rev_id**: Contains the revision ID of the last edit made to the page

In [2]:
# Add page_data.csv as data frame

df_page = pd.read_csv("page_data.csv")

The second dataset: *WPDS_2020_data.csv* is acquired from the Population Reference Bureau, and it includes data on the estimated world population from mid-2020. The documentation and source of the dataset can be found [here](https://www.prb.org/international/indicator/population/table). This file comprises of the following information:

1. **FIPS**: Abbreviation of country name
2. **Name**: Name of country or sub-region
3. **Type**: Category of Name (i.e country or sub-region)
4. **Timeframe**: Year that data was collected
5. **Data(M)**: Population in millions
6. **Population**: Total population

In [3]:
# Add WPDS_2020_data.csv as data frame

df_world = pd.read_csv("WPDS_2020_data.csv")

### 2. Cleaning Data

After acquiring our data, the next step is to clean and process it to remove any uncessary information not needed for analysis.

**Remove non-Wikipedia articles**<br>

The *page_data.csv* file contains some page names that start with "Template:". Since these are not considered Wikipedia articles, we have removed these rows in the code below so it is not included in our analysis. 

In [4]:
# Remove rows containing "Template:" from wikipedia page data frame

df_page = df_page[df_page["page"].str.contains("Template:") == False]

**Add geographic region**<br>

In our subsequent analysis, we will be looking at article and population proportions grouped by geographic region. For this reason, we will need to add a new column to our world population data frame that includes the corresponding geographic region for each country. The *Name* column of world population dataset differentiates geographic region and country by UPPERCASE lettering for geographic regions. In addition, the default data set is sorted by geographic region and the countries corresponding to that geographic region listed below. 

The code below adds a new column called 'geographic region' to the world population data frame and parses through each value in the *Name* column. If the string value in that column is all uppercase (i.e. is a geographic region), it saves that value into a variable called *current_upper* and adds that value into the 'geographic region' column. This ensures that all countries are tied to the geographic region listed above them. 

In [5]:
# Add geographic region to world population data frame

df_world['geographic region'] = ""
current_upper = df_world.iloc[0,1]

for i in range(len(df_world)):
    if df_world.iloc[i, 1].isupper():
        current_upper = df_world.iloc[i,1]
        df_world.iloc[i, -1] = current_upper
    else:
        df_world.iloc[i, -1] = current_upper

**Remove geographic region rows from Name**<br>

Now that we have obtained the geographic region for each country, we no longer need the rows containing the geographic region name. The code below removes these rows from the world population data frame. 

In [6]:
# Remove rows containing geographic regions in the Name column from the world population data frame

df_world = df_world[df_world['Name'].str.isupper() == False]

**Re-sort and re-name columns**<br>

The code below narrows the world population data frame to include only the rows needed for analysis later on. Specifically, we select: *Name, geographic region, and Population*. In addition, the column names for both data frames are renamed for cleanliness and consistency purposes and to make future analysis easier to conduct and follow along. 

In [7]:
# Select Name, geographic region, and population from world population data frame

df_world = df_world[["Name", "geographic region", "Population"]]


# Rename columns in both data frames

df_page.columns = ["article_name", "country", "revision_id"]
df_world.columns = ["country", "geographic region", "population"]

**Final cleaned data frames** <br>

Below are previews of the finalized clean data frames for reference:

In [8]:
# Preview the wikipedia page data frame

df_page.head()

Unnamed: 0,article_name,country,revision_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [9]:
# Preview the world population data frame

df_world.head()

Unnamed: 0,country,geographic region,population
3,Algeria,NORTHERN AFRICA,44357000
4,Egypt,NORTHERN AFRICA,100803000
5,Libya,NORTHERN AFRICA,6891000
6,Morocco,NORTHERN AFRICA,35952000
7,Sudan,NORTHERN AFRICA,43849000


### 3. Acquiring Article Quality Predictions

In this section, we will be obtaining data on predicted quality scores for each Wikipedia article. We will be utilizing a machine learning tool called ORES (short for Objective Revision Evaluation Service) to retrieve these predictions. Below is a table referencing the predicted scores that ORES will assign to an article; please note that they are listed in order from best to worst:

| Score | Description |
| --- | --- |
| FA | Featured article |
| GA | Good article |
| B | B-class article |
| C | C-class article |
| Start | Start-class article |
| Stub | Stub-class article |

**Set up API endpoint, header, and parameters**<br>

To obtain the predicted article quality data, we will utilize the ORES REST API. In the code below, we first define the endpoint and headers. Next, we construct a function *api_call* that accepts an endpoint, a list of revision ids, and headers and returns the queried API as a dictionary. The parameters include *context* which we have set as *enwiki* to correspond to English Wikipedia articles, *model* which we have set to *articlequality* which corresponds to the scoring model, and finally the *revid* which corresponds to the revision id of the article. It is important to note that multiple revision ids can be passed through the parameters by separating each value with a "|".

The links below were used for reference:<br>
[Documentation for ORES REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model)<br>
[ORES MediaWiki Page](https://www.mediawiki.org/wiki/ORES)

In [10]:
# Set endpoint and headers

endpoint = "https://ores.wikimedia.org/v3/scores/{context}?models={model}&revids={revid}" 

headers = {'User-Agent': 'https://github.com/madalynli',
           'From': 'mli2324@uw.edu'}


# Define api_call function

def api_call(endpoint, revid, headers):  
    revid_combine = "|".join(str(id) for id in revid)
    
    param = {"context":"enwiki",
             "model":"articlequality",
             "revid":revid_combine}
    
    call = requests.get(endpoint.format(**param), headers=headers)
    response = call.json()
    
    return response

**Query API data and obtain article quality predictions**<br>

In the code below, we first convert the *revision_id* column in the wikipedia page data frame to a list. This makes it easier to query later when we run *api_call*. Next, since our list has nearly 50,000 different revision ids, we will need to divide this into batches of 50 so that the query will run successfully. To implement this, we run a for loop with *batch_size* = 50 to section off a chunk of the *rev_id* list to run through the *api_call* function. 

It is important to note that the ORES API will not be able to obtain a predicted score for every article in our list. In these instances, we have set up an if else statement to document and log the revision ids of the articles that return no predicted score. This list is saved to a file in the repository named 'ores_api_no_score.csv'

The result from api_call returns a nested dictionary, and to obtain the prediction value, we have to parse through various keys to get there. The final result is an appended list of revision_ids with their corresponding predicted scores. 

***Note: the cell below takes an estimated time of 5 minutes to run.*** 

In [11]:
# Convert revision_id to list

rev_id = df_page['revision_id'].to_list()


# Set batch size and create empty list score and no_score to store the results

batch_size = 50

score = []
no_score = []


# Query API data and obtain article quality predictions

for i in range(0, len(rev_id), batch_size):
    rev_id_chunk = rev_id[i:i+batch_size]
    response = api_call(endpoint, rev_id_chunk, headers)
    
    for res in response:
        for res in response['enwiki']['scores']:
            if response['enwiki']['scores'][res]['articlequality'].get('score') is None:
                no_score.append(response['enwiki']['scores'][res]['articlequality']['error']['message'])
            else:
                score.append([res,response['enwiki']['scores'][res]['articlequality']['score']['prediction']])

                
# Output list of revision_ids with no score to ores_api_no_score.csv

df_no_score = pd.DataFrame(no_score)
df_no_score.to_csv("ores_api_no_score.csv")

**Clean and standardize score results** <br>

To make the results easier to utilize in later analysis, we need to clean and standardize the data. First, we will convert the list of scores to a data frame called *df_pred*. Next, we re-name the columns *revision_id* and *article_quality_est*, respectively. Finally, since values of *revision_id* are objects, we will need to convert these to integers to make merging the data more seamless in the next step.

In addition, we have output the list of predictions to a csv file named: *ores_api_scores.csv*

In [12]:
# Convert list of score results to a data frame

df_pred = pd.DataFrame(score)


# Re-name columns

df_pred.columns = ["revision_id", "article_quality_est"]


# Convert revision_id type from object to integer

df_pred['revision_id'] = df_pred['revision_id'].astype(str).astype(int)


# Output list of score to ores_api_scores.csv

df_pred.to_csv("ores_api_scores.csv")

### 4. Combining Datasets

In this section, we will combine all our data sets (wikipedia page data, world population data, and article quality prediction data) into one file to make the analysis in step 5 easier. The final combined data set will be saved to a csv file named *wp_wpds_politicians_by_country.csv*. In addition, since not all countries in the world population data frame will match to the countries in the wikipedia page data frame, we isolate these values into a separate csv file named *wp_wpds_countries-no_match.csv*.

**Merge Wikipedia page data with article quality data** <br>

First, we will merge *df_pred* (article quality prediction data) with *df_page* (wikipedia page data) together on their common value: *revision_id*.

In [13]:
# Merge df_pred and df_page

df_pred_page = pd.merge(df_pred, df_page, on = "revision_id", how = "left")

**Merge Wikipedia quality page data with world population data** <br>

Next, we will merge the results from the previous step with *df_world* (world population data) together on their common value: *country*. In this scenario, we will use an outer join so we can obtain all matches and non-matches for each country.

In [14]:
# Merge df_pred_page and df_world 

merge_all = pd.merge(df_pred_page, df_world, on = "country", how = "outer", indicator = True)

**Obtain and save values for countries with no matches**

In the code below, we filter *merge_all* to values where *_merge* does not equal to *both*. This gives us all the results for countries that had no matches on both data frames. Finally, we output the results to a file named *wp_wpds_countries-no_match.csv*. 

In [15]:
# Filter results to countries with no matches

no_match = merge_all.query('_merge != "both"')


# Output results to csv file

no_match.to_csv('wp_wpds_countries-no_match.csv')

**Obtain and save values for countries with no matches**

Simlar to the code above, this time we filter *merge_all* to values where *_merge* does equal *both* since we want to obtain the results for countries that had matches to both data frames. Next, we remove the *_merge* column in the data frame to clean and finalize the dataset since this column is no longer needed for the final analysis. Finally, we output the results to a file named *wp_wpds_politicians_by_country.csv*.

In [16]:
# Filter results to countries with matches

df_all = merge_all.query('_merge == "both"')


# Remove _merge column 

df_all = df_all.drop('_merge', axis = 1)


# Output results to csv file

df_all.to_csv('wp_wpds_politicians_by_country.csv')

### 5. Analzying Data and Displaying Results

The goal of this section is to produce a total of 6 tables that display the following information:

| Table # | Table Name | Description |
| :--: | :-- | :-- |
| 1 | Top 10 countries by coverage | 10 highest-ranked countries in terms of number of politician articles as a proportion of country population |
| 2 | Bottom 10 countries by coverage | 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population |
| 3 | Top 10 countries by relative quality | 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality |
| 4 | Bottom 10 countries by relative quality | 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality |
| 5 | Geographic regions by coverage | Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population |
| 6 | Geographic regions by relative quality | Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality |

In order to produce these tables, we will first need to perform a series of processing steps to calculate the proportion values needed. Specifically, we will need to sum up the total number of articles grouped by country. Then, we will need to sum the total number of high quality articles (where scores are either FA or GA) grouped by country. 

**Count number of articles grouped by country** <br>

In the step below, we calculate the total number of politician articles for each country

In [17]:
# Calculate total number of politician articles grouped by country

article_count = df_all.groupby(['country']).size().to_frame('article_count').reset_index()

**Count number of high quality articles grouped by country**

In the next step, we calculate the total number of high quality articles grouped by country. In this instance, we will define high quality articles as those who have an estimated ranked score of FA or GA. 
In the code below, we first add an additional column called *highquality_count* that tallies the number of FA or GA scores. If the estimated article quality is equal to any one of these values, it will input a 1 in the new column, otherwise it will input a 0. 

Next, now that we have the tallies, we can sum the total number of high quality articles grouped by country. 

In [18]:
# Create new column that equals 1 if the article_quality_est is equal to FA or GA and 0 if otherwise

df_all['highquality_count'] = np.where(
    df_all['article_quality_est'] == 'FA', 1, np.where(
    df_all['article_quality_est'] == 'GA', 1, 0))


# Calculate total number of high quality articles grouped by country

highquality_count = df_all.groupby(['country'])['highquality_count'].sum().reset_index()

**Merge tables**

Next, we will be merging the values for total article count, total high quality article count, and total population by country into one table.

In [19]:
# Merge article count value, high quality count value, and population values

merge_all_country = article_count.merge(highquality_count, on='country').merge(df_world, on='country')

**Calculating proportions**

Now that we have our count values per country, we can calculate the proportions needed for the final tables. Specifically, we will add two new columns to hold these calculations: *Percentage of articles per population* & *Percentage of high quality articles*<br>
***Percentage of articles per population*** is calculated by dividing article count by population and multiplying this value by 100. <br>
***Percentage of high quality articles*** is calculated by dividing high quality article count by total article count and multiplying this value by 100. 

In [20]:
# Calculate proportion values 

merge_all_country['Percentage of articles per population'] = (merge_all_country['article_count']/merge_all_country['population']) * 100
merge_all_country['Percentage of high quality articles'] = (merge_all_country['highquality_count']/merge_all_country['article_count']) * 100

**Sort by percentage of articles per population descending**

For the first two tables, we want to find the top and bottom countries sorted by percentage of articles per population. Thus, we sort percentage of articles per population in descending order.

In [21]:
# Sort by percentage of articles per population

articles_per_pop = merge_all_country.sort_values('Percentage of articles per population',ascending=False)

### Table 1: Top 10 countries by coverage

In [22]:
articles_per_pop.head(10)

Unnamed: 0,country,article_count,highquality_count,geographic region,population,Percentage of articles per population,Percentage of high quality articles
169,Tuvalu,54,4,OCEANIA,10000,0.54,7.407407
117,Nauru,52,0,OCEANIA,11000,0.472727,0.0
138,San Marino,81,0,SOUTHERN EUROPE,34000,0.238235,0.0
110,Monaco,40,0,WESTERN EUROPE,38000,0.105263,0.0
95,Liechtenstein,28,0,WESTERN EUROPE,39000,0.071795,0.0
104,Marshall Islands,37,0,OCEANIA,57000,0.064912,0.0
164,Tonga,63,0,OCEANIA,99000,0.063636,0.0
70,Iceland,201,2,NORTHERN EUROPE,368000,0.05462,0.995025
3,Andorra,34,0,SOUTHERN EUROPE,82000,0.041463,0.0
52,Federated States of Micronesia,36,0,OCEANIA,106000,0.033962,0.0


### Table 2: Bottom 10 countries by coverage

In [23]:
articles_per_pop.tail(10)

Unnamed: 0,country,article_count,highquality_count,geographic region,population,Percentage of articles per population,Percentage of high quality articles
13,Bangladesh,317,3,SOUTH ASIA,169809000,0.000187,0.946372
114,Mozambique,58,0,EASTERN AFRICA,31166000,0.000186,0.0
162,Thailand,112,3,SOUTHEAST ASIA,66534000,0.000168,2.678571
84,"Korea, North",36,8,EAST ASIA,25779000,0.00014,22.222222
181,Zambia,25,0,EASTERN AFRICA,18384000,0.000136,0.0
51,Ethiopia,101,2,EASTERN AFRICA,114916000,8.8e-05,1.980198
176,Uzbekistan,28,3,CENTRAL ASIA,34174000,8.2e-05,10.714286
34,China,1129,40,EAST ASIA,1402385000,8.1e-05,3.542958
72,Indonesia,209,9,SOUTHEAST ASIA,271739000,7.7e-05,4.30622
71,India,968,13,SOUTH ASIA,1400100000,6.9e-05,1.342975


**Sort by percentage of high quality articles descending**

For the next two tables, we want to find the top and bottom countries sorted by percentage of high quality articles. Thus, we sort percentage of high quality articles in descending order.

In [24]:
# Sort by percentage of high quality articles

percent_of_highquality = merge_all_country.sort_values('Percentage of high quality articles',ascending=False)

### Table 3: Top 10 countries by relative quality

In [25]:
percent_of_highquality.head(10)

Unnamed: 0,country,article_count,highquality_count,geographic region,population,Percentage of articles per population,Percentage of high quality articles
84,"Korea, North",36,8,EAST ASIA,25779000,0.00014,22.222222
140,Saudi Arabia,117,15,WESTERN ASIA,35041000,0.000334,12.820513
135,Romania,343,42,EASTERN EUROPE,19241000,0.001783,12.244898
31,Central African Republic,66,8,MIDDLE AFRICA,4830000,0.001366,12.121212
176,Uzbekistan,28,3,CENTRAL ASIA,34174000,8.2e-05,10.714286
106,Mauritania,48,5,WESTERN AFRICA,4650000,0.001032,10.416667
64,Guatemala,83,7,CENTRAL AMERICA,18066000,0.000459,8.433735
44,Dominica,12,1,CARIBBEAN,72000,0.016667,8.333333
158,Syria,128,10,WESTERN ASIA,19398000,0.00066,7.8125
18,Benin,91,7,WESTERN AFRICA,12209000,0.000745,7.692308


### Table 4: Bottom 10 countries by relative quality

In [26]:
percent_of_highquality.tail(10)

Unnamed: 0,country,article_count,highquality_count,geographic region,population,Percentage of articles per population,Percentage of high quality articles
138,San Marino,81,0,SOUTHERN EUROPE,34000,0.238235,0.0
139,Sao Tome and Principe,21,0,MIDDLE AFRICA,210000,0.01,0.0
12,Bahrain,42,0,WESTERN ASIA,1465000,0.002867,0.0
67,Guyana,20,0,SOUTH AMERICA,787000,0.002541,0.0
143,Seychelles,21,0,EASTERN AFRICA,98000,0.021429,0.0
63,Guadeloupe,49,0,CARIBBEAN,375000,0.013067,0.0
148,Solomon Islands,97,0,OCEANIA,715000,0.013566,0.0
62,Grenada,36,0,CARIBBEAN,113000,0.031858,0.0
30,Cape Verde,36,0,WESTERN AFRICA,556000,0.006475,0.0
11,Bahamas,20,0,CARIBBEAN,393000,0.005089,0.0


**Group by geographic region**

For the last two tables, we will need to re-sum *article_count*, *highquality_count* and *population values* grouped by geographic region. Then we will need to recalculate the values for *percentage of articles per population* and *percentage of high quality articles* with the new totals. 

In [27]:
# Calculate total article_count, highquality_count, and population by geographic region

groupby_geo = merge_all_country.groupby(['geographic region'])[['article_count', 'highquality_count', 'population']].agg('sum')

In [28]:
# Calculate proportion values by geographic region

groupby_geo['Percentage of articles per population'] = (groupby_geo['article_count']/groupby_geo['population']) * 100
groupby_geo['Percentage of high quality articles'] = (groupby_geo['highquality_count']/groupby_geo['article_count']) * 100

### Table 5: Geographic regions by coverage

In [29]:
groupby_geo.sort_values('Percentage of articles per population',ascending=False)

Unnamed: 0_level_0,article_count,highquality_count,population,Percentage of articles per population,Percentage of high quality articles
geographic region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OCEANIA,3126,63,42031000,0.007437,2.015355
NORTHERN EUROPE,3763,102,105680000,0.003561,2.710603
SOUTHERN EUROPE,3710,74,151136000,0.002455,1.994609
WESTERN EUROPE,4560,56,195479000,0.002333,1.22807
CARIBBEAN,695,13,39056000,0.001779,1.870504
EASTERN EUROPE,3732,118,281186000,0.001327,3.161844
SOUTHERN AFRICA,634,9,66628000,0.000952,1.419558
CENTRAL AMERICA,1543,23,162267000,0.000951,1.490603
WESTERN ASIA,2563,89,272499000,0.000941,3.472493
MIDDLE AFRICA,665,16,90189000,0.000737,2.406015


### Table 6: Geographic regions by relative quality

In [30]:
groupby_geo.sort_values('Percentage of high quality articles',ascending=False)

Unnamed: 0_level_0,article_count,highquality_count,population,Percentage of articles per population,Percentage of high quality articles
geographic region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NORTHERN AMERICA,1901,104,368068000,0.000516,5.470805
SOUTHEAST ASIA,2020,73,660056000,0.000306,3.613861
WESTERN ASIA,2563,89,272499000,0.000941,3.472493
EASTERN EUROPE,3732,118,281186000,0.001327,3.161844
EAST ASIA,2473,76,1632883000,0.000151,3.07319
CENTRAL ASIA,245,7,74960000,0.000327,2.857143
NORTHERN EUROPE,3763,102,105680000,0.003561,2.710603
MIDDLE AFRICA,665,16,90189000,0.000737,2.406015
NORTHERN AFRICA,899,19,243748000,0.000369,2.113459
OCEANIA,3126,63,42031000,0.007437,2.015355


### 6. Reflection

Looking at the results produced from Tables 1 and 3, I was surprised to see that the U.S. and a lot of other first world countries did not make it to the top. In addition, one result in particular raised a red flag: Table 3 shows that North Korea is the top country for producing the highest percentage of high quality articles. Shocked and confused by the results, I took a step back to understand what the analysis was trying to accomplish in the first place. Upon further reflection, I realized the results included biases that stemmed from the proportion calculations used for analysis and the source of the data. In Table 1, it’s clear that the countries ranked at the top are only there because of their significantly small populations. Proportion of articles per population in this case is not a good measure for analysis since they can be heavily weighted towards countries with smaller populations.

Before starting to work with the data, I had expectations that the results would show the U.S. being at the top for most proportion of articles per population and proportion of high quality articles. My initial assumptions came from the fact that Wikipedia is mostly used in the U.S. and that the data for articles on Politicians by Country we were using for analysis were derived from only English Wikipedia articles. Because of this, it only made sense that there would be more total articles on politicians from the U.S. and more high quality rated articles because of the surplus of knowledge coming from U.S. history and politicians.

After going through the processing and analysis, I realized that there are other factors contributing to bias. Beside the one mentioned above (obtaining data on only English Wikipedia articles), I also noticed that the ORES API data’s method in scoring articles is a poor indication on measure of the actual contents within the article. The ORES MediaWiki page mentions that the article quality model is based on “predictions on structural characteristics of the article… it doesn’t evaluate the quality of the writing or whether or not there’s a tone problem”. This seems like a major issue regarding inherent bias within the analysis. If the source data rankings are solely based on structure of articles, it negates the rankings shown in Tables 3 and 4. If we consider the fact that some articles may not have as many sources available to begin with and some articles may not have many sections included based on how the information is presented, these factors have nothing to do with the actual quality of the writing. 

Overall, this was a great learning experience for coding, querying APIs, and most importantly documenting and reflecting on the data and processes. This helped to support my understanding and practice of achieving human centered data science by increasing accountability and promoting reproducibility. For future reference, I would suggest any researchers looking to improve and correct the biases observed to look into obtaining more reliable data sources. Perhaps, the scope of analysis can be expanded to all Wikipedia articles or more focus can be allocated to defining and measure true quality of the content within the articles using NLP methods. 