In [3]:
import csv
import pandas as pd
import numpy as np
import os
os.chdir("/content")

# Goal of the Project
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. In this project, I combined a dataset of Wikipedia articles with a dataset of country populations (`WPDS_2020_data.csv` and `page_data.csv`), and used a machine learning service called ORES to estimate the quality of each article. I performed an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. My analysis will consist of a series of tables that show:
*  the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
*  the countries with the highest and lowest proportion of high quality articles about politicians.
*  a ranking of geographic regions by articles-per-person and proportion of high quality articles.


# Step 1: Getting the Article and Population Data

The first step is getting the data, which lives in several different places. The Wikipedia politicians by country dataset can be found on Figshare. Read through the documentation for this repository, then download and unzip it to extract the data file, which is `called page_data.csv`.
The population data is available in CSV format as `WPDS_2020_data.csv`. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.


In [4]:
page_data = pd.read_csv("page_data.csv")
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [5]:
WPDS_data = pd.read_csv("WPDS_2020_data.csv")
WPDS_data

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.850,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000
230,SB,Solomon Islands,Country,2019,0.715,715000
231,TO,Tonga,Country,2019,0.099,99000
232,TV,Tuvalu,Country,2019,0.010,10000


# Step 2: Cleaning the Data

In the case of page_data.csv, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and therefore excluded in the analysis.

In [6]:
filtered_page_data = page_data.loc[~page_data.page.str.startswith("Template:")]
filtered_page_data.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [7]:
print("Number of rows and columns in filtered page_data DataFrame", filtered_page_data.shape)

Number of rows and columns in filtered page_data DataFrame (46701, 3)


Similarly, WPDS_2020_data.csv contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having 'Sub-Region' values in the 'Type' field. These rows won't match the country values in page_data.csv, but they are retained so that we can report coverage and quality by region in the analysis section.

In [8]:
WPDS_data['Type'].value_counts()

Country       209
Sub-Region     24
World           1
Name: Type, dtype: int64

In [9]:
#regional_rows = WPDS_data["Name"].str.isupper()
#country_level_counts = WPDS_data[~regional_rows]
#country_level_counts['Type'].value_counts()
#country_level_counts.loc[country_level_counts['Type'] == 'Sub-Region']
#this still returns 168	Channel Islands	which is a Sub-Region

country_rows = WPDS_data.loc[WPDS_data['Type'] == 'Country']
regional_rows = WPDS_data.loc[(WPDS_data['Type'] == 'Sub-Region')]
regional_rows = regional_rows.drop(columns=['FIPS', 'Type', 'TimeFrame', 'Data (M)'])

For the analysis below, we also need a mapper that maps the countries with the corresponding sub-region and region names.

In [10]:
region = ""
region_country_mapper = []
for idx, row in WPDS_data.iterrows():
    if row["Type"] == 'Sub-Region':
        region = row["Name"]
    elif row["Type"] == 'Country':
        region_country_mapper.append({'region': region, 'country': row['Name']})
region_country_mapper = pd.DataFrame(region_country_mapper)
region_country_mapper

Unnamed: 0,region,country
0,NORTHERN AFRICA,Algeria
1,NORTHERN AFRICA,Egypt
2,NORTHERN AFRICA,Libya
3,NORTHERN AFRICA,Morocco
4,NORTHERN AFRICA,Sudan
...,...,...
204,OCEANIA,Samoa
205,OCEANIA,Solomon Islands
206,OCEANIA,Tonga
207,OCEANIA,Tuvalu


# Step 3: Getting Article Quality Predictions

The goal is to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1.   FA - Featured article
2.   List item
3.   GA - Good article
4.   B - B-class article
5.   C - C-class article
6.   Start - Start-class article
7.   Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures. These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. 

In this project, I choose to install and run the ORES client (Python only)
Please see the installation instructions here: https://github.com/wikimedia/ores.

In [11]:
pip install ores

Collecting ores
  Downloading ores-1.4.0-py2.py3-none-any.whl (160 kB)
[K     |████████████████████████████████| 160 kB 4.9 MB/s 
Collecting celery<4.1.999,>=4.1.1
  Downloading celery-4.1.1-py2.py3-none-any.whl (394 kB)
[K     |████████████████████████████████| 394 kB 49.7 MB/s 
[?25hCollecting yamlconf<0.2.999,>=0.2.4
  Downloading yamlconf-0.2.4-py3-none-any.whl (9.1 kB)
Collecting flask-jsonpify<1.5.999,>=1.5.0
  Downloading Flask-Jsonpify-1.5.0.tar.gz (3.0 kB)
Collecting flask-swaggerui<0.0.999,>=0.0.1
  Downloading flask_swaggerui-0.0.1-py2.py3-none-any.whl (740 kB)
[K     |████████████████████████████████| 740 kB 44.8 MB/s 
[?25hCollecting flask<1.0.999,>=1.0.2
  Downloading Flask-1.0.4-py2.py3-none-any.whl (92 kB)
[K     |████████████████████████████████| 92 kB 348 kB/s 
[?25hCollecting flask-wikimediaui<0.0.999,>=0.0.1
  Downloading flask_wikimediaui-0.0.1-py2.py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 42.0 MB/s 
[?25hCollecting pyyaml=

In [12]:
from ores import api
ores_session = api.Session("https://ores.wikimedia.org", "DATA 512 A2 <ningjis@uw.edu>")

rev_ids = filtered_page_data.rev_id.values
results = ores_session.score("enwiki", ["articlequality"], rev_ids)

#for score in results:
#    print(score)

quality_est_results = []
error_revids = []
for rev_id, result in zip(rev_ids, results):
    try:
        quality_est_results.append({'rev_id': rev_id, 'article_quality_est': result['articlequality']['score']['prediction']})
    except:
        error_revids.append(rev_id)
quality_est_results = pd.DataFrame(quality_est_results)
quality_est_results

Unnamed: 0,rev_id,article_quality_est
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub
...,...,...
46420,807481636,C
46421,807482007,GA
46422,807483006,C
46423,807483153,GA


# Step 3: Combining the Datasets

The goal is to merge the wikipedia data and population data together using country names. After merging the data, I found some of the entries cannot be merged because either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

All rows that do not have matching data are output them to a CSV file called: *wp_wpds_countries-no_match.csv*.

The remaining data are consolidated into a single CSV file called *wp_wpds_politicians_by_country.csv*, with the following headers:
*   country
*   article_name
*   revision_id
*   article_quality_est.
*   population






In [13]:
page_data_with_preds = filtered_page_data.merge(quality_est_results, left_on='rev_id', right_on='rev_id')
page_data_with_preds

Unnamed: 0,page,country,rev_id,article_quality_est
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub
...,...,...,...,...
46420,Hal Bidlack,United States,807481636,C
46421,Yahya Jammeh,Gambia,807482007,GA
46422,Lucius Fairchild,United States,807483006,C
46423,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA


In [14]:
# some of the entries cannot be merged 
# because either the population dataset does not have an entry for the equivalent Wikipedia country, 
# or vise versa.
valid_country_page = page_data_with_preds.country.isin(country_rows.Name.unique())
page_invalid_country = page_data_with_preds[~valid_country_page]

valid_country_wpds = country_rows.Name.isin(page_data_with_preds.country.unique())
wpds_invalid_country = country_rows[~valid_country_wpds]

countries_no_match = wpds_invalid_country.merge(page_invalid_country, left_on='Name', right_on='country', how='outer')
countries_no_match.to_csv('wp_wpds_countries-no_match.csv', index=False)

In [15]:
page_valid_country = page_data_with_preds[valid_country_page]
wpds_valid_country = country_rows[valid_country_wpds]
page_valid_country = page_valid_country.rename(columns={"page": "article_name", "rev_id": "revision_id"})

final_page = page_valid_country.merge(wpds_valid_country, left_on='country', right_on='Name')
final_page = final_page.drop(columns=['FIPS', 'Name', 'Type', 'TimeFrame', 'Data (M)'])
final_page.to_csv('wp_wpds_politicians_by_country.csv', index=False)
final_page

Unnamed: 0,article_name,country,revision_id,article_quality_est,Population
0,Bir I of Kanem,Chad,355319463,Stub,16877000
1,Abdullah II of Kanem,Chad,498683267,Stub,16877000
2,Salmama II of Kanem,Chad,565745353,Stub,16877000
3,Kuri I of Kanem,Chad,565745365,Stub,16877000
4,Mohammed I of Kanem,Chad,565745375,Stub,16877000
...,...,...,...,...,...
44563,Rita Sinon,Seychelles,800323154,Stub,98000
44564,Sylvette Frichot,Seychelles,800323798,Stub,98000
44565,May De Silva,Seychelles,800969960,Start,98000
44566,Vincent Meriton,Seychelles,802051093,Stub,98000


# Step 4: Analysis

The analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

In [16]:
final_page['is_high_quality_article'] = final_page.article_quality_est.isin(['FA', 'GA'])
article_count = pd.crosstab(final_page['country'], final_page['is_high_quality_article'], margins=True, margins_name="Total-articles")
article_country = article_count

### Country level analysis:

In [17]:
# quality rate:
# if a country has 10 articles about politicians, and 2 of them are FA or GA class articles
# then the percentage of high-quality articles would be 20%.

article_country['quality rate'] = article_country[True] * 100.0 / article_country['Total-articles']

# Coverage
# if a country has a population of 10,000 people, 
# and you found 10 FA or GA class articles about politicians from that country,
# then the percentage of articles-per-population would be .1%.
article_country = article_count.merge(wpds_valid_country, left_on='country', right_on='Name')
article_country = article_country.drop(columns=['FIPS', 'Type', 'TimeFrame', 'Data (M)'])
article_country = article_country.rename(columns={"Name": "country", False: "bad-articles", True: "good-articles"})
article_country['Coverage'] = article_country["good-articles"] * 100.0 / article_country['Population']

article_country.insert(0, 'country', article_country.pop('country'))
article_country

Unnamed: 0,country,bad-articles,good-articles,Total-articles,quality rate,Population,Coverage
0,Afghanistan,306,13,319,4.075235,38928000,0.000033
1,Albania,453,3,456,0.657895,2838000,0.000106
2,Algeria,114,2,116,1.724138,44357000,0.000005
3,Andorra,34,0,34,0.000000,82000,0.000000
4,Angola,106,0,106,0.000000,32522000,0.000000
...,...,...,...,...,...,...,...
178,Venezuela,127,3,130,2.307692,28645000,0.000010
179,Vietnam,174,13,187,6.951872,96209000,0.000014
180,Yemen,113,3,116,2.586207,29826000,0.000010
181,Zambia,25,0,25,0.000000,18384000,0.000000


### Region level analysis:


Summing up the good article counts and bad article counts for each region. For ASIA, AFRICA, EUROPE, and LATIN AMERICA AND THE CARIBBEAN, we need to sum up the counts for sub-regions. For example, the counts for LATIN AMERICA AND THE CARIBBEAN is the sum of the counts for CARIBBEAN, CENTRAL AMERICA, and SOUTH AMERICA. This could be useful if we want to compare the coverage rate or quality rate of a sub region (e.g. EAST ASIA) with the corresponding number of the greater region (e.g. ASIA), which could serve as a 'mean' for all its sub-regions.

In [18]:
# Use mapper to match countries and regions
article_region = article_count.merge(region_country_mapper, right_on='country', left_on='country')
article_region = article_region.groupby(by=["region"]).sum()

# Sum up sub-regions' counts. E.g. ASIA, AFRICA, EUROPE, and LATIN AMERICA AND THE CARIBBEAN
asia_count = {False: [article_region.loc[article_region.index.str.contains('ASIA'), False].sum()], \
              True: [article_region.loc[article_region.index.str.contains('ASIA'), True].sum()],  \
             'Total-articles': [article_region.loc[article_region.index.str.contains('ASIA'), 'Total-articles'].sum()]}
asia_count = pd.DataFrame(asia_count, index=['ASIA'])
article_region = article_region.append(asia_count)

africa_count = {False: [article_region.loc[article_region.index.str.contains('AFRICA'), False].sum()], \
              True: [article_region.loc[article_region.index.str.contains('AFRICA'), True].sum()],  \
             'Total-articles': [article_region.loc[article_region.index.str.contains('AFRICA'), 'Total-articles'].sum()]}
africa_count = pd.DataFrame(africa_count, index=['AFRICA'])
article_region = article_region.append(africa_count)

europe_count = {False: [article_region.loc[article_region.index.str.contains('EUROPE'), False].sum()], \
              True: [article_region.loc[article_region.index.str.contains('EUROPE'), True].sum()],  \
             'Total-articles': [article_region.loc[article_region.index.str.contains('EUROPE'), 'Total-articles'].sum()]}
europe_count = pd.DataFrame(europe_count, index=['EUROPE'])
article_region = article_region.append(europe_count)

latin_count = {False: [article_region.loc[['CARIBBEAN', 'CENTRAL AMERICA', 'SOUTH AMERICA'], False].sum()], \
              True: [article_region.loc[['CARIBBEAN', 'CENTRAL AMERICA', 'SOUTH AMERICA'], True].sum()],  \
             'Total-articles': [article_region.loc[['CARIBBEAN', 'CENTRAL AMERICA', 'SOUTH AMERICA'], 'Total-articles'].sum()]}
latin_count = pd.DataFrame(latin_count, index=['LATIN AMERICA AND THE CARIBBEAN'])
article_region = article_region.append(latin_count)

# Merge table to get population column
article_region = article_region.merge(regional_rows, left_on=article_region.index, right_on='Name')

# Calculate quality rate
article_region['quality rate'] = article_region[True] * 100.0 / article_region['Total-articles']

# Calculate Coverage
article_region = article_region.rename(columns={"Name": "Region", False: "bad-articles", True: "good-articles"})
article_region['Coverage'] = article_region["good-articles"] * 100.0 / article_region['Population']

article_region.insert(0, 'Region', article_region.pop('Region'))
article_region

Unnamed: 0,Region,bad-articles,good-articles,Total-articles,quality rate,Population,Coverage
0,CARIBBEAN,682,13,695,1.870504,43233000,3e-05
1,CENTRAL AMERICA,1520,23,1543,1.490603,178611000,1.3e-05
2,CENTRAL ASIA,238,7,245,2.857143,74961000,9e-06
3,Channel Islands,3661,102,3763,2.710603,172000,0.059302
4,EAST ASIA,2397,76,2473,3.07319,1641063000,5e-06
5,EASTERN AFRICA,2467,35,2502,1.398881,444970000,8e-06
6,EASTERN EUROPE,3614,118,3732,3.161844,291902000,4e-05
7,MIDDLE AFRICA,649,16,665,2.406015,179757000,9e-06
8,NORTHERN AFRICA,880,19,899,2.113459,244344000,8e-06
9,NORTHERN AMERICA,1797,104,1901,5.470805,368193000,2.8e-05


# Step 5: Results

1.   Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population.

In [19]:
article_country.sort_values('Coverage', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,country,bad-articles,good-articles,Total-articles,quality rate,Population,Coverage
0,Tuvalu,50,4,54,7.407407,10000,0.04
1,Dominica,11,1,12,8.333333,72000,0.001389
2,Vanuatu,55,3,58,5.172414,321000,0.000935
3,Iceland,199,2,201,0.995025,368000,0.000543
4,Ireland,348,25,373,6.702413,5003000,0.0005
5,Montenegro,70,2,72,2.777778,622000,0.000322
6,Martinique,33,1,34,2.941176,356000,0.000281
7,Bhutan,31,2,33,6.060606,730000,0.000274
8,New Zealand,770,13,783,1.660281,4987000,0.000261
9,Romania,301,42,343,12.244898,19241000,0.000218


2.   Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [20]:
article_country.sort_values('Coverage').reset_index(drop=True).head(10)

Unnamed: 0,country,bad-articles,good-articles,Total-articles,quality rate,Population,Coverage
0,Finland,569,0,569,0.0,5529000,0.0
1,Comoros,51,0,51,0.0,870000,0.0
2,Costa Rica,147,0,147,0.0,5111000,0.0
3,Djibouti,37,0,37,0.0,988000,0.0
4,Eritrea,16,0,16,0.0,3546000,0.0
5,Estonia,148,0,148,0.0,1331000,0.0
6,Federated States of Micronesia,36,0,36,0.0,106000,0.0
7,French Guiana,27,0,27,0.0,294000,0.0
8,Solomon Islands,97,0,97,0.0,715000,0.0
9,Grenada,36,0,36,0.0,113000,0.0


3.   Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [21]:
article_country.sort_values('quality rate', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,country,bad-articles,good-articles,Total-articles,quality rate,Population,Coverage
0,"Korea, North",28,8,36,22.222222,25779000,3.1e-05
1,Saudi Arabia,102,15,117,12.820513,35041000,4.3e-05
2,Romania,301,42,343,12.244898,19241000,0.000218
3,Central African Republic,58,8,66,12.121212,4830000,0.000166
4,Uzbekistan,25,3,28,10.714286,34174000,9e-06
5,Mauritania,43,5,48,10.416667,4650000,0.000108
6,Guatemala,76,7,83,8.433735,18066000,3.9e-05
7,Dominica,11,1,12,8.333333,72000,0.001389
8,Syria,118,10,128,7.8125,19398000,5.2e-05
9,Benin,84,7,91,7.692308,12209000,5.7e-05


4.   Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [22]:
article_country.sort_values('quality rate').reset_index(drop=True).head(10)

Unnamed: 0,country,bad-articles,good-articles,Total-articles,quality rate,Population,Coverage
0,Solomon Islands,97,0,97,0.0,715000,0.0
1,Tonga,63,0,63,0.0,99000,0.0
2,Nauru,52,0,52,0.0,11000,0.0
3,Namibia,162,0,162,0.0,2541000,0.0
4,Djibouti,37,0,37,0.0,988000,0.0
5,Mozambique,58,0,58,0.0,31166000,0.0
6,Monaco,40,0,40,0.0,38000,0.0
7,Eritrea,16,0,16,0.0,3546000,0.0
8,Estonia,148,0,148,0.0,1331000,0.0
9,Moldova,421,0,421,0.0,3535000,0.0


5.   Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [23]:
article_region.sort_values('Coverage', ascending=False).reset_index(drop=True)

Unnamed: 0,Region,bad-articles,good-articles,Total-articles,quality rate,Population,Coverage
0,Channel Islands,3661,102,3763,2.710603,172000,0.059302
1,OCEANIA,3063,63,3126,2.015355,43155000,0.000146
2,SOUTHERN EUROPE,3636,74,3710,1.994609,153251000,4.8e-05
3,EASTERN EUROPE,3614,118,3732,3.161844,291902000,4e-05
4,EUROPE,11754,248,12002,2.066322,746622000,3.3e-05
5,WESTERN ASIA,2474,89,2563,3.472493,280927000,3.2e-05
6,CARIBBEAN,682,13,695,1.870504,43233000,3e-05
7,WESTERN EUROPE,4504,56,4560,1.22807,195479000,2.9e-05
8,NORTHERN AMERICA,1797,104,1901,5.470805,368193000,2.8e-05
9,SOUTHERN AFRICA,625,9,634,1.419558,67732000,1.3e-05


6.   Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [24]:
article_region.sort_values('quality rate', ascending=False).reset_index(drop=True)

Unnamed: 0,Region,bad-articles,good-articles,Total-articles,quality rate,Population,Coverage
0,NORTHERN AMERICA,1797,104,1901,5.470805,368193000,2.8e-05
1,SOUTHEAST ASIA,1947,73,2020,3.613861,661845000,1.1e-05
2,WESTERN ASIA,2474,89,2563,3.472493,280927000,3.2e-05
3,EASTERN EUROPE,3614,118,3732,3.161844,291902000,4e-05
4,EAST ASIA,2397,76,2473,3.07319,1641063000,5e-06
5,CENTRAL ASIA,238,7,245,2.857143,74961000,9e-06
6,Channel Islands,3661,102,3763,2.710603,172000,0.059302
7,ASIA,11351,316,11667,2.708494,4625927000,7e-06
8,MIDDLE AFRICA,649,16,665,2.406015,179757000,9e-06
9,NORTHERN AFRICA,880,19,899,2.113459,244344000,8e-06


# Reflections and Implications

Piror to this analysis, I expected to see the countries and regions with the highest-ranked countries in terms of both the number of politician articles as a proportion of country population and the relative proportion of politician articles that are of GA and FA-quality are all English speaking countries/regions due to the fact that the database is based on all English Wikipedia pages.  However, I was surprised to see that the top 10 countries by relative article quality includes countries such as North Koera and Saudi Arabia. This could likely be explained due to the fact that these are some of the most censored countries in the world with tightly controlled political systems, and my guess is that some of the articles were written by the authorities.

I also expected that the countries have the lowest articles-per-population rate also have a relativety lower article quality rate, and vise versa. This expectation of mine is consistant with the results of my analysis. Moreover, these bottom 10 countries by coverage and the bottom 10 countries by relative quality all have a relativity small population. This is likely because the politicians from smaller countries are not well-known by people from other countries.



*   **What biases did you expect to find in the data (before you started working with it), and why?**
> The source data, page_data.csv, which is the "Politicians by Country from the English-language Wikipedia" dataset, pulls only the English-language Wikipedia pages. Naturally, the non-english speaking countries and regions would have fewer articles written in English and lower quality English articles comparing to the English speaking countries.


*   **What might your results suggest about (English) Wikipedia as a data source?**
> If we are using English Wikipedia as the data source, I would suggest that the author also supplements with the Wikipedia in the native language of the corresponding countries/regions to get more accurate results.


*   **Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might still be appropriate and useful, despite its inherent limitations and biases?**
>  The biases mentioned above will not be a problem if we use these data for analysis in English speaking countries, such as looking at the individual politicians in the United States.