# A2 - Bias in Data Assignment

Chang Xu

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, I combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

I perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. My analysis will consist of a series of tables that show:
1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. the countries with the highest and lowest proportion of high quality articles about politicians.
3. a ranking of geog

### import packages

In [2]:
import pandas as pd
import numpy as np
import os
import requests
import json

## Step 1: Getting the Article and Population Data

The first step is getting the data, which lives in several different places. The Wikipedia politicians by country dataset can be found on Figshare. Read through the documentation for this repository, then download and unzip it to extract the data file, which is called page_data.csv.

In [3]:
page_data_raw = pd.read_csv("raw_data/page_data.csv")
print(page_data_raw.head())
print(page_data_raw.shape)

                                 page   country     rev_id
0  Template:ZambiaProvincialMinisters    Zambia  235107991
1                      Bir I of Kanem      Chad  355319463
2   Template:Zimbabwe-politician-stub  Zimbabwe  391862046
3     Template:Uganda-politician-stub    Uganda  391862070
4    Template:Namibia-politician-stub   Namibia  391862409
(47197, 3)


The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

In [4]:
wpds_data_raw = pd.read_csv("raw_data/WPDS_2020_data.csv")
print(wpds_data_raw.head())
print(wpds_data_raw.shape)

              FIPS             Name        Type  TimeFrame  Data (M)  \
0            WORLD            WORLD       World       2019  7772.850   
1           AFRICA           AFRICA  Sub-Region       2019  1337.918   
2  NORTHERN AFRICA  NORTHERN AFRICA  Sub-Region       2019   244.344   
3               DZ          Algeria     Country       2019    44.357   
4               EG            Egypt     Country       2019   100.803   

   Population  
0  7772850000  
1  1337918000  
2   244344000  
3    44357000  
4   100803000  
(234, 6)


## Step 2: Cleaning the Data

Both page_data.csv and WPDS_2020_data.csv contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of page_data.csv, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in your analysis.

In [5]:
# get the index of pages whose page names starts with "Template:"
template_rows = page_data_raw['page'].str.startswith('Template:')

# drop the slected pages
page_data_clean = page_data_raw.drop(index = page_data_raw[template_rows].index)
print(page_data_clean.head())
print(page_data_clean.shape)

                                                 page                country  \
1                                      Bir I of Kanem                   Chad   
10  Information Minister of the Palestinian Nation...  Palestinian Territory   
12                                            Yos Por               Cambodia   
23                                       Julius Gregr         Czech Republic   
24                                       Edvard Gregr         Czech Republic   

       rev_id  
1   355319463  
10  393276188  
12  393822005  
23  395521877  
24  395526568  
(46701, 3)


In [6]:
# reset index
page_data_clean = page_data_clean.reset_index(drop=True)
print(page_data_clean.head())

                                                page                country  \
0                                     Bir I of Kanem                   Chad   
1  Information Minister of the Palestinian Nation...  Palestinian Territory   
2                                            Yos Por               Cambodia   
3                                       Julius Gregr         Czech Republic   
4                                       Edvard Gregr         Czech Republic   

      rev_id  
0  355319463  
1  393276188  
2  393822005  
3  395521877  
4  395526568  


Similarly, WPDS_2020_data.csv contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in page_data.csv, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.

In [9]:
wpds_data_clean = wpds_data_raw[wpds_data_raw['Name'].str.isupper() == False]
print(wpds_data_clean.head())
print(wpds_data_clean.shape)

  FIPS     Name     Type  TimeFrame  Data (M)  Population
3   DZ  Algeria  Country       2019    44.357    44357000
4   EG    Egypt  Country       2019   100.803   100803000
5   LY    Libya  Country       2019     6.891     6891000
6   MA  Morocco  Country       2019    35.952    35952000
7   SD    Sudan  Country       2019    43.849    43849000
(210, 6)


In [10]:
# reset index
wpds_data_clean = wpds_data_clean.reset_index(drop=True)
print(wpds_data_clean.head())

  FIPS     Name     Type  TimeFrame  Data (M)  Population
0   DZ  Algeria  Country       2019    44.357    44357000
1   EG    Egypt  Country       2019   100.803   100803000
2   LY    Libya  Country       2019     6.891     6891000
3   MA  Morocco  Country       2019    35.952    35952000
4   SD    Sudan  Country       2019    43.849    43849000


## Step 3: Getting Article Quality Predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these 6 categories to any rev_id you send it.

In order to get article predictions for each article in the Wikipedia dataset, we will first need to read page_data.csv into Python (or R), and then read through the dataset line by line, using the value of the rev_id column to make an API query.

In [11]:
rev_ids = page_data_clean.rev_id
print(rev_ids)

0        355319463
1        393276188
2        393822005
3        395521877
4        395526568
           ...    
46696    807482007
46697    807483006
46698    807483153
46699    807483270
46700    807484325
Name: rev_id, Length: 46701, dtype: int64


In [12]:
# construct the correct url for rev_id API
url = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_id}'

# header info
headers = {
    'User-Agent': 'https://github.com/maruchang',
    'From': 'xuchang@uw.edu'
}

In [13]:
# define the function for single API call
def api_call(rev_ids):
    call = requests.get(url.format(rev_id = rev_ids), headers=headers)
    response = call.json()
    return response

In [14]:
# define the function for batch API call
def api_call_batch(batch_size, rev_ids):
    # list to return
    batches = []
    
    # length of rev_ids
    n = len(rev_ids)
    i = 0
    
    while i < n:
        start = i
        # if do not exceed length of rev_ids
        if i + batch_size < n:
            end = i + batch_size
        else:
            end = n

        batches.append('|'.join(str(x) for x in rev_ids[start : end]))
        # assign end index to i
        i = end

    return batches

In [15]:
batches = api_call_batch(50, rev_ids)
print(batches)

['355319463|393276188|393822005|395521877|395526568|401577829|442937236|448555418|470173494|477962574|492060822|492964343|498683267|502721672|516633096|521986779|532253442|543225630|545936100|546364151|549300521|550682925|550953646|559553872|559788982|560758943|561744402|564873005|565745353|565745365|565745375|566504165|573710096|574571582|576988466|585894477|592289232|595693452|596181202|598819900|601122766|601127343|614786300|623004627|623334577|624468970|625509885|626606789|627001041|627051151', '627432937|627547024|628261896|628268705|628270736|628312759|628379479|628563978|628619000|628766656|628988952|629562076|629818376|630396351|630396786|630704768|631437331|631581752|632008524|632261377|632447328|633612729|634032715|635240253|635814126|636911471|637801253|638214719|638362866|638377138|638566016|638571205|638599355|639021339|639061161|639471171|640014648|640214913|640826254|641422326|643410335|643746000|643932216|643932220|643932225|643932226|643932239|643932242|644024203|64404

In [16]:
def parse_json(response):
    # things to return
    revids = []
    scores = []
    unscored = []
    
    for i in response['enwiki']['scores']:
        try:
            scores.append(response['enwiki']['scores'][i]['articlequality']['score']['prediction'])
            revids.append(i)
            
        except KeyError:
            unscored.append(i)

    return (revids, scores, unscored)

Get the list for rev_ids, scores, and all rev_ids for unscored pages

In [17]:
all_revids = []
all_scores = []
all_unscored = []

for i, batch in enumerate(batches):
    response = api_call(batch)
    revids, scores, unscored = parse_json(response)
    
    all_revids.extend(revids)
    all_scores.extend(scores)
    all_unscored.extend(unscored)

In [18]:
prediction = pd.DataFrame({'rev_id': all_revids, 'score': all_scores})
prediction

Unnamed: 0,rev_id,score
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub
...,...,...
46420,807481636,C
46421,807482007,GA
46422,807483006,C
46423,807483153,GA


It's possible that we will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log can be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. The choice is up to you.

In [19]:
# Print rev_ids for articles that cannot be scored
print(all_unscored)

['516633096', '550682925', '627547024', '636911471', '669987106', '671484594', '680981536', '684023803', '684023859', '696608092', '698572327', '699260156', '703773782', '706204833', '706810694', '708482569', '708813010', '709508670', '710135228', '710311600', '710715953', '711224007', '711288191', '711513274', '712411818', '712872338', '712872421', '712872473', '712872531', '712873183', '712873308', '712873386', '712878000', '712878267', '712878343', '712878396', '712881543', '712881676', '712881741', '712881882', '712889562', '712889594', '712889683', '712889781', '712889809', '712891291', '712891354', '712891378', '712891476', '713368646', '713381693', '714352602', '715273866', '715457941', '715534283', '715978328', '717536136', '717891895', '717917231', '717927381', '718090116', '719342595', '719521006', '719581803', '719981739', '720054719', '720154872', '720164068', '720356159', '720688837', '720856841', '720924221', '720953589', '720959757', '720993927', '721106120', '721509220'

## Step 3: Combining the Datasets

Some processing of the data will be necessary! In particular, we'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. 

In [20]:
prediction['rev_id'] = prediction['rev_id'].astype(int)
merged_by_revid = page_data_clean.merge(prediction, on='rev_id')
merged_by_revid

Unnamed: 0,page,country,rev_id,score
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub
...,...,...,...,...
46420,Hal Bidlack,United States,807481636,C
46421,Yahya Jammeh,Gambia,807482007,GA
46422,Lucius Fairchild,United States,807483006,C
46423,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA


In [21]:
# merge by country
merged_all = pd.merge(wpds_data_clean, merged_by_revid, how = 'left', left_on = 'Name', right_on = 'country')
merged_all

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,page,country,rev_id,score
0,DZ,Algeria,Country,2019,44.357,44357000,Ali Fawzi Rebaine,Algeria,686269631.0,Stub
1,DZ,Algeria,Country,2019,44.357,44357000,Ahmed Attaf,Algeria,705910185.0,Stub
2,DZ,Algeria,Country,2019,44.357,44357000,Ahmed Djoghlaf,Algeria,707427823.0,Stub
3,DZ,Algeria,Country,2019,44.357,44357000,Hammi Larouissi,Algeria,708060571.0,Stub
4,DZ,Algeria,Country,2019,44.357,44357000,Salah Goudjil,Algeria,708980561.0,Stub
...,...,...,...,...,...,...,...,...,...,...
44590,VU,Vanuatu,Country,2019,0.321,321000,Tallis Obed Moses,Vanuatu,799954279.0,Stub
44591,VU,Vanuatu,Country,2019,0.321,321000,Esmon Saimon,Vanuatu,799954813.0,Start
44592,VU,Vanuatu,Country,2019,0.321,321000,Baldwin Lonsdale,Vanuatu,799955662.0,C
44593,VU,Vanuatu,Country,2019,0.321,321000,Sela Molisa,Vanuatu,800106636.0,C


After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

Please remove any rows that do not have matching data, and output them to a CSV file called: <br />
`wp_wpds_countries-no_match.csv`

In [22]:
# DataFrame.merge(right, how='inner', on=None, ...)
# outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
no_match_revid = pd.merge(prediction, page_data_clean, how = 'outer', on = 'rev_id')
no_match_revid

Unnamed: 0,rev_id,score,page,country
0,355319463,Stub,Bir I of Kanem,Chad
1,393276188,Stub,Information Minister of the Palestinian Nation...,Palestinian Territory
2,393822005,Stub,Yos Por,Cambodia
3,395521877,Stub,Julius Gregr,Czech Republic
4,395526568,Stub,Edvard Gregr,Czech Republic
...,...,...,...,...
46696,807336308,,John Rose (Trotskyist),United Kingdom
46697,807367030,,Jalal Movaghar,Iran
46698,807367166,,Mohsen Movaghar,Iran
46699,807479587,,King Gutierrez,Philippines


In [23]:
no_match_population = pd.merge(wpds_data_clean, merged_by_revid, how = 'outer', left_on = 'Name', right_on = 'country')
no_match_population

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,page,country,rev_id,score
0,DZ,Algeria,Country,2019.0,44.357,44357000.0,Ali Fawzi Rebaine,Algeria,686269631.0,Stub
1,DZ,Algeria,Country,2019.0,44.357,44357000.0,Ahmed Attaf,Algeria,705910185.0,Stub
2,DZ,Algeria,Country,2019.0,44.357,44357000.0,Ahmed Djoghlaf,Algeria,707427823.0,Stub
3,DZ,Algeria,Country,2019.0,44.357,44357000.0,Hammi Larouissi,Algeria,708060571.0,Stub
4,DZ,Algeria,Country,2019.0,44.357,44357000.0,Salah Goudjil,Algeria,708980561.0,Stub
...,...,...,...,...,...,...,...,...,...,...
46447,,,,,,,Dahir Riyale Kahin,Somaliland,798692052.0,Start
46448,,,,,,,Adan Ahmed Elmi,Somaliland,804143605.0,Stub
46449,,,,,,,Muhammad Haji Ibrahim Egal,Somaliland,805840190.0,C
46450,,,,,,,Hediya Yousef,Rojava,805873719.0,C


In [24]:
no_match_result = pd.concat([no_match_revid, no_match_population], axis = 0)
no_match_result

Unnamed: 0,rev_id,score,page,country,FIPS,Name,Type,TimeFrame,Data (M),Population
0,355319463.0,Stub,Bir I of Kanem,Chad,,,,,,
1,393276188.0,Stub,Information Minister of the Palestinian Nation...,Palestinian Territory,,,,,,
2,393822005.0,Stub,Yos Por,Cambodia,,,,,,
3,395521877.0,Stub,Julius Gregr,Czech Republic,,,,,,
4,395526568.0,Stub,Edvard Gregr,Czech Republic,,,,,,
...,...,...,...,...,...,...,...,...,...,...
46447,798692052.0,Start,Dahir Riyale Kahin,Somaliland,,,,,,
46448,804143605.0,Stub,Adan Ahmed Elmi,Somaliland,,,,,,
46449,805840190.0,C,Muhammad Haji Ibrahim Egal,Somaliland,,,,,,
46450,805873719.0,C,Hediya Yousef,Rojava,,,,,,


In [25]:
no_match_result.to_csv('wp_wpds_countries-no_match.csv')

Consolidate the remaining data into a single CSV file called: <br />
`wp_wpds_politicians_by_country.csv`

The schema for that file should look something like this:

|  Column    |
| ----------- |
| country     |
| article_name   |
| revision_id   |
| article_quality_est.  |
| population   |


Note: revision_id here is the same thing as `rev_id`, which you used to get scores from ORES.

Select the columns we are interested in. Also change into the order of schema

In [26]:
result_data = merged_all[['country', 'page', 'rev_id', 'score', 'Population']]
result_data

Unnamed: 0,country,page,rev_id,score,Population
0,Algeria,Ali Fawzi Rebaine,686269631.0,Stub,44357000
1,Algeria,Ahmed Attaf,705910185.0,Stub,44357000
2,Algeria,Ahmed Djoghlaf,707427823.0,Stub,44357000
3,Algeria,Hammi Larouissi,708060571.0,Stub,44357000
4,Algeria,Salah Goudjil,708980561.0,Stub,44357000
...,...,...,...,...,...
44590,Vanuatu,Tallis Obed Moses,799954279.0,Stub,321000
44591,Vanuatu,Esmon Saimon,799954813.0,Start,321000
44592,Vanuatu,Baldwin Lonsdale,799955662.0,C,321000
44593,Vanuatu,Sela Molisa,800106636.0,C,321000


Rename the columns according to the schema

In [27]:
# rename the columns
result_data.columns = ['country', 'article_name', 'revision_id', 'article_quality_est.', 'population'] 
# pd.rename(columns={'oldname':'newname',...},inplace=True)
result_data

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
0,Algeria,Ali Fawzi Rebaine,686269631.0,Stub,44357000
1,Algeria,Ahmed Attaf,705910185.0,Stub,44357000
2,Algeria,Ahmed Djoghlaf,707427823.0,Stub,44357000
3,Algeria,Hammi Larouissi,708060571.0,Stub,44357000
4,Algeria,Salah Goudjil,708980561.0,Stub,44357000
...,...,...,...,...,...
44590,Vanuatu,Tallis Obed Moses,799954279.0,Stub,321000
44591,Vanuatu,Esmon Saimon,799954813.0,Start,321000
44592,Vanuatu,Baldwin Lonsdale,799955662.0,C,321000
44593,Vanuatu,Sela Molisa,800106636.0,C,321000


In [28]:
# save file 
result_data.to_csv('wp_wpds_politicians_by_country.csv')

## Step 4: Analysis

The analysis is consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes. <br />

* if a country has a population of 10,000 people, and you found 10 FA or GA class articles about politicians from that country, then the percentage of articles-per-population would be .1%. 
* if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

Data preparation for calculations

Mark high quality articles with "1"s, otherwise "0"s

In [115]:
merged_all['high_quality'] = np.where(merged_all['score'].str.contains('FA' or 'GA'), 1, 0)
merged_all

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,page,country,rev_id,score,high_quality
0,DZ,Algeria,Country,2019,44.357,44357000,Ali Fawzi Rebaine,Algeria,686269631.0,Stub,0
1,DZ,Algeria,Country,2019,44.357,44357000,Ahmed Attaf,Algeria,705910185.0,Stub,0
2,DZ,Algeria,Country,2019,44.357,44357000,Ahmed Djoghlaf,Algeria,707427823.0,Stub,0
3,DZ,Algeria,Country,2019,44.357,44357000,Hammi Larouissi,Algeria,708060571.0,Stub,0
4,DZ,Algeria,Country,2019,44.357,44357000,Salah Goudjil,Algeria,708980561.0,Stub,0
...,...,...,...,...,...,...,...,...,...,...,...
44590,VU,Vanuatu,Country,2019,0.321,321000,Tallis Obed Moses,Vanuatu,799954279.0,Stub,0
44591,VU,Vanuatu,Country,2019,0.321,321000,Esmon Saimon,Vanuatu,799954813.0,Start,0
44592,VU,Vanuatu,Country,2019,0.321,321000,Baldwin Lonsdale,Vanuatu,799955662.0,C,0
44593,VU,Vanuatu,Country,2019,0.321,321000,Sela Molisa,Vanuatu,800106636.0,C,0


Clean the table

In [116]:
high_quality_country = merged_all.drop('FIPS', 1).drop('Type', 1).drop('TimeFrame', 1).drop('Data (M)', 1)
high_quality_country

Unnamed: 0,Name,Population,page,country,rev_id,score,high_quality
0,Algeria,44357000,Ali Fawzi Rebaine,Algeria,686269631.0,Stub,0
1,Algeria,44357000,Ahmed Attaf,Algeria,705910185.0,Stub,0
2,Algeria,44357000,Ahmed Djoghlaf,Algeria,707427823.0,Stub,0
3,Algeria,44357000,Hammi Larouissi,Algeria,708060571.0,Stub,0
4,Algeria,44357000,Salah Goudjil,Algeria,708980561.0,Stub,0
...,...,...,...,...,...,...,...
44590,Vanuatu,321000,Tallis Obed Moses,Vanuatu,799954279.0,Stub,0
44591,Vanuatu,321000,Esmon Saimon,Vanuatu,799954813.0,Start,0
44592,Vanuatu,321000,Baldwin Lonsdale,Vanuatu,799955662.0,C,0
44593,Vanuatu,321000,Sela Molisa,Vanuatu,800106636.0,C,0


First convert each page with 1, and sum up the count for each country

In [117]:
high_quality_country['count'] = 1
high_quality_country

Unnamed: 0,Name,Population,page,country,rev_id,score,high_quality,count
0,Algeria,44357000,Ali Fawzi Rebaine,Algeria,686269631.0,Stub,0,1
1,Algeria,44357000,Ahmed Attaf,Algeria,705910185.0,Stub,0,1
2,Algeria,44357000,Ahmed Djoghlaf,Algeria,707427823.0,Stub,0,1
3,Algeria,44357000,Hammi Larouissi,Algeria,708060571.0,Stub,0,1
4,Algeria,44357000,Salah Goudjil,Algeria,708980561.0,Stub,0,1
...,...,...,...,...,...,...,...,...
44590,Vanuatu,321000,Tallis Obed Moses,Vanuatu,799954279.0,Stub,0,1
44591,Vanuatu,321000,Esmon Saimon,Vanuatu,799954813.0,Start,0,1
44592,Vanuatu,321000,Baldwin Lonsdale,Vanuatu,799955662.0,C,0,1
44593,Vanuatu,321000,Sela Molisa,Vanuatu,800106636.0,C,0,1


In [118]:
country_articles_num = high_quality_country.groupby(['country', 'Population']).sum('count').drop(columns = ['rev_id']).reset_index()
country_articles_num

Unnamed: 0,country,Population,high_quality,count
0,Afghanistan,38928000,1,319
1,Albania,2838000,0,456
2,Algeria,44357000,0,116
3,Andorra,82000,0,34
4,Angola,32522000,0,106
...,...,...,...,...
178,Venezuela,28645000,0,130
179,Vietnam,96209000,7,187
180,Yemen,29826000,1,116
181,Zambia,18384000,0,25


#### Calculating the proportion (percentage) of articles-per-population for each country

In [119]:
country_articles_num['articles_per_population'] = (country_articles_num['high_quality'] / country_articles_num['Population']) * 100
country_articles_num

Unnamed: 0,country,Population,high_quality,count,articles_per_population
0,Afghanistan,38928000,1,319,0.000003
1,Albania,2838000,0,456,0.000000
2,Algeria,44357000,0,116,0.000000
3,Andorra,82000,0,34,0.000000
4,Angola,32522000,0,106,0.000000
...,...,...,...,...,...
178,Venezuela,28645000,0,130,0.000000
179,Vietnam,96209000,7,187,0.000007
180,Yemen,29826000,1,116,0.000003
181,Zambia,18384000,0,25,0.000000


#### Calculating the proportion (percentage) of  high-quality articles for each country

In [78]:
country_articles_num['percent_high_quality'] = (country_articles_num['high_quality'] / country_articles_num['count']) * 100
country_articles_num

Unnamed: 0,country,Population,high_quality,count,articles_per_population,percent_high_quality
0,Afghanistan,38928000,1,319,0.000003,0.313480
1,Albania,2838000,0,456,0.000000,0.000000
2,Algeria,44357000,0,116,0.000000,0.000000
3,Andorra,82000,0,34,0.000000,0.000000
4,Angola,32522000,0,106,0.000000,0.000000
...,...,...,...,...,...,...
178,Venezuela,28645000,0,130,0.000000,0.000000
179,Vietnam,96209000,7,187,0.000007,3.743316
180,Yemen,29826000,1,116,0.000003,0.862069
181,Zambia,18384000,0,25,0.000000,0.000000


In [79]:
# how to know which countries are for which region

## Step 5: Results

#### 1.Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [112]:
table1 = country_articles_num.copy()
table1['art_num_per_population'] = (table1['count'] / table1['Population']) * 100
table1

Unnamed: 0,country,Population,high_quality,count,articles_per_population,percent_high_quality,art_num_per_population
0,Afghanistan,38928000,1,319,0.000003,0.313480,0.000819
1,Albania,2838000,0,456,0.000000,0.000000,0.016068
2,Algeria,44357000,0,116,0.000000,0.000000,0.000262
3,Andorra,82000,0,34,0.000000,0.000000,0.041463
4,Angola,32522000,0,106,0.000000,0.000000,0.000326
...,...,...,...,...,...,...,...
178,Venezuela,28645000,0,130,0.000000,0.000000,0.000454
179,Vietnam,96209000,7,187,0.000007,3.743316,0.000194
180,Yemen,29826000,1,116,0.000003,0.862069,0.000389
181,Zambia,18384000,0,25,0.000000,0.000000,0.000136


In [121]:
res1 = table1.nlargest(10,'art_num_per_population')
res1 = res1.drop(columns = ['high_quality']).drop(columns = ['articles_per_population']).drop(columns = ['percent_high_quality']).reset_index()
res1

Unnamed: 0,index,country,Population,art_num_per_population
0,169,Tuvalu,10000,0.54
1,117,Nauru,11000,0.472727
2,138,San Marino,34000,0.238235
3,110,Monaco,38000,0.105263
4,95,Liechtenstein,39000,0.071795
5,104,Marshall Islands,57000,0.064912
6,164,Tonga,99000,0.063636
7,70,Iceland,368000,0.05462
8,3,Andorra,82000,0.041463
9,52,Federated States of Micronesia,106000,0.033962


Therefore, the top 10 highest-ranked countries in terms of number of politician articles as a proportion of country population are: Tuvalu, Nauru, San Marino, Monaco,Liechtenstein, Marshall Islands, Tonga, Iceland, Andorra and Federated States of Micronesia. 

#### 2.Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [114]:
res2 = table1.nsmallest(10,'art_num_per_population')
res2 = res2.drop(columns = ['high_quality']).drop(columns = ['articles_per_population']).drop(columns = ['percent_high_quality']).reset_index()
res2

Unnamed: 0,index,country,Population,count,art_num_per_population
0,71,India,1400100000,968,6.9e-05
1,72,Indonesia,271739000,209,7.7e-05
2,34,China,1402385000,1129,8.1e-05
3,176,Uzbekistan,34174000,28,8.2e-05
4,51,Ethiopia,114916000,101,8.8e-05
5,181,Zambia,18384000,25,0.000136
6,84,"Korea, North",25779000,36,0.00014
7,162,Thailand,66534000,112,0.000168
8,114,Mozambique,31166000,58,0.000186
9,13,Bangladesh,169809000,317,0.000187


Therefore, the 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population are India, Indonesia, China, Uzbekistan, Ethiopia, Zambia, North Korea, Thailand, Mozambique and Bangladesh.

#### 3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [100]:
res3 = country_articles_num.nlargest(10,'percent_high_quality')
res3 = res3.drop(columns = ['Population']).drop(columns = ['articles_per_population']).reset_index()
res3

Unnamed: 0,index,country,high_quality,count,percent_high_quality
0,135,Romania,28,343,8.163265
1,179,Vietnam,7,187,3.743316
2,176,Uzbekistan,1,28,3.571429
3,18,Benin,3,91,3.296703
4,158,Syria,4,128,3.125
5,31,Central African Republic,2,66,3.030303
6,152,Spain,25,871,2.870264
7,15,Belarus,2,72,2.777778
8,84,"Korea, North",1,36,2.777778
9,47,Egypt,6,234,2.564103


Therefore, the 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality are: Romania, Vietnam, Uzbekistan, Benin, Syria, Central African Republic, Spain, Belarus, North Korea and Egypt.

#### 4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [106]:
table2 = country_articles_num.copy()
table2 = table2.sort_values(['percent_high_quality', 'count'], ascending=[True, False])
table2

Unnamed: 0,country,Population,high_quality,count,articles_per_population,percent_high_quality
125,Pakistan,220940000,0,1019,0.000000,0.000000
123,Nigeria,206140000,0,676,0.000000,0.000000
54,Finland,5529000,0,569,0.000000,0.000000
23,Brazil,211812000,0,545,0.000000,0.000000
16,Belgium,11515000,0,519,0.000000,0.000000
...,...,...,...,...,...,...
158,Syria,19398000,4,128,0.000021,3.125000
18,Benin,12209000,3,91,0.000025,3.296703
176,Uzbekistan,34174000,1,28,0.000003,3.571429
179,Vietnam,96209000,7,187,0.000007,3.743316


In [110]:
res4 = table2.drop(columns = ['Population']).drop(columns = ['articles_per_population']).reset_index()
res4 = res4.head(10)
res4

Unnamed: 0,index,country,high_quality,count,percent_high_quality
0,125,Pakistan,0,1019,0.0
1,123,Nigeria,0,676,0.0
2,54,Finland,0,569,0.0
3,23,Brazil,0,545,0.0
4,16,Belgium,0,519,0.0
5,153,Sri Lanka,0,461,0.0
6,1,Albania,0,456,0.0
7,109,Moldova,0,421,0.0
8,161,Tanzania,0,404,0.0
9,157,Switzerland,0,402,0.0


Therefore, the 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality are: Pakistan, Nigeria, Finland, Brazil, Belgium, Sri Lanka, Albania, Moldova, Tanzania and Switzerland.

## Write-up

Through this assignment, I learned many techniques of how to process multiple data files and put together related datasets to extract useful information and perfomr analysis. These techniques is very important for data scientists. 


Before I started working with the data, I expected to see:
1. countries with larger population are likely to have a larger number of articles about politicians
2. countries with more articles about politicians should also have more good quality articles, AND
3. countries with higher GDP and military power (judged by common sense), such as such as United States, China, and Russia, should have more total number of article pages and more good articles.

As the politicians from those countries generally attract more public attention and usually more powerful globally. 

However, these are not the case when I observed the output I get (I was a bit surprised to see my output). Some countries I did not expect to show up on the top 10 lists are there, and for a lot of them I cannot come up with a reason why they have that rankings. 

But I think my results still make sense in some ways, for example, although having many articles about politician, China is ranked high in Bottom 10 countries by coverage, due to have a much larger population. 

I think language can be a bias. Since most content on Wikipedia is in English, and I think most editors are English-speaking. This disadvantageous authors whose native language is not English. Therefore, if we use (English) Wikipedia as a data source, the conclusion we draw is very likely to be biased.
On the otherhand, I think the data source is also very limited. We only identify articles by revision_id, but in real life, many authors can work on one article and an author can work on multiple articles. The real world situation is just way more complex than the abstraction of our dataset.

I think the internet and global society in general should be more diverse, and hopefully less and less countries will be underepresented. 