## Goal:
The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries.

To do this, we will combine a data set of wiki articles with country populations, and then use an ML service to estimate quality of each article

In [1]:
import pandas as pd
import requests

## Step 1
Getting the article & population data

In [2]:
#Population data data
#This dataset is drawn from the world population data sheet published by the Population Reference Bureau.https://www.prb.org/international/indicator/population/table/
WPDS_data = pd.read_csv("data_raw/WPDS_2020_data.csv")

#Article data
#This data set is available on figshare https://figshare.com/articles/dataset/Untitled_Item/5513449
#I downloaded the data & it can be found in this guthub repo
page_data = pd.read_csv("data_raw/page_data.csv")

## Step 2
Data Cleaning


the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in your analysis.


In [3]:
page_data = page_data[page_data["page"].str.contains("Template:")!=True].reset_index()

WPDS_2020_data.csv contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in page_data.csv, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.


In [4]:
WPDS_data_country = WPDS_data[WPDS_data["Type"]=="Country"].reset_index()

## Step 3
Getting Article Quality Prediction

In [5]:
page_data

Unnamed: 0,index,page,country,rev_id
0,1,Bir I of Kanem,Chad,355319463
1,10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,12,Yos Por,Cambodia,393822005
3,23,Julius Gregr,Czech Republic,395521877
4,24,Edvard Gregr,Czech Republic,395526568
...,...,...,...,...
46696,47192,Yahya Jammeh,Gambia,807482007
46697,47193,Lucius Fairchild,United States,807483006
46698,47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
46699,47195,Francis Fessenden,United States,807483270


In [6]:
# iterate through each rev_id, and call API to get prediction of article quality
rev_ids = []
predictions =[]
log_rev_ids_missing_predictions=[]
for batch in range(len(page_data["rev_id"])//50+1):
    batch_ids = page_data["rev_id"][50*batch:50*batch+50]
    rev_id = "|".join(str(x) for x in batch_ids)
    url = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_id}'
    call = requests.get(url.format(rev_id=rev_id))
    response = call.json()
    
    for i in batch_ids:
        try:
            prediction = response["enwiki"]["scores"][str(i)]["articlequality"]["score"]["prediction"]
            rev_ids.append(i)
            predictions.append(prediction)
        except KeyError:
            log_rev_ids_missing_predictions.append(i)
        

In [7]:
article_quality= pd.DataFrame({'rev_id':rev_ids,
                   'article_quality_estimate':predictions})

## Step 4 Combining the datasets

In [8]:
page_data["rev_id"] = page_data["rev_id"].apply(str)
article_quality["rev_id"] = article_quality["rev_id"].apply(str)

In [9]:
page_data = page_data.merge(article_quality,
                how = "inner",
                on = "rev_id"
               )

In [10]:
joined = page_data.merge(WPDS_data_country,
                how = "outer",
                left_on = "country",
                right_on = "Name"
               )

there are a couple of edge cases - 
either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

In [11]:
joined

Unnamed: 0,index_x,page,country,rev_id,article_quality_estimate,index_y,FIPS,Name,Type,TimeFrame,Data (M),Population
0,1.0,Bir I of Kanem,Chad,355319463,Stub,52.0,TD,Chad,Country,2019.0,16.877,16877000.0
1,122.0,Abdullah II of Kanem,Chad,498683267,Stub,52.0,TD,Chad,Country,2019.0,16.877,16877000.0
2,260.0,Salmama II of Kanem,Chad,565745353,Stub,52.0,TD,Chad,Country,2019.0,16.877,16877000.0
3,261.0,Kuri I of Kanem,Chad,565745365,Stub,52.0,TD,Chad,Country,2019.0,16.877,16877000.0
4,262.0,Mohammed I of Kanem,Chad,565745375,Stub,52.0,TD,Chad,Country,2019.0,16.877,16877000.0
...,...,...,...,...,...,...,...,...,...,...,...,...
46446,,,,,,220.0,PF,French Polynesia,Country,2019.0,0.280,280000.0
46447,,,,,,221.0,GU,Guam,Country,2019.0,0.175,175000.0
46448,,,,,,225.0,NC,New Caledonia,Country,2019.0,0.295,295000.0
46449,,,,,,227.0,PW,Palau,Country,2019.0,0.018,18000.0


In [12]:
wp_wpds_countries_no_match = joined[(joined["country"].isnull())|(joined["Name"].isnull())]

In [13]:
wp_wpds_countries_no_match.to_csv("data_clean/wp_wpds_countries-no_match.csv")

In [14]:
wp_wpds_politicians_by_country = joined[(joined["country"].isnull()==False)&(joined["Name"].isnull()==False)]

In [15]:
page_data

Unnamed: 0,index,page,country,rev_id,article_quality_estimate
0,1,Bir I of Kanem,Chad,355319463,Stub
1,10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,12,Yos Por,Cambodia,393822005,Stub
3,23,Julius Gregr,Czech Republic,395521877,Stub
4,24,Edvard Gregr,Czech Republic,395526568,Stub
...,...,...,...,...,...
46420,47191,Hal Bidlack,United States,807481636,C
46421,47192,Yahya Jammeh,Gambia,807482007,GA
46422,47193,Lucius Fairchild,United States,807483006,C
46423,47194,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA


In [16]:
wp_wpds_politicians_by_country.columns = ['index_x', 'article_name', 'country', 'revision_id', 'article_quality_est.',
       'index_y', 'FIPS', 'Name', 'Type', 'TimeFrame', 'Data (M)',
       'population']

In [17]:
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country[["country","article_name","revision_id","article_quality_est.","population"]]
wp_wpds_politicians_by_country.to_csv("wp_wpds_politicians_by_country.csv")

In [18]:
wp_wpds_politicians_by_country

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
0,Chad,Bir I of Kanem,355319463,Stub,16877000.0
1,Chad,Abdullah II of Kanem,498683267,Stub,16877000.0
2,Chad,Salmama II of Kanem,565745353,Stub,16877000.0
3,Chad,Kuri I of Kanem,565745365,Stub,16877000.0
4,Chad,Mohammed I of Kanem,565745375,Stub,16877000.0
...,...,...,...,...,...
46414,Seychelles,Rita Sinon,800323154,Stub,98000.0
46415,Seychelles,Sylvette Frichot,800323798,Stub,98000.0
46416,Seychelles,May De Silva,800969960,Start,98000.0
46417,Seychelles,Vincent Meriton,802051093,Stub,98000.0


## Step 5 Analysis

articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

In [19]:
wp_wpds_politicians_by_country["article_quality_est."].value_counts()

Stub     23325
Start    13857
C         5662
GA         747
B          696
FA         281
Name: article_quality_est., dtype: int64

#### Articles per population
Number of articles in each country divided by population in that country

In [20]:
articles_per_population = wp_wpds_politicians_by_country.groupby(["country"]).apply(lambda s: (s.article_name.count()/s.population.max())*100)
articles_per_population

country
Afghanistan    0.000819
Albania        0.016068
Algeria        0.000262
Andorra        0.041463
Angola         0.000326
                 ...   
Venezuela      0.000454
Vietnam        0.000194
Yemen          0.000389
Zambia         0.000136
Zimbabwe       0.001097
Length: 183, dtype: float64

#### High Quality Articles 

In [21]:
wp_wpds_politicians_by_country["hq"]= (wp_wpds_politicians_by_country["article_quality_est."]=="FA")|(wp_wpds_politicians_by_country["article_quality_est."]=="GA")

In [22]:
hq_articles = wp_wpds_politicians_by_country.groupby(["country"]).apply(lambda s: (s.hq.sum()/s.article_name.count())*100)
hq_articles

country
Afghanistan    4.075235
Albania        0.657895
Algeria        1.724138
Andorra        0.000000
Angola         0.000000
                 ...   
Venezuela      2.307692
Vietnam        6.951872
Yemen          2.586207
Zambia         0.000000
Zimbabwe       1.226994
Length: 183, dtype: float64

## Step 6: 


#### 1) Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [23]:
articles_per_population.to_frame("articles_per_pop_pct").reset_index().sort_values("articles_per_pop_pct",ascending = False).head(10)

Unnamed: 0,country,articles_per_pop_pct
169,Tuvalu,0.54
117,Nauru,0.472727
138,San Marino,0.238235
110,Monaco,0.105263
95,Liechtenstein,0.071795
104,Marshall Islands,0.064912
164,Tonga,0.063636
70,Iceland,0.05462
3,Andorra,0.041463
52,Federated States of Micronesia,0.033962


#### 2) Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [24]:
articles_per_population.to_frame("articles_per_pop_pct").reset_index().sort_values("articles_per_pop_pct",ascending = True).head(10)

Unnamed: 0,country,articles_per_pop_pct
71,India,6.9e-05
72,Indonesia,7.7e-05
34,China,8.1e-05
176,Uzbekistan,8.2e-05
51,Ethiopia,8.8e-05
181,Zambia,0.000136
84,"Korea, North",0.00014
162,Thailand,0.000168
114,Mozambique,0.000186
13,Bangladesh,0.000187


#### 3) Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [25]:
hq_articles.to_frame("hq_articles_pct").reset_index().sort_values("hq_articles_pct",ascending = False).head(10)

Unnamed: 0,country,hq_articles_pct
84,"Korea, North",22.222222
140,Saudi Arabia,12.820513
135,Romania,12.244898
31,Central African Republic,12.121212
176,Uzbekistan,10.714286
106,Mauritania,10.416667
64,Guatemala,8.433735
44,Dominica,8.333333
158,Syria,7.8125
18,Benin,7.692308


#### 4) Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [26]:
hq_articles.to_frame("hq_articles_pct").reset_index().sort_values("hq_articles_pct",ascending = True).head(10)

Unnamed: 0,country,hq_articles_pct
148,Solomon Islands,0.0
164,Tonga,0.0
117,Nauru,0.0
116,Namibia,0.0
43,Djibouti,0.0
114,Mozambique,0.0
110,Monaco,0.0
49,Eritrea,0.0
50,Estonia,0.0
109,Moldova,0.0


Alot of the countries have 0 high quality articles, 
we can also look at the bottom 10 that have atleast 1 hq article

In [27]:
temp = hq_articles.to_frame("hq_articles_pct").reset_index()
temp[temp["hq_articles_pct"]>0].sort_values("hq_articles_pct",ascending = True).head(10)

Unnamed: 0,country,hq_articles_pct
16,Belgium,0.192678
161,Tanzania,0.247525
157,Switzerland,0.248756
118,Nepal,0.280899
130,Peru,0.285714
123,Nigeria,0.295858
133,Portugal,0.314465
35,Colombia,0.350877
96,Lithuania,0.409836
113,Morocco,0.485437


In [30]:
#To answer the next 2 questions, we need a country region mapping
WPDS_data = pd.read_csv("data_raw/WPDS_2020_data.csv")

In [31]:
#create region - country mapping
region = WPDS_data["Type"]
name = WPDS_data["Name"]
population = WPDS_data["Population"]
regions_country = {}
regions_population ={}

In [32]:
#hacky way to create region country mapping
for r,n,p in zip(region,name,population):
    if r=="Sub-Region":
        regions_country[n]=[]
        current_region = n
        regions_population[n]=p
    if r=="Country":
        if current_region!= None:
            regions_country[current_region].append(n)

In [33]:
# create country. ->region mapping
country_region ={}
for region,countries in regions_country.items():
    for country in countries:
        country_region[country]=region
    

In [34]:
#create a column for region
wp_wpds_politicians_by_country["region"] = wp_wpds_politicians_by_country["country"].replace(country_region)

In [35]:
#create a region for region population
wp_wpds_politicians_by_country["region_population"] = wp_wpds_politicians_by_country["region"].replace(regions_population)

#### 5) Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [37]:
wp_wpds_politicians_by_country.groupby(["region"]).apply(lambda s: (s.article_name.count()/s.population.max())*100).to_frame("articles_per_regional_pop_pct").sort_values("articles_per_regional_pop_pct",ascending = False)

Unnamed: 0_level_0,articles_per_regional_pop_pct
region,Unnamed: 1_level_1
OCEANIA,0.012138
SOUTHERN EUROPE,0.006153
CARIBBEAN,0.006095
Channel Islands,0.005603
WESTERN EUROPE,0.005474
WESTERN ASIA,0.003061
EASTERN EUROPE,0.002543
EASTERN AFRICA,0.002177
MIDDLE AFRICA,0.002045
SOUTH AMERICA,0.001431


#### 6) Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [38]:
wp_wpds_politicians_by_country.groupby(["region"]).apply(lambda s: (s.hq.sum()/s.article_name.count())*100).to_frame("hq_articles_regional_pct").sort_values("hq_articles_regional_pct",ascending = False)


Unnamed: 0_level_0,hq_articles_regional_pct
region,Unnamed: 1_level_1
NORTHERN AMERICA,5.470805
SOUTHEAST ASIA,3.613861
WESTERN ASIA,3.472493
EASTERN EUROPE,3.161844
EAST ASIA,3.07319
CENTRAL ASIA,2.857143
Channel Islands,2.710603
MIDDLE AFRICA,2.406015
NORTHERN AFRICA,2.113459
OCEANIA,2.015355
