## A2 - Bias in Data

### Project Overview

The goal of this project is to explore the concept of 'bias' through data on Wikipedia articles on political figures from different countries. 

The dataset includes the set of political articles on Wikipedia, the predicted article quality scores for those articles, and a dataset of country populations.

The series of plots will be as follows:  
1) the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
2) the countries with the highest and lowest proportion of high quality articles about politicians.

## Part 1 - Getting the data

### The Wikipedia dataset

The wikipedia dataset can be found on Figshare. This was extracted using the Wikimedia API, saved as a csv file named page_data.csv. The columns in this file are:    
1) country: the name of the country  
2) page: the wikipedia article title  
3) rev_id: the revision id for the last edit of the page

In [86]:
"""
    This cell reads the page_data csv file into the dataframe "data" from the file page_data.csv.
    The shape of this dataframe is (47197,3).
    
"""
import pandas as pd

data = pd.read_csv('page_data.csv')

### The Population dataset

The population data is from Population Research Bureau website.http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14. The dataset has 5 columns: Location, Location Type, TimeFrame, Data Type, Data and Footnotes. I removed the unnecessary columns so that I could focus on the analysis part of the project without any superfluous data.

In [11]:
pop = pd.read_csv('Population_Mid_2015.csv',header=1)
del pop['Location Type']
del pop['TimeFrame']
del pop['Data Type']
del pop['Footnotes']

To make the merging of the two dataframes easier, I changed the 'Location' column name to 'country'.

In [12]:
pop.columns = ['country','population']

In [13]:
pop['population'] = pop['population'].str.replace(',', '')

In [14]:
pop['population'] = [int(x) for x in pop['population'].values.tolist()]

### Article quality prediction

The predicted quality scores for each article in the Wikipedia dataset comes from a Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article at a particular point in time, and assigns a series of probabilities that the article is best described by one of the categories listed below.


The range of quality scores are, from best to worst:  
FA - Featured article  
GA - Good article  
B - B-class article  
C - C-class article  
Start - Start-class article  
Stub - Stub-class article  


In [164]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}
"""
    Taken from the example jupyter notebook, this function returns a json object containing the ORES data 
    for each article with revision id as 'revids'. 
    An example of the json object is as follows:
    {'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'235107991': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.00617021706415532,
       'C': 0.01705290459462909,
       'FA': 0.0015941304170005732,
       'GA': 0.0012422843354764665,
       'Start': 0.024596904658825667,
       'Stub': 0.9493435589299127}}}}}}}
"""



In [205]:
def get_ores_data(revision_ids,headers):
    missing_index = []
    scores = [] #aq = article quality
    endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"    
    params = {"project" : "enwiki",
              "model" : "wp10",
              "revids" : "|".join(str(x) for x in revision_ids)
             }

    api_call = requests.get(endpoint.format(**params))
    
    response = api_call.json()
    for rev_id in revision_ids:
        try:
            scores.append(response["enwiki"]["scores"][str(rev_id)]["wp10"]["score"]["prediction"])
        except:
            missing_index.append(i)
            print("no rev_id found")
            pass
    
    return scores, missing_index

In [206]:
aq=[]
for i in range(0, len(data["rev_id"]), 100):

        revision_ids = data["rev_id"][i:(i + 100)]    
    
        scores, missing_index = get_ores_data(revision_ids,headers) 
        for score in scores:
            aq.append(score)

no rev_id found


Dropping the rows with the missing indices 

In [207]:
data.drop(missing_index[0],inplace=True)

In [208]:
data['article_quality'] = aq

### Merging the datasets

In the following cell, I have merged the data and the population dataframes for the final datafame which I will use for analysis.

In [210]:
final_df = data.merge(pop, on = 'country', how='inner')

In [211]:
"""
    Renaming the columns of the dataframe to make it easier to perform analysis
"""
final_df.columns= ['article_name','country','rev_id','article_quality','population']

In [212]:
final_df.head()

Unnamed: 0,article_name,country,rev_id,article_quality,population
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub,15473900
1,Gladys Lundwe,Zambia,757566606,Stub,15473900
2,Mwamba Luchembe,Zambia,764848643,Stub,15473900
3,Thandiwe Banda,Zambia,768166426,Start,15473900
4,Sylvester Chisembele,Zambia,776082926,C,15473900


## Part 2 - Analysis


The analysis calculates the proportion (as a percentage) of articles-per-population and high-quality articles for each country. "High quality" articles are defined as articles about politicians that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes

Examples:  
. if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.  
. if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

In [213]:
#Getting total number of articles for each country
df_temp = final_df[['country','article_name']].groupby('country').count()

In [214]:
df_temp = df_temp.reset_index()

In [215]:
#Since population is an important factor in the analysis, we merge the population dataframe with the above data frame
article_pop = df_temp.merge(pop,on='country',how='inner')

In [216]:
article_pop.columns = ['country','total_articles','population']

In [217]:
#Calculating the percentage of articles per country
article_pop['perc_articles_pop']=article_pop['total_articles']/article_pop['population']
article_pop['perc_articles_pop'] = article_pop['perc_articles_pop']*100

In [218]:
# calculating the proportion of high-quality ("FA" or "GA") articles for each country.
hq = pd.concat([final_df.loc[final_df['article_quality']=='FA'], final_df.loc[final_df['article_quality']=='GA']])

In [219]:
hq_count = hq.groupby('country').count()['article_name']

In [220]:
hq_count = hq_count.to_frame().reset_index()

In [221]:
hq_perc = pd.merge(hq_count,article_pop,on='country',how='inner')

In [222]:
hq_perc['high_quality_perc'] = hq_perc['article_name']/hq_perc['total_articles']*100

In [223]:
hq_plot = hq_perc[['country','high_quality_perc']]
hq_plot.head()

Unnamed: 0,country,high_quality_perc
0,Afghanistan,4.587156
1,Albania,1.304348
2,Algeria,1.680672
3,Angola,0.909091
4,Argentina,3.427419


The following code will generate the following tables  
1) 10 highest-ranked countries in terms of number of politician articles as a proportion of country population  
2) 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population  
3) 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country  
4) 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [233]:
article_pop_asc = article_pop.sort_values(['perc_articles_pop'],ascending=True, inplace=False, axis=0)

In [237]:
print("10 lowest-ranked countries in terms of number of politician articles as a proportion of country population")
article_pop_asc.head(10)

10 lowest-ranked countries in terms of number of politician articles as a proportion of country population


Unnamed: 0,country,total_articles,population,perc_articles_pop
73,India,990,1314097616,7.5e-05
34,China,1138,1371920000,8.3e-05
74,Indonesia,215,255741973,8.4e-05
180,Uzbekistan,29,31290791,9.3e-05
53,Ethiopia,105,98148000,0.000107
86,"Korea, North",39,24983000,0.000156
185,Zambia,26,15473900,0.000168
166,Thailand,112,65121250,0.000172
38,"Congo, Dem. Rep. of",142,73340200,0.000194
13,Bangladesh,324,160411000,0.000202


In [238]:
article_pop_desc = article_pop.sort_values(['perc_articles_pop'],ascending=False, inplace=False, axis=0)

In [239]:
print("10 highest-ranked countries in terms of number of politician articles as a proportion of country population")
article_pop_desc.head(10)

10 highest-ranked countries in terms of number of politician articles as a proportion of country population


Unnamed: 0,country,total_articles,population,perc_articles_pop
120,Nauru,53,10860,0.488029
173,Tuvalu,55,11800,0.466102
141,San Marino,82,33000,0.248485
113,Monaco,40,38088,0.10502
97,Liechtenstein,29,37570,0.077189
107,Marshall Islands,37,55000,0.067273
72,Iceland,206,330828,0.062268
168,Tonga,63,103300,0.060987
3,Andorra,34,78000,0.04359
54,Federated States of Micronesia,38,103000,0.036893


In [224]:
#First we sort the countries according to their high quality percentage values
hq_asc = hq_plot.sort_values(['high_quality_perc'],ascending=True, inplace=False, axis=0)

In [228]:
print("Top 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country")
hq_asc.head(10)

Top 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


Unnamed: 0,country,high_quality_perc
33,Czech Republic,0.393701
77,Lithuania,0.403226
90,Morocco,0.480769
128,Tanzania,0.490196
135,Uganda,0.531915
13,Bolivia,0.534759
78,Luxembourg,0.555556
104,Peru,0.564972
115,Sierra Leone,0.60241
92,Namibia,0.606061


In [226]:
hq_desc = hq_plot.sort_values(['high_quality_perc'],ascending=False, inplace=False, axis=0)

In [229]:
print("10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country")
hq_desc.head(10)

10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


Unnamed: 0,country,high_quality_perc
67,"Korea, North",20.512821
141,Uzbekistan,10.344828
23,Central African Republic,10.294118
112,Saudi Arabia,10.084034
109,Romania,10.057471
51,Guinea-Bissau,9.52381
12,Bhutan,9.090909
144,Vietnam,8.376963
35,Dominica,8.333333
85,Mauritania,7.692308
