## A2 - Bias in Data

### Project Overview

The goal of this project is to explore the concept of 'bias' through data on Wikipedia articles on political figures from different countries. 

The dataset includes the set of political articles on Wikipedia, the predicted article quality scores for those articles, and a dataset of country populations.

The series of plots will be as follows:  
1) the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
2) the countries with the highest and lowest proportion of high quality articles about politicians.

## Part 1 - Getting the data

### The Wikipedia dataset

The wikipedia dataset can be found on Figshare. This was extracted using the Wikimedia API, saved as a csv file named page_data.csv. The columns in this file are:    
1) country: the name of the country  
2) page: the wikipedia article title  
3) rev_id: the revision id for the last edit of the page

In [186]:
"""
    This cell reads the page_data csv file into the dataframe "data" from the file page_data.csv.
    The shape of this dataframe is (47197,3).
    
"""
import pandas as pd

data = pd.read_csv('page_data.csv')

### The Population dataset

The population data is from Population Research Bureau website.http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14. The dataset has 5 columns: Location, Location Type, TimeFrame, Data Type, Data and Footnotes. I removed the unnecessary columns so that I could focus on the analysis part of the project without any superfluous data.

In [205]:
pop = pd.read_csv('Population_Mid_2015.csv',header=1)
del pop['Location Type']
del pop['TimeFrame']
del pop['Data Type']
del pop['Footnotes']

To make the merging of the two dataframes easier, I changed the 'Location' column name to 'country'.

In [206]:
pop.columns = ['country','population']

In [253]:
pop['population'] = pop['population'].str.replace(',', '')

In [254]:
pop['population'] = [int(x) for x in pop['population'].values.tolist()]

### Article quality prediction

The predicted quality scores for each article in the Wikipedia dataset comes from a Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article at a particular point in time, and assigns a series of probabilities that the article is best described by one of the categories listed below.


The range of quality scores are, from best to worst:  
FA - Featured article  
GA - Good article  
B - B-class article  
C - C-class article  
Start - Start-class article  
Stub - Stub-class article  


In [105]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}
"""
    Taken from the example jupyter notebook, this function returns a json object containing the ORES data 
    for each article with revision id as 'revids'. 
    An example of the json object is as follows:
    {'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'235107991': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.00617021706415532,
       'C': 0.01705290459462909,
       'FA': 0.0015941304170005732,
       'GA': 0.0012422843354764665,
       'Start': 0.024596904658825667,
       'Stub': 0.9493435589299127}}}}}}}
"""

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

After trying to run the above function on all the revision ids present in the data, I found that there were revision ids that could not be found by the ORES API. I separated those revision Ids and found the article qualities for the ones that the function could run for.

In [165]:
"""
    This bit of code extracts the value of the prediction from the JSON and puts it in a list aq (article qualities) 
    In this for loop, I call the get_ores_data function for a group of 100 revision ids at a time.
"""

aq = [] #aq = article qualities
for i in range(0,46700,100):
    ores_data = get_ores_data(data["rev_id"][i:(i + 100)] ,headers)
    for a in ores_data['enwiki']['scores']:
        aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])


In [166]:
"""
    This cell appends other values after index 46700 to aq which will then be added as a column to 
    the 'data' dataframe. 
"""

ores_data = get_ores_data(data["rev_id"][46700:46800] ,headers)
for a in ores_data['enwiki']['scores']:
    aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])

ores_data = get_ores_data(data["rev_id"][46800:46861] ,headers)
for a in ores_data['enwiki']['scores']:
    aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])

ores_data = get_ores_data([data["rev_id"][46861]] ,headers)
for a in ores_data['enwiki']['scores']:
    aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])

ores_data = get_ores_data([data["rev_id"][46864]],headers)
for a in ores_data['enwiki']['scores']:
    aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])

    
ores_data = get_ores_data(data["rev_id"][46865:46900] ,headers)
for a in ores_data['enwiki']['scores']:
    aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])

ores_data = get_ores_data(data["rev_id"][46900:47000] ,headers)
for a in ores_data['enwiki']['scores']:
    aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])

ores_data = get_ores_data(data["rev_id"][47000:47100] ,headers)
for a in ores_data['enwiki']['scores']:
    aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])

ores_data = get_ores_data(data["rev_id"][47100:47200] ,headers)
for a in ores_data['enwiki']['scores']:
    aq.append(ores_data['enwiki']['scores'][a]['wp10']['score']['prediction'])


In [167]:
get_ores_data([data["rev_id"][46862]] ,headers)


{'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'807367030': {'wp10': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807367030)',
      'type': 'RevisionNotFound'}}}}}}

In [168]:
get_ores_data([data["rev_id"][46863]] ,headers)

{'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'807367166': {'wp10': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807367166)',
      'type': 'RevisionNotFound'}}}}}}

Dropping the rows with indices 46862 and 46863

In [187]:
data.drop(46832, inplace=True)
data.drop(46833, inplace=True)

In [189]:
data['article_quality'] = aq

In [191]:
data.head()

Unnamed: 0,page,country,rev_id,article_quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


### Merging the datasets

In the following cell, I have merged the data and the population dataframes for the final datafame which I will use for analysis.

In [207]:
final_df = data.merge(pop, on = 'country', how='inner')

In [208]:
"""
    Renaming the columns of the dataframe to make it easier to perform analysis
"""
final_df.columns= ['article_name','country','rev_id','article_quality','population']

In [209]:
final_df.head()

Unnamed: 0,article_name,country,rev_id,article_quality,population
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub,15473900
1,Gladys Lundwe,Zambia,757566606,Stub,15473900
2,Mwamba Luchembe,Zambia,764848643,Stub,15473900
3,Thandiwe Banda,Zambia,768166426,Start,15473900
4,Sylvester Chisembele,Zambia,776082926,C,15473900


## Part 2 - Analysis


The analysis calculates the proportion (as a percentage) of articles-per-population and high-quality articles for each country. "High quality" articles are defined as articles about politicians that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes

Examples:  
. if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.  
. if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

In [288]:
#Getting total number of articles for each country
df_temp = final_df[['country','article_name']].groupby('country').count()

In [289]:
df_temp = df_temp.reset_index()

In [303]:
#Since population is an important factor in the analysis, we merge the population dataframe with the above data frame
article_pop = df_temp.merge(pop,on='country',how='inner')

In [304]:
article_pop.columns = ['country','total_articles','population']

In [305]:
#Calculating the percentage of articles per country
article_pop['perc_articles_pop']=article_pop['total_articles']/article_pop['population']
article_pop['perc_articles_pop'] = article_pop['perc_articles_pop']*100

In [306]:
# calculating the proportion of high-quality ("FA" or "GA") articles for each country.
hq = pd.concat([final_df.loc[final_df['article_quality']=='FA'], final_df.loc[final_df['article_quality']=='GA']])

In [307]:
hq_count = hq.groupby('country').count()['article_name']

In [308]:
hq_count = hq_count.to_frame().reset_index()

In [311]:
hq_perc = pd.merge(hq_count,article_pop,on='country',how='inner')

In [316]:
hq_perc['high_quality_perc'] = hq_perc['article_name']/hq_perc['total_articles']*100

In [317]:
hq_plot = hq_perc[['country','high_quality_perc']]
hq_plot.head()

Unnamed: 0,country,high_quality_perc
0,Afghanistan,5.810398
1,Albania,1.086957
2,Algeria,2.521008
3,Angola,1.818182
4,Argentina,3.225806


### Visualizations

In [318]:
#First we sort the countries according to their high quality percentage values
hq_asc = hq_plot.sort_values(['high_quality_perc'],ascending=True, inplace=False, axis=0)

In [322]:
#Top 10 countries with the lowest percentage of high quality articles are
hq_asc.head(10)

Unnamed: 0,country,high_quality_perc
42,Finland,0.174825
131,Tanzania,0.245098
95,Nepal,0.275482
107,Peru,0.282486
77,Lithuania,0.403226
89,Moldova,0.469484
41,Fiji,0.502513
137,Uganda,0.531915
78,Luxembourg,0.555556
100,Nigeria,0.584795


In [323]:
hq_desc = hq_plot.sort_values(['high_quality_perc'],ascending=False, inplace=False, axis=0)

In [324]:
#Top 10 countries with the highest percentage of high quality articles are
hq_desc.head(10)

Unnamed: 0,country,high_quality_perc
67,"Korea, North",23.076923
112,Romania,12.931034
115,Saudi Arabia,12.605042
22,Central African Republic,10.294118
111,Qatar,9.803922
52,Guinea-Bissau,9.52381
146,Vietnam,9.424084
12,Bhutan,9.090909
60,Ireland,8.136483
141,United States,7.832423
