## A2 - Bias in Data

### Project Overview

The goal of this project is to explore the concept of 'bias' through data on Wikipedia articles on political figures from different countries. 

The dataset includes the set of political articles on Wikipedia, the predicted article quality scores for those articles, and a dataset of country populations.

The series of plots will be as follows:  
1) the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
2) the countries with the highest and lowest proportion of high quality articles about politicians.

## Part 1 - Getting the data

### The Wikipedia dataset

The wikipedia dataset can be found on Figshare. This was extracted using the Wikimedia API, saved as a csv file named page_data.csv. The columns in this file are:    
1) country: the name of the country  
2) page: the wikipedia article title  
3) rev_id: the revision id for the last edit of the page

In [86]:
"""
    This cell reads the page_data csv file into the dataframe "data" from the file page_data.csv.
    The shape of this dataframe is (47197,3).
    
"""
import pandas as pd

data = pd.read_csv('page_data.csv')

### The Population dataset

The population data is from Population Research Bureau website.http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14. The dataset has 5 columns: Location, Location Type, TimeFrame, Data Type, Data and Footnotes. I removed the unnecessary columns so that I could focus on the analysis part of the project without any superfluous data.

In [11]:
pop = pd.read_csv('Population_Mid_2015.csv',header=1)
del pop['Location Type']
del pop['TimeFrame']
del pop['Data Type']
del pop['Footnotes']

To make the merging of the two dataframes easier, I changed the 'Location' column name to 'country'.

In [12]:
pop.columns = ['country','population']

In [13]:
pop['population'] = pop['population'].str.replace(',', '')

In [14]:
pop['population'] = [int(x) for x in pop['population'].values.tolist()]

### Article quality prediction

The predicted quality scores for each article in the Wikipedia dataset comes from a Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article at a particular point in time, and assigns a series of probabilities that the article is best described by one of the categories listed below.


The range of quality scores are, from best to worst:  
FA - Featured article  
GA - Good article  
B - B-class article  
C - C-class article  
Start - Start-class article  
Stub - Stub-class article  


In [164]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}
"""
    Taken from the example jupyter notebook, this function returns a json object containing the ORES data 
    for each article with revision id as 'revids'. 
    An example of the json object is as follows:
    {'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'235107991': {'wp10': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.00617021706415532,
       'C': 0.01705290459462909,
       'FA': 0.0015941304170005732,
       'GA': 0.0012422843354764665,
       'Start': 0.024596904658825667,
       'Stub': 0.9493435589299127}}}}}}}
"""

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

In [197]:
def get_ores_data(revision_ids,headers):
    missing_ids = []
    scores = [] #aq = article quality
    endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"    
    params = {"project" : "enwiki",
              "model" : "wp10",
              "revids" : "|".join(str(x) for x in revision_ids)
             }

    api_call = requests.get(endpoint.format(**params))
    
    response = api_call.json()
    for rev_id in revision_ids:
        try:
            scores.append(response["enwiki"]["scores"][str(rev_id)]["wp10"]["score"]["prediction"])
        except:
            missing_ids.append(rev_id)
            print("no rev_id found")
            pass
    
    return scores, missing_ids

In [None]:
aq=[]
for i in range(0, len(data["rev_id"]), 100):

        revision_ids = data["rev_id"][i:(i + 100)]    
    
        scores, missing_ids = get_ores_data(revision_ids,headers) 
        for score in scores:
            aq.append(score)

Dropping the rows with the missing indices 

In [109]:
data.drop(46862, inplace=True)
data.drop(46863, inplace=True)
data.drop(45837, inplace=True)

In [111]:
data.shape

(47194, 3)

In [113]:
len(aq)

47230

### Merging the datasets

In the following cell, I have merged the data and the population dataframes for the final datafame which I will use for analysis.

In [None]:
final_df = data.merge(pop, on = 'country', how='inner')

In [None]:
"""
    Renaming the columns of the dataframe to make it easier to perform analysis
"""
final_df.columns= ['article_name','country','rev_id','article_quality','population']

In [None]:
final_df.head()

## Part 2 - Analysis


The analysis calculates the proportion (as a percentage) of articles-per-population and high-quality articles for each country. "High quality" articles are defined as articles about politicians that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes

Examples:  
. if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.  
. if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

In [None]:
#Getting total number of articles for each country
df_temp = final_df[['country','article_name']].groupby('country').count()

In [None]:
df_temp = df_temp.reset_index()

In [None]:
#Since population is an important factor in the analysis, we merge the population dataframe with the above data frame
article_pop = df_temp.merge(pop,on='country',how='inner')

In [None]:
article_pop.columns = ['country','total_articles','population']

In [None]:
#Calculating the percentage of articles per country
article_pop['perc_articles_pop']=article_pop['total_articles']/article_pop['population']
article_pop['perc_articles_pop'] = article_pop['perc_articles_pop']*100

In [None]:
# calculating the proportion of high-quality ("FA" or "GA") articles for each country.
hq = pd.concat([final_df.loc[final_df['article_quality']=='FA'], final_df.loc[final_df['article_quality']=='GA']])

In [None]:
hq_count = hq.groupby('country').count()['article_name']

In [None]:
hq_count = hq_count.to_frame().reset_index()

In [None]:
hq_perc = pd.merge(hq_count,article_pop,on='country',how='inner')

In [None]:
hq_perc['high_quality_perc'] = hq_perc['article_name']/hq_perc['total_articles']*100

In [None]:
hq_plot = hq_perc[['country','high_quality_perc']]
hq_plot.head()

### Visualizations

In [None]:
#First we sort the countries according to their high quality percentage values
hq_asc = hq_plot.sort_values(['high_quality_perc'],ascending=True, inplace=False, axis=0)

In [None]:
#Top 10 countries with the lowest percentage of high quality articles are
hq_asc.head(10)

In [None]:
hq_desc = hq_plot.sort_values(['high_quality_perc'],ascending=False, inplace=False, axis=0)

In [None]:
#Top 10 countries with the highest percentage of high quality articles are
hq_desc.head(10)