# Bias on Wikipedia

The goal of this project is to explore the concept of 'bias' through analyzing data on Wikipedia articles, which are on politicians from a variety of countries. By analyzing the existence and quality of these political articles, we are expecting to have a deeper understanding on bias of Wikipedia's content.

### Step 1: Data aquisition
#### Getting the article and population data
The wikipedia dataset is downloaded from Figshare (https://figshare.com/articles/Untitled_Item/5513449). 

The population data is on the Population Research Bureau website (http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14).

After downloading on 10/28/2017, read the csv files into dataframes.

In [1]:
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
import copy
from datetime import datetime
import plotly
import plotly.graph_objs as go
from plotly import tools

plotly.__version__
plotly.offline.init_notebook_mode(connected=True)
plotly.plotly.sign_in('45220Zmb', '9EywutMKCyDsD5WpWSp9')

# endpoint = 'https://ores.wikimedia.org/v3/scores/{context}?models={model}&revids={revid}'
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
headers={'User-Agent' : 'https://github.com/mbzhuang', 'From' : 'mbzhuang@uw.edu'}

In [2]:
pagedata = pd.read_csv("page_data.csv")
pagedata.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [3]:
population = pd.read_csv("Population Mid-2015.csv")
population.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Population Mid-2015
Location,Location Type,TimeFrame,Data Type,Data,Footnotes
Afghanistan,Country,Mid-2015,Number,32247000,
Albania,Country,Mid-2015,Number,2892000,
Algeria,Country,Mid-2015,Number,39948000,
Andorra,Country,Mid-2015,Number,78000,


It appeared that the population data file is converted from excel and the format is not descent. Rearrange the population dataframe by reading the second row as the header.

In [4]:
population = pd.read_csv("Population Mid-2015.csv", header = 1)
population.head()

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,
3,Andorra,Country,Mid-2015,Number,78000,
4,Angola,Country,Mid-2015,Number,25000000,


#### Getting article quality predictions

Convert the rev_id column to be a series of strings, each string contains 50 rev_ids. Thus, it is efficient to get the quality scores using ORES since they will be acquired in batches of 50.

In [5]:
composite_list = [pagedata['rev_id'][x:x+50] for x in range(0, len(pagedata['rev_id']), 50)]
str_list = []
for j in range(len(composite_list)):
    id_str = ""
    for i in composite_list[j]:
        id_str += str(i) + "|"
    id_str = id_str[:-1]
    str_list.append(id_str)

Take a look at the first element of the concatenated revids.

In [6]:
str_list[0]

'235107991|355319463|391862046|391862070|391862409|391862819|391863340|391863361|391863617|391863809|393276188|393298432|393822005|394482629|394482891|394580295|394580630|394580939|394580993|394581284|394581557|394587483|394587547|395521877|395526568|401577829|413885084|433871129|433871165|435008715|437454659|437735138|438305657|439671509|439708117|440397578|440594068|440598656|441172886|441186581|441771813|441995465|442411422|442913438|442937236|443468553|443469862|443470532|443496992|443497423'

Get the article quality through ORES, save to JSON files, read in JSON files, and save the data into one dataframe.

In [7]:
df_quality = pd.DataFrame()

# Loop over all the elements of str_list, which containes all revids from pagedata, to get their quality information.
for j in range(len(str_list)):
    params = {'project' : 'enwiki',
              'model' : 'wp10',
              'revids' : str_list[j]
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    
    with open('JSON_Data/ORES_quality_data_'+ str(j) + '.json', 'w') as outfile:
        json.dump(response, outfile)

    with open('JSON_Data/ORES_quality_data_'+ str(j) + '.json', 'r') as infile:
        json_content = json.load(infile)
            
    DataFrame = pd.DataFrame.transpose(json_normalize({(i): json_content['enwiki']['scores'][i]['wp10']['score']['prediction'] 
                                                       for i in json_content['enwiki']['scores'].keys() 
                                                       if 'score' in json_content['enwiki']['scores'][i]['wp10'].keys()}))
    df_quality = df_quality.append(DataFrame)

df_quality['revid'] = df_quality.index
df_quality = pd.DataFrame(df_quality)
df_quality['revid'] = df_quality['revid'].astype(int)
df_quality['prediction'] = df_quality[0].astype(str)

# Test if the length of df_quality is the same with pagedata.
pagedata.shape[0] == df_quality.shape[0]

False

In [8]:
[x for x in list(pagedata.rev_id.astype(str)) if x not in list(df_quality.index)]

['807367030', '807367166']

The above revids doesn't get quality data through ORES. 

### Step 2: Data processing
Now we have three dataframes, pagedata, population, and df_quality. Merge them together to get the final dataframe that will contain country, article_name, revision_id, article_quality, and population.

In [9]:
# Merge df_quality and pagedata.
quality_pagedata = pd.merge(df_quality, pagedata, left_on='revid', right_on='rev_id', how = 'left')
# Merge in population.
df_final = pd.merge(quality_pagedata, population, left_on='country', right_on='Location')
# Select needed columns from the final dataset and rename them.
df_final = df_final[['country', 'page', 'revid', 'prediction', 'Data']]
df_final.columns = ['country', 'article_name', 'revision_id', 'article_quality', 'population']
# Look at the data types of the dataframe.
df_final.dtypes

country            object
article_name       object
revision_id         int64
article_quality    object
population         object
dtype: object

In [10]:
# Convert population variable to integer.
df_final['population'] = [int(element.replace(',', '')) for element in df_final['population']]
# Save the dataframe to csv file.
df_final.to_csv("bias_analysis_processed_data.csv", index=False)

### Step 3: Data analysis and visualization

#### The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.

Analysis on 10 highest-ranked countries in terms of number of politician articles as a proportion of country population:

In [11]:
df_final = pd.read_csv("bias_analysis_processed_data.csv")
# Aggregate df_final dataframe, group by country and population, get the count of article names.
article_by_population = df_final.groupby(['country', 'population'])['article_name'].count()
article_by_population = pd.DataFrame(article_by_population)
article_by_population = pd.melt(article_by_population.reset_index(), id_vars=['country', 'population'], value_name='article_count')
# Calculate the proportion of cpolitician articles by country population
article_by_population['article_by_population'] = article_by_population['article_count']/article_by_population['population']
# Get the top 10 highest-ranked countries
top_article_by_population = article_by_population.sort_values(by = 'article_by_population', ascending=False)[0:10]
top_article_by_population

Unnamed: 0,country,population,variable,article_count,article_by_population
120,Nauru,10860,article_name,53,0.00488
173,Tuvalu,11800,article_name,55,0.004661
141,San Marino,33000,article_name,82,0.002485
113,Monaco,38088,article_name,40,0.00105
97,Liechtenstein,37570,article_name,29,0.000772
107,Marshall Islands,55000,article_name,37,0.000673
72,Iceland,330828,article_name,206,0.000623
168,Tonga,103300,article_name,63,0.00061
3,Andorra,78000,article_name,34,0.000436
54,Federated States of Micronesia,103000,article_name,38,0.000369


As shown below, most countries on top of the list in terms of number of articles divided by population are the countries of lowest population.

In [12]:
bottom_by_population = df_final.groupby(['country'])['population'].mean().sort_values(ascending=True)[0:10]
bottom_by_population

country
Nauru                  10860
Tuvalu                 11800
San Marino             33000
Liechtenstein          37570
Monaco                 38088
Marshall Islands       55000
Dominica               68000
Andorra                78000
Antigua and Barbuda    90000
Seychelles             92833
Name: population, dtype: int64

Next, use plotly to make visualization to present the above information.

In [13]:
trace1 = go.Bar(
    x=top_article_by_population.country,
    y=top_article_by_population.article_by_population,
    marker=dict(
#         color='rgb(158,202,225)',
        line=dict(
#             color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace1]
layout = go.Layout(
    title='Number of politician articles by population, 10 highest-ranked countries',
    height= 600, width = 1000,
    margin=go.Margin(
        l=100,
        r=150,
        b=150,
        t=100,
        pad=4
    ),
    xaxis = dict(title = 'Country'),
    yaxis = dict(title = 'Number of Articles Divided By Country Population', range=[0, 0.005]),
)

fig = go.Figure(data=data, layout=layout)
# Show image in jupyter notebook
plotly.offline.iplot(fig, filename='HighestCountryArticlebyPopulation')

# Save the image
plotly.plotly.image.save_as(fig, filename='HighestCountryArticlebyPopulation.png')

Analysis on 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population:

In [14]:
# Get the top 10 lowest-ranked countries
bottom_article_by_population = article_by_population.sort_values(by = 'article_by_population', ascending=True)[0:10]
bottom_article_by_population

Unnamed: 0,country,population,variable,article_count,article_by_population
73,India,1314097616,article_name,990,7.533687e-07
34,China,1371920000,article_name,1138,8.294944e-07
74,Indonesia,255741973,article_name,215,8.406911e-07
180,Uzbekistan,31290791,article_name,29,9.267902e-07
53,Ethiopia,98148000,article_name,105,1.069813e-06
86,"Korea, North",24983000,article_name,39,1.561062e-06
185,Zambia,15473900,article_name,26,1.680249e-06
166,Thailand,65121250,article_name,112,1.719869e-06
38,"Congo, Dem. Rep. of",73340200,article_name,142,1.936182e-06
13,Bangladesh,160411000,article_name,324,2.019812e-06


Similarly, most countries on the bottom of the list in terms of number of articles divided by population are the countries of highest population, for instance, China, India, Indonesia, and Bangladesh.

In [15]:
top_by_population = df_final.groupby(['country'])['population'].mean().sort_values(ascending=False)[0:10]
top_by_population

country
China            1371920000
India            1314097616
United States     321234172
Indonesia         255741973
Brazil            204519398
Pakistan          199047300
Nigeria           181839400
Bangladesh        160411000
Russia            144302000
Mexico            127017000
Name: population, dtype: int64

Use plotly to make visualization to present the above information.

In [16]:
trace1 = go.Bar(
    x=bottom_article_by_population.country,
    y=bottom_article_by_population.article_by_population,
    marker=dict(
#         color='rgb(158,202,225)',
        line=dict(
#             color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace1]
layout = go.Layout(
    title='Number of politician articles by population, 10 lowest-ranked countries',
    height= 600, width = 1000,
        margin=go.Margin(
        l=100,
        r=150,
        b=150,
        t=100,
        pad=4
    ),
    xaxis = dict(title = 'Country'),
    yaxis = dict(title = 'Number of Articles Divided By Country Population'),
)

fig = go.Figure(data=data, layout=layout)
# Show image in jupyter notebook
plotly.offline.iplot(fig, filename='LowestCountryArticlebyPopulation')

# Save the image
plotly.plotly.image.save_as(fig, filename='LowestCountryArticlebyPopulation.png')


#### The countries with the highest and lowest proportion of high quality articles about politicians.
Analysis on 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country:

In [23]:
# Get the total article count from the dataframe above, article_by_population
article_count_df = article_by_population[['country', 'article_count']]
# Get the count of GA and FA articles for each country
FA_df = df_final[df_final.article_quality == 'FA']
FA_df = pd.DataFrame(FA_df.groupby(['country'])['article_quality'].count())
FA_df = pd.melt(FA_df.reset_index(), id_vars=['country'], value_name='FA_count')
GA_df = df_final[df_final.article_quality == 'GA']
GA_df = pd.DataFrame(GA_df.groupby(['country'])['article_quality'].count())
GA_df = pd.melt(GA_df.reset_index(), id_vars=['country'], value_name='GA_count')
# Merge article_count_df, FA_df, and GA_df
article_high_quality = pd.merge(article_count_df, FA_df, on='country', how ='left')
article_high_quality = pd.merge(article_high_quality, GA_df, on='country', how ='left')
# Fill NA with 0
article_high_quality = article_high_quality.fillna(value = 0)
# Calculate proportion of high quality articles
article_high_quality['high_quality_proportion'] = (article_high_quality['FA_count'] + article_high_quality['GA_count'])/article_high_quality['article_count']
article_high_quality = article_high_quality[['country', 'high_quality_proportion']]
# Get the top 10 highest-ranked countries
top_high_quality = article_high_quality.sort_values(by = 'high_quality_proportion', ascending=False)[0:10]
top_high_quality

Unnamed: 0,country,high_quality_proportion
86,"Korea, North",0.230769
138,Romania,0.12931
143,Saudi Arabia,0.12605
31,Central African Republic,0.117647
137,Qatar,0.098039
68,Guinea-Bissau,0.095238
183,Vietnam,0.094241
19,Bhutan,0.090909
77,Ireland,0.081365
178,United States,0.078324


Use plotly to make visualization to present the above information.

In [26]:
trace1 = go.Bar(
    x=top_high_quality.country,
    y=top_high_quality.high_quality_proportion,
    marker=dict(
#         color='rgb(158,202,225)',
        line=dict(
#             color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace1]
layout = go.Layout(
    title='High quality article proportion, 10 highest-ranked countries',
    height= 600, width = 1000,
        margin=go.Margin(
        l=100,
        r=150,
        b=150,
        t=100,
        pad=4
    ),
    xaxis = dict(title = 'Country'),
    yaxis = dict(title = 'Percentage of FA and GA grade articles', tickformat=".0%"),
)

fig = go.Figure(data=data, layout=layout)
# Show image in jupyter notebook
plotly.offline.iplot(fig, filename='HighestArticleQaulity')

# Save the image
plotly.plotly.image.save_as(fig, filename='HighestArticleQaulity.png')

This is an interesting finding, as shown above, North Korea not only has the highest proportion
of high quality articles on politician but also has a rate significantly higher than the rest countries on top of the list.

10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [19]:
# Get the top 10 lowest-ranked countries
bottom_high_quality = article_high_quality.sort_values(by = 'high_quality_proportion', ascending=True)[0:10]
bottom_high_quality

Unnamed: 0,country,high_quality_proportion
142,Sao Tome and Principe,0.0
172,Turkmenistan,0.0
107,Marshall Islands,0.0
69,Guyana,0.0
36,Comoros,0.0
170,Tunisia,0.0
45,Djibouti,0.0
46,Dominica,0.0
100,Macedonia,0.0
168,Tonga,0.0


Since high quality articles in all of the above countries have proportion of 0, there is no need to make visualization.

In [20]:
# df = pd.DataFrame.from_dict({(i): response['enwiki']['scores'][i]['wp10']['score']['prediction'] 
#                                   for i in response['enwiki']['scores'].keys()},
#                               orient='index')
# df.shape