In [60]:
import requests
import json
import csv
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

### Assignment 2: Bias in Data
#### Project Overview

The goal of this assignment is to explore the concept of 'bias' through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. First, we need to use a machine learning service called [ORES](https://ores.wikimedia.org) to estimate the quality of each article for the wekipedia's dataset. Then, we will combine a dataset of Wikipedia articles with a dataset of country populations. Moreover, we are going to use bar charts to create 4 different visualizations from this combined dataset to address the following analyses:

1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


#### Getting the article and population data
The first step is getting the data, which lives in several different places. 
1. The wikipedia dataset can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449).
2. The population data is on the [Population Research Bureau website](http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14). 

#### Getting article quality predictions
We need to get the predicted quality scores for each article in the Wikipedia dataset. 
For this step, we're using a Wikimedia API endpoint for a machine learning system called $ORES$ ("Objective Revision Evaluation Service"). 

$ORES$ estimates the quality of an article and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

See the documentaion [here](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model)

Below is a function to get the each article's quality by using $ORES$. It will return a list contains all the qulity predictions.

In [61]:
def get_article_quality(revids):
    """
    This function takes a list of revision id.
    use the Wikimedia API endpoint for a machine learning system called 'ORES' to get the prediction values
    and returns a list of all the articles quality prediction
    Args:
        param (str): an url for the API
        
    Return:
        a list of string, each string represents a prediction value corresponding to the revision id
    """
    quality = []
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    
    with open('data.json', 'w') as f:
        json.dump(response, f)
    
    with open("data.json", 'r') as f:
        json_data = json.load(f)
    
    prediction = json_data['enwiki']['scores']
    for key in prediction:
        if 'error' in prediction[key]['wp10']:
            quality.append('NA')
        else:
            # print(prediction[key]['wp10']['score']['prediction'])
            quality.append(prediction[key]['wp10']['score']['prediction'])  
    return quality

#### Get the Prediction
In order to get article predictions for each article in the Wikipedia dataset, we need to read page_data.csv into Python, and then read through the dataset line by line. Moreover, we are going to use the value of the 'last_edit' column in the API query.

Now I am calling the get_article_quality function every time I have collected 100 ids. This function will give me back the article quality prediction corresponding to each id. Because $ORES$ allows you to submit multiple revision id at the same time and it would return the same amount of predictions for you. This approach would speed up this process a lot!

Once this procedure done, we can print and see the output of a list of predictions.

Note that, there are 4 articles that don't have prediction values.

In [62]:
revids = []
quality = []
header = True
count = 1
with open("page_data.csv", 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    for row in csv_reader:
        if (header):
            header = False
            continue        
        if (count == 100):
            quality = quality + get_article_quality(revids)
            revids = []
            count = 0       
        # add the id into revids list
        revids.append(row[2])
        count = count + 1    
    quality = quality + get_article_quality(revids)
#print(quality)

#### Getting the article data
Now we need to use pandas to read the Wekipeida data into a data frame. Then, create a new column call 'article_quality' for the page_data which contains every article's prediction. 

Let's see the dimension and first 5 rows of the data.

In [63]:
page_data = pd.read_csv("page_data.csv")
page_data['article_quality'] = quality
print(page_data.count())
page_data.head()

page               47197
country            47197
rev_id             47197
article_quality    47197
dtype: int64


Unnamed: 0,page,country,rev_id,article_quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


Let's take a look at the number of these 6 quality categories.
We can see there are 4 'NA' value in the 'article_quality' column. That means there are two rows that without article quality predictions. Thus, we need to remove those two rows before moving forward.

In [64]:
print(page_data['article_quality'].value_counts())
page_data = page_data.drop(page_data.index[page_data.article_quality == 'NA'])
print()
print(page_data['article_quality'].value_counts())
print()
print(page_data.count())

Stub     24666
Start    14873
C         5851
GA         774
B          737
FA         292
NA           4
Name: article_quality, dtype: int64

Stub     24666
Start    14873
C         5851
GA         774
B          737
FA         292
Name: article_quality, dtype: int64

page               47193
country            47193
rev_id             47193
article_quality    47193
dtype: int64


#### Population Dataset
Next, The population data is on the Population Research Bureau website, download the CSV file and use pandas to read it as a data frame. 

Since the header starting in row 3 of the Population Mid-2015.csv, when we read the csv file remeber to skip the first two rows. Moreover, I also change the headers to lowercase.

Let's see the dimension and first 5 rows of the data.

In [65]:
population = pd.read_csv("Population Mid-2015.csv", skiprows=2)
population.columns = map(str.lower, population.columns)
print(population.count())
population.head()

location         210
location type    210
timeframe        210
data type        210
data             210
footnotes          0
dtype: int64


Unnamed: 0,location,location type,timeframe,data type,data,footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,
3,Andorra,Country,Mid-2015,Number,78000,
4,Angola,Country,Mid-2015,Number,25000000,


The column 'footnotes' is irrelevant here, so we can drop in this case. Moreover, we need to merge the population data  to the Wekipedia data by the country names. Thus, it is better to rename the 'location' column as 'country'.

Moreover, we need to get rid of the all the ',' in 'data' column and change it into int type.

Let's see the first 5 rows of the data.

In [66]:
population.drop('footnotes', axis=1, inplace=True)
population.rename(columns={'location':'country'}, inplace=True)
population['data'] = [int(x.replace(',', '')) for x in population['data']]
print(population.info())
population.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 5 columns):
country          210 non-null object
location type    210 non-null object
timeframe        210 non-null object
data type        210 non-null object
data             210 non-null int64
dtypes: int64(1), object(4)
memory usage: 8.3+ KB
None


Unnamed: 0,country,location type,timeframe,data type,data
0,Afghanistan,Country,Mid-2015,Number,32247000
1,Albania,Country,Mid-2015,Number,2892000
2,Algeria,Country,Mid-2015,Number,39948000
3,Andorra,Country,Mid-2015,Number,78000
4,Angola,Country,Mid-2015,Number,25000000


#### Combining the datasets
Now, we can combine these two datasets. We need to merge the wikipedia data and population data together by the commcon attributes. And we can see both datasets have fields containing country names for merging purpose. 

After merging the data, we are going to drop thoes rows that can not be matched. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice versa.

Therefore, we can use the inner join method in order to merge these two dataset by only keeping all the country names that are matched after the merge operation. Then, we need to rename some of the columns and drop the irrelevant columns in order to construct our the final dataset. It has the following columns:
1. country
2. article_name
3. revision_id
4. article_quality
5. population

Let's see the dimension and first 5 rows of the data. Then save the data frame into csv file name as 'final_data.csv'

In [67]:
final = pd.merge(page_data, population, how='inner', on=['country'])
final.rename(columns={'page':'article_name'}, inplace=True)
final.rename(columns={'rev_id':'revision_id'}, inplace=True)
final.rename(columns={'data':'population'}, inplace=True)
final.drop('location type', axis=1, inplace=True)
final.drop('timeframe', axis=1, inplace=True)
final.drop('data type', axis=1, inplace=True)
print(final.info())
final.to_csv("final_data.csv", index=False)
final.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45795 entries, 0 to 45794
Data columns (total 5 columns):
article_name       45795 non-null object
country            45795 non-null object
revision_id        45795 non-null int64
article_quality    45795 non-null object
population         45795 non-null int64
dtypes: int64(2), object(3)
memory usage: 2.1+ MB
None


Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub,15473900
1,Gladys Lundwe,Zambia,757566606,Stub,15473900
2,Mwamba Luchembe,Zambia,764848643,Stub,15473900
3,Thandiwe Banda,Zambia,768166426,Start,15473900
4,Sylvester Chisembele,Zambia,776082926,C,15473900


### Analysis
Now we are going to analyst the proportion (as a percentage) of articles-per-population and high-quality articles for each country. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

* If a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.

Therefore, we will calculate the percentage of articles per population for each country. 

Let's see the first 5 rows of the data.

In [68]:
article_count = final[['country', 'article_name']].groupby('country').count().reset_index(level=0)
population_count = final[['country', 'population']].groupby('country').mean().reset_index(level=0)
article_population = article_count.merge(population_count, how='inner', on=['country'])
article_population.rename(columns={'article_name':'article_total'}, inplace=True)
article_population['article_population_percent'] = round(article_population['article_total'] 
                                                         / article_population['population'] * 100, 6)
article_population.head(5)

Unnamed: 0,country,article_total,population,article_population_percent
0,Afghanistan,327,32247000,0.001014
1,Albania,460,2892000,0.015906
2,Algeria,119,39948000,0.000298
3,Andorra,34,78000,0.04359
4,Angola,110,25000000,0.00044


* If a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

Therefore, we will find all the high-quality articles from the final dataset and then calculate the total number of those articles by grouping the country names.

Note: We need to use left join for the two table in order to keep all the countries for high-quality articles. Otherwise, we will lose some of the countries that have 0 high-quality articles when we only consider FA or GA class articles.

Let's see the first 5 rows of the data.

In [69]:
high_quality_article = final.loc[(final['article_quality'] == 'FA') | (final['article_quality'] == 'GA')]
high_quality_article_count = high_quality_article[['country', 'article_name']].groupby('country').count().reset_index(level=0)
high_quality_article_count.rename(columns={'article_name':'high_quality_total'}, inplace=True)
high_quality = article_count.merge(high_quality_article_count, how='left', on=['country'])
high_quality['high_quality_total'].fillna(0, inplace=True)
high_quality['high_quality_total'] = high_quality['high_quality_total'].astype(int)
high_quality.rename(columns={'article_name':'article_total'}, inplace=True)
high_quality['high_quality_percent'] = round(high_quality['high_quality_total'] 
                                             / high_quality['article_total'] * 100, 6)
high_quality.head()

Unnamed: 0,country,article_total,high_quality_total,high_quality_percent
0,Afghanistan,327,15,4.587156
1,Albania,460,5,1.086957
2,Algeria,119,2,1.680672
3,Andorra,34,0,0.0
4,Angola,110,1,0.909091


After we have these two new data frames we construct above, we will be able to show the following tables:

1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

I am going to use two dataframes *article_population* and *high_quality* to address the above tables. The idea is to just sort the dataframse either in ascending or descending orders and keep the first 10 rows.

You can also go to my github and see the tables I have provided [here](https://github.com/lzctony/data-512-a2/blob/master/README.md).

#### 1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [70]:
table_1 = article_population.sort_values(['article_population_percent'], ascending=False)[:10]
table_1

Unnamed: 0,country,article_total,population,article_population_percent
120,Nauru,53,10860,0.488029
173,Tuvalu,55,11800,0.466102
141,San Marino,82,33000,0.248485
113,Monaco,40,38088,0.10502
97,Liechtenstein,29,37570,0.077189
107,Marshall Islands,37,55000,0.067273
72,Iceland,206,330828,0.062268
168,Tonga,63,103300,0.060987
3,Andorra,34,78000,0.04359
54,Federated States of Micronesia,38,103000,0.036893


#### 2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [54]:
table_2 = article_population.sort_values(['article_population_percent'], ascending=True)[:10]
table_2

Unnamed: 0,country,article_total,population,article_population_percent
73,India,989,1314097616,7.5e-05
34,China,1138,1371920000,8.3e-05
74,Indonesia,215,255741973,8.4e-05
180,Uzbekistan,29,31290791,9.3e-05
53,Ethiopia,105,98148000,0.000107
86,"Korea, North",39,24983000,0.000156
185,Zambia,26,15473900,0.000168
166,Thailand,112,65121250,0.000172
38,"Congo, Dem. Rep. of",142,73340200,0.000194
13,Bangladesh,324,160411000,0.000202


#### 3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [55]:
table_3 = high_quality.sort_values(['high_quality_percent'], ascending=False)[:10]
table_3

Unnamed: 0,country,article_total,high_quality_total,high_quality_percent
86,"Korea, North",39,9,23.076923
143,Saudi Arabia,119,14,11.764706
180,Uzbekistan,29,3,10.344828
31,Central African Republic,68,7,10.294118
138,Romania,348,34,9.770115
68,Guinea-Bissau,21,2,9.52381
19,Bhutan,33,3,9.090909
183,Vietnam,191,16,8.376963
46,Dominica,12,1,8.333333
109,Mauritania,52,4,7.692308


#### 4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [56]:
table_4 = high_quality.sort_values(['high_quality_percent'], ascending=True)[:10]
table_4

Unnamed: 0,country,article_total,high_quality_total,high_quality_percent
172,Turkmenistan,33,0,0.0
164,Tajikistan,40,0,0.0
113,Monaco,40,0,0.0
117,Mozambique,60,0,0.0
120,Nauru,53,0,0.0
168,Tonga,63,0,0.0
30,Cape Verde,37,0,0.0
65,Guadeloupe,49,0,0.0
83,Kazakhstan,79,0,0.0
158,Suriname,40,0,0.0


### Writeup

In this assignment, when I use the ORES API to get the article quality preditions, there are two observations in the Wikipedia data that could not get the prediction values. Moreover, after merging the *Wikipedia dataset* and *Population dataset* those unmatched observation would be dropped.

The analysis is divdied into two parts:

1. countries in terms of number of politician articles as a proportion of country population
2. countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In table 1 (10 highest-ranked countries in terms of number of politician articles as a proportion of country population), we can see those vary small countries have the highest proportion of politician articles per population. 

In table 2 (10 lowest-ranked countries in terms of number of politician articles as a proportion of country. population), we can see that India and China have the lowest proprtion of politician articles per population which make sense because we know China has the largest population while India has the second-largest population around the world. 

| country |article_total |population | article_population_percent |
| --- | --- | --- | --- |
| India | 989 | 1314097616 | 0.000075 |
| China	| 1138| 1371920000 | 0.000083 |

In table 3 (10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country), I'm very superised to see that North Korea has the highest proportion of high quality articles. I would consider whether there is bias in this analysis. Because we know the national conditions in North Korea, and even they are not English spoken country. I would like to reconsider the article quality predition for North Korea; or maybe explore more about the data. 

| country | high_quality_total | article_total |
| --- | --- | --- | 
| United States | 79 |1098 |
| United Kingdom |48 | 867 |
| Australia |44 |1566 |
| Spain	| 37 | 881 |
| China |35 |1138 |
| Romania | 34 | 348 |
| Russia |32 | 882 |
| Canada |29 | 852 |
| Ireland |28 |381 |
| France |22 |168 |

Moreover, from the above table (**10 highest-ranked countries in terms of number of high-quality articles**) we can see that United States, United Kingdom and Australia are the top 3 countries that have the total count of high-quality articles all over the wrold. I think this make sense becuase people from these three countries are native speakers of English. I am superised to see that Spain and China are within the top 5 countries because these two countries are not English-spoken countries. 

In table 4 (10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country), we can see they are all 0%. Moreover, there are 39 countries that have 0 high-quality politician articles.

In my opinion, base on the analysis and tables we have created above, I think there are biases in both Wikipedia's articles and also the ORES API. Those articles are collected from English Wikipedia and there are lots of countries that are not English spoken countries, this might affect the quality of writing an article in English as well.