# A2 - Bias in English Wikipedia Articles

This assignment is to assess the biases in English Wikipedia. More information on this assignment can be found here: https://wiki.communitydata.cc/Human_Centered_Data_Science_(Fall_2018)/Assignments#A2:_Bias_in_data

In [1]:
# Imports for code
import requests
import json
# import csv
import pandas as pd

## Gather the Data

To use the ORES API, I used the code below. I got this code from the repository here: https://github.com/Ironholds/data-512-a2

The code is from this Python Notebook: https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb

In [3]:
# Customize these with your own information by replacing "hmurph3"
headers = {
    'User-Agent': 'https://github.com/hmurph3',
    'From': 'hmurph3@uw.edu'
}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    #print(json.dumps(response, indent=4, sort_keys=True)) 
    
    #I decided to comment the print so the function returned something instead of printing it out
    return response


# So if we grab some example revision IDs and turn them into a list and then call get_ores_data...
example_ids = [783381498, 807355596, 757539710]
example_call= get_ores_data(example_ids, headers)

In [4]:
print(example_call)

{'enwiki': {'models': {'wp10': {'version': '0.6.1'}}, 'scores': {'757539710': {'wp10': {'score': {'prediction': 'Start', 'probability': {'B': 0.05635270475191951, 'C': 0.17635417131683803, 'FA': 0.001919869734464717, 'GA': 0.005517075264277984, 'Start': 0.732764644204933, 'Stub': 0.027091534727566813}}}}, '783381498': {'wp10': {'score': {'prediction': 'Start', 'probability': {'B': 0.039498449850621085, 'C': 0.06068466061111685, 'FA': 0.0029057427468351755, 'GA': 0.007477221115409147, 'Start': 0.5674464916024892, 'Stub': 0.3219874340735285}}}}, '807355596': {'wp10': {'score': {'prediction': 'Start', 'probability': {'B': 0.04566408685167919, 'C': 0.10144128886317841, 'FA': 0.002651239009002438, 'GA': 0.006433022662730785, 'Start': 0.7675063182740381, 'Stub': 0.07630404433937113}}}}}}}


I now need to read in the csv files `page_data.csv` and `WPDS_2018_data.csv` as tables.  
I used the **pandas.read_csv()**

In [70]:
page_data = pd.read_csv('page_data.csv', sep = ',', header = 0)
wpds_2018 = pd.read_csv('WPDS_2018_data.csv', sep = ',', thousands=',', header = 0) # population has commas

In [71]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [72]:
wpds_2018.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


I now need to conver the *rev_id* in the `page_data` table into a list in order to use the **ORES API**.

In [8]:
rev_ids = page_data.iloc[:, 2].tolist()

In [9]:
print(rev_ids[0:10])

[235107991, 355319463, 391862046, 391862070, 391862409, 391862819, 391863340, 391863361, 391863617, 391863809]


I created a for-loop to gather all *rev_id* since **ORES** only allows 50 to 100 *rev_id* to be passed into the API query at once and saved it as a list of dictionaries.

In [12]:
# this part will take some time. I had to do ~ 1000 calls to the API.
ores_query = []
inc = 50
start = 0
end = len(rev_ids)

for i in range(int(end/inc)+1):
    if start + inc > end:
        temp = get_ores_data(rev_ids[start: start + (end-start)], headers)
    else:
        temp = get_ores_data(rev_ids[start: start + inc], headers)
    
    ores_query.append(temp)
    start += inc
    

In [13]:
print(ores_query[0])

{'enwiki': {'models': {'wp10': {'version': '0.6.1'}}, 'scores': {'235107991': {'wp10': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:235107991)', 'type': 'RevisionNotFound'}}}, '355319463': {'wp10': {'score': {'prediction': 'Stub', 'probability': {'B': 0.0037293011286007372, 'C': 0.003856823065973545, 'FA': 0.0005009114577946061, 'GA': 0.0009278080381894021, 'Start': 0.008398482183096077, 'Stub': 0.9825866741263456}}}}, '391862046': {'wp10': {'score': {'prediction': 'Stub', 'probability': {'B': 0.00752908372935955, 'C': 0.011698750542107464, 'FA': 0.001217297276719427, 'GA': 0.0018271099726449593, 'Start': 0.12703001272170586, 'Stub': 0.8506977457574628}}}}, '391862070': {'wp10': {'score': {'prediction': 'Stub', 'probability': {'B': 0.007528602399161758, 'C': 0.011761932099515725, 'FA': 0.0012172194555714589, 'GA': 0.0018269931665054447, 'Start': 0.1270218917625896, 'Stub': 0.8506433611166563}}}}, '391862409': {'wp10': {'score': {'prediction': 'Stub', 'pr

## Clean the data

Here is where I start to break out the dictionaries of dictionaires that the API querie gave me. I am interestet in the *rev_id* (which is within the **scores** dictionary) and the *prediction* (which is within the **score** dictionary).  
For this part, I used list comprehension. More information can be found here: https://www.pythonforbeginners.com/basics/list-comprehensions-in-python

In [14]:
# Here I drill down into the "enwiki" dictionary
new_list = [i["enwiki"] for i in ores_query] #list comprehesion
temp_data_frame = pd.DataFrame(new_list) # create a dataframe of the data becuase I like dataframes better

In [15]:
temp_data_frame.head()

Unnamed: 0,models,scores
0,{'wp10': {'version': '0.6.1'}},{'235107991': {'wp10': {'error': {'message': '...
1,{'wp10': {'version': '0.6.1'}},{'443497605': {'wp10': {'score': {'prediction'...
2,{'wp10': {'version': '0.6.1'}},{'446222994': {'wp10': {'score': {'prediction'...
3,{'wp10': {'version': '0.6.1'}},{'535414570': {'wp10': {'score': {'prediction'...
4,{'wp10': {'version': '0.6.1'}},{'541004175': {'wp10': {'score': {'prediction'...


Now I need to drill down into the **scores** column

In [16]:
# Here I get the first batch of the API query results
scores = pd.DataFrame.from_dict(ores_query[0]['enwiki']['scores']).T

In [17]:
scores.head()

Unnamed: 0,wp10
235107991,{'error': {'message': 'RevisionNotFound: Could...
355319463,"{'score': {'prediction': 'Stub', 'probability'..."
391862046,"{'score': {'prediction': 'Stub', 'probability'..."
391862070,"{'score': {'prediction': 'Stub', 'probability'..."
391862409,"{'score': {'prediction': 'Stub', 'probability'..."


Because my for-loop created a list of dictionaries, I need a way to append each query for only the information I want. I created another for-loop to do this. I start with the first chunk of the API query, and append each chunk as I loop through the list.

In [18]:
new_table = scores.reset_index() # Start with the first chunk from the API Query

for i in range(1,len(ores_query)):
    temp = pd.DataFrame.from_dict(ores_query[i]['enwiki']['scores']).T.reset_index()
    new_table = new_table.append(temp, ignore_index = True)

In [19]:
new_table.head()

Unnamed: 0,index,wp10
0,235107991,{'error': {'message': 'RevisionNotFound: Could...
1,355319463,"{'score': {'prediction': 'Stub', 'probability'..."
2,391862046,"{'score': {'prediction': 'Stub', 'probability'..."
3,391862070,"{'score': {'prediction': 'Stub', 'probability'..."
4,391862409,"{'score': {'prediction': 'Stub', 'probability'..."


I now need to create a column that has just the *prediction*, and append it to my current data frame.

In [20]:
# Create a table of just the "wp10" column, keeping the indicies the same. Will need to use this for combining data sets later
new_table_2 = pd.DataFrame(new_table['wp10'])

In [21]:
new_table_2.head()

Unnamed: 0,wp10
0,{'error': {'message': 'RevisionNotFound: Could...
1,"{'score': {'prediction': 'Stub', 'probability'..."
2,"{'score': {'prediction': 'Stub', 'probability'..."
3,"{'score': {'prediction': 'Stub', 'probability'..."
4,"{'score': {'prediction': 'Stub', 'probability'..."


In [22]:
# split the 'wp10' into columns based on the dictionary key
temp_scores = new_table_2['wp10'].apply(pd.Series)

In [23]:
temp_scores.head()

Unnamed: 0,error,score
0,{'message': 'RevisionNotFound: Could not find ...,
1,,"{'prediction': 'Stub', 'probability': {'B': 0...."
2,,"{'prediction': 'Stub', 'probability': {'B': 0...."
3,,"{'prediction': 'Stub', 'probability': {'B': 0...."
4,,"{'prediction': 'Stub', 'probability': {'B': 0...."


I am only really interested in the *prediction* value under the *score* column. As you can see above, it will show NaN if there was an error in finding the *rev_id*.

In [24]:
# split *score* dictionary into its values
pred_list = temp_scores['score'].apply(pd.Series)

  index = _union_indexes(indexes, sort=sort)
  result = result.union(other)


In [25]:
pred_list.head()

Unnamed: 0,0,prediction,probability
0,,,
1,,Stub,"{'B': 0.0037293011286007372, 'C': 0.0038568230..."
2,,Stub,"{'B': 0.00752908372935955, 'C': 0.011698750542..."
3,,Stub,"{'B': 0.007528602399161758, 'C': 0.01176193209..."
4,,Stub,"{'B': 0.007958430970874009, 'C': 0.01225332170..."


Now that I have the **ORES** predictions of the quality of the article, I can append it to `new_dataframe` so that I have the *rev_id* and the *predictions*

In [26]:
new_table['prediction'] = pred_list['prediction']

In [27]:
new_table.head()

Unnamed: 0,index,wp10,prediction
0,235107991,{'error': {'message': 'RevisionNotFound: Could...,
1,355319463,"{'score': {'prediction': 'Stub', 'probability'...",Stub
2,391862046,"{'score': {'prediction': 'Stub', 'probability'...",Stub
3,391862070,"{'score': {'prediction': 'Stub', 'probability'...",Stub
4,391862409,"{'score': {'prediction': 'Stub', 'probability'...",Stub


Below I create a table of just the values I need (*rev_id* and *prediction*).

In [28]:
predictions = pd.DataFrame(new_table["index"]) # this creates the rev_id column
predictions ['prediction'] = new_table['prediction'] # this creates the prediction column, remember NaN means the rev_id was not found in the query
predictions  = predictions .rename(columns = {'index': 'rev_id'}) # rename the 'index' column to its proper title of 'rev_id'

In [29]:
predictions.head()

Unnamed: 0,rev_id,prediction
0,235107991,
1,355319463,Stub
2,391862046,Stub
3,391862070,Stub
4,391862409,Stub


I now need to merge the three datasources `page_data`, `wpds` and `predictions`. I need to do some more clean up of column titles in order to merge  the tables properly by *country* and *rev_id*.  

For more information on merging dataframes, see https://pandas.pydata.org/pandas-docs/stable/merging.html

In [30]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [73]:
wpds_2018.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


In [32]:
predictions.head()

Unnamed: 0,rev_id,prediction
0,235107991,
1,355319463,Stub
2,391862046,Stub
3,391862070,Stub
4,391862409,Stub


Here I merge the `predicitons` and the `page_data` tables using *rev_id* to merge the tables. First I need to check that the types of the *rev_id* columns are the same.

In [33]:
type(page_data['rev_id'][0])

numpy.int64

In [34]:
type(predictions['rev_id'][0])

str

Since they are not, I need to change the type of *rev_id*

In [35]:
predictions['rev_id'] = predictions['rev_id'].astype('int64')
type(predictions['rev_id'][0])

numpy.int64

In [78]:
en_wikipedia_bias_data = pd.merge(page_data, predictions, on = 'rev_id', how = 'outer')

In [79]:
en_wikipedia_bias_data.head()

Unnamed: 0,page,country,rev_id,prediction
0,Template:ZambiaProvincialMinisters,Zambia,235107991,
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


I need to rename `wpds_2018` *Geography* column to *country* to merge this table with the table created above.

In [80]:
wpds_2018 = wpds_2018.rename(columns = {'Geography': 'country'})

In [81]:
wpds_2018.head()

Unnamed: 0,country,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


Now I can merge the `wpds_2018` table with `en_wikipedia_bias_data` using *country*

In [82]:
en_wikipedia_bias_data = pd.merge(en_wikipedia_bias_data, wpds_2018, on = 'country', how = 'outer')

In [83]:
en_wikipedia_bias_data.head()

Unnamed: 0,page,country,rev_id,prediction,Population mid-2018 (millions)
0,Template:ZambiaProvincialMinisters,Zambia,235107991.0,,17.7
1,Gladys Lundwe,Zambia,757566606.0,Stub,17.7
2,Mwamba Luchembe,Zambia,764848643.0,Stub,17.7
3,Thandiwe Banda,Zambia,768166426.0,Start,17.7
4,Sylvester Chisembele,Zambia,776082926.0,C,17.7


Now that I have the data in one one source, I need to remove the rows with *NaN* values. 

In [84]:
en_wikipedia_bias_data = en_wikipedia_bias_data.dropna().reset_index(drop = True)

In [85]:
en_wikipedia_bias_data.head()

Unnamed: 0,page,country,rev_id,prediction,Population mid-2018 (millions)
0,Gladys Lundwe,Zambia,757566606.0,Stub,17.7
1,Mwamba Luchembe,Zambia,764848643.0,Stub,17.7
2,Thandiwe Banda,Zambia,768166426.0,Start,17.7
3,Sylvester Chisembele,Zambia,776082926.0,C,17.7
4,Victoria Kalima,Zambia,776530837.0,Start,17.7


Now to match the column title requirments, I need to rename *page* and *prediction*.

In [94]:
en_wikipedia_bias_data = en_wikipedia_bias_data.rename(columns ={'page' : 'article_name', 'prediction' : 'article_quality'})

In [95]:
en_wikipedia_bias_data.head()

Unnamed: 0,article_name,country,rev_id,article_quality,Population mid-2018 (millions)
0,Gladys Lundwe,Zambia,757566606.0,Stub,17.7
1,Mwamba Luchembe,Zambia,764848643.0,Stub,17.7
2,Thandiwe Banda,Zambia,768166426.0,Start,17.7
3,Sylvester Chisembele,Zambia,776082926.0,C,17.7
4,Victoria Kalima,Zambia,776530837.0,Start,17.7


Here I save `en_wikipedia_bias_data` to a **.csv** file.

In [96]:
en_wikipedia_bias_data.to_csv('en-wikipedia_bias_data.csv')

## Calculate the proportions of articles by population of country and proportions of high quality articles by country

Here I find the proportion of the politician artilces as a proportion of the country's population and the proportion of the high qulaity articles as a proportion of the country's population. High quality artices are articles that ORES predicted as **FA** or **GA** (*featured article* and *good article* respectivly). To do this, I used `DataFrame.groupby`.  

See for more details: https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

In [98]:
politician_articles_by_population = en_wikipedia_bias_data.groupby(['country', 'Population mid-2018 (millions)'], as_index=False)[['article_quality']].count()

In [99]:
politician_articles_by_population.head()

Unnamed: 0,country,Population mid-2018 (millions),article_quality
0,Afghanistan,36.5,326
1,Albania,2.9,460
2,Algeria,42.7,119
3,Andorra,0.08,34
4,Angola,30.4,110


Here I calculate proportion of articles as a function of the population. First I need to change *article_quality* to *count of articles because they way I grouped the rows was by the count of the articles by country.

In [100]:
politician_articles_by_population = politician_articles_by_population.rename(columns = {'article_quality' : "count of articles"})

In [101]:
politician_articles_by_population.head()

Unnamed: 0,country,Population mid-2018 (millions),count of articles
0,Afghanistan,36.5,326
1,Albania,2.9,460
2,Algeria,42.7,119
3,Andorra,0.08,34
4,Angola,30.4,110


Now I need to calculate the proportion of articles.

In [103]:
politician_articles_by_population['proportion of articles to population (millions)'] = politician_articles_by_population['count of articles'] / politician_articles_by_population['Population mid-2018 (millions)'].astype('float')

In [104]:
politician_articles_by_population.head()

Unnamed: 0,country,Population mid-2018 (millions),count of articles,proportion of articles to population (millions)
0,Afghanistan,36.5,326,8.931507
1,Albania,2.9,460,158.62069
2,Algeria,42.7,119,2.786885
3,Andorra,0.08,34,425.0
4,Angola,30.4,110,3.618421


Now I need to calculate the proportion of high quality articles. Remember, high quality articles means they have a rating of **FA** or **GA**.

In [129]:
en_wikipedia_bias_data.head()

Unnamed: 0,article_name,country,rev_id,article_quality,Population mid-2018 (millions)
0,Gladys Lundwe,Zambia,757566606.0,Stub,17.7
1,Mwamba Luchembe,Zambia,764848643.0,Stub,17.7
2,Thandiwe Banda,Zambia,768166426.0,Start,17.7
3,Sylvester Chisembele,Zambia,776082926.0,C,17.7
4,Victoria Kalima,Zambia,776530837.0,Start,17.7


I need to get the counts of **FA** and **GA** articles. 

In [171]:
fa = en_wikipedia_bias_data['country'][en_wikipedia_bias_data['article_quality'] == "FA"].value_counts()
ga = en_wikipedia_bias_data['country'][en_wikipedia_bias_data['article_quality'] == "GA"].value_counts()

In [172]:
fa = pd.DataFrame(fa).reset_index()
ga = pd.DataFrame(ga).reset_index()

In [173]:
fa.head()

Unnamed: 0,index,country
0,Spain,26
1,Romania,25
2,United States,23
3,United Kingdom,15
4,Australia,12


In [174]:
ga.head()

Unnamed: 0,index,country
0,United States,59
1,United Kingdom,42
2,Australia,30
3,China,25
4,Russia,24


I now need to re-name the column titles of the `fa` and `ga` tables so they make sense.

In [175]:
fa = fa.rename(columns = {'country' : 'count of FA'})
ga = ga.rename(columns ={'country' : 'count of GA'})

fa = fa.rename(columns = {'index' : 'country'})
ga = ga.rename(columns = {'index' : 'country'})

In [176]:
fa.head()

Unnamed: 0,country,count of FA
0,Spain,26
1,Romania,25
2,United States,23
3,United Kingdom,15
4,Australia,12


In [177]:
ga.head()

Unnamed: 0,country,count of GA
0,United States,59
1,United Kingdom,42
2,Australia,30
3,China,25
4,Russia,24


Now I will create a table of quality articles by country. First I will merge `FA` and `GA` by country.

In [181]:
merge = pd.merge(fa, ga, on = 'country', how = 'inner')

In [182]:
merge.head()

Unnamed: 0,country,count of FA,count of GA
0,Spain,26,8
1,Romania,25,15
2,United States,23,59
3,United Kingdom,15,42
4,Australia,12,30


Now I will create a table of the counts of quality articles by country.

In [186]:
high_quality_article_counts = pd.DataFrame(merge['country'])

In [187]:
high_quality_article_counts.head()

Unnamed: 0,country
0,Spain
1,Romania
2,United States
3,United Kingdom
4,Australia


In [188]:
high_quality_article_counts['count of high quality articles'] = merge['count of FA'] + merge['count of GA']

In [189]:
high_quality_article_counts.head()

Unnamed: 0,country,count of high quality articles
0,Spain,34
1,Romania,40
2,United States,82
3,United Kingdom,57
4,Australia,42


Now I need to make a table of countries, populations, and the count of high quality articles. I will do this by merging the `politician_articles_by_population` data with the `high_quality_articles_counts`.

In [192]:
high_quality_politician_articles_by_population = pd.merge(politician_articles_by_population, high_quality_article_counts, on = 'country', how = 'inner')

In [193]:
high_quality_politician_articles_by_population.head()

Unnamed: 0,country,Population mid-2018 (millions),count of articles,proportion of articles to population (millions),count of high quality articles
0,Afghanistan,36.5,326,8.931507,10
1,Argentina,44.5,496,11.146067,15
2,Armenia,3.0,198,66.0,5
3,Australia,24.1,1566,64.979253,42
4,Benin,11.5,94,8.173913,7


I now need to calculate the proportion of high quality articles by country

In [194]:
high_quality_politician_articles_by_population['proportion of high quality articles'] = high_quality_politician_articles_by_population['count of high quality articles'] / high_quality_politician_articles_by_population['count of articles']

In [195]:
high_quality_politician_articles_by_population.head()

Unnamed: 0,country,Population mid-2018 (millions),count of articles,proportion of articles to population (millions),count of high quality articles,proportion of high quality articles
0,Afghanistan,36.5,326,8.931507,10,0.030675
1,Argentina,44.5,496,11.146067,15,0.030242
2,Armenia,3.0,198,66.0,5,0.025253
3,Australia,24.1,1566,64.979253,42,0.02682
4,Benin,11.5,94,8.173913,7,0.074468


## Final Deliverables

###  Top 10 ranked countries of proportion of articles by population

In [202]:
politician_articles_by_population.sort_values(by = 'proportion of articles to population (millions)', ascending = False).head(10)

Unnamed: 0,country,Population mid-2018 (millions),count of articles,proportion of articles to population (millions)
166,Tuvalu,0.01,55,5500.0
115,Nauru,0.01,53,5300.0
135,San Marino,0.03,82,2733.333333
108,Monaco,0.04,40,1000.0
93,Liechtenstein,0.04,29,725.0
161,Tonga,0.1,63,630.0
103,Marshall Islands,0.06,37,616.666667
68,Iceland,0.4,206,515.0
3,Andorra,0.08,34,425.0
52,Federated States of Micronesia,0.1,38,380.0


###  Bottom 10 ranked countries of proportion of articles by population

In [203]:
politician_articles_by_population.sort_values(by = 'proportion of articles to population (millions)', ascending = True).head(10)

Unnamed: 0,country,Population mid-2018 (millions),count of articles,proportion of articles to population (millions)
69,India,1371.3,986,0.719026
70,Indonesia,265.2,214,0.806938
34,China,1393.8,1135,0.814321
173,Uzbekistan,32.9,29,0.881459
51,Ethiopia,107.5,105,0.976744
178,Zambia,17.7,25,1.412429
82,"Korea, North",25.6,39,1.523438
159,Thailand,66.2,112,1.691843
13,Bangladesh,166.4,323,1.941106
112,Mozambique,30.5,60,1.967213


### Top 10 ranked countries of high qulaity articles

In [204]:
high_quality_politician_articles_by_population.sort_values(by = 'proportion of high quality articles', ascending = False).head(10)

Unnamed: 0,country,Population mid-2018 (millions),count of articles,proportion of articles to population (millions),count of high quality articles,proportion of high quality articles
31,"Korea, North",25.6,39,1.523438,7,0.179487
53,Saudi Arabia,33.4,119,3.562874,16,0.134454
9,Central African Republic,4.7,68,14.468085,8,0.117647
51,Romania,19.5,348,17.846154,40,0.114943
38,Mauritania,4.5,52,11.555556,5,0.096154
64,United States,328.0,1092,3.329268,82,0.075092
4,Benin,11.5,94,8.173913,7,0.074468
65,Vietnam,94.7,191,2.016895,13,0.068063
63,United Kingdom,66.4,865,13.027108,57,0.065896
26,Ireland,4.9,381,77.755102,24,0.062992


### Bottom 10 ranked countries of high qulaity articles

In [206]:
high_quality_politician_articles_by_population.sort_values(by = 'proportion of high quality articles', ascending = True).head(10)

Unnamed: 0,country,Population mid-2018 (millions),count of articles,proportion of articles to population (millions),count of high quality articles,proportion of high quality articles
43,Nigeria,195.9,682,3.481368,3,0.004399
28,Italy,60.6,828,13.663366,6,0.007246
39,Mexico,130.8,1081,8.264526,9,0.008326
44,Norway,5.3,658,124.150943,6,0.009119
41,Netherlands,17.2,702,40.813953,8,0.011396
14,France,65.1,1689,25.9447,20,0.011841
66,Zimbabwe,14.0,167,11.928571,2,0.011976
50,Portugal,10.3,323,31.359223,4,0.012384
59,Sweden,10.2,379,37.156863,5,0.013193
24,Iran,81.6,826,10.122549,11,0.013317


# Writeup

While performing this analysis, I notices that some revision of pages did not have ratings. Essentially, the page_data.csv had rev_ids that the **ORES** system did not have page quality predictions for. There were about 47K articles listed in the page_data.csv, but the **ORES** system only output about about 45K predictions from that list. I orignally assumed that the **ORES** system would find all rev_ids, however, this was not the case. This could introduce a form of bias in the results because about 2K pages were able to predict the article quality in **ORES**. What could have happened, is that the rev_id (which is described as the revision id of the latest rev) chagned in such a way that that the article couldn't be found, or mabye the revision was the article being removed.

One result I expected to see was the proportion and quality of articles would be higher for English speaking counties, concidering the analysis is on English Wikipedia. The tables however, show that the highest proportion of articles to population is actually countries that have a population less than one million, and I don't believe that English is the national language in those countries. However, English speaking countries do show up on the table of highest quality articles (US and UK fall within the top 10). 

One thing I also expected to see was that the number of articles would be proportional to the population, you can see it might not be true. Comapre the count of articles to its population of France (25.9) to Mexico (8.26) and the United States (3.33). Fromt the table you can also see that the United States has a much higher populatin than France, yet France has more articles. These could be very specific end cases, and I would need to do more analysis to prove or disprove this.

Another result I expected to see is that the proportion of high quality articles would be proportional to the number of articles. You can see that this isn't true. As you can see from the high quality article tables, the highest ranked country, North Korea, has a high quality article count of 7. Where as the United States has a high quality artilce count of 82, but its total article count of 1902. I would expect that the United States would be ranked higher than North Korea. 

Some sources of bias I would guess are berried within this data could be the accuracy of population counts, access to computers, willingness to publish articles about politicians due to social or cultural pressures, knowledge about world politicans from English Wikipedia article publishers, miss-labling of the catagories of the articles or articles being deleted. These all effect the number of articles published, the quality of the articles published as well as the proportion (especially if the population is inaccuratly reported).