# <center> HW Assignment 2</center>
<center> Data Curation </center>
<center> Lauren Heintz </center>
<center> DATA 512, Fall 2019 </center>
<center> Due 10/17/19 </center>

## 0. The Goal
The goal of this analysis is to observe how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Through this analysis, we will explore the concept of bias. The analysis will focus on tables that show:

"__The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
The countries with the highest and lowest proportion of high quality articles about politicians.  
A ranking of geographic regions by articles-per-person and proportion of high quality articles.__"

Ref: [[DATA 512 A2]](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#A2:_Bias_in_data)

## I. Data Acquisition
Two types of data were used for this analysis. The Wikipedia politicians by country dataset and the world population data set.    
[Politicians by country data set found here](https://figshare.com/articles/Untitled_Item/5513449).   
[World population dataset found here](https://www.prb.org/international/indicator/population/table/).   

CSVs of both were saved locally and then loaded in the steps below.

In [2]:
import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize  


%cd ~/Docs/MSDS/Fa2019/data512/data-512-a2

/Users/laurenheintz/Docs/MSDS/Fa2019/data512/data-512-a2


In [3]:
polData = pd.read_csv('data_raw/page_data.csv', sep=',', header=0) # polData is politican by country data
popData = pd.read_csv('data_raw/WPDS_2018_data.csv', sep=',', header=0) # popData is population by country data
polData.head(5)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


Ultimately, the [pages/politican](https://figshare.com/articles/Untitled_Item/5513449) data was narrowed down to the above 3 columns: the name of the page (which is also the name of the politician), the country the respective politican is from, and the revision ID associated with that page.

In [4]:
popData.head(5)

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


The [population](https://www.prb.org/international/indicator/population/table/) data was narrowed down to the above 2 columns: the name of the country and the size of the population in millions.

## II. Data Processing
The following section outlines the data processing that was done to prepare for this analysis.  

### Politician Pages By Country
First, we will check to see if there are any missing values. If there are, we will replace them with 0.  
This data set contains rows which start with the word "template" which are not actually pages, and must be removed from the dataset.  

In [5]:
# Look for missing or null & fill with zero
if (polData.isnull().values.any()):
    polData = polData.fillna(0)

# Filter out data containing template
polData = polData[~polData.page.str.contains("Template")]
polData = polData.sort_values(by=['country'])
polData = polData.reset_index(drop=True)
polData.head(10)

Unnamed: 0,page,country,rev_id
0,Raul Eshba,Abkhazia,789039267
1,Zakan Jugelia,Abkhazia,786203824
2,Zurab Achba,Abkhazia,721094337
3,Sumbat Saakian,Abkhazia,755193428
4,Gennadi Berulava,Abkhazia,805063877
5,Efrem Eshba,Abkhazia,798644673
6,Yuri Voronov,Abkhazia,803018106
7,Zaur Avidzba,Abkhazia,694519009
8,Bagrat Shinkuba,Abkhazia,789818648
9,Nestor Lakoba,Abkhazia,805967589


### Population By Country
First, we will check to see if there are any missing values. If there are, we will replace them with 0.  
This data set contains entries in the country column which are not countries, but regions. These are in all caps. These will not have match data in the politician pages data set. So for now, we will filter out this data and save it offline to a csv so we can analyze it later. 

In [6]:
# Look for missing or null & fill with zero
if (popData.isnull().values.any()):
    popData = popData.fillna(0)

# Locate rows with ALL CAPS, save this regional roll up data elsewhere
popData[popData.Geography.str.isupper()].to_csv('data_clean/GeographyRollUp.csv', index=False)

# Filter out the non-country data (all caps)
popData = popData[~popData.Geography.str.isupper()]
popData.head(5)

Unnamed: 0,Geography,Population mid-2018 (millions)
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2
5,Sudan,41.7


In [7]:
# Rename Geography to country, sort by country, re index
popData = popData.rename(columns={'Geography': 'country', 'Population mid-2018 (millions)':'population'})

In [8]:
popData = popData.sort_values(by=['country'])
popData = popData.reset_index(drop=True)
popData.head(10)

Unnamed: 0,country,population
0,Afghanistan,36.5
1,Albania,2.9
2,Algeria,42.7
3,Andorra,0.08
4,Angola,30.4
5,Antigua and Barbuda,0.1
6,Argentina,44.5
7,Armenia,3.0
8,Australia,24.1
9,Austria,8.8


### Process ORES Scores from ORES Rest API
Next, we need to process one additional data set, ORES. This is to use in our bias analysis. This ORES Data set will give us "the predicted quality scores for each article in the Wikipedia dataset...the machine learning system is called ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:  
1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article"

Ref: [[DATA 512 A2]](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#A2:_Bias_in_data)

We can access predictions these from the [ORES REST API Endpoint](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model). 

Once this data is process it must be joined to politician pages data set. The key we will use to query the API is the revision ID from the politican pages data set.

So first, we turn the column of revision IDs in the pandas data frame in to a list.

In [143]:
# Turn the column of revision IDs in the pandas data frame in to a list
rev_list = polData['rev_id'].tolist()

Now we use the code (the two cells below) [provided by Jonathan](https://github.com/Ironholds/data-512-a2) to access the API call.

In [30]:
headers = {'User-Agent' : 'https://github.com/lheintz', 'From' : 'heintzl@uw.edu'}

In [31]:
def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks. 
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()

    return response

Lets create some code that interprets the json. The API recommends not trying to grab too many in one call, so we will end up iterating through the ~47,000 revisions by calling 100 at a time.

In [91]:
import os
from pprint import pprint
import pandas as pandas
from json import load


class ModelScore:
    __slots__ = ("identifier", "prediction", "probabilities", "stub")
    # slots used to choose which values will be a part of the object 

    def __init__(
        self, identifier: int = None, prediction: str = None, probabilities: dict = None
    ):
        self.identifier = identifier
        self.prediction = prediction
        self.probabilities = probabilities
        self.stub = self.probabilities.get("Stub")

    # Method to print outputs nicely if desired for debugging
    def __repr__(self):
        string = "{"
        for index, key in enumerate(self.__slots__):
            string += "{"
            string += f"{key}: {str(getattr(self, key, None))}"
            if index < len(self.__slots__) - 1:
                string += "}, "
            else:
                string += "}"
        return string

class JSONParser:
    @staticmethod
    def parse(path_to_json_file):
        """ 
        The sub-function to turn the json file of the raw
        outputs from the ORES API call in to a dictionary. 
  
        Parameters: 
            path_to_json_file (string): The path to the raw json file returned from the API call
          
        Returns: 
            json_as_dictionary (dictionary): A python dictionary object of the original json
        """

        # Returns json as a dictionary

        with open(path_to_json_file) as json_file:
            json_as_dictionary = load(json_file)
        return json_as_dictionary

def json_path_to_dataframe(path_to_json_file):
    """ 
        The function to turn the json file of the raw
        outputs from the ORES API call in to a clean df
        in the format that we want. 
  
        Parameters: 
            path_to_json_file (string): The path to the raw json file returned from the API call
          
        Returns: 
            model_scores (dataframe): Returns the predicted scores from the model for each Revision Id queried
    """
    # Use json parser class to parse json in to a dictionary

    json_parser = JSONParser()
    json_as_dictionary = json_parser.parse(path_to_json_file)
    
    all_scores = json_as_dictionary.get("enwiki").get("scores")
    
    all_score_ids_mapped_to_info = {}
    for score in all_scores:
        all_score_ids_mapped_to_info[score] = (
            all_scores.get(score).get("wp10").get("score")
        )
    
    # Use model score class to choose values we want to add to a df

    model_scores = []
    for key, value in all_score_ids_mapped_to_info.items():
        try:
            model_scores.append([key, value.get("prediction")])
        except AttributeError:
            model_scores.append([key, 0])
    
    return pandas.DataFrame(model_scores)


Lets run our functions from above to grab 46,700 data points in increments of 100.

In [157]:
df_clean = pd.DataFrame([])

for i in range(467):
    df = get_ores_data(rev_list[i:i+100], headers)
    
    with open('data_raw/ores-json-data-raw.json', 'w') as json_file:
        json.dump(df, json_file)
        
    df = json_path_to_dataframe('data_raw/ores-json-data-raw.json')
    df_clean = df_clean.append(df)

Since there were 46,701 revision IDs, we will run it once more to grab the last one and append these results all together.

In [106]:
df = get_ores_data(rev_list[46700:46701], headers)
    
with open('data_raw/ores-json-data-raw.json', 'w') as json_file:
    json.dump(df, json_file)
        
df = json_path_to_dataframe('data_raw/ores-json-data-raw.json')
df_clean = df_clean.append(df)

In [107]:
df_clean = df_clean.rename(columns={0:'rev_id', 1:'prediction'})
df_clean

Unnamed: 0,rev_id,prediction
0,694519009,Start
1,704938340,Start
2,706112927,Stub
3,713113246,Stub
4,715926834,Stub
5,715926905,Stub
6,718250607,Stub
7,718362010,Start
8,718362588,Stub
9,718364221,Stub


After processing these from a json to an appropriately formatted dataframe, we save these initial predictions scores to our raw data folder. This is because there will be some revisions missing data and these will need to be removed.

In [108]:
df_clean.to_csv('data_raw/pred_scores.csv', index=False)

In this CSV of raw prediction results, there are some 0 values which signify that no value was found to be returned by the API.

In [112]:
df_clean.reset_index(drop = True)
df_clean.head(10)

Unnamed: 0,rev_id,prediction
0,694519009,Start
1,704938340,Start
2,706112927,Stub
3,713113246,Stub
4,715926834,Stub
5,715926905,Stub
6,718250607,Stub
7,718362010,Start
8,718362588,Stub
9,718364221,Stub


Now to clean things up, we drop the items that had a zero for the prediction and then save this file to our clean data folder. Now we should only have rev IDs that have non-zero predictions.

In [117]:
# Filter out the data that did not have ORES prediction scores available and save to CSV

missing = df_clean[(df_clean[['prediction']] == 0).any(axis=1)]
missing = missing.drop(columns=["prediction"])
missing.to_csv('data_clean/ores_no_score.csv', index = False)

df_clean = df_clean[~(df_clean[['prediction']] == 0).any(axis=1)]
df_clean.to_csv('data_clean/pred_scores.csv', index=False)

I saved these results in a CSV in the data_clean folder so that I did not have to run the API call & do data cleaning again. Now I reloaded this saved CSV from above and will join it to the politican pages dataset it came from. The only columns are the revision ID and the prediction.

In [9]:
predictions = pd.read_csv('data_clean/pred_scores.csv')
predictions.head(5)

Unnamed: 0,rev_id,prediction
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


Now we merge it with the politican page data on the revision ID.

In [10]:
# Merge to pol data by rev_id
polDataScore = pd.merge(polData, predictions, how = "outer", on= "rev_id")
polDataScore = polDataScore.sort_values(by=['country']).reset_index(drop=True)

In [11]:
polDataScore

Unnamed: 0,page,country,rev_id,prediction
0,Raul Eshba,Abkhazia,789039267,Stub
1,Zhiuli Shartava,Abkhazia,802029007,Start
2,Zaur Ardzinba,Abkhazia,704938340,Start
3,Garri Aiba,Abkhazia,799618550,Start
4,Guram Gabiskiria,Abkhazia,805775169,Start
5,Shota Shamatava,Abkhazia,723736482,Stub
6,Nestor Lakoba,Abkhazia,805967589,GA
7,Bagrat Shinkuba,Abkhazia,789818648,Start
8,Samson Chanba,Abkhazia,789818730,Start
9,Yuri Voronov,Abkhazia,803018106,Stub


### Join Pages and Population Data
The data on politician pages and population must now be joined. The data set will not match up exactly. Countries which are missing either Pages or Population data will be excluded in our analysis dataset, but saved to `wp_wpds_countries-no_match.csv`.

The complete data set with no missing values will be saved as `wp_wpds_politicians_by_country.csv`.  

In [12]:
# Merge columns by the country name, do an outer join
allData = pd.merge(popData, polDataScore, how = "outer", on = "country")

# Clean data by sorting, filling NAs with zero, and reseting the index
allData = allData.sort_values(by=['country']).fillna(0).reset_index(drop=True)

In [13]:
allData

Unnamed: 0,country,population,page,rev_id,prediction
0,Abkhazia,0,Raul Eshba,789039267.0,Stub
1,Abkhazia,0,Zhiuli Shartava,802029007.0,Start
2,Abkhazia,0,Zaur Avidzba,694519009.0,Start
3,Abkhazia,0,Zakan Jugelia,786203824.0,Start
4,Abkhazia,0,Zurab Achba,721094337.0,Stub
5,Abkhazia,0,Gennadi Berulava,805063877.0,Stub
6,Abkhazia,0,Efrem Eshba,798644673.0,Stub
7,Abkhazia,0,Yuri Voronov,803018106.0,Stub
8,Abkhazia,0,Sumbat Saakian,755193428.0,Stub
9,Abkhazia,0,Bagrat Shinkuba,789818648.0,Start


In [28]:
# Identify countries which do not have a full matching set of data and save in a separate csv
rejects = allData[(allData[['population','page']] == 0).any(axis=1)]
rejects.to_csv('data_clean/wp_wpds_countries_no_match.csv', index = False)

# Create clean data set which has filtered out any rows with missing values
desired = allData[~(allData[['population','page']] == 0).any(axis=1)]
desired.to_csv('data_clean/wp_wpds_politicians_by_country.csv', index = False)

In [29]:
desired = desired.sort_values(by=['country']).reset_index(drop=True)

Final cleaned data set (seen below) contains 44618 rows and 5 columns. The 5 columns are country, population, page title (or politician name), revision id, and prediction of quality of article.

In [18]:
desired

Unnamed: 0,country,population,page,rev_id,prediction
0,Afghanistan,36.5,Mohammad Ghous Bashiri,723709911.0,Stub
1,Afghanistan,36.5,Amanullah Khan,797048177.0,C
2,Afghanistan,36.5,Abdul Wahed Sorabi,764743041.0,Stub
3,Afghanistan,36.5,Hafizullah Shabaz Khail,807426425.0,Start
4,Afghanistan,36.5,Said Mohammad Ali Jawid,748594441.0,Stub
5,Afghanistan,36.5,Mohammad Hashim Zare,781002810.0,Stub
6,Afghanistan,36.5,Presidency of Hamid Karzai,802697878.0,B
7,Afghanistan,36.5,Alhaj Mutalib Baig,722893996.0,Start
8,Afghanistan,36.5,Mohammed Zaman,789359940.0,C
9,Afghanistan,36.5,Rouh Gul Khairzad,788030456.0,Stub


## III. Data Analysis

Now, we begin our analysis of the bias by evaluating several different metrics.

1. __Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population__  
2. __Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population__
3. __Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality__
4. __Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality__
5. __Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population__
6. __Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality__

First we must turn population in to a numeric and change the units to match the new column header by multiplying by 1 million.

In [31]:
data = desired.copy()
data.dtypes

country        object
population     object
page           object
rev_id        float64
prediction     object
dtype: object

In [32]:
data['population'] = data['population'].str.replace(',', '')
data['population'] = data['population'].astype(float)

In [None]:
data.loc[:,'population'] *= 1000000

In [34]:
data.head(10)

Unnamed: 0,country,population,page,rev_id,prediction
0,Afghanistan,36500000.0,Mohammad Ghous Bashiri,723709911.0,Stub
1,Afghanistan,36500000.0,Amanullah Khan,797048177.0,C
2,Afghanistan,36500000.0,Abdul Wahed Sorabi,764743041.0,Stub
3,Afghanistan,36500000.0,Hafizullah Shabaz Khail,807426425.0,Start
4,Afghanistan,36500000.0,Said Mohammad Ali Jawid,748594441.0,Stub
5,Afghanistan,36500000.0,Mohammad Hashim Zare,781002810.0,Stub
6,Afghanistan,36500000.0,Presidency of Hamid Karzai,802697878.0,B
7,Afghanistan,36500000.0,Alhaj Mutalib Baig,722893996.0,Start
8,Afghanistan,36500000.0,Mohammed Zaman,789359940.0,C
9,Afghanistan,36500000.0,Rouh Gul Khairzad,788030456.0,Stub


In [54]:
popData['population'] = popData['population'].str.replace(',', '')
popData['population'] = popData['population'].astype(float)
popData.loc[:,'population'] *= 1000000
popData = popData.set_index('country', drop = True)

In [51]:
counts = data.groupby(['country']).count()
page_count_by_country = pd.DataFrame(counts['rev_id'])
page_count_by_country = page_count_by_country.rename(columns = {'rev_id': 'count'})
page_count_by_country.head(10)

Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
Afghanistan,322
Albania,457
Algeria,116
Andorra,34
Angola,106
Antigua and Barbuda,24
Argentina,491
Armenia,196
Australia,1561
Austria,336


### Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [144]:
# Concatenate together the page count data with population data
table_1 = pd.concat([page_count_by_country, popData], axis=1)

# Create metric for coverage which is the count of pages proportional to the population
table_1['coverage'] = table_1['count'] / table_1['population']

# Order from greatest to least
df = table_1.sort_values(by=['coverage'], ascending = False).head(10)
df.to_csv('results/table1.csv')
df

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,count,population,coverage
Tuvalu,54.0,10000.0,0.0054
Nauru,52.0,10000.0,0.0052
San Marino,81.0,30000.0,0.0027
Monaco,40.0,40000.0,0.001
Liechtenstein,28.0,40000.0,0.0007
Tonga,63.0,100000.0,0.00063
Marshall Islands,37.0,60000.0,0.000617
Iceland,202.0,400000.0,0.000505
Andorra,34.0,80000.0,0.000425
Grenada,36.0,100000.0,0.00036


### Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country populatio

In [145]:
# Flip order to least to greatest
table_2 = table_1.copy().sort_values(by=['coverage'], ascending = True).head(10)
table_2.to_csv('results/table2.csv')
table_2

Unnamed: 0,count,population,coverage
India,985.0,1371300000.0,7.182965e-07
Indonesia,211.0,265200000.0,7.956259e-07
China,1133.0,1393800000.0,8.128856e-07
Uzbekistan,28.0,32900000.0,8.510638e-07
Ethiopia,101.0,107500000.0,9.395349e-07
"Korea, North",36.0,25600000.0,1.40625e-06
Zambia,25.0,17700000.0,1.412429e-06
Thailand,112.0,66200000.0,1.691843e-06
Mozambique,58.0,30500000.0,1.901639e-06
Bangladesh,321.0,166400000.0,1.929087e-06


For the next few tables, we will make a subset of the data that only contains "good" articles, or articles containing FA and GA, and do the same analysis.

In [90]:
# First we copy the main data set to a new table
good_data = data.copy()

# Define what is a good score
good_scores = ['FA','GA']

# Filter to keep only items with a good score
good_data = good_data[good_data.prediction.isin(good_scores)]

In [93]:
# Make a good counts tables
good_counts = good_data.groupby(['country']).count()
good_page_count_by_country = pd.DataFrame(good_counts['rev_id'])
good_page_count_by_country = good_page_count_by_country.rename(columns = {'rev_id': 'good_count'})


### Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [146]:
# Concatenate together the counts of good scores and bad scores
table_3 = pd.concat([page_count_by_country, good_page_count_by_country], axis=1)

# Fill NAs with zeros
table_3 = table_3.fillna(0)

# Create metric of relative number of good articles to total articles
table_3['relative'] = table_3['good_count'] / table_3['count']

# Order from greatest to least
table_3 = table_3.sort_values(by=['relative'], ascending = False)
table_3.head(10).to_csv('results/table3.csv')
table_3.head(10)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,count,good_count,relative
"Korea, North",36,7.0,0.194444
Saudi Arabia,118,15.0,0.127119
Mauritania,48,6.0,0.125
Central African Republic,66,8.0,0.121212
Romania,343,39.0,0.113703
Tuvalu,54,5.0,0.092593
Bhutan,33,3.0,0.090909
Dominica,12,1.0,0.083333
Syria,129,10.0,0.077519
Benin,91,7.0,0.076923


### Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [147]:
# Flip order to least to greatest
table_4 = table_3.copy().sort_values(by=['relative'], ascending = True)
table_4 = table_4[~(table_4.good_count == 0)]
table_4.head(10).to_csv('results/table4.csv')
table_4.head(10)

Unnamed: 0,count,good_count,relative
Belgium,520,1.0,0.001923
Tanzania,405,1.0,0.002469
Switzerland,403,1.0,0.002481
Nepal,361,1.0,0.00277
Peru,350,1.0,0.002857
Nigeria,679,2.0,0.002946
Colombia,285,1.0,0.003509
Lithuania,244,1.0,0.004098
Fiji,198,1.0,0.005051
Azerbaijan,179,1.0,0.005587


For this final analysis, we will need the data above on the good articles versus total articles, as well as the population information and geographic information. First we will join the population information to table 3.

In [111]:
# Join together the population data with the data on good pages
table_5 = pd.concat([table_3, popData], axis=1)
table_5 = table_5.dropna()
table_5 = table_5.reset_index()
table_5 = table_5.rename(columns = {'index': 'country'})
table_5

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,country,count,good_count,relative,population
0,Afghanistan,322.0,12.0,0.037267,36500000.0
1,Albania,457.0,3.0,0.006565,2900000.0
2,Algeria,116.0,2.0,0.017241,42700000.0
3,Andorra,34.0,0.0,0.000000,80000.0
4,Angola,106.0,0.0,0.000000,30400000.0
5,Antigua and Barbuda,24.0,0.0,0.000000,100000.0
6,Argentina,491.0,12.0,0.024440,44500000.0
7,Armenia,196.0,5.0,0.025510,3000000.0
8,Australia,1561.0,39.0,0.024984,24100000.0
9,Austria,336.0,3.0,0.008929,8800000.0


In [120]:
# Now lets load in the geography roll up information and join it 
roll_up = pd.read_csv('data_clean/GeographyRollUp1.csv')
roll_up = roll_up.rename(columns = {'Country': 'country'})
roll_up.head(5)

Unnamed: 0,country,Region
0,Algeria,AFRICA
1,Egypt,AFRICA
2,Libya,AFRICA
3,Morocco,AFRICA
4,Sudan,AFRICA


In [131]:
# Merge region data with existing data table
table_5A = pd.merge(table_5, roll_up, how = 'outer', on = 'country')
table_5A = table_5A.dropna()

### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [148]:
# Create groups by the region
table_5B = table_5A.groupby('Region').sum()

# Recalculate metric for relative amount of good pages, since the sum will have distorted the percentage
table_5B['relative'] = table_5B['good_count'] / table_5B['count']

# Calculate metric of overall coverage for region
table_5B['coverage'] = table_5B['count'] / table_5B['population']

# Order from greatest to least
table_5B = table_5B.sort_values(by = ['coverage'], ascending = False)
table_5B.to_csv('results/table5.csv')
table_5B

Unnamed: 0_level_0,count,good_count,relative,population,coverage
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OCEANIA,3132.0,66.0,0.021073,39780000.0,7.9e-05
EUROPE,15923.0,322.0,0.020222,734590000.0,2.2e-05
LATIN AMERICA AND THE CARIBBEAN,5174.0,69.0,0.013336,628270000.0,8e-06
AFRICA,6861.0,125.0,0.018219,1172400000.0,6e-06
NORTHERN AMERICA,1940.0,99.0,0.051031,365200000.0,5e-06
ASIA,11588.0,310.0,0.026752,4513100000.0,3e-06


### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [149]:
# Reorder greatest to least based on the relative 
table_6 = table_5B.sort_values(by = ['relative'], ascending = False)
table_6.to_csv('results/table6.csv')
table_6

Unnamed: 0_level_0,count,good_count,relative,population,coverage
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NORTHERN AMERICA,1940.0,99.0,0.051031,365200000.0,5e-06
ASIA,11588.0,310.0,0.026752,4513100000.0,3e-06
OCEANIA,3132.0,66.0,0.021073,39780000.0,7.9e-05
EUROPE,15923.0,322.0,0.020222,734590000.0,2.2e-05
AFRICA,6861.0,125.0,0.018219,1172400000.0,6e-06
LATIN AMERICA AND THE CARIBBEAN,5174.0,69.0,0.013336,628270000.0,8e-06


## VI. Conclusion