# <center> HW Assignment 2</center>
<center> Data Curation </center>
<center> Lauren Heintz </center>
<center> DATA 512, Fall 2019 </center>
<center> Due 10/17/19 </center>

## 0. The Goal
The goal of this analysis is to observe how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. The analysis will focus on tables that show:

__The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
The countries with the highest and lowest proportion of high quality articles about politicians.  
A ranking of geographic regions by articles-per-person and proportion of high quality articles.__

## I. Data Acquisition
Two types of data were used for this analysis. The Wikipedia politicians by country dataset and the world population data set.    
[Politicians by country data set found here](https://figshare.com/articles/Untitled_Item/5513449).   
[World population dataset found here](https://www.prb.org/international/indicator/population/table/).   

CSVs of both were saved locally and then loaded in the steps below.

In [14]:
import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize  

%cd ~/Docs/MSDS/Fa2019/data512/data-512-a2

/Users/laurenheintz/Docs/MSDS/Fa2019/data512/data-512-a2


In [137]:
polData = pd.read_csv('data_raw/page_data.csv', sep=',', header=0) # polData is politican by country data
popData = pd.read_csv('data_raw/WPDS_2018_data.csv', sep=',', header=0) # popData is population by country data
polData.head(5)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [138]:
popData.head(5)

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


## II. Data Processing
The following section outlines the data processing that was done to prepare for this analysis.  

### Politician Pages By Country
First, we will check to see if there are any missing values. If there are, we will replace them with Nan.  
This data set contains rows which start with the word "template" which are not actually pages, and must be removed from the dataset.  

In [139]:
# Look for missing or null & fill with zero
if (polData.isnull().values.any()):
    polData = polData.fillna(0)

# Filter out data containing template
polData = polData[~polData.page.str.contains("Template")]
polData = polData.sort_values(by=['country'])
polData = polData.reset_index(drop=True)
polData.head(10)

Unnamed: 0,page,country,rev_id
0,Raul Eshba,Abkhazia,789039267
1,Zakan Jugelia,Abkhazia,786203824
2,Zurab Achba,Abkhazia,721094337
3,Sumbat Saakian,Abkhazia,755193428
4,Gennadi Berulava,Abkhazia,805063877
5,Efrem Eshba,Abkhazia,798644673
6,Yuri Voronov,Abkhazia,803018106
7,Zaur Avidzba,Abkhazia,694519009
8,Bagrat Shinkuba,Abkhazia,789818648
9,Nestor Lakoba,Abkhazia,805967589


### Population By Country
First, we will check to see if there are any missing values. If there are, we will replace them with Nan.  
This data set contains entries in the country column which are not countries, but regions. These are in all caps. These will not have match data in the pages data set. So for now, we will filter out this data and save it offline to a csv so we can analyze it later. 

In [140]:
# Look for missing or null & fill with zero
if (popData.isnull().values.any()):
    popData = popData.fillna(0)

# Locate rows with ALL CAPS, save this regional roll up data elsewhere
popData[popData.Geography.str.isupper()].to_csv('data_clean/GeographyRollUp.csv', index=False)

# Filter out the non-country data (all caps)
popData = popData[~popData.Geography.str.isupper()]
popData.head(5)

Unnamed: 0,Geography,Population mid-2018 (millions)
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2
5,Sudan,41.7


In [141]:
# Rename Geography to country, sort by country, re index
popData = popData.rename(columns={'Geography': 'country', 'Population mid-2018 (millions)':'population'})

In [142]:
popData = popData.sort_values(by=['country'])
popData = popData.reset_index(drop=True)
popData.head(10)

Unnamed: 0,country,population
0,Afghanistan,36.5
1,Albania,2.9
2,Algeria,42.7
3,Andorra,0.08
4,Angola,30.4
5,Antigua and Barbuda,0.1
6,Argentina,44.5
7,Armenia,3.0
8,Australia,24.1
9,Austria,8.8


### Aqcuire ORES Scores from ORES Rest API
Explanation of ORES score and API and API documentation.   
Must be joined to pages data.

Turn the column of revision IDs in the pandas data frame in to a list

In [143]:
# Turn the column of revision IDs in the pandas data frame in to a list
rev_list = polData['rev_id'].tolist()

Now we use the code below provided by Jonathan to access the API call.

In [30]:
headers = {'User-Agent' : 'https://github.com/lheintz', 'From' : 'heintzl@uw.edu'}

In [31]:
def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
#     print(json.dumps(response, indent=4, sort_keys=True))
    return response

Lets create some code that interprets the json.

In [91]:
import os
from pprint import pprint
import pandas as pandas
from json import load

class ModelScore:
    __slots__ = ("identifier", "prediction", "probabilities", "stub")
    # slots used to choose which values will be a part of the object 

    def __init__(
        self, identifier: int = None, prediction: str = None, probabilities: dict = None
    ):
        self.identifier = identifier
        self.prediction = prediction
        self.probabilities = probabilities
        self.stub = self.probabilities.get("Stub")

    # Method to print outputs nicely if desired for debugging
    def __repr__(self):
        string = "{"
        for index, key in enumerate(self.__slots__):
            string += "{"
            string += f"{key}: {str(getattr(self, key, None))}"
            if index < len(self.__slots__) - 1:
                string += "}, "
            else:
                string += "}"
        return string

class JSONParser:
    @staticmethod
    def parse(path_to_json_file):
        """ 
        The sub-function to turn the json file of the raw
        outputs from the ORES API call in to a dictionary. 
  
        Parameters: 
            path_to_json_file (string): The path to the raw json file returned from the API call
          
        Returns: 
            json_as_dictionary (dictionary): A python dictionary object of the original json
        """

        # Returns json as a dictionary

        with open(path_to_json_file) as json_file:
            json_as_dictionary = load(json_file)
        return json_as_dictionary

def json_path_to_dataframe(path_to_json_file):
    """ 
        The function to turn the json file of the raw
        outputs from the ORES API call in to a clean df
        in the format that we want. 
  
        Parameters: 
            path_to_json_file (string): The path to the raw json file returned from the API call
          
        Returns: 
            model_scores (dataframe): Returns the predicted scores from the model for each Revision Id queried
    """
    # Use json parser class to parse json in to a dictionary

    json_parser = JSONParser()
    json_as_dictionary = json_parser.parse(path_to_json_file)
    
    all_scores = json_as_dictionary.get("enwiki").get("scores")
    
    all_score_ids_mapped_to_info = {}
    for score in all_scores:
        all_score_ids_mapped_to_info[score] = (
            all_scores.get(score).get("wp10").get("score")
        )
    
    # Use model score class to choose values we want to add to a df

    model_scores = []
    for key, value in all_score_ids_mapped_to_info.items():
        try:
            model_scores.append([key, value.get("prediction")])
        except AttributeError:
            model_scores.append([key, 0])
    
    return pandas.DataFrame(model_scores)


In [88]:
len(rev_list)

46701

In [165]:
len(set(rev_list))

46701

In [157]:
df_clean = pd.DataFrame([])

for i in range(467):
    df = get_ores_data(rev_list[i:i+100], headers)
    
    with open('data_raw/ores-json-data-raw.json', 'w') as json_file:
        json.dump(df, json_file)
        
    df = json_path_to_dataframe('data_raw/ores-json-data-raw.json')
    df_clean = df_clean.append(df)

In [159]:
df_clean[0].nunique()

566

In [105]:
df_clean

Unnamed: 0,0,1
0,694519009,Start
1,704938340,Start
2,706112927,Stub
3,713113246,Stub
4,715926834,Stub
5,715926905,Stub
6,718250607,Stub
7,718362010,Start
8,718362588,Stub
9,718364221,Stub


In [106]:
df = get_ores_data(rev_list[46700:46701], headers)
    
with open('data_raw/ores-json-data-raw.json', 'w') as json_file:
    json.dump(df, json_file)
        
df = json_path_to_dataframe('data_raw/ores-json-data-raw.json')
df_clean = df_clean.append(df)

In [107]:
df_clean = df_clean.rename(columns={0:'rev_id', 1:'prediction'})
df_clean

Unnamed: 0,rev_id,prediction
0,694519009,Start
1,704938340,Start
2,706112927,Stub
3,713113246,Stub
4,715926834,Stub
5,715926905,Stub
6,718250607,Stub
7,718362010,Start
8,718362588,Stub
9,718364221,Stub


We have now parsed all the predictions from the jsons and saved this as a PDF. In this CSV, there are some 0 values which signify that no value was found to be returned by the API.

In [108]:
df_clean.to_csv('data_raw/pred_scores.csv', index=False)

In [112]:
df_clean.reset_index(drop = True)
df_clean.head(10)

Unnamed: 0,rev_id,prediction
0,694519009,Start
1,704938340,Start
2,706112927,Stub
3,713113246,Stub
4,715926834,Stub
5,715926905,Stub
6,718250607,Stub
7,718362010,Start
8,718362588,Stub
9,718364221,Stub


Now to clean things up, we drop the items that had a zero for the prediction and then save this file to our clean data folder. Now we should only have rev IDs that have non-zero predictions.

In [117]:
# Filter out the data that did not have ORES prediction scores available and save to CSV

missing = df_clean[(df_clean[['prediction']] == 0).any(axis=1)]
missing = missing.drop(columns=["prediction"])
missing.to_csv('results/ores-no-score.csv', index = False)

df_clean = df_clean[~(df_clean[['prediction']] == 0).any(axis=1)]
df_clean.to_csv('data_clean/pred_scores.csv', index=False)

I saved these results in a CSV so that I did not have to run the API call & do data cleaning again. Now I reloaded this save CSV from above and will join it to the politican pages dataset it came from.

In [146]:
predictions = pd.read_csv('data_clean/pred_scores.csv')
predictions.head(5)

Unnamed: 0,rev_id,prediction
0,694519009,Start
1,704938340,Start
2,706112927,Stub
3,713113246,Stub
4,715926834,Stub


Lets check out the data type on the two columns we would like to merge on: revision ID.

In [148]:
predictions.dtypes

rev_id         int64
prediction    object
dtype: object

In [149]:
polData.dtypes

page       object
country    object
rev_id      int64
dtype: object

In [154]:
predictions['rev_id'].nunique()

565

In [151]:
polDataPredictions

Unnamed: 0,page,country,rev_id,prediction
0,Raul Eshba,Abkhazia,789039267,Stub
1,Zakan Jugelia,Abkhazia,786203824,Start
2,Zakan Jugelia,Abkhazia,786203824,Start
3,Zurab Achba,Abkhazia,721094337,Stub
4,Zurab Achba,Abkhazia,721094337,Stub
5,Zurab Achba,Abkhazia,721094337,Stub
6,Sumbat Saakian,Abkhazia,755193428,Stub
7,Sumbat Saakian,Abkhazia,755193428,Stub
8,Sumbat Saakian,Abkhazia,755193428,Stub
9,Sumbat Saakian,Abkhazia,755193428,Stub


In [129]:
# Merge to pol data by rev_id
# polData['rev_id'] = polData['rev_id'].astype(int)
# df_clean['rev_id'] = df_clean['rev_id'].astype(int)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [128]:
df_clean

Unnamed: 0,rev_id,prediction
0,694519009,Start
1,704938340,Start
2,706112927,Stub
3,713113246,Stub
4,715926834,Stub
5,715926905,Stub
6,718250607,Stub
7,718362010,Start
8,718362588,Stub
9,718364221,Stub


In [174]:
%cd ~/Docs/MSDS/Fa2019/
pred = pd.read_csv('prediction.csv')
pred.head(5)

/Users/laurenheintz/Docs/MSDS/Fa2019


Unnamed: 0,rev_id,prediction
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [184]:
# Merge to pol data by rev_id
polDataScore = pd.merge(polData, pred, how = "outer", on= "rev_id")
polDataScore = polDataScore[~polDataScore["prediction"].str.contains("Rev")]
polDataScore = polDataScore[~polDataScore['prediction'].str.contains("Text")]
polDataScore = polDataScore.sort_values(by=['country']).reset_index(drop=True)

In [187]:
polDataScore

Unnamed: 0,page,country,rev_id,prediction
0,Raul Eshba,Abkhazia,789039267,Stub
1,Zhiuli Shartava,Abkhazia,802029007,Start
2,Zaur Ardzinba,Abkhazia,704938340,Start
3,Garri Aiba,Abkhazia,799618550,Start
4,Guram Gabiskiria,Abkhazia,805775169,Start
5,Shota Shamatava,Abkhazia,723736482,Stub
6,Nestor Lakoba,Abkhazia,805967589,GA
7,Bagrat Shinkuba,Abkhazia,789818648,Start
8,Samson Chanba,Abkhazia,789818730,Start
9,Yuri Voronov,Abkhazia,803018106,Stub


Remove or documents entries for which no ORES prediction was available.

### Join Pages and Population Data
The data on politician pages and population must now be joined. The data set will not match up exactly. Countries which are missing either Pages or Population data will be excluded in our analysis dataset, but saved to `wp_wpds_countries-no_match.csv`.

The complete data set with no missing values will be saved as `wp_wpds_politicians_by_country.csv`.  

In [188]:
# Merge columns by the country name, do an outer join
allData = pd.merge(popData, polDataScore, how = "outer", on = "country")

# Clean data by sorting, filling NAs with zero, and reseting the index
allData = allData.sort_values(by=['country']).fillna(0).reset_index(drop=True)

In [189]:
allData

Unnamed: 0,country,population,page,rev_id,prediction
0,Abkhazia,0,Raul Eshba,789039267.0,Stub
1,Abkhazia,0,Zhiuli Shartava,802029007.0,Start
2,Abkhazia,0,Zaur Avidzba,694519009.0,Start
3,Abkhazia,0,Zakan Jugelia,786203824.0,Start
4,Abkhazia,0,Zurab Achba,721094337.0,Stub
5,Abkhazia,0,Gennadi Berulava,805063877.0,Stub
6,Abkhazia,0,Efrem Eshba,798644673.0,Stub
7,Abkhazia,0,Yuri Voronov,803018106.0,Stub
8,Abkhazia,0,Sumbat Saakian,755193428.0,Stub
9,Abkhazia,0,Bagrat Shinkuba,789818648.0,Start


In [191]:
# Identify countries which do not have a full matching set of data and save in a separate csv
rejects = allData[(allData[['population','page']] == 0).any(axis=1)]
rejects.to_csv('wp_wpds_countries-no_match.csv')

# Create clean data set which has filtered out any rows with missing values
desired = allData[~(allData[['population','page']] == 0).any(axis=1)]
desired.to_csv('wp_wpds_politicians_by_country.csv')

In [194]:
desired

Unnamed: 0,country,population,page,rev_id,prediction
16,Afghanistan,36.5,Nur ul-Haq Ulumi,779084312.0,C
17,Afghanistan,36.5,Al-Haj Suliman Yari,723481980.0,Stub
18,Afghanistan,36.5,Ghulam Qawis Abubaker,752026068.0,Stub
19,Afghanistan,36.5,Mohammad Fahim Dashty,706112927.0,Stub
20,Afghanistan,36.5,Seema Jowenda,802286319.0,Start
21,Afghanistan,36.5,Dost Mohammad Khan (Emir of Afghanistan),806499548.0,C
22,Afghanistan,36.5,Hashmat Ghani Ahmadzai,794028379.0,Start
23,Afghanistan,36.5,Humayun Azizi,800555876.0,Start
24,Afghanistan,36.5,Abdul Ahad Karzai,796975361.0,Start
25,Afghanistan,36.5,Abdullah Abdullah,806496321.0,C


## III. Data Analysis
1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population  
2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

## VI. Conclusion