# <center> HW Assignment 2</center>
<center> Data Curation </center>
<center> Lauren Heintz </center>
<center> DATA 512, Fall 2019 </center>
<center> Due 10/17/19 </center>

## 0. The Goal
The goal of this analysis is to observe how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. The analysis will focus on tables that show:

__The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.  
The countries with the highest and lowest proportion of high quality articles about politicians.  
A ranking of geographic regions by articles-per-person and proportion of high quality articles.__

## I. Data Acquisition
Two types of data were used for this analysis. The Wikipedia politicians by country dataset and the world population data set.    
[Politicians by country data set found here](https://figshare.com/articles/Untitled_Item/5513449).   
[World population dataset found here](https://www.prb.org/international/indicator/population/table/).   

CSVs of both were saved locally and then loaded in the steps below.

In [122]:
import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize  

%cd ~/Docs/MSDS/Fa2019/Week2

/Users/laurenheintz/Docs/MSDS/Fa2019/Week2


In [93]:
polData = pd.read_csv('page_data.csv', sep=',', header=0) # polData is politican by country data
popData = pd.read_csv('WPDS_2018_data.csv', sep=',', header=0) # popData is population by country data
polData.head(5)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [94]:
popData.head(5)

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


## II. Data Processing
The following section outlines the data processing that was done to prepare for this analysis.  

### Politician Pages By Country
First, we will check to see if there are any missing values. If there are, we will replace them with Nan.  
This data set contains rows which start with the word "template" which are not actually pages, and must be removed from the dataset.  

In [95]:
# Look for missing or null & fill with zero
if (polData.isnull().values.any()):
    polData = polData.fillna(0)

# Filter out data containing template
polData = polData[~polData.page.str.contains("Template")]
polData = polData.sort_values(by=['country'])
polData = polData.reset_index(drop=True)
polData.head(10)

Unnamed: 0,page,country,rev_id
0,Raul Eshba,Abkhazia,789039267
1,Zakan Jugelia,Abkhazia,786203824
2,Zurab Achba,Abkhazia,721094337
3,Sumbat Saakian,Abkhazia,755193428
4,Gennadi Berulava,Abkhazia,805063877
5,Efrem Eshba,Abkhazia,798644673
6,Yuri Voronov,Abkhazia,803018106
7,Zaur Avidzba,Abkhazia,694519009
8,Bagrat Shinkuba,Abkhazia,789818648
9,Nestor Lakoba,Abkhazia,805967589


### Population By Country
First, we will check to see if there are any missing values. If there are, we will replace them with Nan.  
This data set contains entries in the country column which are not countries, but regions. These are in all caps. These will not have match data in the pages data set. So for now, we will filter out this data and save it offline to a csv so we can analyze it later. 

In [96]:
# Look for missing or null & fill with zero
if (popData.isnull().values.any()):
    popData = popData.fillna(0)

# Locate rows with ALL CAPS, save this regional roll up data elsewhere
popData[popData.Geography.str.isupper()].to_csv('GeographyRollUp.csv', index=False)

# Filter out the non-country data (all caps)
popData = popData[~popData.Geography.str.isupper()]
popData.head(5)

Unnamed: 0,Geography,Population mid-2018 (millions)
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2
5,Sudan,41.7


In [104]:
# Rename Geography to country, sort by country, re index
popData = popData.rename(columns={'Geography': 'country', 'Population mid-2018 (millions)':'population'})

In [105]:
popData = popData.sort_values(by=['country'])
popData = popData.reset_index(drop=True)
popData.head(10)

Unnamed: 0,country,Population
0,Afghanistan,36.5
1,Albania,2.9
2,Algeria,42.7
3,Andorra,0.08
4,Angola,30.4
5,Antigua and Barbuda,0.1
6,Argentina,44.5
7,Armenia,3.0
8,Australia,24.1
9,Austria,8.8


### Aqcuire ORES Scores from ORES Rest API
Explanation of ORES score and API and API documentation.   
Must be joined to pages data.

In [120]:
headers = {'User-Agent' : 'https://github.com/lheintz', 'From' : 'heintzl@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    print(json.dumps(response, indent=4, sort_keys=True))
    return response


# So if we grab some example revision IDs and turn them into a list and then call get_ores_data...
example_ids = [783381498, 807355596, 757539710]
df = get_ores_data(example_ids, headers)

{
    "enwiki": {
        "models": {
            "wp10": {
                "version": "0.8.1"
            }
        },
        "scores": {
            "757539710": {
                "wp10": {
                    "score": {
                        "prediction": "Start",
                        "probability": {
                            "B": 0.06907655349650586,
                            "C": 0.1730497923608886,
                            "FA": 0.003738253691275387,
                            "GA": 0.007083489019420698,
                            "Start": 0.7205318510650603,
                            "Stub": 0.02652006036684928
                        }
                    }
                }
            },
            "783381498": {
                "wp10": {
                    "score": {
                        "prediction": "Start",
                        "probability": {
                            "B": 0.02903486686501717,
                            "C": 0.06807603083007

In [128]:
norm = json_normalize(df['enwiki'])[['prediction', 'revids']]
norm

KeyError: "['prediction' 'revids'] not in index"

In [None]:
## getting the data from the CSV files
import csv

data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])

Remove or documents entries for which no ORES prediction was available.

### Join Pages and Population Data
The data on politician pages and population must now be joined. The data set will not match up exactly. Countries which are missing either Pages or Population data will be excluded in our analysis dataset, but saved to `wp_wpds_countries-no_match.csv`.

The complete data set with no missing values will be saved as `wp_wpds_politicians_by_country.csv`.  

In [113]:
# Merge columns by the country name, do an outer join
allData = pd.merge(popData, polData, how = "outer", on="country")

# Clean data by sorting, filling NAs with zero, and reseting the index
allData = allData.sort_values(by=['country']).fillna(0).reset_index(drop=True)
allData

Unnamed: 0,country,Population,page,rev_id
0,Abkhazia,0,Raul Eshba,789039267.0
1,Abkhazia,0,Zakan Jugelia,786203824.0
2,Abkhazia,0,Zhiuli Shartava,802029007.0
3,Abkhazia,0,Zaur Ardzinba,704938340.0
4,Abkhazia,0,Garri Aiba,799618550.0
5,Abkhazia,0,Samson Chanba,789818730.0
6,Abkhazia,0,Shota Shamatava,723736482.0
7,Abkhazia,0,Nestor Lakoba,805967589.0
8,Abkhazia,0,Guram Gabiskiria,805775169.0
9,Abkhazia,0,Zaur Avidzba,694519009.0


In [117]:
# Identify countries which do not have a full matching set of data and save in a separate csv
rejects = allData[(allData[['Population','page']] == 0).any(axis=1)]
rejects.to_csv('wp_wpds_countries-no_match.csv')

# Create clean data set which has filtered out any rows with missing values
desired = allData[~(allData[['Population','page']] == 0).any(axis=1)]
desired.to_csv('wp_wpds_politicians_by_country.csv')

## III. Data Analysis
1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population  
2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

## VI. Conclusion