Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A2 - Wikipedia, ORES, and Bias in Data
Please follow the reproducability workflow as practiced during the last exercise.

## Step 1⃣ | Data acquisition

You will use two data sources: (1) Wikipedia articles of politicians and (2) world population data.

**Wikipedia articles -**
The Wikipedia articles can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449). It contains politiciaans by country from the English-language wikipedia. Please read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.

**Population data -**
The population data is available in `CSV` format in the `_data` folder. The file is named `export_2019.csv`. This dataset is drawn from the [world population datasheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau (downloaded 2020-11-13 10:14 AM). I have edited the dataset to make it easier to use in this assignment. The population per country is given in millions!

First of al we import all needed libraries.

In [None]:
import requests 
import zipfile
import shutil
import os
import pandas as pd
import numpy as np
import json

Next we define some helper functions, which are needed for the follwing steps.

In [None]:
def download_zip_file(url, save_path, chunk_size=128):
    '''
    Downloads a zip from a given url. 

    Parameters
    ----------
    url : str
        Url of zip file
    save_path : str
        Save path of zip file
    chunk_size: int
        Chuck size in which the zip gile is downloaded 
    '''
    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)
            
def extract_zip_file(path):
    '''
    Extracts a zip.

    Parameters
    ----------
    
    path : str
        Path of zip file
    '''
    with zipfile.ZipFile(path, "r") as zip_ref:
        zip_ref.extractall(path.replace('.zip', ''))
    
def move_to_data_raw(dir_path, filename):
    '''
    Moves a file into the data_raw directory.

    Parameters
    ----------
    
    dir_path : str
        Directory path of file
    filemame : str
        Filename of file
    '''
    shutil.copyfile(f"{dir_path}{filename}", f"../data_raw/{filename}")
    
def remove_zip_files(path):
    '''
    Removes all zip file dependent files and directories.

    Parameters
    ----------
    
    path : str
        Path to zip file
    '''
    os.remove(path)
    shutil.rmtree(path.replace('.zip', ''))

def save_data_raw(data, filename):
    '''
    Saves data to the data raw directory.

    Parameters
    ----------
    
    data : dict, pandas.core.frame.DataFrame 
        Data which should be saved
    filename:
        Resulting filename
    '''
    __save(data, '../data_raw/', filename)
    
def save_data_clean(data, filename):
    '''
    Saves data to the data clean directory.

    Parameters
    ----------
    
    data : dict, pandas.core.frame.DataFrame
        Data which should be saved
    filename : str
        Resulting filename
    '''
    __save(data, '../data_clean/', filename)
    
def save_results(data, filename):
    '''
    Saves data to the results directory.

    Parameters
    ----------
    
    data : dict, pandas.core.frame.DataFrame
        Data which should be saved
    filename : str
        Resulting filename
    '''
    __save(data, '../results/', filename)
    
def __save(data, path, filename):
    '''
    Saves data to a given directory.

    Parameters
    ----------
    
    data : dict, pandas.core.frame.DataFrame
        Data which should be saved
    path : str
        Directory path
    filename:
        Resulting filename
    '''
    data.to_csv(f'{path}{filename}', index=False)
    
def read_data_raw(filename, sep=','):
    '''
    Reads data from the data raw directory.

    Parameters
    ----------
    filename : str
        Filename of file which should be read
    sep : str
        Separation character of file
        
    Returns
    -------
    pandas.core.frame.DataFrame
        Read data
    '''
    return __read('../data_raw/', filename, sep)
    
def read_data_clean(filename, sep=','):
    '''
    Reads data from the data clean directory.

    Parameters
    ----------
    filename : str
        Filename of file which should be read
    sep : str
        Separation character of file
        
    Returns
    -------
    pandas.core.frame.DataFrame
        Read data
    '''
    return __read('../data_clean/', filename, sep)

def read_results(filename, sep=','):
    '''
    Reads data from the results directory.

    Parameters
    ----------
    filename : str
        Filename of file which should be read
    sep : str
        Separation character of file
        
    Returns
    -------
    pandas.core.frame.DataFrame
        Read data
    '''
    return __read('../results/', filename, sep)

def __read(path, filename, sep=','):
    '''
    Reads data from a given directory.

    Parameters
    ----------
    path : str
        Directory path
    filename : str
        Filename of file which should be read
    sep : str
        Separation character of file
        
    Returns
    -------
    pandas.core.frame.DataFrame
        Read data
    '''
    return pd.read_csv(f'{path}{filename}', sep=sep)

Now we download and extract the **Wikipedia articles** zip file. The **Population data** can allready be found in the folder: `_data/export_2019.csv`. 

In [None]:
url = 'https://ndownloader.figshare.com/files/9614893'
zip_path = '_data/country.zip'

# download
download_zip_file(url, zip_path)
#extract
extract_zip_file(zip_path)

The needed files for the analysis `_data/country/country/data/page_data.csv` and `_data/country.zip` are moved to the data raw folder so that we have a starting point in terms of data. All not needed zip file related data is removed afterwards to keep the directories clean.

In [None]:
# move
move_to_data_raw('_data/country/country/data/', 'page_data.csv')
move_to_data_raw('_data/', 'export_2019.csv')

# remove
remove_zip_files(zip_path)

## Step 2⃣ | Data processing and cleaning
The data in `page_data.csv` contain some rows that you will need to filter out. It contains some page names that start with the string `"Template:"`. These pages are not Wikipedia articles, and should not be included in your analysis. The data in `export_2019.csv` does not need any cleaning.

***

| | `page_data.csv` | | |
|-|------|---------|--------|
| | **page** | **country** | **rev_id** |
|0|	Template:ZambiaProvincialMinisters | Zambia | 235107991 |
|1|	Bir I of Kanem | Chad | 355319463 |

***

| | `export_2019.csv` | | |
|-|------|---------|--------|
| | **country** | **population** | **region** |
|0|	Algeria | 44.357 | AFRICA |
|1|	Egypt | 100.803 | 355319463 |

***

To process the data we first load both files from the `data_raw` folder as data frames. From the page data we then remove all entries that are templates and thus not needed.

In [None]:
# load
page_data = read_data_raw('page_data.csv')
country_data = read_data_raw('export_2019.csv', ';')

# remove template pages
page_data = page_data[~page_data['page'].str.contains('Template:')] 

### Getting article quality predictions with ORES

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [**ORES**](https://www.mediawiki.org/wiki/ORES) ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of the six quality categories. The options are, from best to worst:

| ID | Quality Category |  Explanation |
|----|------------------|----------|
| 1 | FA    | Featured article |
| 2 | GA    | Good article |
| 3 | B     | B-class article |
| 4 | C     | C-class article |
| 5 | Start | Start-class article |
| 6 | Stub  | Stub-class article |

For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can [read more](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades) about what these assessment classes mean on English Wikipedia. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these six categories to any `rev_id`. You need to extract all `rev_id`s in the `page_data.csv` file and use the ORES API to get the predicted quality score for that specific article revision.

### ORES REST API endpoint

The [ORES REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model) is configured fairly similarly to the pageviews API we used for the last assignment. It expects the following parameters:

* **project** --> `enwiki`
* **revid** --> e.g. `235107991` or multiple ids e.g.: `235107991|355319463` (batch)
* **model** --> `wp10` - The name of a model to use when scoring.

**❗Note on batch processing:** Please read the documentation about [API usage](https://www.mediawiki.org/wiki/ORES#API_usage) if you want to query a large number of revisions (batches). 

You will notice that ORES returns a prediction value that contains the name of one category (e.g. `Start`), as well as probability values for each of the six quality categories. For this assignment, you only need to capture and use the value for prediction.

**❗Note:** It's possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log should be saved as a separate file named `ORES_no_scores.csv` and should include the `page`, `country`, and `rev_id` (just as in `page_data.csv`).

You can use the following **sample code for API calls**:

We then define function and header infromation to fetch data from the **ORES REST API** and process ths fetched data so that we get quality predictions for each page.

In [None]:
# Header information for the ORES API call
headers = {
    'User-Agent': 'https://github.com/marisanest',
    'From': 'marisa.f.nest@fu-berlin.de'
}

def get_ores_data(rev_ids, headers):
    '''
    Fetches ORES scores for given rev ids. 

    Parameters
    ----------
    rev_ids : list
        List of rev ids.
    headers : dict
        headers for ORES call
    
    Returns
    -------
    dict
        ORES scores as dict
    '''
    
    # Define the endpoint
    # https://ores.wikimedia.org/scores/enwiki/?models=wp10&revids=807420979|807422778
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    # Convert the rev ids to valid parameters
    if len(rev_ids) > 1:
        rev_ids = '|'.join(str(rev_id) for rev_id in rev_ids)
    else:
        rev_ids = str(rev_ids[0])
    
    # Define parameters
    params = {
        'project' : 'enwiki',
        'model'   : 'wp10',
        'revids'  : rev_ids
    }
    
    # Call API 
    api_call = requests.get(endpoint.format(**params))
    # Covert response to a dict
    response = api_call.json()
    
    return response

def get_batches(data, batch_size=50):
    '''
    Converts given data into batches with given size.

    Parameters
    ----------
    data : pandas.core.frame.DataFrame
        Data which sould be converted to batches
    batch_size : int
        Size of each batch
    
    Returns
    -------
    pandas.core.groupby.generic.DataFrameGroupBy
        Batches
    '''
    return data.groupby(np.arange(len(data)) // batch_size)

def get_quality_scores(page_data):
    '''
    Fetches ORES quality scores for given pages.

    Parameters
    ----------
    page_data : pandas.core.frame.DataFrame
        Pages for which quality scores should be fetched.
    
    Returns
    -------
    dict
        Valids quality scores for each page
    dict
        Pages for which no valid quality scores could be fetched
    '''
    
    # Init the quality score data
    quality_score_data = pd.DataFrame(
        {
            'page': [], 
            'country': [], 
            'rev_id': [], 
            'quality_score': []
        }
    )
    
    # Init the error data
    error_data = pd.DataFrame(
        {
            'page': [], 
            'country': [], 
            'rev_id': []
        }
    )
    
    # Get page data as batches with batch_size = 50
    batches = get_batches(page_data)

    # Iterate overall all batches
    for index, batch in (batches):
        print(f'Batch {index + 1}/{len(batches)}', end='\r')
        
        # Extract all batch rev ids as list
        batch_rev_ids = batch.rev_id.values
        # Convert rev id list to 
        
        # Get ORES data for all batch rev ids
        ores_data = get_ores_data(batch_rev_ids, headers)
    
        # Iterate overall all pages within the batch
        for index, page in batch.iterrows():
            try:
                # Extract quality scores from ORES data
                quality_score = ores_data['enwiki']['scores'][str(page.rev_id)]['wp10']['score']['prediction']
                
                # If no error occured (KeyError due to missing quality score), append quality score to quality score data
                quality_score_data = quality_score_data.append(
                    {
                        **dict(page), 
                        **{'quality_score': quality_score}
                    }, 
                    ignore_index=True
                ) 
            except KeyError:
                # If error occured (KeyError due to missing quality score), append error to error data
                error_data = error_data.append(dict(page), ignore_index=True)

    # Convert rev id columns to int
    quality_score_data.rev_id = quality_score_data.rev_id.astype(int)
    error_data.rev_id = error_data.rev_id.astype(int)
    
    # Return data
    return quality_score_data, error_data

Sending one request for each `rev_id` might take some time. If you want to send batches you can use `'|'.join(str(x) for x in revision_ids` to put your ids together. Please make sure to deal with [exception handling](https://www.w3schools.com/python/python_try_except.asp) of the `KeyError` exception, when extracting the `prediction` from the `JSON` response.

### Combining the datasets

Now you need to combine both dataset: (1) the wikipedia articles and its ORES quality scores and (2) the population data. Both have columns named `country`. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vis versa.

Please remove any rows that do not have matching data, and output them to a `CSV` file called `countries_no_match.csv`. Consolidate the remaining data into a single `CSV` file called `politicians_by_country.csv`.

The schema for that file should look like the following table:


| article_name | country | region | revision_id | article_quality | population |
|--------------|---------|--------|-------------|-----------------|------------|
| Bir I of Kanem | Chad  | AFRICA | 807422778 | Stub | 16877000 |

Now, we fetch ORES quality scores for each page. All pages for which no quality scores could be found are saved within the `data_clean` directory to keep trak of all occuring errors.

In [None]:
# Get quality score data
quality_score_data, quality_score_error_data = get_quality_score_data(page_data)

# Save error data
save_data_clean(quality_score_error_data, 'ORES_no_scores.csv')

The fetched quality scores (including all page info) and the country data are then merged, processed and saved to the `data_clean` directory to get a final processed data file for the following analyis step. Again error data (pages for which no matching country could be found) is saved as well.

In [None]:
# Merge quality score data and country data by country
page_country_data = pd.merge(prediction_data, country_data, on=['country'], how='outer')

# Extract error data from merged data (all rows with at least one value = NaN)
page_country_error_data = page_country_data[page_country_data.isnull().any(axis=1)]

# Save error data
save_data_clean(page_country_error_data, 'countries_no_match.csv')

# Extract valid data from merged data (all rows without any value = NaN)
page_country_data = page_country_data[~page_country_data.isnull().any(axis=1)]

# Save merged data
save_data_clean(page_country_data, 'politicians_by_country.csv')

## Step 3⃣ | Analysis

Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population (we can also call it `coverage`) and high-quality articles (we can also call it `relative-quality`)for **each country** and for **each region**. By `"high quality"` arcticle we mean an article that ORES predicted as `FA` (featured article) or `GA` (good article).

**Examples:**

* if a country has a population of `10,000` people, and you found `10` articles about politicians from that country, then the percentage of `articles-per-population` would be `0.1%`.
* if a country has `10` articles about politicians, and `2` of them are `FA` or `GA` class articles, then the percentage of `high-quality-articles` would be `20%`.

### Results format

The results from this analysis are six `data tables`. Embed these tables in the Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment. The tables will show:

1. **Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
1. **Regions by relative quality**<br>Ranking of regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

**❗Hint:** You will find what country belongs to which region (e.g. `ASIA`) also in `export_2019.csv`. You need to calculate the total poulation per region. For that you could use `groupby` and also check out `apply`.

To start the analysis, we load the resulting data from step 2 as data frame.

In [None]:
politician_country_data = read_data_clean('politicians_by_country.csv')

First we do all coverage dependent analysis. Therefore we define a helper function which prepares the data for the desired tables.

In [None]:
def get_coverage_data(data, groupby, agg):
    '''
    Generates coverage data for given groupby and aggregation parameters.

    Parameters
    ----------
    data : pandas.core.frame.DataFrame
        Data for which coverage data should be generated
    groupby : list
        Column names which should be grouped
    agg : dict
        Parameters which define the aggregation method
    
    Returns
    -------
    pandas.core.frame.DataFrame
        Coverage data
    '''
    # Group
    coverage_data = data.groupby(groupby)
    # Aggregate
    coverage_data = coverage_data.agg(**agg)
    # Reset index
    coverage_data = coverage_data.reset_index()
    # Calculate coverage
    coverage_data['coverage'] = coverage_data.politicians / (coverage_data.population * 1e6)
    # Sort coverage descending
    coverage_data = coverage_data.sort_values('coverage', ascending=False)
    # Drop not needed columns
    coverage_data = coverage_data.drop(columns=['population', 'politicians'])
    # Reset index and drop old index
    coverage_data = coverage_data.reset_index(drop=True)
    
    return coverage_data

Then we generate the needed data for the coverage dependent analysis: once per country and once per region.

In [None]:
# Country dependent coverage data
country_coverage_data = get_coverage_data(
    politician_country_data, 
    ['country', 'population'], 
    {
        'politicians':('population', 'count')
    }
)

# Region dependent coverage data
region_coverage_data = get_coverage_data(
    politician_country_data, 
    ['region'], 
    {
        'population':('population', 'sum'), 
        'politicians':('population', 'count')
    }
)

Each output table is saved to the `results` directory and afterward loaded and shown within the notebook.

1. **Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [None]:
save_results(country_coverage_data.head(10), 'country_coverage_data_top_10.csv')

In [None]:
read_results('country_coverage_data_top_10.csv')

2. **Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [None]:
save_results(country_coverage_data.tail(10), 'country_coverage_data_bottom_10.csv')

In [None]:
read_results('country_coverage_data_bottom_10.csv')

5. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [None]:
save_results(region_coverage_data, 'region_coverage_data.csv')

In [None]:
read_results('region_coverage_data.csv')

Second we do all relative quality dependent analysis. Therefore we define a helper functions which prepares the data for the desired tables.

In [None]:
def absolut_quality_count(pages):
    '''
    Counts all pages with a high quality.

    Parameters
    ----------
    pages : pandas.core.frame.DataFrame
        Pages for which high quality should be counted
    
    
    Returns
    -------
    int
        Counter
    '''
    
    counter = 0
    
    for quality in pages.values:
        if quality in ['GA', 'FA']:
            counter += 1;
            
    return counter;

def get_relative_quality_data(data, groupby, agg):
    '''
    Generates relative quality data for given groupby and aggregation parameters.

    Parameters
    ----------
    data : pandas.core.frame.DataFrame
        Data for which relative quality data should be generated
    groupby : list
        Column names which should be grouped
    agg : dict
        Parameters which define the aggregation method
    
    Returns
    -------
    pandas.core.frame.DataFrame
        Relative quality data
    '''
    # Group
    relative_quality_data = data.groupby(groupby)
    # Aggregate
    relative_quality_data = relative_quality_data.agg(**agg)
    # Reset index
    relative_quality_data = relative_quality_data.reset_index()
    # Calculate relative quality
    relative_quality_data['relative_quality'] = relative_quality_data.absolut_quality / relative_quality_data.politicians
    # Sort relative quality descending
    relative_quality_data = relative_quality_data.sort_values('relative_quality', ascending=False)
    # Drop not needed columns
    relative_quality_data = relative_quality_data.drop(columns=['politicians', 'absolut_quality'])
    # Reset index and drop old index
    relative_quality_data = relative_quality_data.reset_index(drop=True)
    
    return relative_quality_data

Then we generate the needed data for the relative quality dependent analysis: once per country and once per region.

In [None]:
# Country dependent Relative quality data
country_relative_quality_data = get_relative_quality_data(
    politician_country_data, 
    ['country'], 
    {
        'politicians': ('prediction', 'count'), 
        'absolut_quality': ('prediction', absolut_quality_count)
    }
)

# Region dependent Relative quality data
region_relative_quality_data = get_relative_quality_data(
    politician_country_data, 
    ['region'], 
    {
        'politicians': ('prediction', 'count'), 
        'absolut_quality': ('prediction', absolut_quality_count)
    }
)

Each output table is saved to the `results` directory and afterward loaded and shown within the notebook.

3. **Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [None]:
save_results(country_relative_quality_data.head(10), 'country_relative_quality_data_top_10.csv')

In [None]:
read_results('country_relative_quality_data_top_10.csv')

4. **Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [None]:
save_results(country_relative_quality_data.tail(10), 'country_relative_quality_data_bottom_10.csv')

In [None]:
read_results('country_relative_quality_data_bottom_10.csv')

6. **Regions by relative quality**<br>Ranking of regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [None]:
save_results(region_relative_quality_data, 'region_relative_quality_data.csv')

In [None]:
read_results('region_relative_quality_data.csv')

***

#### Credits

This exercise is slighty adapted from the course [Human Centered Data Science (Fall 2019)](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)) of [Univeristy of Washington](https://www.washington.edu/datasciencemasters/) by [Jonathan T. Morgan](https://wiki.communitydata.science/User:Jtmorgan).

Same as the original inventors, we release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).