Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A2 - Wikipedia, ORES, and Bias in Data
Please follow the reproducability workflow as practiced during the last exercise.

In [19]:
# import libraries
import pandas as pd

## Step 1⃣ | Data acquisition

You will use two data sources: (1) Wikipedia articles of politicians and (2) world population data.

**Wikipedia articles -**
The Wikipedia articles can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449). It contains politiciaans by country from the English-language wikipedia. Please read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.

**Population data -**
The population data is available in `CSV` format in the `_data` folder. The file is named `export_2019.csv`. This dataset is drawn from the [world population datasheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau (downloaded 2020-11-13 10:14 AM). I have edited the dataset to make it easier to use in this assignment. The population per country is given in millions!

In [20]:
# import data
df_articles = pd.read_csv('data_raw/page_data.csv')
df_population = pd.read_csv('data_raw/export_2019.csv', delimiter=';')

# View data
df_articles.head()
df_population.head()

Unnamed: 0,country,population,region
0,Algeria,44.357,AFRICA
1,Egypt,100.803,AFRICA
2,Libya,6.891,AFRICA
3,Morocco,35.952,AFRICA
4,Sudan,43.849,AFRICA


## Step 2⃣ | Data processing and cleaning
The data in `page_data.csv` contain some rows that you will need to filter out. It contains some page names that start with the string `"Template:"`. These pages are not Wikipedia articles, and should not be included in your analysis. The data in `export_2019.csv` does not need any cleaning.

***

| | `page_data.csv` | | |
|-|------|---------|--------|
| | **page** | **country** | **rev_id** |
|0|	Template:ZambiaProvincialMinisters | Zambia | 235107991 |
|1|	Bir I of Kanem | Chad | 355319463 |

***

| | `export_2019.csv` | | |
|-|------|---------|--------|
| | **country** | **population** | **region** |
|0|	Algeria | 44.357 | AFRICA |
|1|	Egypt | 100.803 | 355319463 |

***

In [21]:
# filter out rows containing "template" 
df_articles = df_articles[~df_articles['page'].str.contains('Template:')]
# df_articles = df_articles[df_articles['page'].str.contains('Template:')]
df_articles.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


### Getting article quality predictions with ORES

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [**ORES**](https://www.mediawiki.org/wiki/ORES) ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of the six quality categories. The options are, from best to worst:

| ID | Quality Category |  Explanation |
|----|------------------|----------|
| 1 | FA    | Featured article |
| 2 | GA    | Good article |
| 3 | B     | B-class article |
| 4 | C     | C-class article |
| 5 | Start | Start-class article |
| 6 | Stub  | Stub-class article |

For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can [read more](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades) about what these assessment classes mean on English Wikipedia. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these six categories to any `rev_id`. You need to extract all `rev_id`s in the `page_data.csv` file and use the ORES API to get the predicted quality score for that specific article revision.

### ORES REST API endpoint

The [ORES REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model) is configured fairly similarly to the pageviews API we used for the last assignment. It expects the following parameters:

* **project** --> `enwiki`
* **revid** --> e.g. `235107991` or multiple ids e.g.: `235107991|355319463` (batch)
* **model** --> `wp10` - The name of a model to use when scoring.

**❗Note on batch processing:** Please read the documentation about [API usage](https://www.mediawiki.org/wiki/ORES#API_usage) if you want to query a large number of revisions (batches). 

You will notice that ORES returns a prediction value that contains the name of one category (e.g. `Start`), as well as probability values for each of the six quality categories. For this assignment, you only need to capture and use the value for prediction.

**❗Note:** It's possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log should be saved as a separate file named `ORES_no_scores.csv` and should include the `page`, `country`, and `rev_id` (just as in `page_data.csv`).

You can use the following **samle code for API calls**:

In [22]:
import requests
import json

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/malina-scheuer',
    'From': 'm.scheuer@campus.tu-berlin.de'
}

def get_ores_data(rev_id, headers):
    
    # Define the endpoint
    # https://ores.wikimedia.org/scores/enwiki/?models=wp10&revids=807420979|807422778
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : rev_id
              }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    data = json.dumps(response)

    return data

Sending one request for each `rev_id` might take some time. If you want to send batches you can use `'|'.join(str(x) for x in revision_ids` to put your ids together. Please make sure to deal with [exception handling](https://www.w3schools.com/python/python_try_except.asp) of the `KeyError` exception, when extracting the `prediction` from the `JSON` response.

In [23]:
# extract rev ids
rev_ids = [i for i in df_articles['rev_id']]

# requesting quality scores in batches
quality = []
errors = []

for i in range(0, len(rev_ids) // 25+1):

    rev_id_batch = '|'.join(str(x) for x in rev_ids[(i * 25):(25 * (i+1))])
    data = get_ores_data(rev_id_batch, headers)
    data = json.loads(data)

    for key in data['enwiki']['scores'].keys():
        try:
            prediction = data['enwiki']['scores'][str(key)]['wp10']['score']['prediction']
            quality.append([key, prediction])
        except KeyError:
            errors.append(key)

# assert len(rev_ids) == len(quality) + len(errors), 'Not working'

In [60]:
# dataframe with quality scores
df_quality = pd.DataFrame(quality, columns=['rev_id', 'article_quality'])
df_quality['rev_id'] = df_quality['rev_id'].astype(int)

df_quality.head()

Unnamed: 0,rev_id,article_quality
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [24]:
# dataframe with errors
# df_errors = pd.DataFrame(errors, columns=['rev_id'])
# df_errors['rev_id'] = df_errors['rev_id'].astype(int)
# df_errors = pd.merge(df_errors, df_articles, on='rev_id', how='left')

# df_errors.head()

Unnamed: 0,rev_id,page,country
0,516633096,List of politicians in Poland,Poland
1,550682925,Tingtingru,Vanuatu
2,627547024,Daud Arsala,Afghanistan
3,671484594,Bharat Saud,Nepal
4,684023803,Robert Sych,Poland


### Combining the datasets

Now you need to combine both dataset: (1) the wikipedia articles and its ORES quality scores and (2) the population data. Both have columns named `country`. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vis versa.

Please remove any rows that do not have matching data, and output them to a `CSV` file called `countries-no_match.csv`. Consolidate the remaining data into a single `CSV` file called `politicians_by_country.csv`.

The schema for that file should look like the following table:


| article_name | country | region | revision_id | article_quality | population |
|--------------|---------|--------|-------------|-----------------|------------|
| Bir I of Kanem | Chad  | AFRICA | 807422778 | Stub | 16877000 |

In [86]:
# merge dataframes
df_politicians = pd.merge(df_articles, df_quality, on='rev_id', how='outer')
df_politicians = pd.merge(df_politicians, df_population, on='country', how='outer')

# politicians dataframe
df_politicians = df_politicians[['page','country','region','rev_id','article_quality','population']]
df_politicians = df_politicians.rename(columns={'page':'article_name','rev_id':'revision_id'})

# dataframe with non matching
df_countries_no_match = df_data[~df_data['population'].notnull()]

df_politicians.head()

Unnamed: 0,article_name,country,region,revision_id,article_quality,population
0,Bir I of Kanem,Chad,AFRICA,355319463.0,Stub,16.877
1,Abdullah II of Kanem,Chad,AFRICA,498683267.0,Stub,16.877
2,Salmama II of Kanem,Chad,AFRICA,565745353.0,Stub,16.877
3,Kuri I of Kanem,Chad,AFRICA,565745365.0,Stub,16.877
4,Mohammed I of Kanem,Chad,AFRICA,565745375.0,Stub,16.877


In [87]:
# save final datasets
df_politicians.to_csv('data_clean/politicians_by_country.csv', index=False)
df_countries_no_match.to_csv('data_clean/countries_no_match.csv', index=False)

## Step 3⃣ | Analysis

Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population (we can also call it `coverage`) and high-quality articles (we can also call it `relative-quality`)for **each country** and for **each region**. By `"high quality"` arcticle we mean an article that ORES predicted as `FA` (featured article) or `GA` (good article).

**Examples:**

* if a country has a population of `10,000` people, and you found `10` articles about politicians from that country, then the percentage of `articles-per-population` would be `0.1%`.
* if a country has `10` articles about politicians, and `2` of them are `FA` or `GA` class articles, then the percentage of `high-quality-articles` would be `20%`.

### Results format

The results from this analysis are six `data tables`. Embed these tables in the Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment. The tables will show:

1. **Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

**❗Hint:** You will find what country belongs to which region (e.g. `ASIA`) also in `export_2019.csv`. You need to calculate the total poulation per region. For that you could use `groupby` and also check out `apply`.

### Coverage

In [190]:
# coverage dataframe (articles per population)
df_coverage = df_articles.groupby('country')['page'].count()
df_coverage = pd.DataFrame(df_coverage)
df_coverage = df_coverage.rename(columns={'page':'articles'})
df_coverage = pd.merge(df_coverage, df_population, on='country',how='left')
df_coverage['population'] = df_coverage['population']*1000000

#### 1. Top 10 countries by coverage

In [116]:
# Print Top 10 countries 
df_coverage['articles_per_population'] = df_coverage['articles']/df_coverage['population']
df_coverage = df_coverage.sort_values('articles_per_population', ascending=False)

df_coverage.head(10)

Unnamed: 0,country,articles,population,region,articles_per_population
205,Tuvalu,54,10000.0,OCEANIA,0.0054
2,Albania,457,2838000.0,EUROPE,0.000161
138,New Zealand,784,4987000.0,OCEANIA,0.000157
143,Norway,656,5387000.0,EUROPE,0.000122
126,Moldova,424,3535000.0,EUROPE,0.00012
59,Estonia,149,1331000.0,EUROPE,0.000112
64,Finland,570,5529000.0,EUROPE,0.000103
169,Sao Tome and Principe,21,210000.0,AFRICA,0.0001
112,Lithuania,244,2794000.0,EUROPE,8.7e-05
47,Cyprus,98,1207000.0,ASIA,8.1e-05


#### 1. Bottom 10 countries by coverage

In [191]:
# Print Bottom 10 countries 
df_coverage['articles_per_population'] = df_coverage['articles']/df_coverage['population']
df_coverage = df_coverage.sort_values('articles_per_population', ascending=True)

df_coverage.head(10)

Unnamed: 0,country,articles,population,region,articles_per_population
79,Guyana,20,787000000.0,LATIN AMERICA AND THE CARIBBEAN,2.541296e-08
51,Djibouti,37,988000000.0,AFRICA,3.744939e-08
18,Belize,16,419000000.0,LATIN AMERICA AND THE CARIBBEAN,3.818616e-08
15,Barbados,14,287000000.0,LATIN AMERICA AND THE CARIBBEAN,4.878049e-08
12,Bahamas,20,393000000.0,LATIN AMERICA AND THE CARIBBEAN,5.089059e-08
189,Suriname,40,605000000.0,LATIN AMERICA AND THE CARIBBEAN,6.61157e-08
32,Cape Verde,37,556000000.0,AFRICA,6.654676e-08
66,French Guiana,27,294000000.0,LATIN AMERICA AND THE CARIBBEAN,9.183673e-08
122,Martinique,34,356000000.0,LATIN AMERICA AND THE CARIBBEAN,9.550562e-08
129,Montenegro,72,622000000.0,EUROPE,1.157556e-07


### Relative quality

In [237]:
# dataframe with high quality articles
df_high_quality = df_politicians[(df_politicians.article_quality =='FA') | (df_politicians.article_quality =='GA')]

# high quality articles per country
df_high_quality_c = pd.DataFrame(df_high_quality.groupby('country')['article_name'].count())

# total articles per country
df_articles_c = pd.DataFrame(df_articles.groupby('country')['page'].count())

# merge
df_rel_quality_c = pd.merge(df_high_quality_c, df_articles_c, on='country',how='left')
df_rel_quality_c = df_rel_quality_c.rename(
    columns={'article_name':'high_quality',
            'page':'total'})

df_rel_quality_c['relative_quality'] = df_rel_quality_c['high_quality']/df_rel_quality_c['total']

#### 3. Top 10 countries by relative quality

In [238]:
# Print Top 10 
df_rel_quality_c = df_rel_quality_c.sort_values('relative_quality', ascending=False)

df_rel_quality_c.head(10)

Unnamed: 0_level_0,high_quality,total,relative_quality
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Korea, North",8,36,0.222222
Rhodesian,10,75,0.133333
Saudi Arabia,15,118,0.127119
Romania,42,343,0.122449
Central African Republic,8,66,0.121212
Uzbekistan,3,28,0.107143
Mauritania,5,48,0.104167
Guatemala,7,83,0.084337
Dominica,1,12,0.083333
Syria,10,129,0.077519


#### 4. Bottom 10 countries by relative quality

In [239]:
# Print Bottom 10 
df_rel_quality_c = df_rel_quality_c.sort_values('relative_quality', ascending=True)

df_rel_quality_c.head(10)

Unnamed: 0_level_0,high_quality,total,relative_quality
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Belgium,1,520,0.001923
Tanzania,1,405,0.002469
Switzerland,1,403,0.002481
Nepal,1,361,0.00277
Peru,1,350,0.002857
Nigeria,2,679,0.002946
Portugal,1,320,0.003125
Colombia,1,285,0.003509
Czech Republic,1,251,0.003984
Lithuania,1,244,0.004098


### Regions

In [217]:
# dataframe articles per region
df_articles_region = pd.merge(df_articles, df_population, on='country', how='left')

df_region_articles = pd.DataFrame(df_articles_region.groupby('region')['page'].count())
df_region_articles = df_region_articles.rename(columns={'page':'articles'})

# dataframe population per region
df_region_pop = pd.DataFrame(df_population.groupby('region')['population'].sum())

# merged
df_region_coverage = pd.merge(df_region_articles, df_region_pop, on='region',how='left')

# coverage percentage
df_region_coverage['population'] = df_region_coverage['population']*1000000
df_region_coverage['coverage'] = df_region_coverage['articles']/df_region_coverage['population']

#### 5. Regions by coverage

In [244]:
# regions by coverage
df_region_coverage = df_region_coverage.sort_values('coverage', ascending=False)

df_region_coverage.head(6)

Unnamed: 0_level_0,articles,population,coverage
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NORTHERN AMERICA,1940,368068000.0,5e-06
EUROPE,15858,3252940000.0,5e-06
ASIA,11767,6320228000.0,2e-06
AFRICA,6861,4718528000.0,1e-06
OCEANIA,3132,2858181000.0,1e-06
LATIN AMERICA AND THE CARIBBEAN,5284,4947246000.0,1e-06


#### 6. Regions by quality

In [240]:
# dataframe with high quality articles
df_high_quality = df_politicians[(df_politicians.article_quality =='FA') | (df_politicians.article_quality =='GA')]

# high quality articles per region
df_high_quality_r = pd.DataFrame(df_high_quality.groupby('region')['article_name'].count())

# total articles per region
df_articles_region = pd.merge(df_articles, df_population, on='country', how='left')
df_articles_region = pd.DataFrame(df_articles_region.groupby('region')['page'].count())

# merge
df_rel_quality_r = pd.merge(df_high_quality_r, df_articles_region, on='region',how='left')
df_rel_quality_r = df_rel_quality_r.rename(
    columns={'article_name':'high_quality',
            'page':'total'})

df_rel_quality_r['relative_quality'] = df_rel_quality_r['high_quality']/df_rel_quality_r['total']


# sort
df_rel_quality_r = df_rel_quality_r.sort_values('relative_quality', ascending=False)

df_rel_quality_r.head(6)

Unnamed: 0_level_0,high_quality,total,relative_quality
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NORTHERN AMERICA,104,1940,0.053608
ASIA,316,11767,0.026855
EUROPE,350,15858,0.022071
OCEANIA,63,3132,0.020115
AFRICA,119,6861,0.017344
LATIN AMERICA AND THE CARIBBEAN,76,5284,0.014383


***

#### Credits

This exercise is slighty adapted from the course [Human Centered Data Science (Fall 2019)](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)) of [Univeristy of Washington](https://www.washington.edu/datasciencemasters/) by [Jonathan T. Morgan](https://wiki.communitydata.science/User:Jtmorgan).

Same as the original inventors, we release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).