Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A2 - Wikipedia, ORES, and Bias in Data
Please follow the reproducability workflow as practiced during the last exercise.

## Step 1⃣ | Data acquisition

You will use two data sources: (1) Wikipedia articles of politicians and (2) world population data.

**Wikipedia articles -**
The Wikipedia articles can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449). It contains politiciaans by country from the English-language wikipedia. Please read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.

**Population data -**
The population data is available in `CSV` format in the `_data` folder. The file is named `export_2019.csv`. This dataset is drawn from the [world population datasheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau (downloaded 2020-11-13 10:14 AM). I have edited the dataset to make it easier to use in this assignment. The population per country is given in millions!

#### 1. Let's import the raw data as a pandas dataframe and look at it

In [1]:
import pandas as pd 
page_data = pd.read_csv("data_raw/page_data.csv") 
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [2]:
export_2019 = pd.read_csv("data_raw/export_2019.csv", sep=";")
export_2019.head()

Unnamed: 0,country,population,region
0,Algeria,44.357,AFRICA
1,Egypt,100.803,AFRICA
2,Libya,6.891,AFRICA
3,Morocco,35.952,AFRICA
4,Sudan,43.849,AFRICA


## Step 2⃣ | Data processing and cleaning
The data in `page_data.csv` contain some rows that you will need to filter out. It contains some page names that start with the string `"Template:"`. These pages are not Wikipedia articles, and should not be included in your analysis. The data in `export_2019.csv` does not need any cleaning.

***

| | `page_data.csv` | | |
|-|------|---------|--------|
| | **page** | **country** | **rev_id** |
|0|	Template:ZambiaProvincialMinisters | Zambia | 235107991 |
|1|	Bir I of Kanem | Chad | 355319463 |

***

| | `export_2019.csv` | | |
|-|------|---------|--------|
| | **country** | **population** | **region** |
|0|	Algeria | 44.357 | AFRICA |
|1|	Egypt | 100.803 | 355319463 |

***

In [3]:
page_data = page_data[~page_data.page.str.contains("Template:")]
page_data

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


### Getting article quality predictions with ORES

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [**ORES**](https://www.mediawiki.org/wiki/ORES) ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of the six quality categories. The options are, from best to worst:

| ID | Quality Category |  Explanation |
|----|------------------|----------|
| 1 | FA    | Featured article |
| 2 | GA    | Good article |
| 3 | B     | B-class article |
| 4 | C     | C-class article |
| 5 | Start | Start-class article |
| 6 | Stub  | Stub-class article |

For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can [read more](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades) about what these assessment classes mean on English Wikipedia. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these six categories to any `rev_id`. You need to extract all `rev_id`s in the `page_data.csv` file and use the ORES API to get the predicted quality score for that specific article revision.

### ORES REST API endpoint

The [ORES REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model) is configured fairly similarly to the pageviews API we used for the last assignment. It expects the following parameters:

* **project** --> `enwiki`
* **revid** --> e.g. `235107991` or multiple ids e.g.: `235107991|355319463` (batch)
* **model** --> `wp10` - The name of a model to use when scoring.

**❗Note on batch processing:** Please read the documentation about [API usage](https://www.mediawiki.org/wiki/ORES#API_usage) if you want to query a large number of revisions (batches). 

You will notice that ORES returns a prediction value that contains the name of one category (e.g. `Start`), as well as probability values for each of the six quality categories. For this assignment, you only need to capture and use the value for prediction.

**❗Note:** It's possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log should be saved as a separate file named `ORES_no_scores.csv` and should include the `page`, `country`, and `rev_id` (just as in `page_data.csv`).

You can use the following **samle code for API calls**:

In [4]:
import requests
import json

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/mvrcx',
    'From': 'm.oprisiu@fu-berlin.de'
}

def get_ores_data(rev_id, headers):
    
    # Define the endpoint
    # https://ores.wikimedia.org/scores/enwiki/?models=wp10&revids=807420979|807422778
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : rev_id
              }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    data = json.dumps(response)

    return data

#### Defining functions that return the score and probability for a given rev_id

In [5]:
def get_ores_prediction(rev_id, headers):
    """
    returns score for given rev_id. If no score is available for rev_id the function returns -1 
    """
    try:
        res = json.loads(get_ores_data(rev_id, headers))
        prediction = res['enwiki']['scores'][str(rev_id)]['wp10']['score']['prediction']
        return prediction
    except KeyError:
        return -1

def get_ores_prediction_value(rev_id, headers):
    """
    might need this later, not sure yet
    returns probability for given rev_id. If no probability is available for rev_id the function returns -1 
    """
    try:
        res = json.loads(get_ores_data(rev_id, headers))
        value = res['enwiki']['scores'][str(rev_id)]['wp10']['score']['probability'][prediction]
        return value
    except KeyError:
        return -1

dataframe = pd.DataFrame(columns = ["rev_id"]) #Create new dataframe
dataframe["rev_id"] = page_data['rev_id'] #Initialize with rev_ids
dataframe

Unnamed: 0,rev_id
1,355319463
10,393276188
12,393822005
23,395521877
24,395526568
...,...
47192,807482007
47193,807483006
47194,807483153
47195,807483270


Sending one request for each `rev_id` might take some time. If you want to send batches you can use `'|'.join(str(x) for x in revision_ids` to put your ids together. Please make sure to deal with [exception handling](https://www.w3schools.com/python/python_try_except.asp) of the `KeyError` exception, when extracting the `prediction` from the `JSON` response.

Ok, so it takes roughly 2h to compute a dataframe with 10.000 rows, so I'll copy it and save it to a csv file in order of not having to compute it each and every single time. This might not be the most efficient algorithm :))))

In [6]:
# Creating directories, 
# Source: https://stackoverflow.com/questions/11373610/save-matplotlib-file-to-a-directory
def mkdir_p(mypath):
    '''Creates a directory. equivalent to using mkdir -p on the command line'''

    from errno import EEXIST
    from os import makedirs,path

    try:
        makedirs(mypath)
    except OSError as exc: # Python >2.5
        if exc.errno == EEXIST and path.isdir(mypath):
            pass
        else: raise

# Creating directorys
mkdir_p('data_clean')
mkdir_p('data_clean/chunks')

In [None]:
# Apply get_ores_prediction function on first 10.000 rows
df_1_10000 = dataframe[:10000].copy()
df_1_10000["prediction"] = df_1_10000.apply(lambda x: get_ores_prediction(int(x), headers), axis=1)
df_1_10000.to_csv(r"data_clean/chunks/df_1_10000.csv",sep=";", index = True, header = True)

In [None]:
# Apply get_ores_prediction function on next 10.000 rows
df_9999_20000 = dataframe[10000:20000].copy()
df_9999_20000["prediction"] = df_9999_20000.apply(lambda x: get_ores_prediction(int(x), headers), axis=1)
df_9999_20000.to_csv(r"data_clean/chunks/df_9999_20000.csv",sep=";", index = True, header = True)

In [None]:
# Apply get_ores_prediction function on next 10.000 rows
df_19999_30000 = dataframe[20000:30000].copy()
df_19999_30000["prediction"] = df_19999_30000.apply(lambda x: get_ores_prediction(int(x), headers), axis=1)
df_19999_30000.to_csv(r"data_clean/chunks/df_19999_30000.csv",sep=";", index = True, header = True)

In [None]:
# Apply get_ores_prediction function on next 10.000 rows
df_29999_40000 = dataframe[30000:40000].copy()
df_29999_40000["prediction"] = df_29999_40000.apply(lambda x: get_ores_prediction(int(x), headers), axis=1)
df_29999_40000.to_csv(r"data_clean/chunks/df_29999_40000.csv",sep=";", index = True, header = True)

In [None]:
# Apply get_ores_prediction function on remaining rows
df_39999_end = dataframe[40000:].copy()
df_39999_end["prediction"] = df_39999_end.apply(lambda x: get_ores_prediction(int(x), headers), axis=1)
df_39999_end.to_csv(r"data_clean/chunks/df_39999_end.csv", sep=";", index = True, header = True)

In [7]:
# Parsing all csv's
df1 = pd.read_csv("data_clean/chunks/df_1_10000.csv", sep=";", index_col=0) 
df2 = pd.read_csv("data_clean/chunks/df_9999_20000.csv", sep=";", index_col=0)
df3 = pd.read_csv("data_clean/chunks/df_19999_30000.csv", sep=";", index_col=0)
df4 = pd.read_csv("data_clean/chunks/df_29999_40000.csv", sep=";", index_col=0)
df5 = pd.read_csv("data_clean/chunks/df_39999_end.csv", sep=";", index_col=0)

# Appending dataframe chunks to one dataframe
ores_scores = df1.append(df2.append(df3.append(df4.append(df5))))
ores_scores

# Saving appended dataframe chunks to a big df

Unnamed: 0,rev_id,prediction
1,355319463,Stub
10,393276188,Stub
12,393822005,Stub
23,395521877,Stub
24,395526568,Stub
...,...,...
47192,807482007,GA
47193,807483006,C
47194,807483153,GA
47195,807483270,C


### Combining the datasets

Now you need to combine both dataset: (1) the wikipedia articles and its ORES quality scores and (2) the population data. Both have columns named `country`. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vis versa.

Please remove any rows that do not have matching data, and output them to a `CSV` file called `countries-no_match.csv`. Consolidate the remaining data into a single `CSV` file called `politicians_by_country.csv`.

The schema for that file should look like the following table:


| article_name | country | region | revision_id | article_quality | population |
|--------------|---------|--------|-------------|-----------------|------------|
| Bir I of Kanem | Chad  | AFRICA | 807422778 | Stub | 16877000 |

In [8]:
# Merging the page_data and ores_scores dataframes (inner join on rev_id as key)
df = pd.merge(page_data, ores_scores, on ='rev_id')

# Renaming columns
df = df.rename(columns={'page': 'article_name','rev_id': 'revision_id', 'prediction': 'article_quality'})

In [9]:
import numpy as np
# Merging the above result dataframe with export_2019 (full outer join on county as key)
result = pd.merge(df, export_2019, on='country', how='outer')

# Reindexing columns
result = result.reindex(columns=['article_name', 'country', 'region', 'revision_id', 'article_quality', 'population'])

# replacing -1 (initially meant for no prediction available with empty string)
result = result.replace('-1', '')
result


Unnamed: 0,article_name,country,region,revision_id,article_quality,population
0,Bir I of Kanem,Chad,AFRICA,355319463.0,Stub,16.877
1,Abdullah II of Kanem,Chad,AFRICA,498683267.0,Stub,16.877
2,Salmama II of Kanem,Chad,AFRICA,565745353.0,Stub,16.877
3,Kuri I of Kanem,Chad,AFRICA,565745365.0,Stub,16.877
4,Mohammed I of Kanem,Chad,AFRICA,565745375.0,Stub,16.877
...,...,...,...,...,...,...
46723,,French Polynesia,OCEANIA,,,0.280
46724,,Guam,OCEANIA,,,175.000
46725,,New Caledonia,OCEANIA,,,295.000
46726,,Palau,OCEANIA,,,18.000


In [10]:
# Filtering out rows that have NaN values after merging
# https://www.kite.com/python/answers/how-to-find-rows-with-nan-values-in-a-pandas-dataframe-in-python
is_NaN = result.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = result[row_has_NaN]

# Saving rows with NaN to "result/countries-no_match.csv"
mkdir_p('result')
rows_with_NaN.to_csv(r"result/countries-no_match.csv", sep=";", index = True, header = True)


In [11]:
# Consolidate the remaining data into a single CSV file called "result/politicians_by_country.csv"
result = result.dropna()
result.to_csv(r"result/politicians_by_country.csv", sep=";", index=True, header=True)
result

Unnamed: 0,article_name,country,region,revision_id,article_quality,population
0,Bir I of Kanem,Chad,AFRICA,355319463.0,Stub,16.877
1,Abdullah II of Kanem,Chad,AFRICA,498683267.0,Stub,16.877
2,Salmama II of Kanem,Chad,AFRICA,565745353.0,Stub,16.877
3,Kuri I of Kanem,Chad,AFRICA,565745365.0,Stub,16.877
4,Mohammed I of Kanem,Chad,AFRICA,565745375.0,Stub,16.877
...,...,...,...,...,...,...
46690,Rita Sinon,Seychelles,AFRICA,800323154.0,Stub,98.000
46691,Sylvette Frichot,Seychelles,AFRICA,800323798.0,Stub,98.000
46692,May De Silva,Seychelles,AFRICA,800969960.0,Start,98.000
46693,Vincent Meriton,Seychelles,AFRICA,802051093.0,Stub,98.000


## Step 3⃣ | Analysis

Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population (we can also call it `coverage`) and high-quality articles (we can also call it `relative-quality`)for **each country** and for **each region**. By `"high quality"` arcticle we mean an article that ORES predicted as `FA` (featured article) or `GA` (good article).

**Examples:**

* if a country has a population of `10,000` people, and you found `10` articles about politicians from that country, then the percentage of `articles-per-population` would be `0.1%`.
* if a country has `10` articles about politicians, and `2` of them are `FA` or `GA` class articles, then the percentage of `high-quality-articles` would be `20%`.

### Results format

The results from this analysis are six `data tables`. Embed these tables in the Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment. The tables will show:

1. **Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

**❗Hint:** You will find what country belongs to which region (e.g. `ASIA`) also in `export_2019.csv`. You need to calculate the total poulation per region. For that you could use `groupby` and also check out `apply`.

# Results of analysis

In order to analyse the resulting dataset, the following functions needed for computation are defined:

In [12]:
# Parse CSV
data_table = pd.read_csv("result/politicians_by_country.csv", sep=";", index_col=0) 


def get_country_population(country):
    """
    returns a country's population in Millions
    """
    return float(data_table.loc[data_table['country'] == str(country)]["population"].head(1))
    
def get_region_population(region):
    """
    returns the summed population of a given region by adding all country's population in this region
    """
    region_populations = export_2019.groupby('region').sum()
    dictionary = region_populations.to_dict()
    dictionary = dictionary['population']
    return dictionary[str(region).upper()]

def get_article_num(region):
    """
    returns the number of articles by a given region
    """
    data = data_table[['article_name','region']].groupby('region').count().to_dict()['article_name']
    return data[str(region).upper()]
    
def get_hq_article_num(region):
    """
    returns the number of good articles (GA+FA) by given region
    """



    
def get_number_of_articles(country):
    """
    returns the number of articles grouped by country
    """
    return (data_table.groupby(['country']).size())[str(country)]
    
    
def get_number_of_hq_articles(country):
    """
    returns the number of high quality articles grouped by country
    """
    try:
        result = (data_table.loc[data_table['article_quality'].isin(["GA", "FA"])].groupby(['country']).size()[str(country)])
        
    except:
        result = 0
    # This literally took me forever to write, but it somehow works
    return result


def coverage(number_of_articles, country_population):
    """
    returns the coverage as the proportion (as a percentage) of articles-per-population
    """
    return (country_population/number_of_articles)*100


def relative_quality(number_of_hq_articles, number_of_articles):
    """
    returns the relative_quality as the proportion (as a percentage) of high-quality articles
    """
    return (number_of_hq_articles/number_of_articles)*100

# Test functions
#get_country_population("Germany")
#get_number_of_articles("Germany")
#coverage(get_number_of_articles("Germany"), get_country_population("Germany"))

countrys = data_table["country"].unique()
regions = data_table["region"].unique()
region_country = data_table.groupby('region').count()

# Compute coverage for each country
coverages = {}
for country in countrys:
    coverages[country] = coverage(get_number_of_articles(country), get_country_population(country))

# Compute relative quality for each country
rq = {}
for country in countrys:
    try:
        rq[country] = relative_quality(get_number_of_hq_articles(country), get_number_of_articles(country))
    except KeyError:
        pass




1. **Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population

**Note: Ok, so a coverage of 3.935% does not make sense to me, but I figured out, that not every value given in the export_2019.csv represents the actual population of each country (e.g. Montenegro has a value of 622 (that would mean 622 Million), but should instead have a population of 0.622)**

In [13]:
coverage = pd.DataFrame(data=coverages, index=[0]) 
coverage = coverage.transpose()
coverage.columns = ['coverage']
dt1 = coverage.sort_values(['coverage'],ascending=False)
dt1.head(10)

Unnamed: 0,coverage
Guyana,3935.0
Djibouti,2670.27027
Belize,2618.75
Barbados,2050.0
Bahamas,1965.0
Suriname,1512.5
Cape Verde,1502.702703
French Guiana,1088.888889
Martinique,1047.058824
Montenegro,863.888889


2. **Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [14]:
dt2 = dt1.copy()
dt2 = dt2.tail(10)
dt2.iloc[::-1]

Unnamed: 0,coverage
Tuvalu,0.018519
Albania,0.621007
New Zealand,0.636097
Norway,0.821189
Moldova,0.833726
Estonia,0.893289
Finland,0.97
Sao Tome and Principe,1.0
Lithuania,1.145082
Cyprus,1.231633


3. **Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [15]:
dt3 = pd.DataFrame(data=rq, index=[0]) 
dt3 = dt3.transpose()
dt3.columns = ['relative_quality']
dt3 = dt3.sort_values(['relative_quality'], ascending=False)
dt3.head(10)

Unnamed: 0,relative_quality
"Korea, North",22.222222
Saudi Arabia,12.711864
Romania,12.244898
Central African Republic,12.121212
Uzbekistan,10.714286
Mauritania,10.416667
Guatemala,8.433735
Dominica,8.333333
Syria,7.751938
Benin,7.692308


4. **Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [16]:
dt4 = dt3.copy()
dt4 = dt4.sort_values(['relative_quality'], ascending=True)
dt4.head(10)

Unnamed: 0,relative_quality
Seychelles,0.0
Comoros,0.0
Zambia,0.0
Djibouti,0.0
Belize,0.0
Barbados,0.0
Bahamas,0.0
French Guiana,0.0
Federated States of Micronesia,0.0
Angola,0.0


5. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [17]:
# not sure if im doing the right thing here, but I'll calculate the following:
# proportion = number of articles / total regional population

dt5 = pd.DataFrame(columns = ['region'], index=[1,2,3,4,5,6])
dt5['region'] = regions
dt5['proportion'] = dt5['region'].apply(lambda x: get_article_num(str(x))/get_region_population(str(x)))
dt5 = dt5.set_index('region')
dt5 = dt5.sort_values(['proportion'], ascending = False)
dt5

Unnamed: 0_level_0,proportion
region,Unnamed: 1_level_1
NORTHERN AMERICA,5.270765
EUROPE,4.874975
ASIA,1.8618
AFRICA,1.454055
OCEANIA,1.095802
LATIN AMERICA AND THE CARIBBEAN,1.068069


6. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [20]:
# calculating: number of good articles by region / regions population

def proportion(x,y):
    return x/y

data = data_table.groupby(['region','country']).sum()

data['number of good articles'] = ""
data = data.reset_index()
data['number of good articles'] = data['country'].apply(lambda x: get_number_of_hq_articles(str(x)))
good_articles_by_region = data[['region', 'number of good articles']].groupby('region').sum()
good_articles_by_region = good_articles_by_region.reset_index()
good_articles_by_region['population'] = good_articles_by_region['region'].apply(lambda x: get_region_population(x))
good_articles_by_region['proportion'] = good_articles_by_region['number of good articles']/good_articles_by_region['population']
dt6 = good_articles_by_region.groupby('region').sum()
dt6 = dt6.sort_values('proportion', ascending = False)
dt6.drop(['number of good articles', 'population'], axis = 1)
# Sorted by proportion:

Unnamed: 0_level_0,proportion
region,Unnamed: 1_level_1
NORTHERN AMERICA,0.282556
EUROPE,0.107595
ASIA,0.049998
AFRICA,0.02522
OCEANIA,0.022042
LATIN AMERICA AND THE CARIBBEAN,0.015362


***

#### Credits

This exercise is slighty adapted from the course [Human Centered Data Science (Fall 2019)](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)) of [Univeristy of Washington](https://www.washington.edu/datasciencemasters/) by [Jonathan T. Morgan](https://wiki.communitydata.science/User:Jtmorgan).

Same as the original inventors, we release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).