In [1]:
import pandas as pd
import requests
import json
import csv

## A2: Bias in data

The notebook attempts to do some basic exploration of biases that exist in the english wikipedia pages. We use the publicly available wikipedia dataset about politicians from various countries along with [ORES](https://www.mediawiki.org/wiki/ORES), a machine learning web service, that 'rates' the articles on certain parameters and assigns it a list of probabilities as a score reflective of its [quality](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades), per the API. We use this and another publicly available data set, called the [world population datasheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau for better evaluation.

### Notebook Workflow:

1. Data Acquisition and preparation
2. Getting article quality predictions
3. Fetching Quality Score Prediction Score
4. Data Merge
5. Analysis
5. Reflection

----

## Data Acquisition and preparation

The Wikipedia politicians by country dataset is gathered from [Figshare](https://figshare.com/articles/Untitled_Item/5513449). Originaly, the data was extracted via the Wikimedia API using the associated code. The fields in the data are:

1. "country", containing the sanitised country name, extracted from the category name;
2. "page", the unsanitised page title.
3. "rev_id", Unique identifier, refers to the the edit ID of the last edit to the page.

In [2]:
page_data = pd.read_csv('source-data/page_data.csv')

page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [3]:
# Dropping the wiki pages starting with 'Template:', as they are not real articles pages.

page_data = page_data[~page_data.page.str.startswith('Template:')].reset_index(drop=True)

Next we grab the the Population dataset, which is drawn from the [world population datasheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.

In [4]:
WPDS_2018_data = pd.read_csv('source-data/WPDS_2018_data.csv')

In [5]:
WPDS_2018_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


In [6]:
# Since the dataset contains regions/continents as well, we separate Countries and region data from WPDS data

WPDS_2018_data_region = WPDS_2018_data[WPDS_2018_data.Geography.str.isupper()].reset_index(drop=True)
# Converting values to float
WPDS_2018_data_region['Population mid-2018 (millions)'] = \
WPDS_2018_data_region['Population mid-2018 (millions)'].str.replace(',', '').astype(float)

In [7]:
WPDS_2018_data_country = WPDS_2018_data[~WPDS_2018_data.Geography.str.isupper()].reset_index(drop=True)

# Since the values for population are in string, we need to convert them to float for later use
WPDS_2018_data_country['Population mid-2018 (millions)'] = \
WPDS_2018_data_country['Population mid-2018 (millions)'].str.replace(',', '').astype(float)

---

### Getting article quality predictions

The article quality predictions are gather through [ORES](https://www.mediawiki.org/wiki/ORES), a machine learning system called that estimates the quality of an article (at a particular point in time), and assigns the following series of probabilities that the article is in one of 6 quality categories:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

The exact assessment details can be read [here](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades)

The 'rev_id' from the page data contains the unique identifier to fetch the quality score from the API

In [8]:
# Sample data fetch:

revision_ids = page_data.rev_id[-1890:-1888]
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
params = {'project' : 'enwiki',
          'model'   : 'wp10',
          'revids'  : '|'.join(str(x) for x in revision_ids)
          }
api_call = requests.get(endpoint.format(**params))
response = api_call.json()

response

{'enwiki': {'models': {'wp10': {'version': '0.8.1'}},
  'scores': {'806503132': {'wp10': {'score': {'prediction': 'Start',
      'probability': {'B': 0.08366287323596852,
       'C': 0.17744300684057873,
       'FA': 0.005809853264138301,
       'GA': 0.01580210415313383,
       'Start': 0.6655620827086284,
       'Stub': 0.05172007979755225}}}},
   '806503196': {'wp10': {'score': {'prediction': 'Start',
      'probability': {'B': 0.030833538536049584,
       'C': 0.05191350351802135,
       'FA': 0.0039232070356563,
       'GA': 0.006332520745605151,
       'Start': 0.7767308522691304,
       'Stub': 0.1302663778955373}}}}}}}

In [9]:
HEADERS = {'User-Agent' : 'https://github.com/nmnshrma', 'From' : 'namans3@uw.edu'}

def fetch_ores_response(revision_ids, headers):
    """
    fetches ORES response for the ORES API, for a set of revision IDs
    
    :param revision_ids: list of ids to be fetched
    :param headers: HEADERS for the API call
    
    :returns: nested dict object with ORES API response
    """
    
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    
    return response

In [10]:
def store_and_read(store_loc, action, dict_={}):
    """
    Helper function to read and write dict objects to a location
    
    NOTE: The following ONLY performs a 'write' if there is non-empty dict passed
    
    :param store_loc: location for the file store/read
    :param action: list of ids to be fetched
    :param dict_: key-value dict to be stored 
    
    :returns: (for read) dict objects
    """
    
    if action not in ['read', 'write']:
        raise ValueError("action value must be read/write")
    
    
    if action == 'read':
        try:
            with open(store_loc, 'r+') as csv_file:
                reader = csv.reader(csv_file)
                read_dict = dict(reader)
            csv_file.close()
            return read_dict
        except FileNotFoundError:
            return None
    
    
    
    if dict_ and action=='write':
        # dict_.update(read_dict)
        with open(store_loc, 'a+') as csv_file:
            writer = csv.writer(csv_file)
            for key, value in dict_.items():
                writer.writerow([key, value])
            csv_file.close()
        return None
    

In [11]:
def fetch_ores_response_batchwise(revision_ids, headers, perc_split, store_loc):
    """
    splits a large set of revision IDs to gather and clean pertinent responses
    from the ORES API
    
    Calls fetch_ores_response method for a fraction of revision IDs. 
    Fraction of revision IDs to be sent are decided through perc_split
    
    :param revision_ids: list of ids to be fetched
    :param headers: HEADERS for the API call
    :param perc_split: fraction of revision_ids to be used
    :param store_loc: the store loc for the dict object to be written/read
    
    :returns: nested dict object with key-value pair
    """
    init_ids=len(revision_ids)
    # Shorten the list that have already been read and stored 
    ignore_list = store_and_read(store_loc=store_loc, action='read')
    revision_ids = [i for i in revision_ids if str(i) not in ignore_list] 
    print(f"IDs list shortened by {init_ids-len(revision_ids)}")
    if len(revision_ids) == 0:
        return ignore_list
    
    # helper values for the API calls
    # Batch size decides the chunk size for the API call
    n = len(revision_ids)
    batch_size = n//perc_split
    
    data_dict = {}
    
    for i in range(0, n, batch_size):
        # sends a batch at once
        data = fetch_ores_response(revision_ids[i:i+batch_size], headers=headers) 
        for key, val in data['enwiki']['scores'].items():
            data_dict[key] = 'NA' if 'error' in val.get('wp10') else val.get('wp10', 'NA').get('score', 'NA').get('prediction', 'NA')
            store_and_read(store_loc=store_loc, action='write', dict_={key:data_dict[key]})
    
    # Return dict object contains: {Rev_id: Prediction}
    return data_dict

In [12]:
score_map= fetch_ores_response_batchwise(revision_ids=page_data.rev_id, headers=HEADERS, perc_split=500,
                                         store_loc='results-data/quality-map.csv')

IDs list shortened by 46701


In [13]:
## Maps the quality score from 'score_map' 

page_data['quality_score'] = page_data.rev_id.map(lambda x: score_map.get(str(x), 'NA'))

----

### Data merge

In [14]:
# STEPWISE Preparation for the data

# Only take articles who have a legitimate quality score
final_page_data = page_data[page_data.quality_score != 'NA']


# Inner join to merge file with country, so as to attach populations
final_page_data = final_page_data.merge(WPDS_2018_data_country, how='inner', 
                                        left_on='country', right_on='Geography')

# Remove redundant columns
final_page_data = final_page_data[['page', 'country', 'rev_id', 'quality_score','Population mid-2018 (millions)']]

# Column rename and reshuffle as per the instructions
final_page_data.rename(columns={"page": "article_name", 
                               "quality_score": "article_quality",
                               "rev_id": "revision_id",
                               "Population mid-2018 (millions)": "population"},
                      inplace = True)
final_page_data = final_page_data[['country', 'article_name', 'revision_id', 'article_quality', 'population']]

final_page_data.to_csv('results-data/wiki_page_merged.csv', index=False)

final_page_data.head()

Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Chad,Bir I of Kanem,355319463,Stub,15.4
1,Chad,Abdullah II of Kanem,498683267,Stub,15.4
2,Chad,Salmama II of Kanem,565745353,Stub,15.4
3,Chad,Kuri I of Kanem,565745365,Stub,15.4
4,Chad,Mohammed I of Kanem,565745375,Stub,15.4


----

### Analysis
Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

#### Examples:
1. if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.
2. if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.

In [15]:
# Generating a helper column that has 1 for good quality articles and 0 for bad

final_page_data.loc[:,'high_quality'] = final_page_data.article_quality.map(lambda x: 
                                                                            1 if x in ['GA', 'FA'] else 0)

final_page_data.head(n=2)

Unnamed: 0,country,article_name,revision_id,article_quality,population,high_quality
0,Chad,Bir I of Kanem,355319463,Stub,15.4,0
1,Chad,Abdullah II of Kanem,498683267,Stub,15.4,0


---

**Results format**
Your results from this analysis will be published in the form of data tables. You are being asked to produce six total tables, that show:

1. **Top 10 countries by coverage**: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. **Bottom 10 countries by coverage**: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

To answer the questions about the 'coverage', we need to prepare aggregated values of the data prepared in the final_page_data prepared in the last step.

In [16]:
country_group_data = final_page_data.groupby('country').agg({'revision_id':'count', 
                                                             'high_quality':'mean', 
                                                             # min of population will just mean the actual population
                                                             'population': 'min'}).reset_index()

# Since aggregated 'revision_id' represent count of those IDs, we rename the column for clarity
country_group_data.rename(columns = {'revision_id':'articles'},
                          inplace=True)

country_group_data.head(n=2)

Unnamed: 0,country,articles,high_quality,population
0,Afghanistan,320,0.0375,36.5
1,Albania,457,0.006565,2.9


In [17]:
# Coverage here is defined as per instructions

country_group_data.loc[:, 'coverage'] = \
(country_group_data.articles/country_group_data.population/1e6)*100

1. **Top 10 countries by coverage:**

In [18]:
country_group_data.sort_values(by='coverage', ascending=False).head(n=10)

Unnamed: 0,country,articles,high_quality,population,coverage
166,Tuvalu,54,0.092593,0.01,0.54
115,Nauru,52,0.0,0.01,0.52
135,San Marino,81,0.0,0.03,0.27
108,Monaco,40,0.0,0.04,0.1
93,Liechtenstein,28,0.0,0.04,0.07
161,Tonga,63,0.0,0.1,0.063
103,Marshall Islands,37,0.0,0.06,0.061667
68,Iceland,201,0.00995,0.4,0.05025
3,Andorra,34,0.0,0.08,0.0425
61,Grenada,36,0.027778,0.1,0.036


1. **Bottom 10 countries by coverage:**

In [19]:
country_group_data.sort_values(by='coverage', ascending=True).head(n=10)

Unnamed: 0,country,articles,high_quality,population,coverage
69,India,980,0.017347,1371.3,7.1e-05
70,Indonesia,210,0.047619,265.2,7.9e-05
34,China,1130,0.036283,1393.8,8.1e-05
173,Uzbekistan,28,0.071429,32.9,8.5e-05
51,Ethiopia,101,0.019802,107.5,9.4e-05
82,"Korea, North",36,0.194444,25.6,0.000141
178,Zambia,25,0.0,17.7,0.000141
159,Thailand,112,0.026786,66.2,0.000169
112,Mozambique,58,0.0,30.5,0.00019
13,Bangladesh,319,0.009404,166.4,0.000192


---

3. **Top 10 countries by relative quality**: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. **Bottom 10 countries by relative quality**: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [20]:
# We can utilize the high_quality parameter to measure the relative quality

country_group_data.loc[:, 'high_quality'] = country_group_data.loc[:, 'high_quality']*100

3. **Top 10 countries by relative quality**:

In [21]:
country_group_data.sort_values(by='high_quality', ascending=False).head(n=10)

Unnamed: 0,country,articles,high_quality,population,coverage
82,"Korea, North",36,19.444444,25.6,0.000141
137,Saudi Arabia,118,12.711864,33.4,0.000353
104,Mauritania,48,12.5,4.5,0.001067
31,Central African Republic,66,12.121212,4.7,0.001404
132,Romania,343,11.370262,19.5,0.001759
166,Tuvalu,54,9.259259,0.01,0.54
19,Bhutan,33,9.090909,0.8,0.004125
44,Dominica,12,8.333333,0.07,0.017143
155,Syria,128,7.8125,18.3,0.000699
18,Benin,91,7.692308,11.5,0.000791


4. **Bottom 10 countries by relative quality**:

In [22]:
country_group_data.sort_values(by='high_quality', ascending=True).head(n=10)

Unnamed: 0,country,articles,high_quality,population,coverage
143,Slovakia,116,0.0,5.4,0.002148
114,Namibia,162,0.0,2.5,0.00648
30,Cape Verde,37,0.0,0.6,0.006167
112,Mozambique,58,0.0,30.5,0.00019
38,Costa Rica,147,0.0,5.0,0.00294
108,Monaco,40,0.0,0.04,0.1
43,Djibouti,37,0.0,1.0,0.0037
107,Moldova,423,0.0,3.5,0.012086
167,Uganda,185,0.0,44.1,0.00042
49,Eritrea,16,0.0,6.0,0.000267


---

5. **Geographic regions by coverage**: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. **Geographic regions by coverage**: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

To answer the above question we have to extract the country-region mapping through the WPDS_2018_data dataframe

In [23]:
# Declares a new column to be filled with region names
WPDS_2018_data.loc[:,'geo_region'] = None

# Extracting list of regions to be mapped to country
geo_regions = list(WPDS_2018_data_region.Geography)

# We extract the region by mapping the first occurence of region in the WPDS_2018_data dataframe
for idx, geo_region in enumerate(geo_regions):
    i = int(WPDS_2018_data.index[WPDS_2018_data.Geography== geo_region][0])
    
    if geo_region != geo_regions[-1]:
        region_next = geo_regions[idx+1]
        i_next = int(WPDS_2018_data.index[WPDS_2018_data.Geography== region_next][0])
        WPDS_2018_data.loc[i:i_next, 'geo_region'] = geo_region
    else:
        WPDS_2018_data.loc[i:, 'geo_region'] = geo_region
    

WPDS_2018_data.head(n=2)

Unnamed: 0,Geography,Population mid-2018 (millions),geo_region
0,AFRICA,1284.0,AFRICA
1,Algeria,42.7,AFRICA


In [24]:
# For convinience, we can create a dictionary of the map

country_region_map = dict(list(zip(WPDS_2018_data.Geography,
                                   WPDS_2018_data.geo_region)))

final_page_data.loc[:, 'region'] = final_page_data.country.map(lambda x: 
                                                               country_region_map.get(x, 'None'))

final_page_data.head(n=2)

Unnamed: 0,country,article_name,revision_id,article_quality,population,high_quality,region
0,Chad,Bir I of Kanem,355319463,Stub,15.4,0,AFRICA
1,Chad,Abdullah II of Kanem,498683267,Stub,15.4,0,AFRICA


To figure out region-level data, we need to prepare aggregated values of the data prepared in the in the last step:

In [25]:
region_group_data = final_page_data.groupby('region').\
agg({'revision_id':'count',
     'high_quality':'mean'}).reset_index()


# Merge region population 
region_group_data = \
region_group_data.merge(WPDS_2018_data_region, how='left', left_on='region', right_on='Geography')

# Rename column for clarity:
region_group_data.rename(columns={'Population mid-2018 (millions)':'population'}, inplace=True)
region_group_data.rename(columns = {'revision_id':'articles'}, inplace =True)

# Drop redundant columns
region_group_data.drop(labels="Geography", axis=1,inplace=True)

region_group_data.head(n=2)

Unnamed: 0,region,articles,high_quality,population
0,AFRICA,6851,0.018246,1284.0
1,ASIA,11531,0.026884,4536.0


In [26]:
region_group_data.loc[:, 'coverage'] = \
(region_group_data.articles/region_group_data.population/1e6)*100

5. **Geographic regions by coverage**: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [27]:
region_group_data.sort_values(by='coverage', ascending=False).head(n=10)

Unnamed: 0,region,articles,high_quality,population,coverage
5,OCEANIA,3128,0.0211,41.0,0.007629
2,EUROPE,15864,0.020298,746.0,0.002127
3,LATIN AMERICA AND THE CARIBBEAN,5169,0.013349,649.0,0.000796
0,AFRICA,6851,0.018246,1284.0,0.000534
4,NORTHERN AMERICA,1921,0.051536,365.0,0.000526
1,ASIA,11531,0.026884,4536.0,0.000254


6. **Geographic regions by coverage**: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [28]:
region_group_data.sort_values(by='high_quality', ascending=False).head(n=10)

Unnamed: 0,region,articles,high_quality,population,coverage
4,NORTHERN AMERICA,1921,0.051536,365.0,0.000526
1,ASIA,11531,0.026884,4536.0,0.000254
5,OCEANIA,3128,0.0211,41.0,0.007629
2,EUROPE,15864,0.020298,746.0,0.002127
0,AFRICA,6851,0.018246,1284.0,0.000534
3,LATIN AMERICA AND THE CARIBBEAN,5169,0.013349,649.0,0.000796


---

## Reflections

Writeup: reflections and implications
Write a few paragraphs, either in the README or at the end of the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.

In addition to any reflections you want to share about the process of the assignment, please respond the questions below:

**What biases did you expect to find in the data (before you started working with it), and why?**

My inherent assumption was that the English speaking, western, part of the world will come on top especially with respect to the high quality articles. This assumption was based on an the bias that the western/english speaking world is generally developed, with the larger pool of resources will have higher quality of wiki articles. Furthermore, I wrongly believed that the higher education level in said countries will inevitable place them on top of the countries with highest quality coverage list.

I also believed, that the number of article on politicians to be much higher overall. This was largely on account of the news-cycle and the coverage the political personality recieve, I believed the number would at least an order of magnitude over what it came out to be. Consequently, I did not expect the 'high quality' article percentage to hover between 2-5 percent, but instead be higher than that on account of personally having never seen a shoddy wiki page about a political entities.

**What (potential) sources of bias did you discover in the course of your data processing and analysis?**

The biggest source of possible issue is the ORES API with the claims of the quality of the article. While wikimedia has comprehensive guidelines in place, however there are still gaps that may be introduced by subjectivity from the reviewers.

**What might your results suggest about (English) Wikipedia as a data source?**

The coverage of the English Wiki is truly profound, with 200 articles per country the resource is vastly great. However, finding out that over 87% of the articles are of 'Stub' or 'Start' quality is very dissapointing and has shaken my belief in the repository as a database. The biggest shock in the whole process was certainly the appearance of North Korea in the 'highest quality' score.