- https://www.prb.org/international/indicator/population/table/
- https://figshare.com/articles/Untitled_Item/5513449 Keyes, Os (2017): Politicians by Country from the English-language Wikipedia. figshare. Dataset. 

"This project contains data on most English-language Wikipedia articles within the category "Category:Politicians by nationality" and subcategories, along with the code used to generate that data. Both are released under the CC-BY-SA 4.0 license."

- ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

Use the ORES Python client available via pip.


### Imports

In [1]:
import pandas as pd
from ores import api

Both page_data.csv and WPDS_2018_data.csv contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of page_data.csv, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in your analysis.

Similarly, WPDS_2018_data contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in page_data, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.

### Read Datasets

In [2]:
page_path = 'data/country/data/page_data.csv'
wpds_path = 'data/WPDS_2018_data.csv'

In [3]:
page_df = pd.read_csv(page_path)
wpds_df = pd.read_csv(wpds_path)

### Remove "Template" Articles from the Page Data

In [4]:
page_df = page_df[~page_df['page'].str.contains('Template')]

### Split out Regions and Countries

In [5]:
region_df = wpds_df[wpds_df['Geography'].str.isupper()]
country_df = wpds_df[~wpds_df['Geography'].str.isupper()]

### Initiate ORES API Session
The useragent string (second arg below) is used to help the ORES team track requests.

In [6]:
ores_session = api.Session("https://ores.wikimedia.org", "Class project <jmorgan@wikimedia.org>")

### Score Pages by  `rev_id`

In [7]:
results = ores_session.score("enwiki", ["articlequality"], page_df['rev_id'].tolist())

### Extract Predictions and Catch Articles for which Predictions Cannot be Extracted

In [8]:
predictions = []
for score in results:
    try:
        predictions.append(score['articlequality']['score']['prediction'])
    except:
        predictions.append('None')

### Add Predictions to Dataframe and Filter out Articles without Scores

In [9]:
page_df['ores_score'] = predictions

### Display Dataframe of Pages without Scores

In [10]:
scoreless_df = page_df[page_df['ores_score'] == 'None']

In [11]:
print(scoreless_df.to_string())

                                          page               country     rev_id ores_score
126              List of politicians in Poland                Poland  516633096       None
222                                 Tingtingru               Vanuatu  550682925       None
330                                Daud Arsala           Afghanistan  627547024       None
539                                Bharat Saud                 Nepal  671484594       None
643                                Robert Sych                Poland  684023803       None
644                Marek Krzysztof Jeleniewski                Poland  684023859       None
830                      Narayani Datt Chataut                 Nepal  698572327       None
894                             Mohammad Amjad              Pakistan  699260156       None
1084                         Gomez das Mariñas                 Spain  703773782       None
1535                              Kamlesh Arya                  Fiji  706810694       None

### Filter out Articles without Scores

In [12]:
page_df = page_df[~(page_df['ores_score'] == 'None')]

### Merge Wikipedia and Population Data

In [13]:
wp_country_df = pd.merge(page_df, country_df,  how='inner', left_on='country', right_on='Geography')\
                  .drop(['Geography'], 1)

### Collect and Save Wikipedia Page Records that do Not Match Population Data

In [14]:
left_df = pd.merge(page_df, country_df,  how='left', left_on='country', right_on='Geography', indicator=True)
no_match_df = left_df[left_df['_merge'] == 'left_only']

In [15]:
no_match_df.to_csv('data/wp_wpds_countries-no_match.csv', index=False)

### Save Merged Records with the Following Schema
- **Column**
- country
- article_name
- revision_id
- article_quality
- population

In [16]:
wp_country_df.columns = ['article_name', 'country', 'revision_id', 'article_quality', 'population']
wp_country_df.to_csv('data/wp_wpds_politicians_by_country.csv', index=False)

### Read Merged Wikipedia and Population Dataframe for Analysis

In [17]:
wp_country_df = pd.read_csv('data/wp_wpds_politicians_by_country.csv')

Analysis

Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

Examples:

if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.
if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.
Results format
Your results from this analysis will be published in the form of data tables. You are being asked to produce six total tables, that show:

- Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

- Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

- Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

- Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

Embed these tables in the Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment, although you are welcome to do so in addition to generating the data tables described above, if you wish to do so!

Reminder: you will find the list of geographic regions, which countries are in each region, and total regional population in the raw WPDS_2018_data.csv file. See "Cleaning the data" above for more information.

Writeup: reflections and implications
Write a few paragraphs, either in the README or at the end of the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.

In addition to any reflections you want to share about the process of the assignment, please respond (briefly) to at least three of the questions below:

- What biases did you expect to find in the data (before you started working with it), and why?
- What (potential) sources of bias did you discover in the course of your data processing and analysis?
- What might your results suggest about (English) Wikipedia as a data source?
- What might your results suggest about the internet and global society in general?
- Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might create biased or misleading results, due to the inherent gaps and limitations of the data?
- Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might still be appropriate and useful, despite its inherent limitations and biases?
- How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?

This section doesn't need to be particularly long or thorough, but we'll expect you to write at least a couple paragraphs.

Required deliverables

A directory in your GitHub repository called data-512-a2 that contains at minimum the following files:

- your two source data files and a description of each
- 1 final data file in CSV format that contains all articles you analyzed, the corresponding country and population, and their predicted quality score.
- 1 Jupyter notebook named hcds-a2-bias that contains all code as well as information necessary to understand each programming step, as well your findings (six tables) and your writeup (if you have not included it in the README).
- 1 README file in .txt or .md format that contains information to reproduce the analysis, including data descriptions, attributions and provenance information, and descriptions of all relevant resources and documentation (inside and outside the repo) and hyperlinks to those resources, and your writeup (if you have not included it in the notebook). A prototype framework is included in the sample repository
- 1 LICENSE file that contains an MIT LICENSE for your code.
- If you created any additional process or incremental files in the course of your data processing and analysis (for example, a list of articles for which you were not able to gather ORES scores), please include these in the folder as well, and briefly describe them in the README.

Helpful tips

- Read all instructions carefully before you begin
- Read all API documentation carefully before you begin
- Experiment with queries in the sandbox of the technical documentation for the API to familiarize yourself with the schema and the data
- Explore the data a bit before starting to be sure you understand how it is structured and what it contains
- Ask questions on Slack if you're unsure about anything. If you need more help, come to office hours or schedule a time to meet with Yihan or Jonathan.
- When documenting/describing your project, think: "If I found this GitHub repo, and wanted to fully reproduce the analysis, what information would I want? What information would I need?"