This is the source code for **Homework #2**

## Step 1: Import libraries

In [85]:
import os
import pandas as pd
import json, time, urllib.parse
import requests

## Step 2: Getting the Article and Population Data

**A. Data Upload**

In [72]:
# Load dataset
df_politician = pd.read_csv('politicians_by_country_sept2022.csv')
df_population = pd.read_csv('population_by_country_2022.csv')

**B. Column renaming for consistency**

In [73]:
# Rename columns
df_politician.rename(columns = {'name': 'article'}, inplace = True)
df_population.rename(columns = {'Geography': 'country'}, inplace = True)

## Step 3: Handling Inconsistencies within Data

**A. Removing Duplicates**

In [74]:
# Review dataframe info
df_politician.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7584 entries, 0 to 7583
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   article  7584 non-null   object
 1   url      7584 non-null   object
 2   country  7584 non-null   object
dtypes: object(3)
memory usage: 177.9+ KB


In [75]:
# Check for duplicate articles at country level
df_politician[df_politician[['article','country']].duplicated()]

Unnamed: 0,article,url,country
6295,Abdirahman Aw Ali Farrah,https://en.wikipedia.org/wiki/Abdirahman_Aw_Al...,Somalia
6309,Ibrahim Megag Samatar,https://en.wikipedia.org/wiki/Ibrahim_Megag_Sa...,Somalia


In [76]:
# Check all possible entries of the duplicate records at country level
df_politician.loc[df_politician['article'].isin(['Abdirahman Aw Ali Farrah','Ibrahim Megag Samatar'])]

Unnamed: 0,article,url,country
6198,Abdirahman Aw Ali Farrah,https://en.wikipedia.org/wiki/Abdirahman_Aw_Al...,Somalia
6231,Ibrahim Megag Samatar,https://en.wikipedia.org/wiki/Ibrahim_Megag_Sa...,Somalia
6295,Abdirahman Aw Ali Farrah,https://en.wikipedia.org/wiki/Abdirahman_Aw_Al...,Somalia
6309,Ibrahim Megag Samatar,https://en.wikipedia.org/wiki/Ibrahim_Megag_Sa...,Somalia


In [77]:
# Remove duplicate articles at country level
df_politician.drop_duplicates(inplace = True)
df_politician.reset_index(drop = True, inplace = True)
df_politician.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7582 entries, 0 to 7581
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   article  7582 non-null   object
 1   url      7582 non-null   object
 2   country  7582 non-null   object
dtypes: object(3)
memory usage: 177.8+ KB


In [78]:
# Check for duplicate articles at world level
df_politician[df_politician[['article']].duplicated()]

Unnamed: 0,article,url,country
1566,Rudi Kolak,https://en.wikipedia.org/wiki/Rudi_Kolak,Croatia
1654,Count Wenzel Chotek of Chotkow and Wognin,https://en.wikipedia.org/wiki/Count_Wenzel_Cho...,Czechia
1669,Eduard Hedvicek,https://en.wikipedia.org/wiki/Eduard_Hedvicek,Czechia
1676,Konstantin Jireček,https://en.wikipedia.org/wiki/Konstantin_Jireček,Czechia
1680,Maximilian Ulrich von Kaunitz,https://en.wikipedia.org/wiki/Maximilian_Ulric...,Czechia
1711,"Leopold, Count von Thun und Hohenstein","https://en.wikipedia.org/wiki/Leopold,_Count_v...",Czechia
1914,Ibrahim Harun,https://en.wikipedia.org/wiki/Ibrahim_Harun,Ethiopia
2513,José Alejandro de Aycinena,https://en.wikipedia.org/wiki/José_Alejandro_d...,Guatemala
2659,José Francisco Barrundia,https://en.wikipedia.org/wiki/José_Francisco_B...,Honduras
3419,Luca Rovinalti,https://en.wikipedia.org/wiki/Luca_Rovinalti,Italy


**Note:** Reviewing a few of these articles, it is unclear if these politicians have actually pursued politics in 2 countries or are just descendent of one country and worker in another. We will continue to keep these 48 records for now assuming that these politicians have career in both the countries.

In [79]:
# Follow the same steps for population data
df_population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233 entries, 0 to 232
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                233 non-null    object 
 1   Population (millions)  233 non-null    float64
dtypes: float64(1), object(1)
memory usage: 3.8+ KB


In [80]:
# Check for duplicate geography records
df_population[df_population.duplicated()]

Unnamed: 0,country,Population (millions)


**B. Missing Values**

In [81]:
df_politician.isnull().sum()

article    0
url        0
country    0
dtype: int64

In [82]:
df_population.isnull().sum()

country                  0
Population (millions)    0
dtype: int64

**Observation:** There are no missing record values in both the tables.

**C. Handling records with 0 Population**

In [145]:
# Check for geographies with 0 population
print(df_population.shape)
df_population[df_population['population'] == 0]

(233, 4)


Unnamed: 0,country,population,region_flag,region
183,liechtenstein,0.0,0,western europe
185,monaco,0.0,0,western europe
211,san marino,0.0,0,southern europe
223,nauru,0.0,0,oceania
226,palau,0.0,0,oceania
231,tuvalu,0.0,0,oceania


**Observation:** Since these 6 countries have incorrect population value, we will go ahead and drop them from the dataframe.

In [148]:
# Drop countries with 0 population
df_population = df_population[df_population['population'] != 0]
df_population.reset_index(drop = True, inplace = True)
df_population.shape

(227, 4)

## Step 4: Preliminary Data Processing

**A. Creating Region Column in Population Data**

In [83]:
# Include a flag to identify rows that provide cumulative regional population counts
def myFunc(val):
    if val.isupper():
        return 1
    else:
        return 0
    
df_population['region_flag'] = df_population['country'].apply(myFunc)


# Creating region
df_population['region'] = None
for i in range(df_population.shape[0]):
    if df_population.loc[i, 'region_flag'] == 1:
        df_population.loc[i, 'region'] = df_population.loc[i, 'country']
    else:
        df_population.loc[i, 'region'] = df_population.loc[i-1, 'region']

        
# Rename columns
df_population['country'] = df_population['country'].str.lower()
df_population['region'] = df_population['region'].str.lower()

**B. Converting Population output format**

In [84]:
# Convert Population (millions) to literal (multiply by 10^6)
df_population['Population (millions)'] = df_population['Population (millions)'].apply(lambda x: x*pow(10, 6))

# Rename columns
df_population.rename(columns = {'Population (millions)': 'population'}, inplace = True)

df_population.head()

Unnamed: 0,country,population,region_flag,region
0,world,7963000000.0,1,world
1,africa,1419000000.0,1,africa
2,northern africa,251000000.0,1,northern africa
3,algeria,44900000.0,0,northern africa
4,egypt,103500000.0,0,northern africa


**C. Separating df_population into two dataframes: country & region**

In [149]:
df_region = df_population.loc[df_population['region_flag'] == 1, ['region', 'population']].reset_index(drop = True)
df_region.head()

Unnamed: 0,region,population
0,world,7963000000.0
1,africa,1419000000.0
2,northern africa,251000000.0
3,western africa,430000000.0
4,eastern africa,473000000.0


In [150]:
df_region.shape

(24, 2)

In [151]:
df_country = df_population.loc[df_population['region_flag'] == 0, ['country', 'region', 'population']].reset_index(drop = True)
df_country.head()

Unnamed: 0,country,region,population
0,algeria,northern africa,44900000.0
1,egypt,northern africa,103500000.0
2,libya,northern africa,6800000.0
3,morocco,northern africa,36700000.0
4,sudan,northern africa,46900000.0


In [152]:
df_country.shape

(203, 3)

## Step 4: Getting Article Quality Predictions

**A. Obtain current revision_id of articles**

In [87]:
# Making a page info request to get the current page revision

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'choubju1@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = df_politician.article.to_list()

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

In [88]:
# Setup procedure for API call
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [89]:
# Make the page info request on all the article listed to get revision id
df_politician['revision_id'] = None
for idx, title in enumerate(ARTICLE_TITLES):
    info = request_pageinfo_per_article(ARTICLE_TITLES[idx])
    dct = info['query']['pages']
    pageid = ''.join(list(dct.keys()))
    revision_id = dct[pageid].get('lastrevid')
    
    # include the revision_id in a new column under df_politician dataset
    df_politician.loc[(df_politician['article'] == title), 'revision_id'] = revision_id

In [119]:
df_politician = df_politician_dummy1.copy()

**B. Remove articles with no revision_id**

In [120]:
# List articles with null revision_id
df_politician[df_politician['revision_id'].isnull()]

Unnamed: 0,article,url,country,revision_id
2446,Prince Ofosu Sefah,https://en.wikipedia.org/wiki/Prince_Ofosu_Sefah,Ghana,
2985,Harjit Kaur Talwandi,https://en.wikipedia.org/wiki/Harjit_Kaur_Talw...,India,
3212,Abd al-Razzaq al-Hasani,https://en.wikipedia.org/wiki/'Abd_al-Razzaq_a...,Iraq,
4865,Abiodun Abimbola Orekoya,https://en.wikipedia.org/wiki/Abiodun_Abimbola...,Nigeria,
4879,Segun “Aeroland” Adewale,https://en.wikipedia.org/wiki/Segun_”Aeroland”...,Nigeria,
5801,Roman Konoplev,https://en.wikipedia.org/wiki/Roman_Konoplev,Russia,
6342,Nhlanhla “Lux” Dlamini,https://en.wikipedia.org/wiki/Nhlanhla_”Lux”_D...,South Africa,


**Observation:** There are 7 articles for which the API was unable to fetch any revision_id. With no revision_id, we will also not obtain their ORES scores. Hence, it will only make sense to drop these articles.

In [121]:
# Drop articles with null revision_id
print(f'Before dropping: {df_politician.shape}')
df_politician.dropna(subset = ['revision_id'], inplace = True)
df_politician.reset_index(drop = True, inplace = True)
print(f'After dropping: {df_politician.shape}')

Before dropping: (7582, 4)
After dropping: (7575, 4)


**C: Make ORES Request to get prediction scores**

In [124]:
# Make ORES Request

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'choubju1@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
ARTICLE_REVISIONS = {df_politician.loc[i,'article']:df_politician.loc[i,'revision_id'] for i in range(df_politician.shape[0])}

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

In [125]:
# Setup procedure for API call
def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [126]:
# Make the page info request on all the article listed to get revision id
df_politician['article_quality'] = None
for article, revision_id in ARTICLE_REVISIONS.items():
    score = request_ores_score_per_article(ARTICLE_REVISIONS[article])
    article_quality = score['enwiki']['scores'][str(revision_id)]['articlequality']['score']['prediction']
    
    # include the article_quality in a new column under df_politician dataset
    df_politician.loc[(df_politician['article'] == article), 'article_quality'] = article_quality

In [127]:
df_politician_dummy2 = df_politician.copy()

In [131]:
# Review total number of articles under each score:
df_politician['article_quality'].value_counts()

Stub     3098
Start    2833
C        1282
GA        242
B          75
FA         45
Name: article_quality, dtype: int64

## Step 5: Combining the Datasets

In [198]:
# Merging datasets
df_politician['country'] = df_politician['country'].str.lower()
df_merge = df_country.merge(df_politician, how = 'outer', on = 'country', indicator = True)
df_merge.drop(columns = 'url', inplace = True)

In [199]:
# Output list of countries with no match
countries_list = df_merge.loc[df_merge['_merge'] != 'both', ['country']].drop_duplicates()
countries_list.reset_index(drop = True, inplace = True)
countries_list

Unnamed: 0,country
0,western sahara
1,mauritius
2,mayotte
3,reunion
4,sao tome and principe
5,eswatini
6,canada
7,united states
8,curacao
9,guadeloupe


**NOTE:** Besides the countries that did not match between the two datasets, we also obtained 6 other countries in the no match list. These are the ones that have 0 population and hence were discarded from the df_population file.

In [200]:
# Save the list in a text file
countries_list.to_csv('wp_countries-no_match.txt', index=False, header = False)

In [201]:
# Consolidate remaining data into CSV
df_merge = df_merge.loc[df_merge['_merge'] == 'both']
df_merge.drop(columns = '_merge', inplace = True)
df_merge.reset_index(drop = True, inplace = True)

columns = ['country', 'region', 'population', 'article', 'revision_id', 'article_quality']
df_merge = df_merge.loc[:,columns]
df_merge.to_csv('wp_politicians_by_country.csv', index=False)

df_merge.head()

Unnamed: 0,country,region,population,article,revision_id,article_quality
0,algeria,northern africa,44900000.0,Said Abadou,1112193748,Stub
1,algeria,northern africa,44900000.0,Tahar Allan,1059626268,Stub
2,algeria,northern africa,44900000.0,Mohamed Seghir Babes,1079379844,Stub
3,algeria,northern africa,44900000.0,Djelloul Baghli,1053461392,Stub
4,algeria,northern africa,44900000.0,Noureddine Bahbouh,1099284595,Stub


In [203]:
df_merge.shape

(7474, 6)

## Step 6: Analysis

In [210]:
# Total articles per capita by country
cols = ['country', 'region', 'population']
table1 = df_merge.groupby(cols).agg({'article':'count'}).reset_index()
table1.rename(columns = {'article':'article_count'}, inplace = True)
table1['total_articles_percapita'] = table1['article_count'] / table1['population']

In [211]:
# Total high quality articles per capita by country
mask = df_merge['article_quality'].isin(['FA', 'GA'])
cols = ['country', 'region', 'population']
table2 = df_merge[mask].groupby(cols).agg({'article':'count'}).reset_index()
table2.rename(columns = {'article':'article_count'}, inplace = True)
table2['total_high_quality_articles_percapita'] = table2['article_count'] / table2['population']

In [219]:
# Total articles per capita by region
table3 = df_merge.groupby('region').agg({'article':'count'}).reset_index()
table3.rename(columns = {'article':'article_count'}, inplace = True)

# Obtain regional population from df_region dataframe
table3b = table3.merge(df_region, how = 'inner', on = 'region')
table3b['total_articles_percapita'] = table3b['article_count'] / table3b['population']

In [221]:
# Total high quality articles per capita by country
mask = df_merge['article_quality'].isin(['FA', 'GA'])
table4 = df_merge[mask].groupby('region').agg({'article':'count'}).reset_index()
table4.rename(columns = {'article':'article_count'}, inplace = True)

# Obtain regional population from df_region dataframe
table4b = table4.merge(df_region, how = 'inner', on = 'region')
table4b['total_high_quality_articles_percapita'] = table4b['article_count'] / table4b['population']

## Step 7: Results

**1. Top 10 countries by coverage: 10 countries with the highest total articles per capita (in descending order)**

In [228]:
table1[['country','total_articles_percapita']].sort_values(by = 'total_articles_percapita', 
                                                           ascending = False).head(10).reset_index(drop = True)

Unnamed: 0,country,total_articles_percapita
0,antigua and barbuda,0.00017
1,federated states of micronesia,0.00013
2,andorra,0.0001
3,barbados,9.3e-05
4,marshall islands,9e-05
5,montenegro,6e-05
6,seychelles,6e-05
7,luxembourg,5.3e-05
8,bhutan,5.1e-05
9,grenada,5e-05


**2. Bottom 10 countries by coverage: 10 countries with the lowest total articles per capita (in ascending order)**

In [229]:
table1[['country','total_articles_percapita']].sort_values(by = 'total_articles_percapita', 
                                                           ascending = True).head(10).reset_index(drop = True)

Unnamed: 0,country,total_articles_percapita
0,china,1.392176e-09
1,mexico,7.843137e-09
2,saudi arabia,8.174387e-08
3,romania,1.052632e-07
4,india,1.255998e-07
5,sri lanka,1.339286e-07
6,egypt,1.352657e-07
7,ethiopia,2.025932e-07
8,taiwan,2.155172e-07
9,vietnam,2.716298e-07


**3. Top 10 countries by high quality: 10 countries with the highest high quality articles per capita (in descending order)**

In [231]:
table2[['country','total_high_quality_articles_percapita']].sort_values(by = 'total_high_quality_articles_percapita',
                                                                        ascending = False).head(10).reset_index(drop = True)

Unnamed: 0,country,total_high_quality_articles_percapita
0,andorra,2e-05
1,montenegro,5e-06
2,albania,2.142857e-06
3,suriname,1.666667e-06
4,bosnia-herzegovina,1.470588e-06
5,lithuania,1.071429e-06
6,croatia,1.052632e-06
7,slovenia,9.52381e-07
8,palestinian territory,9.259259e-07
9,gabon,8.333333e-07


**4. Bottom 10 countries by high quality: 10 countries with the lowest high quality articles per capita (in ascending order)**

In [232]:
table2[['country','total_high_quality_articles_percapita']].sort_values(by = 'total_high_quality_articles_percapita', 
                                                           ascending = True).head(10).reset_index(drop = True)

Unnamed: 0,country,total_high_quality_articles_percapita
0,india,4.2337e-09
1,thailand,1.497006e-08
2,japan,1.601281e-08
3,nigeria,1.830664e-08
4,vietnam,2.012072e-08
5,colombia,2.03666e-08
6,uganda,2.118644e-08
7,pakistan,2.120441e-08
8,sudan,2.132196e-08
9,iran,2.257336e-08


**4. Geographic regions by total coverage: Rank ordered list of geographic regions (in descending order) by total articles per capita**

In [233]:
table3b['rank'] = table3b['total_articles_percapita'].rank(ascending = False)
table3b.sort_values('total_articles_percapita', ascending = False).reset_index(drop = True)

Unnamed: 0,region,article_count,population,total_articles_percapita,rank
0,southern europe,888,151000000.0,5.880795e-06,1.0
1,caribbean,201,44000000.0,4.568182e-06,2.0
2,western europe,684,197000000.0,3.472081e-06,3.0
3,eastern europe,735,287000000.0,2.560976e-06,4.0
4,northern europe,262,107000000.0,2.448598e-06,5.0
5,western asia,686,294000000.0,2.333333e-06,6.0
6,southern africa,117,69000000.0,1.695652e-06,7.0
7,oceania,72,44000000.0,1.636364e-06,8.0
8,eastern africa,648,473000000.0,1.369979e-06,9.0
9,central asia,106,78000000.0,1.358974e-06,10.0


**5. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.**

In [235]:
table4b['rank'] = table4b['total_high_quality_articles_percapita'].rank(ascending = False)
table4b.sort_values('total_high_quality_articles_percapita', ascending = False).reset_index(drop = True)

Unnamed: 0,region,article_count,population,total_high_quality_articles_percapita,rank
0,southern europe,46,151000000.0,3.046358e-07,1.0
1,caribbean,8,44000000.0,1.818182e-07,2.0
2,eastern europe,38,287000000.0,1.324042e-07,3.0
3,western europe,22,197000000.0,1.116751e-07,4.0
4,western asia,28,294000000.0,9.52381e-08,5.0
5,northern europe,8,107000000.0,7.476636e-08,6.0
6,southern africa,4,69000000.0,5.797101e-08,7.0
7,central america,10,178000000.0,5.617978e-08,8.0
8,central asia,3,78000000.0,3.846154e-08,9.0
9,southeast asia,24,676000000.0,3.550296e-08,10.0
