# Homework 2
The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles on political figures from different countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.


# Step 1: Getting the Article and Population Data
The first step is getting the data, which lives in several different places. You will need data that lists Wikipedia articles of politicians and data for country populations.

The Wikipedia [Category:Politicians by nationality](https://en.wikipedia.org/wiki/Category:Politicians_by_nationality) was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries. This data is in the homework folder as politicians_by_country.SEPT.2022.csv.

The population data is available in CSV format as population_by_country_2022.csv from the homework folder. This dataset is drawn from the [world population data sheet](https://www.prb.org/international/indicator/population/table) published by the Population Reference Bureau.

### 1a) Importing the required libraries

In [1]:
import json, time, urllib.parse
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# check if any of the libs are not being in use.

### 1b) Loading the clean .csv files into dataframes

In [71]:
df_pol = pd.read_csv('politicians_by_country_SEPT.2022.csv')
df_pop = pd.read_csv('population_by_country_2022.csv')

- politicians by country is saved into a dataframe called *df_pol*
- population by country is saved into a dataframe called *df_pop*

In [72]:
df_pol.head()
# viewing a snapshot of the dataframe loaded

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


In [73]:
df_pop.head()
# viewing a snapshot of the dataframe loaded

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


### 1c) Data Cleaning

**Some Considerations**  
Crawling Wikipedia categories to identify relevant page subsets can result in misleading and/or duplicate category labels. Naturally, the data crawl attempted to resolve these, but not all may have been caught. The below section talks about how the inconsistencies in the data have been handled.  

The population_by_country_2022.csv contains some rows that provide cumulative regional population counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in politicians_by_country.SEPT.2022.csv, but we will want to retain some of them so that you can report coverage and quality by region as specified in the analysis section below.

**Checking for duplicates in both the dataframes and removing those records.**

In [74]:
print("Number of rows for politician dataframe = ", len(df_pol))
print("Number of rows for population dataframe = ", len(df_pop))

Number of rows for politician dataframe =  7584
Number of rows for population dataframe =  233


In [75]:
df_pol = df_pol.drop_duplicates(subset=['name', 'country', 'url'], keep = 'last')
df_pol = df_pol.reset_index()
print("Number of rows for politician dataframe ater removing the duplicates = ", len(df_pol))

Number of rows for politician dataframe ater removing the duplicates =  7582


In [76]:
df_pop = df_pop.drop_duplicates()
print("Number of rows for population dataframe ater removing the duplicates = ", len(df_pop))

Number of rows for population dataframe ater removing the duplicates =  233


*Two rows from the politician dataframe has been deleted and there were no duplicates in the population dataframe*

- Checking for data inconsistencies like nulls/zero numeric values

**Checking for NULL values**

In [77]:
df_pop.isnull().sum()

Geography                0
Population (millions)    0
dtype: int64

In [78]:
df_pol.isnull().sum()

index      0
name       0
url        0
country    0
dtype: int64

*There are no NULL values*

**Checking for ZERO values**

Some population values are given as 0 Million, to avoid misinterpreting analysis we omit these rows from our data analysis.

In [79]:
df_pop[df_pop['Population (millions)'] == 0]

Unnamed: 0,Geography,Population (millions)
183,Liechtenstein,0.0
185,Monaco,0.0
211,San Marino,0.0
223,Nauru,0.0
226,Palau,0.0
231,Tuvalu,0.0


In [80]:
# saving these records in a new dataframe to reuse it later for exclusion
df_pop_zero = df_pop[df_pop['Population (millions)'] == 0]

In [81]:
df_pop.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


In [82]:
# creating a dataframe with all the regions
df_region = df_pop[df_pop['Geography'].str.isupper()==True]
df_region.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
10,WESTERN AFRICA,430.0
27,EASTERN AFRICA,473.0


**Cumulation regional population counts**

- We first search for the rows which has full capitalised letters and then make a new column with region and populate the respective region for each of the country.

In [83]:
# creating a dataframe with all the regions
df_region = df_pop[df_pop['Geography'].str.isupper()==True]

#joining the df_region table with df_pop
df_pop = pd.concat([df_pop, df_region], axis=1)
df_pop.columns = (['Geography_x', 'Population (millions)_x', 'Geography_y', 'Population (millions)_y'])
df_pop = df_pop.drop(columns=['Population (millions)_y'])

df_pop = df_pop.rename(columns={'Geography_x': 'Geography', 'Geography_y': 'Region',
                               'Population (millions)_x': 'Population (millions)'})
df_pop.head()

Unnamed: 0,Geography,Population (millions),Region
0,WORLD,7963.0,WORLD
1,AFRICA,1419.0,AFRICA
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,Algeria,44.9,
4,Egypt,103.5,


In [15]:
# using the ffill() function to fill in the NaN values and deleting the rows where Geography = Region
df_pop = df_pop.ffill(axis=0)
df_pop = df_pop.drop(df_pop[(df_pop.Geography) == (df_pop.Region)].index)
df_pop = df_pop.reset_index(drop=True)
df_pop.head()

Unnamed: 0,Geography,Population (millions),Region
0,Algeria,44.9,NORTHERN AFRICA
1,Egypt,103.5,NORTHERN AFRICA
2,Libya,6.8,NORTHERN AFRICA
3,Morocco,36.7,NORTHERN AFRICA
4,Sudan,46.9,NORTHERN AFRICA


# Step 2: Getting Article Quality Predictions
Now we need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [ORES](https://www.mediawiki.org/wiki/ORES). This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:  
FA - Featured article  
GA - Good article  
B - B-class article  
C - C-class article  
Start - Start-class article  
Stub - Stub-class article    

To get a Wikipedia page quality prediction from ORES for each politician’s article page we will need to:   
a) read each line of politicians_by_country.SEPT.2022.csv  
b) make a page info request to get the current page revision  
c) make an ORES request using the page title and current revision id. 


### 2a) Configuring the API parameters

The below code illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

#### License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - May 13, 2022

In [16]:
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0) - API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<nmohan@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is a list of politicians from Wikipedia article titles 
ARTICLE_TITLES = df_pol['name']

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

### 2b) API request function
The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.


In [17]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

- Iterating through the ARTICLE_TITLES to call the above defined function such that we can get the JSON response from the endpoint.
- capturing only the article title and the lastrevid for the above API call into a dataframe called *df_articles_lastrevid*

**Since the number of article pages not found is more than a few, they have been captured in an error log file saved as "error_log.txt"**

In [18]:
pageinfo_list = {}
with open('error_log.txt', 'w') as f:

for i in range(0, len(ARTICLE_TITLES)):
    try: 
        request_op = request_pageinfo_per_article(article_title = ARTICLE_TITLES[i],
                                                  request_template = PAGEINFO_PARAMS_TEMPLATE)['query']['pages']
        pageinfo_list.update(request_op)
    except:
        txt = ("Couldn't get the page info for: " i)
        f.write(txt)
        f.write('\n')
    
df_articles_lastrevid = pd.DataFrame.from_dict(pageinfo_list, orient='index', columns=['title', 'lastrevid'])
df_articles_lastrevid.reset_index(inplace = True, drop = True)
df_articles_lastrevid.head()

Unnamed: 0,title,lastrevid
0,Shahjahan Noori,1099689000.0
1,Abdul Ghafar Lakanwal,943562300.0
2,Majah Ha Adrif,852404100.0
3,Haroon al-Afghani,1095102000.0
4,Tayyab Agha,1104998000.0


In [19]:
# Saving the API call response into a csv file to avoid reloading it multiple times
# Last updated: 13 Oct 2022 (11:15 hrs)
df_articles_lastrevid.to_csv('request_pageinfo_per_article_output.csv')

### 2c) Page information from endpoint
This example illustrates how to generate quality scores for article revisions using [ORES](https://www.mediawiki.org/wiki/ORES). This example shows how to request a score of a specific revision, where the score provides probabilities for all of the possible article quality levels. The API documentation can be access from the main [ORES](https://ores.wikimedia.org) page. However, this documentation is a little skimpy and if you want more information you may have to dig around.

#### License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - May 13, 2022

In [21]:
# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL

API_ORES_SCORE_PARAMS = "/scores/{context}/?models={model}&revids={revids}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<nmohan@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

In [22]:
def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revids'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

- Extracting the ORES score by running a for loop
- **The ones for which the ORES score wasn't captured is printed below**

In [24]:
ores_score = {}
for i in range(0, len(df_articles_lastrevid.lastrevid)):
    try:
        revids = str(int(df_articles_lastrevid['lastrevid'][i]))
        req_op = request_ores_score_per_article(revids)['enwiki']['scores']
        ores_score[revids] = req_op[revids]['articlequality']['score']['prediction']
    except:
        print("Couldn't get the ORES info for: ", i)

Couldn't get the ORES info for:  2439


- Creating a dataframe of the API output and renaming the columns for easier tracking and combining later.

In [25]:
df_scores = pd.DataFrame.from_dict(ores_score, orient='index', columns=['prediction'])
df_scores.reset_index(inplace = True)
df_scores = df_scores.rename(columns = {'index': 'lastrevid'})
df_scores['lastrevid'] = df_scores['lastrevid'].astype('int')
df_scores

Unnamed: 0,lastrevid,prediction
0,1099689043,GA
1,943562276,Start
2,852404094,Start
3,1095102390,B
4,1104998382,Start
...,...,...
7522,1073818982,Stub
7523,1106932400,C
7524,904246837,Stub
7525,959111842,Stub


- Storing the API output dataframe as a .csv file to avoid re-running the code again to retrieve the information

In [26]:
df_scores.to_csv('request_ores_score_per_article_output.csv')

# Step 3: Combining the Datasets

Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.  

Identify all countries for which there are no matches and output a list of those countries, with each country on a separate line called: **wp_countries-no_match.txt**  

Consolidate the remaining data into a single CSV file called:
**wp_politicians_by_country.csv**  

The schema for that file should look something like this:  
Column  
country  
region  
population  
article_title  
revision_id  
article_quality

## 3a) Combining Datasets 

- politicians data has (title, country)
- page info has (title, lastrevid)
- ores score has (lastrevid, prediction)
- population data has (country, population, region)

Merging the article page info dataframe with the ORED score prediction

In [27]:
print('Number of records in Article page info = ', len(df_articles_lastrevid))
print('Number of records in ORES score prediction = ', len(df_scores))

df_joined = df_articles_lastrevid.merge(df_scores, on = ['lastrevid'], how = 'left')

# Adding country as well
df_joined = df_pol.merge(df_joined, left_on = "name", right_on = "title", how = 'left')

# Cleaning the dataframe by removing duplicate name column (i.e., title) and url
df_pol_scores = df_joined.drop(['url', 'title', 'index'], axis = 1)
print('Number of records in the joined datafarme with politicians and their ORES score prediction =',
      len(df_pol_scores))

# To view a snippet of the dataframe
df_pol_scores.head()

Number of records in Article page info =  7528
Number of records in ORES score prediction =  7527
Number of records in the joined datafarme with politicians and their ORES score prediction = 7582


Unnamed: 0,name,country,lastrevid,prediction
0,Shahjahan Noori,Afghanistan,1099689000.0,GA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start
2,Majah Ha Adrif,Afghanistan,852404100.0,Start
3,Haroon al-Afghani,Afghanistan,1095102000.0,B
4,Tayyab Agha,Afghanistan,1104998000.0,Start


- Adding the population data to this dataframe as well.

In [28]:
df_consolidated = df_pol_scores.merge(df_pop, left_on = 'country', right_on = 'Geography', how = 'outer')
print('Number of records in the joined datafarme =', len(df_consolidated))
df_consolidated.head()

Number of records in the joined datafarme = 7607


Unnamed: 0,name,country,lastrevid,prediction,Geography,Population (millions),Region
0,Shahjahan Noori,Afghanistan,1099689000.0,GA,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404100.0,Start,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102000.0,B,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998000.0,Start,Afghanistan,41.1,SOUTH ASIA


## 3b) Finding countries with no matches

In [29]:
# s1 is a list for checking for countries with no wiki data
# creating sets and taking a set difference for the no matches count

s1 = df_consolidated[df_consolidated['country'].isnull()]['Geography'].unique()

# s2 is a list for checking countries with no population data
s2 = df_consolidated[df_consolidated['Geography'].isnull()]['country'].unique()

no_match = list(set(np.append(s1, s2)))
no_match.sort()
no_match

['Australia',
 'Brunei',
 'Canada',
 'China,  Hong Kong SAR',
 'China,  Macao SAR',
 'Curacao',
 'French Guiana',
 'French Polynesia',
 'Guadeloupe',
 'Guam',
 'Ireland',
 'Kiribati',
 'Korean',
 'Martinique',
 'Mauritius',
 'Mayotte',
 'New Caledonia',
 'New Zealand',
 'Philippines',
 'Puerto Rico',
 'Reunion',
 'Sao Tome and Principe',
 'United Kingdom',
 'United States',
 'Western Sahara',
 'eSwatini']

#### Writing to an output text file no_match.txt

In [30]:
with open ('wp_countries-no_match.txt', 'w') as f:
    for i in no_match:
        f.write(i)
        f.write('\n')

#### Consolidate the remaining data into a single CSV file

- To check for nulls in geography & country, if yes then drop those columns
- renaming all the columns as per the standard given in the instruction file

In [31]:
df_consolidated = df_consolidated[(~df_consolidated['country'].isnull()) & (~df_consolidated['Geography'].isnull())]
df_consolidated = df_consolidated.drop('Geography', axis = 1)

df_consolidated = df_consolidated.rename(columns = {
    'Geography' : 'country',
    'Population (millions)' : 'population',
    'name' : 'article_title',
    'latestrevid': 'revision_id',
    'prediction': 'article_quality',
    'Region' : 'region'
})

- Saving df_consolidated into a csv file as required and displaying a snapshot of the same.

In [32]:
df_consolidated.to_csv('wp_politicians_by_country.csv', index=False)
df_consolidated.head()

Unnamed: 0,article_title,country,lastrevid,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689000.0,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404100.0,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102000.0,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998000.0,Start,41.1,SOUTH ASIA


# Step 4: Analysis
The analysis consists of calculating total-articles-per-population (a ratio representing the number of articles per person)  and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a country-by-country and regional basis. All of these values are to be “per capita”.  

In this analysis a country can only exist in one region. The **population_by_country_2022.csv** actually represents regions in a hierarchical order. For the analysis the country in the closest (lowest in the hierarchy) region.  

For this analysis you should consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.  

Also, keep in mind that the population_by_country_2022.csv file provides population in millions. The calculated proportions in this step are likely to be very small numbers.

## 4a) Total Articles per Population (articles per capita)

### By Country

In [85]:
# Removing the duplicates for countries, 
# group the countries and aggregate population per region by counting number of articles, 
# calculate article_per_capita
df1 = df_consolidated[~df_consolidated.duplicated(subset=['country', 'region'], keep = 'last')]

# Calculating the population of each country
country_pop = df1[['country', 'population']].groupby('country').sum().reset_index()
country_article_cnt = df_consolidated[['country', 'article_title']].groupby('country').count().reset_index()
total_articles_country = country_pop.merge(country_article_cnt, on='country')
total_articles_country.columns=['country', 'population', 'article_count']
total_articles_country['articles_per_capita'] = total_articles_country['article_count'] / (total_articles_country['population'] * 1000000)

# handling for conditions where population is zero (6 countries)
total_articles_country = total_articles_country[total_articles_country['articles_per_capita'] != np.inf] 
print('On a country level, the dataframe returns the below number of rows')
print(len(total_articles_country['country'].unique()))
total_articles_country.reset_index(inplace=True)
total_articles_country = total_articles_country.drop('index', axis = 1)
total_articles_country.head()

On a country level, the dataframe returns the below number of rows
178


Unnamed: 0,country,population,article_count,articles_per_capita
0,Afghanistan,41.1,118,2.871046e-06
1,Albania,2.8,83,2.964286e-05
2,Algeria,44.9,34,7.572383e-07
3,Andorra,0.1,10,0.0001
4,Angola,35.6,42,1.179775e-06


### By Region

In [42]:
# Repeating the same as above but grouping by region in this case
# Calculating the population of each country
df2 = df_consolidated[~df_consolidated.duplicated(subset=['country', 'region'], keep = 'last')] 

region_pop = df2[['region', 'population']].groupby('region').sum().reset_index()
region_article_cnt = df_consolidated[['region', 'article_title']].groupby('region').count().reset_index()
total_articles_region = region_pop.merge(region_article_cnt, on='region')
total_articles_region.columns=['region', 'population', 'article_count']
total_articles_region['articles_per_capita'] = total_articles_region['article_count'] / (total_articles_region['population'] * 1000000)
 
print('On a region level, the dataframe returns the below number of rows')
print(len(total_articles_region['region'].unique()))
total_articles_region.head()

On a region level, the dataframe returns the below number of rows
18


Unnamed: 0,region,population,article_count,articles_per_capita
0,CARIBBEAN,39.5,201,5.088608e-06
1,CENTRAL AMERICA,177.9,195,1.096121e-06
2,CENTRAL ASIA,78.0,106,1.358974e-06
3,EAST ASIA,1665.8,246,1.476768e-07
4,EASTERN AFRICA,470.3,648,1.377844e-06


## 4b) High Quality Articles per Population
This applies to onlt the articles tagged with FA or GA in the "article_quality" column

### By Country

In [63]:
# Filtering the article based on the artcile_quality attribute
# Calculation for article_count and article_per_capita done the same as above i.e., group by country

df3 = df_consolidated[~df_consolidated.duplicated(subset=['country', 'region'], keep = 'last')]

country_pop = df3[['country', 'population']].groupby('country').sum().reset_index()
hq_country_df = df_consolidated[(df_consolidated['article_quality'] == 
                                 'FA') | (df_consolidated['article_quality'] == 'GA')]

country_count = hq_country_df[['country', 'article_title']].groupby('country').count().reset_index()
hq_country_df = country_pop.merge(country_count, on='country')
hq_country_df.columns=['country', 'population', 'article_count']
hq_country_df['articles_per_capita'] = hq_country_df['article_count'] / (hq_country_df['population'] * 1000000)

# Need to exclude conditions where the population of a country is zero
hq_country_df = hq_country_df[hq_country_df['articles_per_capita'] != np.inf]
hq_country_df.reset_index(inplace=True)
hq_country_df.drop(columns=['index'], inplace=True)

print('On a country level, the high quality dataframe returns the below number of rows')
print(len(hq_country_df['country'].unique()))
hq_country_df.head()

On a country level, the high quality dataframe returns the below number of rows
92


Unnamed: 0,country,population,article_count,articles_per_capita
0,Afghanistan,41.1,6,1.459854e-07
1,Albania,2.8,6,2.142857e-06
2,Andorra,0.1,2,2e-05
3,Armenia,3.0,1,3.333333e-07
4,Azerbaijan,10.2,1,9.803922e-08


### By Region

In [64]:
# Filtering the article based on the artcile_quality attribute
# Calculation for article_count and article_per_capita done the same as above i.e., group by region

df4 = df_consolidated[~df_consolidated.duplicated(subset=['country', 'region'], keep = 'last')] 
region_pop = df4[['region', 'population']].groupby('region').sum().reset_index()

hq_region_df = df_consolidated[(df_consolidated['article_quality'] == 
                                 'FA') | (df_consolidated['article_quality'] == 'GA')]
region_count = hq_region_df[['region', 'article_title']].groupby('region').count().reset_index()
hq_region_df = region_pop.merge(region_count, on='region')
hq_region_df.columns=['region', 'population', 'article_count']
hq_region_df['articles_per_capita'] = hq_region_df['article_count'] / (hq_region_df['population'] * 1000000)

print('On a region level, the high quality dataframe returns the below number of rows')
print(len(hq_region_df['region'].unique()))
hq_region_df.head()

On a region level, the high quality dataframe returns the below number of rows
18


Unnamed: 0,region,population,article_count,articles_per_capita
0,CARIBBEAN,39.5,8,2.025316e-07
1,CENTRAL AMERICA,177.9,10,5.621135e-08
2,CENTRAL ASIA,78.0,3,3.846154e-08
3,EAST ASIA,1665.8,16,9.604995e-09
4,EASTERN AFRICA,470.3,15,3.189454e-08


# Step 5: Results
Your results from this analysis will be produced in the form of data tables. You are being asked to produce six total tables, that show:  

### 1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) 

In [65]:
top10_country = total_articles_country.sort_values(by=['articles_per_capita'],
                                                    ascending=False).head(10).reset_index()
top10_country['country']

0               Antigua and Barbuda
1    Federated States of Micronesia
2                           Andorra
3                          Barbados
4                  Marshall Islands
5                        Montenegro
6                        Seychelles
7                        Luxembourg
8                            Bhutan
9                           Grenada
Name: country, dtype: object

### 2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) 

In [66]:
bottom10_country = total_articles_country.sort_values(by=['articles_per_capita'],
                                                    ascending=True).head(10).reset_index()
bottom10_country['country']

0           China
1          Mexico
2    Saudi Arabia
3         Romania
4           India
5       Sri Lanka
6           Egypt
7        Ethiopia
8          Taiwan
9         Vietnam
Name: country, dtype: object

### 3. Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) 

In [67]:
top10_hq_country = hq_country_df.sort_values(by=['articles_per_capita'],
                                             ascending=False).head(10).reset_index()
top10_hq_country['country']

0                  Andorra
1               Montenegro
2                  Albania
3                 Suriname
4       Bosnia-Herzegovina
5                Lithuania
6                  Croatia
7                 Slovenia
8    Palestinian Territory
9                    Gabon
Name: country, dtype: object

### 4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)

In [68]:
bottom10_hq_country = hq_country_df.sort_values(by=['articles_per_capita'],
                                             ascending=True).head(10).reset_index()
bottom10_hq_country['country']

0       India
1    Thailand
2       Japan
3     Nigeria
4     Vietnam
5    Colombia
6      Uganda
7    Pakistan
8       Sudan
9        Iran
Name: country, dtype: object

### 5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [69]:
geo_coverage = total_articles_region.sort_values(by=['articles_per_capita'],
                                                ascending=False).reset_index()
geo_coverage['region']

0     NORTHERN EUROPE
1             OCEANIA
2     SOUTHERN EUROPE
3           CARIBBEAN
4      WESTERN EUROPE
5      EASTERN EUROPE
6        WESTERN ASIA
7     SOUTHERN AFRICA
8      EASTERN AFRICA
9        CENTRAL ASIA
10      SOUTH AMERICA
11     WESTERN AFRICA
12    CENTRAL AMERICA
13      MIDDLE AFRICA
14    NORTHERN AFRICA
15     SOUTHEAST ASIA
16         SOUTH ASIA
17          EAST ASIA
Name: region, dtype: object

### 6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [70]:
geo_hq_coverage = hq_region_df.sort_values(by=['articles_per_capita'],
                                           ascending=False).reset_index()
geo_hq_coverage['region']

0     SOUTHERN EUROPE
1     NORTHERN EUROPE
2           CARIBBEAN
3             OCEANIA
4      EASTERN EUROPE
5      WESTERN EUROPE
6        WESTERN ASIA
7     SOUTHERN AFRICA
8     CENTRAL AMERICA
9      SOUTHEAST ASIA
10       CENTRAL ASIA
11     EASTERN AFRICA
12     WESTERN AFRICA
13      SOUTH AMERICA
14    NORTHERN AFRICA
15      MIDDLE AFRICA
16         SOUTH ASIA
17          EAST ASIA
Name: region, dtype: object