# Getting the Article and Population Data

The Wikipedia Category:Politicians by nationality(https://en.wikipedia.org/wiki/Category:Politicians_by_nationality) was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries.
The population dataset is drawn from the world population data sheet(https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.

The population


In [81]:
import pandas as pd
import numpy as np

In [82]:
population_df=pd.read_excel('input\population_2022.xlsx')
politicians_df=pd.read_excel('input\politicians_2022.xlsx')

In [83]:
population_df.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


In [84]:
politicians_df.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


## Clean the population data and map regions against countries

### Identifying regions by checking the casing of the characters in Geography column and mapping it against the Geography

In [85]:
region=''
population_df['Region']=[None for i in population_df.Geography]
for i in range(len(population_df)):
    if population_df['Geography'].iloc[i].isupper():
        region=population_df['Geography'].iloc[i]
        population_df['Region'].iloc[i]=region
    else:
        population_df['Region'].iloc[i]=region

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  population_df['Region'].iloc[i]=region
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  population_df['Region'].iloc[i]=region


In [86]:
population_df.head()

Unnamed: 0,Geography,Population (millions),Region
0,WORLD,7963.0,WORLD
1,AFRICA,1419.0,AFRICA
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA


### Removing the rows where the geography name is same as region name

In [87]:
population_df=population_df[population_df['Geography']!=population_df['Region']]

In [88]:
population_df

Unnamed: 0,Geography,Population (millions),Region
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA
5,Libya,6.8,NORTHERN AFRICA
6,Morocco,36.7,NORTHERN AFRICA
7,Sudan,46.9,NORTHERN AFRICA
...,...,...,...
228,Samoa,0.2,OCEANIA
229,Solomon Islands,0.7,OCEANIA
230,Tonga,0.1,OCEANIA
231,Tuvalu,0.0,OCEANIA


### Fetching basic stats from the population and politician dataframes

In [89]:
print('Total number of rows are : '+ str(len(population_df)))
print('Total number of unique geographies are : '+ str(population_df['Geography'].nunique()))
print('Total number of unique regions are : '+ str(population_df['Region'].nunique()))

Total number of rows are : 209
Total number of unique geographies are : 209
Total number of unique regions are : 19


In [90]:
print('Total number of rows are : '+ str(len(politicians_df)))
print('Total number of unique names are : '+ str(politicians_df['name'].nunique()))


Total number of rows are : 7584
Total number of unique names are : 7534


### Drop duplicates rows based on name,url,country

Politician names and url may have duplicates because a politician may be mapped to more than one countries but we want to retain such entries and remove duplicates based 
on name,url and country.

In [91]:
politicians_df.drop_duplicates(subset=['name','url','country'],inplace=True)
politicians_df.head()
print('Total number of rows are : '+ str(len(politicians_df)))

Total number of rows are : 7582


## Make API calls to retrieve article metadata

In [14]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

In [15]:


# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<ksahoo@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [16]:
#########
#
#    PROCEDURES/FUNCTIONS
#
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [17]:

# Getting the list of unique politician names
articles_list=list(politicians_df.name.unique())


In [18]:
len(articles_list)

7534

In [19]:
data={'pages':{}} #creating a blank dictionary to save metadata of a articles
data_null=[] #list to collect articles with no json response
n=len(articles_list)
for i in range(0,n,50):
    article_arg='|'.join(articles_list[i:min((i+50),n)]) #make api request for 50 articles at once
    info = request_pageinfo_per_article(article_arg)
    if '-1' in info['query']['pages'].keys(): #checking if there is no response to the api request
        data_null.append(info['query']['pages'].pop('-1'))
    data['pages'].update(info['query']['pages']) #recursively updating the empty dictionary with new values

In [20]:
#list of the articles for which no response was found
data_null

[{'ns': 0,
  'title': 'Prince Ofosu Sefah',
  'missing': '',
  'contentmodel': 'wikitext',
  'pagelanguage': 'en',
  'pagelanguagehtmlcode': 'en',
  'pagelanguagedir': 'ltr',
  'fullurl': 'https://en.wikipedia.org/wiki/Prince_Ofosu_Sefah',
  'editurl': 'https://en.wikipedia.org/w/index.php?title=Prince_Ofosu_Sefah&action=edit',
  'canonicalurl': 'https://en.wikipedia.org/wiki/Prince_Ofosu_Sefah'},
 {'ns': 0,
  'title': 'Harjit Kaur Talwandi',
  'missing': '',
  'contentmodel': 'wikitext',
  'pagelanguage': 'en',
  'pagelanguagehtmlcode': 'en',
  'pagelanguagedir': 'ltr',
  'fullurl': 'https://en.wikipedia.org/wiki/Harjit_Kaur_Talwandi',
  'editurl': 'https://en.wikipedia.org/w/index.php?title=Harjit_Kaur_Talwandi&action=edit',
  'canonicalurl': 'https://en.wikipedia.org/wiki/Harjit_Kaur_Talwandi'},
 {'ns': 0,
  'title': 'Abd al-Razzaq al-Hasani',
  'missing': '',
  'contentmodel': 'wikitext',
  'pagelanguage': 'en',
  'pagelanguagehtmlcode': 'en',
  'pagelanguagedir': 'ltr',
  'fullurl

In [21]:
data_list=[] #empty list to append all the metadata of articles by iterating through the dictionary obtained in the previous step
for k,v in data['pages'].items():
    data_list.append(v)

In [22]:
df_politician_wiki= pd.DataFrame(data_list)

In [23]:
df_politician_wiki.head()

Unnamed: 0,pageid,ns,title,contentmodel,pagelanguage,pagelanguagehtmlcode,pagelanguagedir,touched,lastrevid,length,talkid,fullurl,editurl,canonicalurl,watchers,redirect,new
0,65412901,0,Abas Basir,wikitext,en,en,ltr,2022-10-11T01:20:40Z,1098419766,19306,65415333.0,https://en.wikipedia.org/wiki/Abas_Basir,https://en.wikipedia.org/w/index.php?title=Aba...,https://en.wikipedia.org/wiki/Abas_Basir,,,
1,27428272,0,Abdul Baqi Turkistani,wikitext,en,en,ltr,2022-10-11T03:06:55Z,889226470,1297,27595416.0,https://en.wikipedia.org/wiki/Abdul_Baqi_Turki...,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Baqi_Turki...,,,
2,42972519,0,Abdul Ghafar Lakanwal,wikitext,en,en,ltr,2022-09-26T05:36:04Z,943562276,4165,42972696.0,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,,,
3,29443640,0,Abdul Ghani Ghani,wikitext,en,en,ltr,2022-10-10T23:29:32Z,1072441893,1352,29453228.0,https://en.wikipedia.org/wiki/Abdul_Ghani_Ghani,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Ghani_Ghani,,,
4,44098744,0,Abdul Malik Hamwar,wikitext,en,en,ltr,2022-10-10T23:30:44Z,1100874645,3512,44237349.0,https://en.wikipedia.org/wiki/Abdul_Malik_Hamwar,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Malik_Hamwar,,,


In [24]:
len(df_politician_wiki)

7529

In [25]:
# mask=np.isnan(df_politician_wiki['pageid'])
# df_politician_wiki_nan=df_politician_wiki[mask]
# df_politician_wiki.dropna(subset='pageid',axis=0, inplace=True)
df_politician_wiki['lastrevid']=df_politician_wiki['lastrevid'].astype('str')
# len(df_politician_wiki)

In [26]:
with open("politicians.json", "w") as outfile: #saving the dictionary of articles with metadata in a json file
    json.dump(data, outfile)
    json_mobile = json.dumps(data, indent=4)

Now we will use the article title and the corresponding revid to retrieve the article quality through api calls

## Make API calls to score articles through ORES

In [27]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/?models={models}&revids={revids}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<ksahoo@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revids" : "",               # the revision to be scored - this will probably change each call
    "models": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article revisions. Therefore, the main parameter is article_revid.

In [28]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revids'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [29]:
#getting the list of unique revids
rev_id_list=list(df_politician_wiki.lastrevid.unique())

In [30]:
scores={'scores':{}} #empty dictionary to collect scores for all the revids.
null_scores={'scores':{}} #empty dictionary to collect articles and revids for which there was no api response.
n=len(df_politician_wiki)
for i in range(0,n,50):
    revid_arg='|'.join(rev_id_list[i:min((i+50),n)])
    info = request_ores_score_per_article(revid_arg)
    if info==None: #skipping the revids for which there is no api response
        pass 
    else:
        scores['scores'].update(info['enwiki']['scores']) #recursively updating the empty dictionary with new values to store the article quality

In [92]:
len(scores['scores'])

7529

In [31]:
scores['scores']['1013838830'] #cheking the json for one revid

{'articlequality': {'score': {'prediction': 'Stub',
   'probability': {'B': 0.017602679257176707,
    'C': 0.03741549736464354,
    'FA': 0.00349656390489947,
    'GA': 0.008839274681301947,
    'Start': 0.24961539813724162,
    'Stub': 0.6830305866547366}}}}

In [32]:
revids = [] #empty list to store revids
pred = [] #empty list to store article quality for each revid
for revid, v in scores['scores'].items():
    revids.append(revid)
    pred.append(v['articlequality']['score']['prediction'])

In [33]:
#creating dataframe from the list created above
final_politician_df=pd.DataFrame({'revision_id':revids,'article_quality':pred})

In [34]:
final_politician_df.head()

Unnamed: 0,revision_id,article_quality
0,1013838830,Stub
1,1033383351,Stub
2,1038918070,Start
3,1041460606,B
4,1060707209,Start


# Combining the datasets

### merging the score table and the metadata table

In [37]:
#Map the titles against the revid
final_politician_df1=final_politician_df.merge(df_politician_wiki[['title','lastrevid','fullurl']], left_on='revision_id',right_on='lastrevid')
final_politician_df1.drop('lastrevid',axis=1,inplace=True)

In [38]:
final_politician_df1.head()

Unnamed: 0,revision_id,article_quality,title,fullurl
0,1013838830,Stub,Mohammad Asim Asim,https://en.wikipedia.org/wiki/Mohammad_Asim_Asim
1,1033383351,Stub,Aimal Faizi,https://en.wikipedia.org/wiki/Aimal_Faizi
2,1038918070,Start,Mohammad Sarwar Ahmedzai,https://en.wikipedia.org/wiki/Mohammad_Sarwar_...
3,1041460606,B,Sharif Ghalib,https://en.wikipedia.org/wiki/Sharif_Ghalib
4,1060707209,Start,Bashir Ahmad Bezan,https://en.wikipedia.org/wiki/Bashir_Ahmad_Bezan


In [39]:
len(final_politician_df1)

7529

### merging the previous dataframe with the original politician dataset to map the countries

We can expect the number of rows to increase because some politicians may be affiliated to more than one country

In [40]:
#Dropping duplicate names and country from politician dataframe
# politicians_df_nodup_name=politicians_df[['name','country']].drop_duplicates(subset='name')
final_politician_df2=final_politician_df1.merge(politicians_df, left_on=['title'],right_on=['name'])
final_politician_df2.drop('name',axis=1,inplace=True)

In [41]:
final_politician_df2.head()

Unnamed: 0,revision_id,article_quality,title,fullurl,url,country
0,1013838830,Stub,Mohammad Asim Asim,https://en.wikipedia.org/wiki/Mohammad_Asim_Asim,https://en.wikipedia.org/wiki/Mohammad_Asim_Asim,Afghanistan
1,1033383351,Stub,Aimal Faizi,https://en.wikipedia.org/wiki/Aimal_Faizi,https://en.wikipedia.org/wiki/Aimal_Faizi,Afghanistan
2,1038918070,Start,Mohammad Sarwar Ahmedzai,https://en.wikipedia.org/wiki/Mohammad_Sarwar_...,https://en.wikipedia.org/wiki/Mohammad_Sarwar_...,Afghanistan
3,1041460606,B,Sharif Ghalib,https://en.wikipedia.org/wiki/Sharif_Ghalib,https://en.wikipedia.org/wiki/Sharif_Ghalib,Afghanistan
4,1060707209,Start,Bashir Ahmad Bezan,https://en.wikipedia.org/wiki/Bashir_Ahmad_Bezan,https://en.wikipedia.org/wiki/Bashir_Ahmad_Bezan,Afghanistan


In [42]:
len(final_politician_df2)

7577

In [43]:
x= politicians_df.groupby(by=['name'])['name'].count()
print(x[x>1].head())
sum(x[x>1])

name
Alexandra Benado             2
Ali al-Qaradaghi             2
Antonio Gutiérrez y Ulloa    2
Antonín Janoušek             2
Ashab Uddin Ahmad            2
Name: name, dtype: int64


94

In [44]:
politicians_df[politicians_df['name']=='Alexandra Benado']

Unnamed: 0,name,url,country
1220,Alexandra Benado,https://en.wikipedia.org/wiki/Alexandra_Benado,Chile
6729,Alexandra Benado,https://en.wikipedia.org/wiki/Alexandra_Benado,Sweden


### Merging the previously obtained dataframe with population dataset to get region and population

In [45]:
final_politician_df3=final_politician_df2.merge(population_df, how='outer', left_on='country',right_on='Geography')


In [46]:
final_politician_df3.head()

Unnamed: 0,revision_id,article_quality,title,fullurl,url,country,Geography,Population (millions),Region
0,1013838830,Stub,Mohammad Asim Asim,https://en.wikipedia.org/wiki/Mohammad_Asim_Asim,https://en.wikipedia.org/wiki/Mohammad_Asim_Asim,Afghanistan,Afghanistan,41.1,SOUTH ASIA
1,1033383351,Stub,Aimal Faizi,https://en.wikipedia.org/wiki/Aimal_Faizi,https://en.wikipedia.org/wiki/Aimal_Faizi,Afghanistan,Afghanistan,41.1,SOUTH ASIA
2,1038918070,Start,Mohammad Sarwar Ahmedzai,https://en.wikipedia.org/wiki/Mohammad_Sarwar_...,https://en.wikipedia.org/wiki/Mohammad_Sarwar_...,Afghanistan,Afghanistan,41.1,SOUTH ASIA
3,1041460606,B,Sharif Ghalib,https://en.wikipedia.org/wiki/Sharif_Ghalib,https://en.wikipedia.org/wiki/Sharif_Ghalib,Afghanistan,Afghanistan,41.1,SOUTH ASIA
4,1060707209,Start,Bashir Ahmad Bezan,https://en.wikipedia.org/wiki/Bashir_Ahmad_Bezan,https://en.wikipedia.org/wiki/Bashir_Ahmad_Bezan,Afghanistan,Afghanistan,41.1,SOUTH ASIA


In [47]:
len(final_politician_df3)

7602

In [48]:
#checking the number of countries for which no mapping was found
sum(final_politician_df3['Region'].isnull())

70

In [49]:
sum(final_politician_df3['country'].isnull())

25

In [111]:
#getting the names of the country for which no region mapping was found
x=list(final_politician_df3[final_politician_df3['Geography'].isnull()]['country'].unique())

In [112]:
#getting the names of the country for which no region mapping was found

y=list(final_politician_df3[final_politician_df3['country'].isnull()]['Geography'].unique())

In [116]:
no_match_list=x+y

In [115]:
print(z)

['Korean', 'Western Sahara', 'Mauritius', 'Mayotte', 'Reunion', 'Sao Tome and Principe', 'eSwatini', 'Canada', 'United States', 'Curacao', 'Guadeloupe', 'Martinique', 'Puerto Rico', 'French Guiana', 'Brunei', 'Philippines', 'China,  Hong Kong SAR', 'China,  Macao SAR', 'Ireland', 'United Kingdom', 'Australia', 'French Polynesia', 'Guam', 'Kiribati', 'New Caledonia', 'New Zealand']


In [52]:
#dropping the rows with missing region values
final_politician_df4=final_politician_df3.dropna(subset=['Geography','country'])

In [117]:
with open('output/wp_countries-no_match.txt', 'w') as fp:
    for item in no_match_list:
        # write each item on a new line
        fp.write("No response caught for %s\n" % item)
    print('Done')

Done


In [53]:
len(final_politician_df4)

7507

In [54]:
#renaming the column names
final_politician_df4.rename(columns={'title':'article_title','Population (millions)':'population','Region':'region'},inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_politician_df4.rename(columns={'title':'article_title','Population (millions)':'population','Region':'region'},inplace=True)


In [55]:
final_politician_df4.columns

Index(['revision_id', 'article_quality', 'article_title', 'fullurl', 'url',
       'country', 'Geography', 'population', 'region'],
      dtype='object')

In [56]:
#keeping the desired columns
final_politician_df5=final_politician_df4[['country','region','population','article_title','revision_id','article_quality']]

In [57]:
final_politician_df5.head()

Unnamed: 0,country,region,population,article_title,revision_id,article_quality
0,Afghanistan,SOUTH ASIA,41.1,Mohammad Asim Asim,1013838830,Stub
1,Afghanistan,SOUTH ASIA,41.1,Aimal Faizi,1033383351,Stub
2,Afghanistan,SOUTH ASIA,41.1,Mohammad Sarwar Ahmedzai,1038918070,Start
3,Afghanistan,SOUTH ASIA,41.1,Sharif Ghalib,1041460606,B
4,Afghanistan,SOUTH ASIA,41.1,Bashir Ahmad Bezan,1060707209,Start


In [58]:
#writing the final dataset to csv
final_politician_df5.to_csv('wp_politicians_by_country.csv')

# Analysis

### getting high quality articles

In [59]:
high_quality = final_politician_df5[(final_politician_df5['article_quality'] == 'FA') | (final_politician_df5['article_quality'] == 'GA')]
high_quality.head()

Unnamed: 0,country,region,population,article_title,revision_id,article_quality
22,Afghanistan,SOUTH ASIA,41.1,Shahjahan Noori,1099689043,GA
62,Afghanistan,SOUTH ASIA,41.1,Ahmed Wali Karzai,1090245979,GA
74,Afghanistan,SOUTH ASIA,41.1,Masoud Khalili,1103105365,GA
90,Afghanistan,SOUTH ASIA,41.1,Amrullah Saleh,1115022704,FA
110,Afghanistan,SOUTH ASIA,41.1,Abdul Salam Zaeef,1106388504,GA


In [60]:
high_quality_country=high_quality.groupby(by='country')['revision_id'].count().reset_index()
high_quality_country.columns=['country','no_articles']
high_quality_country.head()

Unnamed: 0,country,no_articles
0,Afghanistan,6
1,Albania,6
2,Andorra,2
3,Armenia,1
4,Azerbaijan,1


### calculating total-articles-per-population

In [61]:
population_df.head()

Unnamed: 0,Geography,Population (millions),Region
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA
5,Libya,6.8,NORTHERN AFRICA
6,Morocco,36.7,NORTHERN AFRICA
7,Sudan,46.9,NORTHERN AFRICA


In [62]:
temp_df=population_df[population_df['Geography'].isin(final_politician_df5.country.unique())]


In [63]:
final_politician_df5.country.nunique()

184

In [64]:
temp_df

Unnamed: 0,Geography,Population (millions),Region
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA
5,Libya,6.8,NORTHERN AFRICA
6,Morocco,36.7,NORTHERN AFRICA
7,Sudan,46.9,NORTHERN AFRICA
...,...,...,...
228,Samoa,0.2,OCEANIA
229,Solomon Islands,0.7,OCEANIA
230,Tonga,0.1,OCEANIA
231,Tuvalu,0.0,OCEANIA


Since the length of final dataframe(final_politician_df5.country.nunique()) and temp_df are same, there is only one unique region for each country

In [65]:

national_articles = final_politician_df5.groupby('country')['revision_id'].count().reset_index()
national_articles.columns=['country','no_articles']
national_articles=national_articles.merge(high_quality_country,how='left',left_on='country',right_on='country')
national_articles.columns=['country','no_articles','no_articles_high']
national_articles['no_articles_high'].fillna(0,inplace=True)
# national_articles.drop('country',axis=1,inplace=True)
national_articles.head()

Unnamed: 0,country,no_articles,no_articles_high
0,Afghanistan,118,6.0
1,Albania,83,6.0
2,Algeria,34,0.0
3,Andorra,10,2.0
4,Angola,42,0.0


In [66]:
articles_by_country=temp_df.merge(national_articles,how='left',left_on='Geography',right_on='country')
articles_by_country.drop('Geography',axis=1,inplace=True)
articles_by_country.head()

Unnamed: 0,Population (millions),Region,country,no_articles,no_articles_high
0,44.9,NORTHERN AFRICA,Algeria,34,0.0
1,103.5,NORTHERN AFRICA,Egypt,14,0.0
2,6.8,NORTHERN AFRICA,Libya,30,2.0
3,36.7,NORTHERN AFRICA,Morocco,62,1.0
4,46.9,NORTHERN AFRICA,Sudan,33,1.0


In [67]:
articles_by_country['per_capita_articles_country(no. per million)']=articles_by_country['no_articles']/articles_by_country['Population (millions)']
articles_by_country['per_capita_articles_country(no. per million)_high']=articles_by_country['no_articles_high']/articles_by_country['Population (millions)']

In [99]:
articles_by_country[articles_by_country['country']=='Antigua and Barbuda']

Unnamed: 0,Population (millions),Region,country,no_articles,no_articles_high,per_capita_articles_country(no. per million),per_capita_articles_country(no. per million)_high
59,0.1,CARIBBEAN,Antigua and Barbuda,17,0.0,170.0,0.0


In [69]:
articles_by_country.sort_values(by = ['per_capita_articles_country(no. per million)']).head()

Unnamed: 0,Population (millions),Region,country,no_articles,no_articles_high,per_capita_articles_country(no. per million),per_capita_articles_country(no. per million)_high
125,1436.6,EAST ASIA,China,2,0.0,0.001392,0.0
56,127.5,CENTRAL AMERICA,Mexico,1,0.0,0.007843,0.0
97,36.7,WESTERN ASIA,Saudi Arabia,3,2.0,0.081744,0.054496
154,19.0,EASTERN EUROPE,Romania,2,2.0,0.105263,0.105263
110,1417.2,SOUTH ASIA,India,178,6.0,0.1256,0.004234


In [70]:
articles_by_region=articles_by_country.groupby('Region').agg({'Population (millions)':'sum','no_articles':'sum','no_articles_high':'sum'}).reset_index()
articles_by_region.columns=['Region','Population (millions)','no_articles','no_articles_high']
articles_by_region['per_capita_articles_region(no. per million)']=articles_by_region['no_articles']/articles_by_region['Population (millions)']
articles_by_region['per_capita_articles_region(no. per million)_high']=articles_by_region['no_articles_high']/articles_by_region['Population (millions)']
articles_by_region

Unnamed: 0,Region,Population (millions),no_articles,no_articles_high,per_capita_articles_region(no. per million),per_capita_articles_region(no. per million)_high
0,CARIBBEAN,39.5,201,8.0,5.088608,0.202532
1,CENTRAL AMERICA,177.9,195,10.0,1.096121,0.056211
2,CENTRAL ASIA,78.0,106,3.0,1.358974,0.038462
3,EAST ASIA,1665.8,246,16.0,0.147677,0.009605
4,EASTERN AFRICA,470.3,648,15.0,1.377844,0.031895
5,EASTERN EUROPE,287.4,735,38.0,2.557411,0.13222
6,MIDDLE AFRICA,195.9,203,5.0,1.036243,0.025523
7,NORTHERN AFRICA,250.6,227,7.0,0.905826,0.027933
8,NORTHERN EUROPE,33.8,262,8.0,7.751479,0.236686
9,OCEANIA,11.7,86,2.0,7.350427,0.17094


# Results

In [102]:
articles_by_country=articles_by_country.sort_values(by='per_capita_articles_country(no. per million)',ascending=True)

In [103]:
articles_by_country.head()

Unnamed: 0,Population (millions),Region,country,no_articles,no_articles_high,per_capita_articles_country(no. per million),per_capita_articles_country(no. per million)_high
125,1436.6,EAST ASIA,China,2,0.0,0.001392,0.0
56,127.5,CENTRAL AMERICA,Mexico,1,0.0,0.007843,0.0
97,36.7,WESTERN ASIA,Saudi Arabia,3,2.0,0.081744,0.054496
154,19.0,EASTERN EUROPE,Romania,2,2.0,0.105263,0.105263
110,1417.2,SOUTH ASIA,India,178,6.0,0.1256,0.004234


### Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) 

In [104]:
#This include the articles with population 0 which could be because of the rounding up in millions.
#Some countries may have population less than 0.5 millions. 
articles_by_country.tail(10)[['country','per_capita_articles_country(no. per million)']].iloc[::-1]

Unnamed: 0,country,per_capita_articles_country(no. per million)
143,Liechtenstein,inf
169,San Marino,inf
182,Tuvalu,inf
177,Palau,inf
145,Monaco,inf
176,Nauru,inf
59,Antigua and Barbuda,170.0
173,Federated States of Micronesia,130.0
159,Andorra,100.0
61,Barbados,93.333333


In [105]:
articles_by_country[articles_by_country['per_capita_articles_country(no. per million)']!=np.Inf]\
                    .tail(10)[['country','per_capita_articles_country(no. per million)']].iloc[::-1]

Unnamed: 0,country,per_capita_articles_country(no. per million)
59,Antigua and Barbuda,170.0
173,Federated States of Micronesia,130.0
159,Andorra,100.0
61,Barbados,93.333333
175,Marshall Islands,90.0
166,Montenegro,60.0
32,Seychelles,60.0
144,Luxembourg,52.857143
109,Bhutan,51.25
65,Grenada,50.0


In [74]:
articles_by_country[articles_by_country['country']=='Monaco']

Unnamed: 0,Population (millions),Region,country,no_articles,no_articles_high,per_capita_articles_country(no. per million),per_capita_articles_country(no. per million)_high
145,0.0,WESTERN EUROPE,Monaco,13,0.0,inf,


## Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .

In [75]:
articles_by_country.head(10)[['country','per_capita_articles_country(no. per million)']]

Unnamed: 0,country,per_capita_articles_country(no. per million)
125,China,0.001392
56,Mexico,0.007843
97,Saudi Arabia,0.081744
154,Romania,0.105263
110,India,0.1256
115,Sri Lanka,0.133929
1,Egypt,0.135266
26,Ethiopia,0.202593
130,Taiwan,0.215517
124,Vietnam,0.27163


In [76]:
articles_by_country=articles_by_country.sort_values(by='per_capita_articles_country(no. per million)_high',ascending=True)

## Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

In [97]:
articles_by_country.tail(10)[['country','per_capita_articles_country(no. per million)_high']].iloc[::-1]

Unnamed: 0,country,per_capita_articles_country(no. per million)_high
143,Liechtenstein,
177,Palau,
176,Nauru,
145,Monaco,
169,San Marino,
182,Tuvalu,inf
159,Andorra,20.0
166,Montenegro,5.0
158,Albania,2.142857
81,Suriname,1.666667


## Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)

In [78]:
articles_by_country.head(10)[['country','per_capita_articles_country(no. per million)_high']]

Unnamed: 0,country,per_capita_articles_country(no. per million)_high
125,China,0.0
60,Bahamas,0.0
51,Belize,0.0
133,Finland,0.0
45,Equatorial Guinea,0.0
67,Jamaica,0.0
49,Namibia,0.0
96,Qatar,0.0
43,Congo,0.0
152,Moldova,0.0


## Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [79]:
articles_by_region=articles_by_region.sort_values(by='per_capita_articles_region(no. per million)',ascending=False)
articles_by_region[['Region','per_capita_articles_region(no. per million)']]

Unnamed: 0,Region,per_capita_articles_region(no. per million)
8,NORTHERN EUROPE,7.751479
9,OCEANIA,7.350427
14,SOUTHERN EUROPE,5.897946
0,CARIBBEAN,5.088608
17,WESTERN EUROPE,3.550025
5,EASTERN EUROPE,2.557411
16,WESTERN ASIA,2.330955
13,SOUTHERN AFRICA,1.732746
4,EASTERN AFRICA,1.377844
2,CENTRAL ASIA,1.358974


## Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita

In [80]:
articles_by_region=articles_by_region.sort_values(by='per_capita_articles_region(no. per million)_high',ascending=False)
articles_by_region[['Region','per_capita_articles_region(no. per million)_high']]

Unnamed: 0,Region,per_capita_articles_region(no. per million)_high
14,SOUTHERN EUROPE,0.304838
8,NORTHERN EUROPE,0.236686
0,CARIBBEAN,0.202532
9,OCEANIA,0.17094
5,EASTERN EUROPE,0.13222
17,WESTERN EUROPE,0.111732
16,WESTERN ASIA,0.095141
13,SOUTHERN AFRICA,0.058737
1,CENTRAL AMERICA,0.056211
12,SOUTHEAST ASIA,0.04288
