**PySDS Week 02 Day 02 v.1 - Exercise - File Types and Text Processing I**

Today we will be doing some example regular expressions (yay), and some dataframe manipulation. Recall that we used the Canada wikipedia page as an example. Below is some code that you can use to pull in a Wikipedia page as data. Today, you will be asked to read in several pages, compare them on a number of features in a dataframe and report on what you found.  Below is the code that you can use to download a Wikipedia page. 

In [66]:
import urllib, urllib.request
import bs4 
# You can set this Wikipage to be any string that has a wikipedia page.

def getWikiPage(page="United Kingdom"): 
    '''Returns the XML found using special export of a Wikipedia page.'''
    
    # Here we use urllib.parse.quote to turn spaces and special characters into
    # the characters needed for an html string. So for example spaces become %20

    URL = "http://en.wikipedia.org/wiki/Special:Export/%s" % urllib.parse.quote(page)

#     print(URL,"\n") # commented out as not necessary for later tasks

    req = urllib.request.Request( URL, headers={'User-Agent': 'OII SDS class 2018.1/Hogan'})
    infile = urllib.request.urlopen(req)

    return infile.read()

# Testing
data = getWikiPage()
soup = bs4.BeautifulSoup(data.decode('utf8'), "lxml")
print(soup.mediawiki.page.revision.id)


<id>864288365</id>


In [71]:
# Now, select 10 countries and place them in a list. 
# These will be rows in a dataframe. 
# For each of the ten countries, 
# find the following features from parsing their wikipedia page: 
# 1. The number of internal wikilinks. 
# 2. The number of external wikilinks. 
# 3. The length of the page (in characters)
# 4. The population of the country. 
#   - This last one will be very tricky. It's okay if you cannot get the 
#     regex working, or if you have to build multiple regexes. 
#     Please simply document this. 

# Print the following: 
# The rank order of each of the columns. 
# For example, for wikilinks you might print 
# (note numbrs below are not accurate)

# Table 1. Number of <Wikilinks>
# Canada        46
# Germany       45
# France        24
# Netherlands   12
# ...

# answer below here
import re
import pandas as pd

countries = ['United Kingdom', 'Canada', 'Swaziland', 'Indonesia', 'Suriname', 'Bhutan', 'Latvia', 'Madagascar', 'Yemen', 'Brazil']

countries_df = pd.DataFrame(columns = ['Wikilinks', 'OuterLinks', 'U_Wikilinks', 'U_OuterLinks', 'PageLength', 'Population']) # initialise df
text_to_parse = {}
for i in countries:
    print(i) # progress indicator
    data = getWikiPage(i) # gets page data
    soup = bs4.BeautifulSoup(data.decode('utf8'), "lxml")
    text_to_parse[i] = soup.mediawiki.page.text # get the text to parse (and is stored in dict - not really necessary, but might be useful if we want to quickly access text later)

    re_inner_links = re.compile(r'\[\[.*?\]\]') # regex as before for internal links
    inner_links = re_inner_links.findall(text_to_parse[i]) # regex as before for internal links

    re_outer_links = re.compile(r'https?://[\w\./?&=%]*') # regex as before for external links
    outer_links = re_outer_links.findall(text_to_parse[i]) # regex as before for external links
    
    pagelength = len(soup.text) # page length  - (or len(text_to_parse), slightly different result)
    
    # population data - some of this data is represented as a wiki template {{UN Population|countryname}}, which is not trivial to fetch with our current approach
    # usually/always(?) at least one of population estimate or population census have the number explicitly displayed
    # We're not going to be fussy, and just select the first of these that has a value
    re_population = re.compile(r'population_[a-z]*\s*=\s*([\d,]*)') # regex returns the string of numbers (\d) and commas (,) that follows "population_[some word] = "
    populationlist = re_population.findall(text_to_parse[i]) # returns a list as occasionally there are 2 matches
    population = int([s for s in populationlist if s][0].replace(',', '')) # gets the first non empty string, removes the commas, and casts as an int
    
    countries_df.loc[i] = [len(inner_links), len(outer_links), len(set(inner_links)), len(set(outer_links)), pagelength, population] # add the data as a row in the df
    
display(countries_df) # display df

# displaying sorted columns:
for i in countries_df.columns: # for each column
    print('\n\nSorted by', i)
    display(countries_df[i].sort_values(ascending=False)) # sort the df by the specified column (descending), and display



# Reviewer's comments




United Kingdom
Canada
Swaziland
Indonesia
Suriname
Bhutan
Latvia
Madagascar
Yemen
Brazil


Unnamed: 0,Wikilinks,OuterLinks,U_Wikilinks,U_OuterLinks,PageLength,Population
United Kingdom,1631,656,1469,610,325644,63181775
Canada,950,607,883,565,234879,37067011
Swaziland,322,120,258,115,81842,1093238
Indonesia,900,379,789,375,192566,237641326
Suriname,599,99,518,91,87635,541638
Bhutan,660,189,536,186,124736,634982
Latvia,918,258,730,241,162347,1925800
Madagascar,606,220,509,214,167082,12238914
Yemen,923,206,775,194,207542,19685000
Brazil,1311,402,1154,378,242077,210147125




Sorted by Wikilinks


United Kingdom    1631
Brazil            1311
Canada             950
Yemen              923
Latvia             918
Indonesia          900
Bhutan             660
Madagascar         606
Suriname           599
Swaziland          322
Name: Wikilinks, dtype: object



Sorted by OuterLinks


United Kingdom    656
Canada            607
Brazil            402
Indonesia         379
Latvia            258
Madagascar        220
Yemen             206
Bhutan            189
Swaziland         120
Suriname           99
Name: OuterLinks, dtype: object



Sorted by U_Wikilinks


United Kingdom    1469
Brazil            1154
Canada             883
Indonesia          789
Yemen              775
Latvia             730
Bhutan             536
Suriname           518
Madagascar         509
Swaziland          258
Name: U_Wikilinks, dtype: object



Sorted by U_OuterLinks


United Kingdom    610
Canada            565
Brazil            378
Indonesia         375
Latvia            241
Madagascar        214
Yemen             194
Bhutan            186
Swaziland         115
Suriname           91
Name: U_OuterLinks, dtype: object



Sorted by PageLength


United Kingdom    325644
Brazil            242077
Canada            234879
Yemen             207542
Indonesia         192566
Madagascar        167082
Latvia            162347
Bhutan            124736
Suriname           87635
Swaziland          81842
Name: PageLength, dtype: object



Sorted by Population


Indonesia         237641326
Brazil            210147125
United Kingdom     63181775
Canada             37067011
Yemen              19685000
Madagascar         12238914
Latvia              1925800
Swaziland           1093238
Bhutan               634982
Suriname             541638
Name: Population, dtype: object