
We will compute the [PageRank](https://en.wikipedia.org/wiki/PageRank) of the articles of the [Hawaiian](https://en.wikipedia.org/wiki/Hawaiian_language) wikipedia, which is available at [haw.wikipedia.org](https://haw.wikipedia.org/wiki/Ka_papa_kinohi). Additional information of the Hawaiian wiki can be found [here](https://meta.wikimedia.org/wiki/List_of_Wikipedias). 

_Hints: If you don't speak Hawaiian, you might want to learn the wiki logic from the English wikipedia, and translate your findings. Also, caching is recommended._

__(a)__ Use the special [AllPages](https://haw.wikipedia.org/wiki/Papa_nui:AllPages) page and understand its logic to retrieve the url of all articles in the Hawaiian wikipedia. Make sure to skip redirections.

In [None]:
# a) 

import requests
import requests_cache
import lxml.html as lx
import re

In [None]:
# Papa_nui:AllPages is retrived from changing over from the english version
url = '/w/index.php?title=Papa_nui:AllPages&from='
session = requests_cache.CachedSession('./source/disc088')
articles = []

In [None]:
while True: 
    result = session.get('https://haw.wikipedia.org' + url)
    if result.raise_for_status(): break    
    html = lx.fromstring(result.text)
    # ignore all entries in the index that are redirections
    # the class attributes are found via inspecting the html
    articles.extend(html.xpath('//ul[@class="mw-allpages-chunk"]/li[not(@class="allpagesredirect")]/a/@href'))

    try: 
        # Mea aʻe means Next Page
        url = html.xpath('//div[@class="mw-allpages-nav"]/a[contains(text(), "Mea aʻe")]/@href')[0]
    except: 
        break

In [None]:
len(articles) 

__(b, i)__ Write a function that scans an article given by its url and retrieves all links to other articles in the Hawaiian wikipedia. Avoid links to special pages, images or the ones that point to another website. Only count the proper article for links that point to a specific section. Use regular expressions to manage these cases. 
__(ii)__ Make sure to match redirections to their correct destiation article. To this end, find how wikipedia treats redirections and retrieve the true article. _(Help: Try searching for 'uc davis' on en.wikipedia.org')_
 I used the collection of article urls obtained in (a), which I stored in a dict object to allow for fast lookups. Then, for each new found link I checked whether that link appeared in the dict. If not, It might be a re-direction and receive special attention.  
__(iii)__ Request all articles and obtain all links to other articles. 


In [None]:
# (b,i)

def fetch_links(article):
    session = get_session()
    result = session.get('https://haw.wikipedia.org' + article)
    try: 
        result.raise_for_status()
    except:
        return None
    
    html = lx.fromstring(result.text)
    links = html.xpath('//div[@id="bodyContent"]//a/@href')
    
    # match all that are not preceded by 'org', ...
    # contain a '/wiki/' term, ...
    # and the term after if that does not contain a colon
    # and only match the parts preceding a within-page reference (#)
    links = [re.findall('(?<!org)\/wiki\/(?!.*:)[^#]*', link) for link in links]
    links = [link[0] for link in links if link != []] # remove unmatched links

    return set(links)

In [None]:
# (ii)
lookup = {key: value for value, key in enumerate(articles)}

def catch_redirect(link):
    if lookup.get(link, None) is None: # redirect must have taken place, or link doesn't exist
        name = re.findall('(?<=\/wiki\/).*', link)[0]
        url = 'https://haw.wikipedia.org/w/index.php?title=' + name + '&redirect=no' # this is how wiki treats redirects
        result = requests.get(url)
        html = lx.fromstring(result.text)
        
        try: link = html.xpath('//ul[@class="redirectText"]//a/@href|//span[@class="mw-redirectedfrom"]//a/@href')[0]
        except: link
        
        #remove all within-page references
        link = re.findall('(?<!org)\/wiki\/(?!.*:)[^#]*', link)
        if link != []: link = link[0]
        else: link = None
        
    return link

In [None]:
# (iii)

import concurrent.futures, threading

thread_local = threading.local() # instantiates thread to create local data (here: session-attr.)

def get_session():
    if not hasattr(thread_local, "session"): 
        thread_local.session = requests.Session()
    return thread_local.session

In [None]:
def download_site(article):
    session = get_session()
    article_id = lookup.get(article)
    links = fetch_links(article)
    if links is None: pairs = []
    elif links == []: pairs = []
    else:
        links = [catch_redirect(link) for link in links]
        pairs = [(article_id, lookup.get(link)) for link in links if lookup.get(link) is not None]
    return pairs

In [None]:
def download_all_sites(sites):
    with concurrent.futures.ThreadPoolExecutor(max_workers = 4) as executor:
        results = executor.map(download_site, sites)
    return results 

In [None]:
result = download_all_sites(articles)

pairs = []
for i in result: pairs.append(i)

In [None]:
print(sum([len(p) for p in pairs])) # 7021 

__(c)__ Compute the transition matrix (see [here](https://en.wikipedia.org/wiki/Google_matrix) and [here](https://www.amsi.org.au/teacher_modules/pdfs/Maths_delivers/Pagerank5.pdf) for step-by-step instructions). Make sure to tread dangling nodes. You may want to use: 
```
import numpy as np
from scipy.sparse import csr_matrix
```

In [None]:
# (c)

import numpy as np
from scipy.sparse import csr_matrix

# 
n = len(articles)
row, col, data = zip(*((row, col, 1 / len(p)) for p in pairs for row, col in p))
H = csr_matrix((data, (row, col)), shape = (n, n))

dangling = [p==[] for p in pairs]
H[dangling,:] = 1 / n

__(d, i)__ Set the damping factor to `0.85` and compute the PageRank for each article, using fourty iterations and starting with a vector with equal entries. __(ii)__ Obtain the top ten articles in terms of PageRank, and, retrieving the articles again, find the correponding English article, if available. 

_Return the corresponding English article titles of the top ten articles from the Hawaiian wikipedia._

In [None]:
# (d, i)

vH = lambda v: 0.85 * (v @ H) + 0.15 * np.mean(v)
v = np.array([1 / n] * n)
#vold = np.array([1] * n)
for _ in range(40):
#    vold = v
    v = vH(v)

In [None]:
import matplotlib.pyplot as plt
plt.plot(range(n), v, label = "PageRank") 
plt.legend() 
plt.show()

In [None]:
import pandas as pd
pagerank = pd.Series(v).sort_values(ascending = False).head(10)
page_id = list(pagerank.index)

In [None]:
pagerank

In [None]:
top10 = [articles[p] for p in page_id]

In [None]:
def get_english_article(link): 

    result = requests.get('https://haw.wikipedia.org' + link)
    html = lx.fromstring(result.text)
    english = html.xpath('//li[@class="interlanguage-link interwiki-en mw-list-item"]/a/@href')

    try: url = english[0]
    except: url = None

    return url

def get_title(url): 

    result = requests.get(url)
    html = lx.fromstring(result.text)
    titlelist = html.xpath('//span[@class="mw-page-title-main"]')

    try: title = titlelist[0].text
    except: title = None

    return title

In [None]:
english_urls = [get_english_article(link) for link in top10] # one is None
title = [get_title(link) for link in english_urls if link is not None]
title

#'Spain',
#'Castile and León',
#'Municipality',
#'United States',
#'Hawaii',
#'List of municipalities in Burgos',
#'Capital city',
#'Lithuania',
#'List of municipalities in Soria'