<img src="../source/images/kandinsky.png" alt="" width="400"/>

# STA 220 Data & Web Technologies for Data Analysis

## Homework 3


Due __February 19 (Sunday), 2023__ by 11:59pm. Submit by editing this file, rename it to __"LastName_FirstName_hw1"__ and then uploading to Canvas twice, in __ipynb and html__ format! 

---
Instructions: 
1. Put your answers in new cells after each exercise. You can make as many new cells as you like. Use code cells for code and Markdown cells for text. Answer all questions with complete sentences.

2. Your code should be readable; writing a piece of code should be compared to writing a page of a book. Adopt the *one-statement-per-line* rule. The lenghth of your code may not exceed the maximum length of each cell for display. If your code is too long, split it into multiple lines to improve readability. 

3. To help understand and maintain code, you should always add comments to explain your code. Use the hash symbol (#) to start writing a comment. Uncommented solutions will not be graded. If you are writing a function, consider using a _docstring_ to add explanation. 

4. Do not clear your output so that we can see your answers without running all of the cells.

### Problem 1 : Getting to Philosophy [10 Points]

Lets play a variation of the [wiki game](https://en.wikipedia.org/wiki/Wikipedia:Wiki_Game) to learn about [this](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy) phenomenon. The rules are as follows: 
 - Start using the random article link (wiki menu on the left hand side)
 - Click on the first non-italicized link outside of parentheses 
 - Ignore external links (e.g., `/wiki/File:...` or `/wiki/Category:...`), links to the current page
 - Stop when reaching "Philosophy", a dead end (page with no links) or when a loop occurs

#### Exercise

Write a function `play` that plays the game and stops if "Philosophy" is not reached after `maxiter = 1000` steps. This function should return information to compute the quantities below. 

Play the game $200$ times. Report 
 - the mean number of sites visited per game, 
 - the maximum number of sites visited per game,
 - and number of convergences to "Philosophy" and 
 - the 20 most visited sites over all 200 games. 
 
You may want to use the module `lxml.html` and the function `tostring` `lxml.etree` or similar packages to to parse the html. Besides these, you are allowed to use `requests`, `re`, and `time`. To display the results, you may use `pandas` and its method `pandas.Series.value_counts()` or similar packages. You might find [regexr.com](https://regexr.com/) helpful. 

__Hint:__ Consider the results below from the function `play`, which takes the wiki-style url as argument and returns a dictionary. 

In [38]:
play('/wiki/Robert_Alfred_Tarlton')['pages']

['/wiki/Robert_Alfred_Tarlton',
 '/wiki/Birmingham',
 '/wiki/City_status_in_the_United_Kingdom',
 '/wiki/The_Crown',
 '/wiki/State_(polity)',
 '/wiki/Politics',
 '/wiki/Decision-making',
 '/wiki/Psychology',
 '/wiki/Science',
 '/wiki/Scientific_method',
 '/wiki/Empirical_evidence',
 '/wiki/Proposition',
 '/wiki/Philosophy_of_language',
 '/wiki/Analytic_philosophy',
 '/wiki/Academic_discipline',
 '/wiki/Knowledge',
 '/wiki/Descriptive_knowledge',
 '/wiki/Epistemology',
 '/wiki/Outline_of_philosophy',
 '/wiki/Philosophy']

In [39]:
play('/wiki/Riku_Morgan')['pages']

['/wiki/Riku_Morgan',
 '/wiki/Nigerian_Airforce',
 '/wiki/Nigerian_Armed_Forces',
 '/wiki/Military',
 '/wiki/Warfare',
 '/wiki/State_(polity)',
 '/wiki/Politics',
 '/wiki/Decision-making',
 '/wiki/Psychology',
 '/wiki/Science',
 '/wiki/Scientific_method',
 '/wiki/Empirical_evidence',
 '/wiki/Proposition',
 '/wiki/Philosophy_of_language',
 '/wiki/Analytic_philosophy',
 '/wiki/Academic_discipline',
 '/wiki/Knowledge',
 '/wiki/Descriptive_knowledge',
 '/wiki/Epistemology',
 '/wiki/Outline_of_philosophy',
 '/wiki/Philosophy']

In [40]:
play('/wiki/Brigade_Commander_(video_game)')['pages']

['/wiki/Brigade_Commander_(video_game)',
 '/wiki/Amiga_Action',
 '/wiki/Amiga',
 '/wiki/Personal_computer',
 '/wiki/Microcomputer',
 '/wiki/Computer',
 '/wiki/Machine',
 '/wiki/Power_(physics)',
 '/wiki/Physics',
 '/wiki/Natural_science',
 '/wiki/Branches_of_science',
 '/wiki/Sciences',
 '/wiki/Scientific_method',
 '/wiki/Empirical_evidence',
 '/wiki/Proposition',
 '/wiki/Philosophy_of_language',
 '/wiki/Analytic_philosophy',
 '/wiki/Academic_discipline',
 '/wiki/Knowledge',
 '/wiki/Descriptive_knowledge',
 '/wiki/Epistemology',
 '/wiki/Outline_of_philosophy',
 '/wiki/Philosophy']

In [41]:
play('/wiki/Exclusive_(TV_series)')['pages']

['/wiki/Exclusive_(TV_series)', '/wiki/Double_Vision_(company)']

__Solution:__ First, retrieve the `html`. Then, select all text paragraphs (there might be no link in the first one) and parse them back to text. Remove everything inside brackets/ italics in the string using `re`. Then, parse back to `html`. Safely search for the first link. Check if link is valid and points to a new wiki page. 

In [1]:
import lxml.html as lx
import requests
from lxml.etree import tostring
import re 
import time

def remove_italics_and_brackets(html_list): 
    'Removes everything between (...) or <i>...<\/i>.'
    processed_text = ''
    for html in html_list: 
        text = tostring(html).decode('utf-8')
        text = re.sub('(<i>.*?<\/i>)', '', text) #remove italics

        # iteratively remove all brackets with no other brackets inside
        oldtext = ''
        while oldtext != text: 
            oldtext = text
            text = re.sub('(?<=[^_])\(([^\(]*?\))', '', text) 
        # regex explanation: 
        # (?<=[^_])     matches only those strings who are not preceded by _, as in a wikilink
        #    \(         matches opening bracket, that... 
        #      (
        #       [^\(]*? does match an arbitrary number of non-opening bracket characters
        #         \)    and concludes with a closing bracket
        #      )   
        processed_text += text
    return processed_text

def first_anchor(html, url): 
    '''
    Returns first link outside of (...) or <i>...<\/i>, that does 
    not refer outside of wikipedia or to the same page. 
    '''
    text = remove_italics_and_brackets(html)
    html = lx.fromstring(text)
    try: # there might be no link at all
        links = html.xpath('//a/@href') # these links might not be valid! 
        for link in links:  
            # check if the link goes outside of wikipedia or to, e.g., wiki/File:, ... 
            # regex explanation: Match everything that is 
            # (?<!org)      not preceded by org (this would link to wikimedia.org, ect),   
            #   \/wiki\/     matches /wiki/,
            #       (?!.*:)  that is not followed by an arbitrary amount of letters and ':'
            if re.search('(?<!org)\/wiki\/(?!.*:)', link) is not None:  
                # check if link links to same page (then link = url#something)
                if url not in link: 
                    return link
    except: 
        return None

def get_link(url): 
    'Fetches first link on wiki page. '
    response = requests.get('https://en.wikipedia.org' + url) # in wiki format 
    response.raise_for_status()
    html = lx.fromstring(response.text) # Parse the HTML
    try: #this should be all paragraphs, if existing
        paras = html.xpath('//*[@id="mw-content-text"]/div[1]/p|//*[@id="mw-content-text"]/div[1]/ul')
        link = first_anchor(paras, url)
    except: # apparently, no para exists
        link = None
    return link
    
def get_first_link(): 
    'Return link from first random page in wiki format.'
    response = requests.get('https://en.wikipedia.org/wiki/Special:Random')
    return re.sub('https://en.wikipedia.org', '', response.url)

def play(url = None, maxiter = 1000):
    # start page
    if url is None: 
        url = get_first_link()
    pages = [url]
    
    exitcode = None
    for i in range(0, maxiter): 
        time.sleep(0.05)
        # get next link
        link = get_link(url)
        # no links to a page
        if link is None:  
            exitcode = 'deadend'
            break
        # loop 
        if pages.count(link) > 0: 
            exitcode = 'loop'
            break 
        if link == '/wiki/Philosophy':
            pages.append(link)
            exitcode = 'philosophy'
            break 
        url = link
        pages.append(url)
        
    if exitcode is None: 
        exitcode = 'maxiter'

    return {'pages': pages, 'exitcode': exitcode}

def sim(nsim = 200): 
    'Plays the game nsim times. '
    allpages = []
    lengths = [None] * nsim
    exitcodes = [None] * nsim
    for i in range(0, nsim): 
        result = play()
        allpages.extend(result['pages'])
        lengths[i] = len(result['pages'])
        exitcodes[i] = result['exitcode']
    return [allpages, lengths, exitcodes]

In [3]:
result = sim() # takes ~ 20 min, use sim(1) for a single result

In [6]:
import pandas as pd 
print(pd.Series(result[1]).mean()) 

18.595


In [7]:
print(pd.Series(result[1]).max())

31


In [8]:
pd.Series(result[2]).value_counts()

philosophy    186
loop           13
deadend         1
dtype: int64

In [9]:
pd.Series(result[0]).value_counts().head(20)

/wiki/Philosophy                186
/wiki/Outline_of_philosophy     167
/wiki/Epistemology              167
/wiki/Descriptive_knowledge     167
/wiki/Knowledge                 167
/wiki/Academic_discipline       143
/wiki/Analytic_philosophy       142
/wiki/Philosophy_of_language    142
/wiki/Proposition               142
/wiki/Empirical_evidence        142
/wiki/Scientific_method         139
/wiki/Science                   135
/wiki/Branches_of_science        59
/wiki/Psychology                 42
/wiki/Politics                   31
/wiki/Decision-making            31
/wiki/Social_science             31
/wiki/Physics                    28
/wiki/Natural_science            28
/wiki/Social_group               26
dtype: int64