Section 1: Introduction
=======

When prompted with the LBA, I knew already where I wanted to navigate. I currently reside in Hyderabad, India, specifcially in HITEC City, an up an coming IT and Tech hub on the edge of the city. However, the more interesting cultural areas of the city are near the center. Specifically, I wanted to go the Old City, the historic Muslim part of the city. However, before I wanted to go there, I wanted to learn a little bit more about the Old City. I opened up my browser and went to wikipedia...and then stopped. In spirit of the assignment, I was curious if instead of autonomosly navigating from HITEC City to the Old City, I autonomously navigated from the _wikipedia page_ of HITEC City to the _wikipedia page_ of the Old City. I found this challenge engaging, and set to work.


Section 2: Wikipedia Python Package
=======

My first step was to figure out how to communicate with Wikipedia from Python. I knew that I could possibly scrape manually from each wikipedia page, but I guessed that there was an easier way. Luckily, the wikipedia API has already been integrated into a Python package, https://wikipedia.readthedocs.io/en/latest/code.html#api.

This package allows lots of useful functionality.
For example:

In [163]:
import wikipedia

#Save a wikipedia page as a wiki object
SA = wikipedia.page("South Africa")
# Return the title of the page
print (SA.title)
# Return the summary of the page (this call just the first 100 characters)
print (SA.summary[:100])
# Return the links on the page  (this call just the first 5)
print (SA.links[:5])

South Africa
South Africa, officially the Republic of South Africa (RSA), is the southernmost country in Africa. 
['+27', '.za', '10th BRICS summit', '11th BRICS summit', '16th meridian east']


This was exactly the functionality that I needed. 
The next step was to determine the rules.

Section 3: Rules - Wikirace
=======

I decided to do this in the style of a wikirace.

In [161]:
wikirace = wikipedia.page('wikirace')
print (wikirace.summary)

Wikiracing is a game using the online encyclopedia Wikipedia which focuses on traversing links from one page to another. It has many different variations and names, including The Wikipedia Game, Wikipedia Maze, Wikispeedia, Wikiwars, Wikipedia Ball, and Litner Ball. External websites have been created to facilitate the game.
The Seattle Times has recommended it as a good educational pastime for children and the Larchmont Gazette has said, "While I don't know any teenagers who would curl up with an encyclopedia for a good read, I hear that a lot are reading it in the process of playing the Wikipedia Game".
The Amazing Wiki Race has been an event at the TechOlympics and the Yale Freshman Olympics.
The average number of links separating any Wikipedia page from the United Kingdom page is 3.67. Other common houserules such as not using the United States page increase the difficulty of the game.


I read about the wikiracing (and have done it it the past) and so decided on a set of rules.

1. I must start on the wikipedia page that I am starting from, this will be called start
2. I must end on the wikepedia paget that I am looking for, this will be called goal.
3. I can only navigate by using the hyperlinks on a the current wikipedia page. I can choose any of the links using any selection procedure I want, but the limit is that I can only use the name and order of the links. 
4. Once I choose a link, I can not use the back button. I must continue from that page.

This can be broken down into two algorithms, a navigation algorithm and an execution algorithim.
The navigation algorithm will, given a list of links on a wikipedia page, choose which link to go to.
The execution algorithm will, given a link, check if the given link is the goal, and if not navigate to the next page and return a list of all links on that page.

My overall algorithm will be recursive. I will first call a driver function which calls the execution algorithm for the first time. Each call of the execution algorithm will check if the given link is the goal, if not it will call the navigation algorithm, and then recursively call itself (the execution algorithm) again. 

This encompasses the essense of a greedy algorithm, as at each step, the navigation algorithm chooses the locally optimal next link. It does not look ahead at the choices, but instead chooses the next link based on a specific metric.

Section 4: First Attempt
=======

My first attempt at creating an algorithm was in real life and not in python.
My execution algorithm was simple, as I would just click on the link.
My navigation algorithm for this first attempt was simple:   
First, check if the desired link is on the page.  
If so, click it and be done.  
If not, choose the first link on the page.

This is a simple algorithm that I can execute manually, and as such, I don't think it will be very succesful.
I tried this algorithm,   starting from 
HITEC City : https://en.wikipedia.org/wiki/HITEC_City   
with the goal of the Old City : https://en.wikipedia.org/wiki/Old_City_(Hyderabad,_India)

The images from my first manual implementation of this algorithm are located in an attached folder.

After 16 steps, and with little success, I decided that I needed to iterate. While I learned some interesting things about computing, I wasn't getting any closer to learning about the Old City.

Section 5: Second Attempt
=======

The first thing I updated was my execution algorithm.

In [187]:
# This is the driver algorithm,that starts the process
# The input is a start and goal, both string.
def wiki(start, goal, navigation):
    # It creates an array that stores the path of the algorithm
    link_array = [start]
    # The counter is used to stop the algorithm after a certain
    # number of steps
    counter = 0
    # Call the execution algorithm
    execution(link_array, goal, counter, navigation)
    # Return the path
    print (link_array)

def execution(link_array, goal, counter, navigation):
    #This is how we exit recursion
    if link_array[-1] == goal:
        print( "Success")
        return link_array
    
    #If not, we find all links from the page we have navigated to
    base = wikipedia.page(link_array[-1])
    new_links = base.links
    # Choose the next link by calling the navigation algorithm
    next_step = navigation(link_array, goal, counter, new_links)
    link_array.append(next_step)
    #Update the counter
    counter+=1
    if counter < 20:
        # Call the execution function recursively, 
        # if the counter is below the specified amount
        execution(link_array, goal, counter, navigation)
    else: 
        print ("Failure")

This execution algorthim looked good! However, I still needed to create a navigation algorithm. Given the rules, I knew I was allowed to use the order and names of all links on a page in order to choose a link. The order didn't seem to be a very helpful metric on my first attempt, which confirms my suspicion that it would not be very helpful for navigation. Thus, I am left with the names of the links. I decided that similarity between the name of each link and the goal would be a good metric. However, I still needed to find a way to calculate this. I did some research, and came across the Largest Common Subsequence

In [165]:
LargestCommonSubsequence = wikipedia.page('Largest Common Subsequence')
print (LargestCommonSubsequence.summary)

The longest common subsequence (LCS) problem is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring problem: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The longest common subsequence problem is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.


This seemed like what I needed. It was more robust that the longest common string, as there might be similarities between strings that were not neccesarily in order. However, there was a problem. I did research on how to solve this problem, but the first algorithm that I came across was not ideal. It got the correct answer, but did so in O(2^n) time. This was unacceptable. Luckily, this problem has been solved, using a concept known as dynamic programming. The essence of dynamix programming is that it helps to solve a problem which can be broken up into many subproblems. Importantly, solving one subproblem can be done with the answers of other subproblems. Thus, in a naive approach, the same problem is being solved many times. Dynamic programming records the solution of this subproblem, and uses it instead of solving the problem again next time it is asked. Our solution can thus be implemented in O(mn), with m and n being the lenghts of the two strings being compared. This is a dramatically faster solution.

The following code was found from https://www.geeksforgeeks.org/longest-common-subsequence/.  I edited the output slightly (line 47) to return the LCS in a format that worked better for my uses.

In [167]:
# Dynamic programming implementation of LCS problem
 
# Returns length of LCS for X[0..m-1], Y[0..n-1] 
def lcs(X, Y, m, n):
    L = [[0 for x in range(n+1)] for x in range(m+1)]
 
    # Following steps build L[m+1][n+1] in bottom up fashion. Note
    # that L[i][j] contains length of LCS of X[0..i-1] and Y[0..j-1] 
    for i in range(m+1):
        for j in range(n+1):
            if i == 0 or j == 0:
                L[i][j] = 0
            elif X[i-1] == Y[j-1]:
                L[i][j] = L[i-1][j-1] + 1
            else:
                L[i][j] = max(L[i-1][j], L[i][j-1])
 
    # Following code is used to print LCS
    index = L[m][n]
 
    # Create a character array to store the lcs string
    lcs = [""] * (index+1)
    lcs[index] = "\0"
 
    # Start from the right-most-bottom-most corner and
    # one by one store characters in lcs[]
    i = m
    j = n
    while i > 0 and j > 0:
 
        # If current character in X[] and Y are same, then
        # current character is part of LCS
        if X[i-1] == Y[j-1]:
            lcs[index-1] = X[i-1]
            i-=1
            j-=1
            index-=1
 
        # If not same, then find the larger of two and
        # go in the direction of larger value
        elif L[i-1][j] > L[i][j-1]:
            i-=1
        else:
            j-=1
            
    ###I edited the below line to output in my prefered format
    return ("".join(lcs[:-1]) )
 
# Driver program

def LCS(string1, string2):
    X = string1
    Y = string2
    m = len(X)
    n = len(Y)
    return (lcs(X, Y, m, n))

    
# This code is contributed by BHAVYA JAIN from Geeks_for_Geeks

In [169]:
#Test the LCS
LCS("nice", 'pie')

'ie'

At this point, I had a way to calculate the metric. I just needed to implement it into my navigation algorithm. My first idea was to find the score for each link using LCS, then use the max function to find the maximum value. However, the problem was that there might be multiple links that have the same score. I decided that in this situation, I wanted to choose the link that was the shortest, meaning that the LCS would compromise the most percentage of the link as a whole, which would be more likely to be an acutal connection between the goal and that link. This is implemented below.

In [184]:
def navigation1(link_array, goal, counter, new_links):

    link_scores = []
    print (link_array[-1])
    for link in new_links:
        link_scores.append((len(LCS(link, goal)),LCS(link, goal),link))
    
    # Find the max score of all the link scores
    mx = max(link_scores, key=lambda x: x[0])[0]
    
    max_scores = []
    #find all links that have score = mx
    for score in link_scores:
        if score[0] == mx:
            max_scores.append(score)

    #Find the link in max_scores that has the shortest link title
    mn =(min(max_scores, key=lambda x: len(x[2])))
   
    #Return the name of the link
    return mn[2]     

In [185]:
# This is the first test of the algorithm.
# Note that the start and goal must be the exact name
# of the page on wikipedia
wiki('HITEC City','Old City (Hyderabad, India)', navigation1 )

HITEC City
List of educational institutions in Hyderabad (India)
Central Institute for Medicinal and Aromatic Plants (Hyderabad, India)
Council of Scientific and Industrial Research
Hyderabad, India
Success
['HITEC City', 'List of educational institutions in Hyderabad (India)', 'Central Institute for Medicinal and Aromatic Plants (Hyderabad, India)', 'Council of Scientific and Industrial Research', 'Hyderabad, India', 'Old City (Hyderabad, India)']


Success! My algorithm correctly navigated from HITEC City to the Old City. I am now finally able to read all the information about the Old City to gain a deeper cultural understanding before going there.

However, I wanted to see how good my algorithim really was. I decided to test it with a different case.

In [188]:
#I decided to test if my algorithim could navigate from 
# "Hydrogen" to "India," two unrelated words.

wiki('Hydrogen', 'India', navigation1)

Hydrogen
Indium
Indium halides
Indium
Indium halides
Indium
Indium halides
Indium
Indium halides
Indium
Indium halides
Indium
Indium halides
Indium
Indium halides
Indium
Indium halides
Indium
Indium halides
Indium
Failure
['Hydrogen', 'Indium', 'Indium halides', 'Indium', 'Indium halides', 'Indium', 'Indium halides', 'Indium', 'Indium halides', 'Indium', 'Indium halides', 'Indium', 'Indium halides', 'Indium', 'Indium halides', 'Indium', 'Indium halides', 'Indium', 'Indium halides', 'Indium', 'Indium halides']


Clearly, there is a problem. My algorithim got stuck in an infinite loop. At each decision point, the optimal link was one that the algorithm had visited before. Because the algorithim is deterministic and teh same input will produce the same output, once it returns to something it has visited before, it will be stuck in a loop, because it will just continue on the path from the position it returned to, and run into the same problem again. This is an infinite loop. In the next section, I will fix this.

Section 6: Third Attempt
=======

In order to prevent an infinite loop, I needed to update the navigation function so that it will not choose a link  that it has already seen. I implement this in the below cell.

In [190]:
def navigation2(link_array, goal, counter,new_links):

    link_scores = []
    print (link_array[-1])
    for link in new_links:
        link_scores.append((len(LCS(link, goal)),LCS(link, goal),link))
    
    # Find the max score of all the link scores
    mx = max(link_scores, key=lambda x: x[0])[0]
    
    #Make a list of all links that have score = mx
    # However, they also must not be links we have previously 
    # visited.
    no_repeats = False
    while not no_repeats:
        max_scores = []
        #find all links that have score = mx
        for score in link_scores:
            if score[0] == mx:
                max_scores.append(score)
        #Now delete all links that have been used before
        deletion_list = [False for x in range(len(max_scores))]
        for i in range(len(max_scores)):
            for j in range(len(link_array)):
                if max_scores[i][2] == link_array[j]:
                    deletion_list[i] = True
        # Delete in reversed order so indexing doesn't become a problem
        for i in reversed(range(len(deletion_list))):
            if deletion_list[i] == True:
                del max_scores[i]
                
        #If we have at least one new link at this score
        # If not, decrease mx and repeat
        if len(max_scores) > 0:
            no_repeats = True
        else:
            mx -=1
        
    #Find the link in max_scores that has the shortest link title
    mn =(min(max_scores, key=lambda x: len(x[2])))
   
    #Return the name of the link
    return mn[2]    


In [191]:
wiki( 'HITEC City', 'Old City (Hyderabad, India)', navigation2)

HITEC City
List of educational institutions in Hyderabad (India)
Central Institute for Medicinal and Aromatic Plants (Hyderabad, India)
Council of Scientific and Industrial Research
Hyderabad, India
Success
['HITEC City', 'List of educational institutions in Hyderabad (India)', 'Central Institute for Medicinal and Aromatic Plants (Hyderabad, India)', 'Council of Scientific and Industrial Research', 'Hyderabad, India', 'Old City (Hyderabad, India)']


In [192]:
wiki('Hydrogen', 'India', navigation2)

Hydrogen
Indium
Indium halides
Indium trichloride
Indigane
Silk in the Indian subcontinent
South India
East India
Success
['Hydrogen', 'Indium', 'Indium halides', 'Indium trichloride', 'Indigane', 'Silk in the Indian subcontinent', 'South India', 'East India', 'India']


This updated navigation algorithm suceeded on both our initial goal, and on our new test case. We can try more test cases. 

In [195]:
wiki ( 'Boston', 'India', navigation2)

wiki ( 'Xylophone','Egyptian pyramids',navigation2 )

Boston
Indiana
Indiana Day
Indiana Code
Indianapolis
Success
['Boston', 'Indiana', 'Indiana Day', 'Indiana Code', 'Indianapolis', 'India']
Xylophone
Encyclopædia Britannica Eleventh Edition
Encyclopædia Britannica Ninth Edition
Egyptian hieroglyphs
Egyptian Grammar: Being an Introduction to the Study of Hieroglyphs
Egyptian language
Ancient Egyptian funerary practices
Ancient Egyptian mathematics
Egyptian fractions
Egyptian mathematics
List of ancient Egyptian papyri
Elephantine papyri
Ancient Egyptian religion
Ancient Egyptian burial customs
List of ancient Egyptian dynasties
Ancient Egyptian trade
Ancient Egyptian philosophy
Ancient Egyptian funerary texts
Pyramid Texts
Success
['Xylophone', 'Encyclopædia Britannica Eleventh Edition', 'Encyclopædia Britannica Ninth Edition', 'Egyptian hieroglyphs', 'Egyptian Grammar: Being an Introduction to the Study of Hieroglyphs', 'Egyptian language', 'Ancient Egyptian funerary practices', 'Ancient Egyptian mathematics', 'Egyptian fractions', 'Eg

Things are looking pretty good. However, we can still find test cases where it will fail. The below two cells show two different examples of failure.

In [196]:
wiki ( 'Egyptian pyramids','Xylophone',navigation2 )

Egyptian pyramids
Pyramid of Elephantine
Geographic coordinate system
Selenographic coordinates
Full Moon
Planetary objects proposed in religion, astrology, ufology and pseudoscience
Val Johnson incident
Kelly–Hopkinsville encounter
1566 celestial phenomenon over Basel
Climatology
Atmospheric boundary layer
Atmospheric Model Intercomparison Project
Coupled model intercomparison project
Navy Operational Global Atmospheric Prediction System
Navy Global Environmental Model
Regional Atmospheric Modeling System
Mars regional atmospheric modeling system
Mars Exploration Rovers
Multi-Mission Radioisotope Thermoelectric Generator
Neutron capture therapy of cancer
Failure
['Egyptian pyramids', 'Pyramid of Elephantine', 'Geographic coordinate system', 'Selenographic coordinates', 'Full Moon', 'Planetary objects proposed in religion, astrology, ufology and pseudoscience', 'Val Johnson incident', 'Kelly–Hopkinsville encounter', '1566 celestial phenomenon over Basel', 'Climatology', 'Atmospheric bo

In [194]:
wiki( 'Atlanta', 'Yoga', navigation2)

Atlanta
Montego Bay
Bog Walk
Hodges, Jamaica
Broughton, Jamaica
Stonehenge, Jamaica
Roxborough, Manchester
Roaring River Park
Rose Hall, Montego Bay




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


DisambiguationError: "Rose Hall" may refer to: 
Rose Hall, New York City
Rose Hall, Guyana
Rose Hall, Saint Vincent and the Grenadines
Rose Hall, Oxford
Rose Hall, Montego Bay
Rose Hall Beach

With the first case, the problem was that our metric was not working to get us closer to the goal. We travelled to niche wikipedia pages about space travel, which are hard to get to Xylophone from. We could try and let our algorithm run longer, but I doubt that this would help. Instead, we would need to find a better metric for searching.  

The next case however, was a problem with the wikipedia python package. The name of the link must be specific, or a disambiguation error is thrown, as it is unclear which page should be chosen. Ideally, the "link" attribute of each wiki object will be the specific name of the wikipedia page. However, in certain cases, the Python package will communicate with the wikipedia API, but return the name of the link as different from the full name of the page. In this scenario, if our algorithim chooses this page, an error will be thrown. In order to resolve this, either a random page from the possible pages from the unspecific name input could be chosen, or we would need to communicate with the wikipedia API ourselves and circumvent the wikipedia package.

Section 7: Future Work
=======

I learned several things from working through this assingment. The first was about preventing infinite loops. In an online algorithm like mine, it is possible to prevent infinite loops with a simple check of previous input. However, in real life, infinite loops may occur from external input that cannot so easily be checked. More advanced methods of stopping them must be implemented.

Secondly, the greedy nature of our algorithim is limiting. If we "looked ahead", we could potentially navigate through wikipedia even faster. For example, our algorithim could look through the links on the pages of all the links that achieve close to max LCS score on a page. Of these links, we would choose the link whose subsequent links have the highest average (or maximum, median, etc.) score. This could improve our search, but would no longer be a greedy algorithm, which is focused only on the local maximum.

An interesting question posed by this assignment is the state of the wikipedia network. Wikipedia could easily be represented as a network, with pages corresponding to nodes and hyperlinks between them corresponding to directed edges. Armed with this, many interesting things could be done. A shortest path algorithm could be used on the network to find the optimal way to travel between wikipedia pages. We could also examine certain topics or pages to understand network conectivity, or the entire network as a whole. Due to the large size of the network, powerful computers would need to be used. Interesting analysis could be done, along the lines of which scientific fields are most densly connected, or similar analysis with actors or celebrities.
