# The Wikipedia Game
The goal of this project is to build an algorithm that is able to perform better than a random player at the wikipedia game! The wikipedia game is [INSERT INFO ABOUT WIKIPEDIA GAME]. 

We've broken the project into three main technical challenges
* Scrape wikipedia (or at least a subset) in order to build a data structure with information about what links exist between pages.
* Build graph algorithms and visualizations for finding the shortest distance between two pages.
* Create an algorithm that does not rely on the graph but can still navigate from one page to another in a short distance. Ideally this distance would be equal to the shortest possible distance, but it would be cool if it at least performed better than an algorithm that chooses links randomly.

## Step 1: Scraping Wikipedia
The easiest way to do this would probably have been to rely on the wikipedia data dumps that are released twice monthly. However, we wanted to play with beautiful soup and requests, so we decided to implement the scraping manually. This is much more time intensive (computationally) which means that we are not able to build a graph of all of wikipedia, but we also found it to be way cooler!

Before we could actually scrape wikipedia, we had to decide what data we wanted to store and how we wanted to store it. After looking through a bunch of articles, we found that all wikipedia pages have the same link structure:
```
https://en.wikipedia.org/wiki/[articleTitle]
```
So the article on Star Wars has the link https://en.wikipedia.org/wiki/Star_Wars and the article entitled "The Hitchhiker's Guide to the Galaxy" has the link https://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy (notice how the apostraphe is handled!).

This meant that as long as we stored the article title, we could always recreate the link for the article.


In [1]:
import requests
from bs4 import BeautifulSoup     

PARSER = "lxml"

title = r"The_Hitchhiker%27s_Guide_to_the_Galaxy"
wiki_url = "https://en.wikipedia.org/wiki/" + title
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text,PARSER)

print(f"The title of the requested page is {soup.title}")

The title of the requested page is <title>The Hitchhiker's Guide to the Galaxy - Wikipedia</title>


A simple way of representing graphs is with dictionaries. Every key in the dictionary represents a node of the graph. The value associated with the key is a list of all of the nodes that the the key is connected to. This is how we decided to represent the wikipedia graph. Let's take a look at what the Star Wars key would look like!

In [2]:
from scraping import get_wiki_graph_one_step, get_wiki_graph

title= "Star_Wars"
d = get_wiki_graph_one_step([title])

print()
print(f"{title} links to {len(d[title])} other wikipedia articles")
print(f"For example, within the {title} article, there are links to the following wikipedia pages:")
for i in range(0,min(10, len(d[title]))):
    print(f"\t{d[title][i]}")

(0/1): Star_Wars

Star_Wars links to 1259 other wikipedia articles
For example, within the Star_Wars article, there are links to the following wikipedia pages:
	The_Empire_Strikes_Back
	Casey_Hudson
	Rail_shooter
	Ouija
	A-wing
	Stretch_Armstrong
	Anthony_Breznican
	Harrison_Ford
	The_Rising_Force
	Mr._Potato_Head


Above, we just requested the star wars wikipedia page. But in order to build a bigger graph, we could then repeat the same process for all of the pages referenced in the star wars page. We could continue to repeat this process for as many steps as we'd like.

Let's try this:

In [3]:
title = "New_York_State_Route_373"
out = get_wiki_graph(starting_refs=[title], num_steps=2)

(0/1): New_York_State_Route_373
(0/59): The_New_York_Times
(1/59): Numbered_highways_in_New_York
(2/59): Lake_Champlain
(3/59): United_States
(4/59): 52nd_New_York_State_Legislature
(5/59): Vermont
(6/59): Parkways_in_New_York
(7/59): ISBN_(identifier)
(8/59): Main_Page
(9/59): Theodore_Roosevelt_International_Highway
(10/59): Ausable_Chasm
(11/59): American_Antiquarian_Society
(12/59): Canadian_Pacific_Railway
(13/59): New_York_State_Legislature
(14/59): Keeseville,_New_York
(15/59): Hamlet_(New_York)
(16/59): St._Lawrence_County,_New_York
(17/59): List_of_reference_routes_in_New_York
(18/59): Burlington,_Vermont
(19/59): New_York_State_Route_374
(20/59): Plattsburgh,_New_York
(21/59): General_Drafting
(22/59): Ausable_Chasm,_New_York
(23/59): U.S._Route_9_in_New_York
(24/59): County_Route_17_(Essex_County,_New_York)
(25/59): Standard_Oil_Company_of_New_York
(26/59): List_of_Interstate_Highways_in_New_York
(27/59): Burlington%E2%80%93Port_Kent_Ferry
(28/59): Portland,_Oregon
(29/59): 

In [11]:
d = out[0]
print(f"{title} has {len(d[title])} links")

title_1 = d[title][0]
print(f"One such link is {title_1} which itself has {len(d[title_1])} links all of which link out to other articles")


New_York_State_Route_373 has 60 links
One such link is The_New_York_Times which itself has 927 links all of which link out to other articles


Above we did this for two steps. In the first step we found all of the links in `New_York_State_Route_373`. In the second step we found all of the links within all of these links.

In [13]:
print(f"If we were to do the third step, we would have to find all of the links within {len(out[2])} wikipedia articles")

print(f"Given that this takes about 0.5 seconds per article, just three steps would take {len(out[2])*0.5} seconds or {len(out[2])*0.5/3600} hours")

If we were to do the third step, we would have to find all of the links within 19530 wikipedia articles
Given that this takes about 0.5 seconds per article, just three steps would take 9765.0 seconds or 2.7125 hours


With every additional step we take beyond the initial article, the time to build the graph increases exponentially. Since we didn't want our personal computers to be occupied for multiple days, we did this on the Pomona servers using the `scraping.py` file we wrote. The final product was a .json file with [DESCRIBE NUMBER OF ENTRIES]. In the next section, we will be working with this json to build cool graph visualizations and algorithms

## Graph Visualizations and Algorithms

In [14]:
import json

import bs4
from bs4 import BeautifulSoup     
import requests

PARSER = "lxml"            # to use lxml (the most common), you'll need to install with .../pip install lxml

In [15]:
star_wars_url = "https://en.wikipedia.org/wiki/Star_Wars"
response = requests.get(star_wars_url)
data_from_url = response.text
soup = BeautifulSoup(data_from_url,PARSER)


In [16]:
queue = ["Star Wars"]

In [17]:
def get_wiki_graph_one_step(l):
    """
    Takes a list of wikipedia articles with no repeats. 
    Returns A dictionary where keys are items in original list. 
    Values are lists of references to other wikipedia articles
    """
    d = {}
    # s = set()

    for i in range(len(l)):
        title = l[i]
        wiki_url = "https://en.wikipedia.org/wiki/" + title

        # Request Wikipedia and Parse
        response = requests.get(wiki_url)
        data_from_url = response.text
        soup = BeautifulSoup(data_from_url,PARSER)

        print(f"({i}/{len(l)}): {title}")

        # Capture all referenced articles within article
        link_set = set()  # Use a set to ensure no repeats
        for link in soup.find_all('a'):
            s = link.get('href')
            if (s and s[:6] == "/wiki/"):
                ref = s[6:]

                # Make sure that title does not include ":" (means it is not normal wikipedia page)
                if (ref.find(":") == -1):
                    link_set.add(ref)
        d[title] = list(link_set)

    return d

    

In [18]:
def list_of_lists_to_set(lol):
    s = set()
    for l in lol:
        for item in l:
            s.add(item)

    return s

In [None]:
def save_dict(d, filename):
    # create json object from dictionary
    json_file = json.dumps(d)

    # open file for writing, "w" 
    f = open(filename,"w")

    # write json object to file
    f.write(json_file)

    # close file
    f.close()

In [19]:
def get_wiki_graph(final_d = {}, starting_refs=[], num_steps=1, filename="wiki_graph"):
    if starting_refs:
        ref_list = starting_refs
    else:
        ref_list = ["New_York_State_Route_373"]

    for i in range(num_steps):
        step_d = get_wiki_graph_one_step(ref_list)
        final_d.update(step_d)

        # Get all of the references that haven't been added to graph
        step_refs = list_of_lists_to_set(list(step_d.values()))
        existing_refs = set(final_d.keys())
        unseen_refs = step_refs.difference(existing_refs)

        ref_list = list(unseen_refs)
        
        save_dict(final_d, f"{filename}_{i}.json")

    return (final_d, step_refs, ref_list)

In [21]:
final_d, step_refs, ref_list = get_wiki_graph(starting_refs=["New_York_State_Route_373"], num_steps=4, filename="wiki_graph_NYSR")

(0/1): New_York_State_Route_373
(0/59): Auto_trail
(1/59): Parkways_in_New_York
(2/59): Chesterfield,_New_York
(3/59): Port_Kent_and_Hopkinton_Turnpike
(4/59): Burlington,_Vermont
(5/59): Plattsburgh,_New_York
(6/59): ISBN_(identifier)
(7/59): County_Route_17_(Essex_County,_New_York)
(8/59): 52nd_New_York_State_Legislature
(9/59): Interstate_87_(New_York)
(10/59): Lake_Champlain
(11/59): New_York_State_Legislature
(12/59): Theodore_Roosevelt_International_Highway
(13/59): Baltimore,_Maryland
(14/59): Hopkinton,_New_York
(15/59): New_York_State_Route_372
(16/59): Lake_Champlain_Transportation_Company#Burlington-Port_Kent
(17/59): General_Drafting
(18/59): Lake_Champlain_Transportation_Company
(19/59): Port_Kent_(Amtrak_station)
(20/59): Albany,_New_York
(21/59): New_York_State_Route_374
(22/59): Ausable_Chasm,_New_York
(23/59): Amtrak
(24/59): Toll_gate
(25/59): Hamlet_(New_York)
(26/59): Google_Maps
(27/59): List_of_U.S._Routes_in_New_York
(28/59): Ausable_Chasm
(29/59): Toll_road
(30/