# OOT Crawler Testing

## Base Crawler

I built a BaseCrawler class that handles most of the functionality needed to build a web crawling spider. Check out `base_crawler.py` to see how it works.

I'll import that here and then overwrite the parsing and crawling functions of the BaseCrawler to make it work for this specific challenge.

In [6]:
from base_crawler import BasicCrawler
from collections import deque
from bs4 import BeautifulSoup

In [7]:
class ZeldaCrawler(BasicCrawler):
    def __init__(self, base_url, db_name, coll_name, use_selenium=False):
        BasicCrawler.__init__(self, base_url, db_name, coll_name, use_selenium)
        
    def parse_page_for_profiles(self, page_html):
        profiles = []
        soup = BeautifulSoup(page_html, 'html.parser')
        all_anchor_tags = soup.select('a')
        profile_href = 'profiles'
        for anchor_tag in all_anchor_tags:
            try:
                anchor_tag_url = anchor_tag['href']
                if profile_href in anchor_tag_url:
                    profile_url = base_url + anchor_tag_url
                    profile_name = anchor_tag.text
                    profiles.append((profile_url, profile_name))
            except KeyError:
                continue
        return profiles
    
    def crawl(self, coll_name=None, start_from_base=True,
                    start_page_route=None, start_record_name='start'):
        if coll_name:
            self._connect_to_new_collection(coll_name)
        # scrape and save start page
        self._scrape_and_save_start_page(start_record_name, start_from_base,
                                         start_page_route)
        # parse_start_page_for_links
        start_html = self.load_start_html_from_db()
        profiles = self.parse_page_for_profiles(start_html)
        # use queue to force breadth first search
        profiles_queue = deque(profiles)
        while len(profiles_queue) > 0:
            profile_url, profile_name = profiles_queue.popleft()
            print "Scraping page at {}".format(profile_url)
            print "Scraping profile of user {}".format(profile_name)
            self.scrape_and_save_page(profile_url, profile_name)
        

## Custom Parsing Method

The links on this page are unfortunately a little hard to get. Every single link on the page is stored in an anchor (`<a>`) tag, but without a class to differentiate between profiles, speed run videos, navigation, etc. To get around this, let's collect every `<a>` tag and check to see if it is a profile link. If it is, store the url of that profile page along with the profile's user name to make it easier to search for specific users in the MongoDB later.

## Custom Crawling Function

Kept most of the same code from the original .crawl() method of the BasicCrawler base class, with the following modifications:

* Used the custom parsing method `parse_page_for_profiles` instead of the `parse_page_for_links` method.
* Modified the simple status messages to include the user name so I can keep track of what the spider is doing.
* Made the spider only scrape one level deep.
  * The BaseCrawler will eventually (it is a work in progress) have the capability of scraping to an arbitrary depth defined by a given stopping criteria, so I overwrote the behavior here since we only want the users on the leaderboard already.


## Testing the ZeldaCrawler

First let's put in the base URL of the site and the page I want to start crawling from. Also, let's put in the db and collection where I'll want to save the scraped html for parsing later.

In [8]:
base_url = "http://zeldaspeedruns.com"
page_route = "leaderboards/oot/any"
db_name = 'zelda'
coll_name = 'testing'

In [9]:
spider = ZeldaCrawler(base_url, db_name, coll_name)

In [10]:
spider.crawl(start_from_base=False, start_page_route=page_route)

Scraping start page...
Successful request, status code: 200
Saving start page html...
Successfully inserted record at id num 58b38b0fa4571b590b824d4b
Loading html from start page url http://zeldaspeedruns.com/leaderboards/oot/any
Scraping page at http://zeldaspeedruns.com/profiles/Torje
Scraping profile of user Torje
Successful request, status code: 200
Saving page html...
Successfully inserted record at id num 58b38b10a4571b590b824d4c
Scraping page at http://zeldaspeedruns.com/profiles/skater82297
Scraping profile of user skater82297
Successful request, status code: 200
Saving page html...
Successfully inserted record at id num 58b38b10a4571b590b824d4d
Scraping page at http://zeldaspeedruns.com/profiles/Jodenstone
Scraping profile of user Jodenstone
Successful request, status code: 200
Saving page html...
Successfully inserted record at id num 58b38b11a4571b590b824d4e
Scraping page at http://zeldaspeedruns.com/profiles/bakerawr
Scraping profile of user bakerawr
Successful request, sta

## Success! (Maybe...)

Looks like the spider worked, but with so many hits it is too hard to visually check that every request had 200 status code. Let's double check the counts, we should have 1 start page in the db and 850 user profile pages.

In [15]:
# Counting the start pages
start_page_count = spider.collection.find({'start_page':True}).count()

In [16]:
# Just like the highlander, there can only be one...
start_page_count

1

In [13]:
# Ok, that looks like it worked.
# Now more imporantly, there should be 850 user pages
# Let's count those...
user_pages_count = spider.collection.find({'start_page': False}).count()

In [14]:
user_pages_count

850

## Success! (Really!)

Looks like the custom ZeldaCrawler is in business. Now that I know it is working, I'll port it over to `zelda_crawler.py` to use later in this scraping challenge.