# Web Scraping 101

Web scraping is all about gathering data from the internet that follows a very structured format. Code is very good at handling predictable cases since you can break things down to if/else statements. Now, how do we get this data, and how should we store it?

For this tutorial, I'm going to assume you're fluent in Python and have already set up the project environment. The vast majority of this tutorial (if not all of it) will only regard the "data" folder inside of the root folder of the project. Feel free to open this folder in VS Code.

## Project Overview

If you want to run the web scraper, you can type "python index.py" or "python3 index.py" if you're on Mac/Linux while inside the data folder. That means that the code we start with starts in the index.py file.

This file has 4 main parts to it:
- Imports
- Constants
- Functions
- \_\_main\_\_

The first one you're probably familiar with. We start by importing external libraries, argparse, os, and unittest (all of which come built into Python).
Then we import functions/classes from our local "data" folder. 

In [None]:
from model.scrape import scrape_data
from util.cache import HTMLCache
from util.utils import time_function
from util.logger import enable_logs
from model.database import Database
from model.indexer import Indexer

The first import, from model.scrape import scrape_data, is from the "scrape.py" file inside of the model folder. The second, from util.cache import HTMLCache, is from the "cache.py" file from the util folder. Why do we separate all of these Python files into different folders? Organization. That's it.

Next we get to constants, which are pretty self-explanatory, and then we get to the main function. This is the meat of our scraper code. It's a function that takes in args and returns None. If you've never dealt with args, no worries, we'll get there. Underneath the "main" function is where our code *actually* starts.

In [None]:
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--reset_cache", action="store_true")
    parser.add_argument("--only_tests", action="store_true")
    parser.add_argument("--skip_tests", action="store_true")
    parser.add_argument("--skip_scraping", action="store_true")
    parser.add_argument("--skip_indexing", action="store_true")
    parser.add_argument("--verbose", action="store_true")
    parser.add_argument("--data_path", default=DEFAULT_DATA_PATH)
    parser.add_argument("--database_filename", default=DEFAULT_DATABASE_FILENAME)
    parser.add_argument("--index_filename", default=DEFAULT_INDEX_FILENAME)
    parser.add_argument("--vocabulary_filename", default=DEFAULT_VOCABULARY_FILENAME)
    args = parser.parse_args()
    main(args)


If you don't remember what the if statement is checking for, it just checks if the current file is the one being ran rather than being imported. 

## Command Line Arguments

We start the program by parsing a bunch of arguments, which let us configure how we run the application without us having to modify our code. If we want to reset our cache before we run the code, instead of doing it manually, all we have to do now is run the program like this:

In [None]:
python index.py --reset_cache

After we parse the arguments, we call the main function. I'll break it down piece by piece.

In [None]:
def main(args: argparse.Namespace) -> None:
    # run the testing sweet unless it should be skipped
    if not args.skip_tests:
        # look through the ./tests folder for unit tests
        test_suite = unittest.TestLoader().discover('./tests')
        test_result = unittest.TextTestRunner(verbosity=0).run(test_suite)
        
        if len(test_result.errors) > 0 or len(test_result.failures):
            print("Test suite failed, quitting program")
            exit()

        # if we only wanted to run tests, end the program here
        if args.only_tests:
            exit()
    else:
        print("Skipping tests, good luck!")


We start with the testing suite. If the tests fail, that means the rest of the code will probably fail, so we just quite the program early. The tests are located inside of the tests folder, and their organized into separate files based on what they're supposed to test.

In [None]:
# if we want extra logs, enable them
if args.verbose:
    enable_logs()

# create the data folder if it doesn't exist
if not os.path.exists(args.data_path):
    os.mkdir(args.data_path)

We then enable/disable extra logs and create the database folder if it doesn't already exist.

In [None]:
# create the cache, reset if necessary
cache = HTMLCache(reset=args.reset_cache)

# create the database
database = Database(args.data_path, args.database_filename)    

## Cache

Then we create 2 very important objects. The first is a cache. What's this cache for? We're going to be downloading a lot of data from the internet, and one thing you'll learn is that downloading stuff is very slow, and we want to reduce how many times we do that.

What the cache does is every time it downloads something from the internet, say the HTML code for the website "www.google.com", it saves that code for a file on your local disk, and maps the url "www.google.com" to where it is in that file. This way, the next time you want the HTML code for "www.google.com", instead of downloading it from the internet again, it can just read it from your local disk, which is a lot faster than downloading it again. 

Why would you want to download the HTML code for "www.google.com" multiple times? Well let's say you ran your scraper, and then made some changes to it and want to run it again. If you run it again after making those changes, you would probably be looking through the same urls that you were looking through earlier, right? This means that if you re-run your code, it won't have to re-download all of those old websites.

One question you might have is, well if the page on the internet changes, like the logo for "www.google.com" changing, will the HTML code in our cache change to reflect that? No, since I didn't program it to. This might be a problem if the website that we're downloading data from changes a lot, but the UCI General Catalogue doesn't change very often, maybe once a quarter/year. We can pass in the --reset_cache flag when we run the code to delete the cache files and re-download the newest version of the web page.

What's the database object? It's not too complicated luckily, it's basically a Python dictionary that knows how to save/load itself from a file. That's it.

In [None]:
if not args.skip_scraping:
    # begin scraping 
    time_function(scrape_data, cache, database)
else:
    # if we wanted to load a previous database instead
    # of scraping all over again, load here
    database.load_course_data()

Now we start scraping stuff from the internet. If you look at the time_function function from the utility class, it just logs to the console how long it took to run that function, in this case, scrape_data. Let's look at scrape_data now.

In [None]:
def scrape_data(cache: HTMLCache, database: Database) -> None:
    courses = scrape_courses(cache)

    log("Updating course database")
    # store the courses into the database
    for course in courses:
        database.update_course_data(course.id, course)
    log("Saving course database")
    database.save_course_data()
    log("Saved database")

This code is pretty simple. It scrapes a list of courses from the internet, which will be a list of CourseData objects. It then updates the database with these courses and saves the database to the disk as a file.

Let's look at how scrape_courses works now. It has the same structure as a lot of the code we've looked at:
- Imports
- Constants
- Functions

## BeautifulSoup

The imports are probably pretty self-explanatory, but we have this external one called BeautifulSoup. This is where the real magic of our scraper comes from. BeautifulSoup turns HTML code (which is usually a string from a file), into easily navigable objects called soups, or BeautifulSoups. You can read more about it on its documentation page: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

We start at the scrape_courses function in the scrape_courses.py file:

In [None]:
# scrapes all of the courses from the COURSE_URL page and returns a list of CourseData objects
def scrape_courses(cache: HTMLCache) -> list:
    log("Scraping courses")
    courses = []
    course_department_urls = get_course_department_urls(cache)
    for course_department_url in course_department_urls:
        department_courses = scrape_course_department(cache, course_department_url)
        courses.extend(department_courses)
    return courses

This function logs a little message to the console and then grabs a list of all of the course department urls. It then goes through each url, downloads and parses that page for its list of courses, and adds them to the total list of courses. How does it scrape the list of course_department_urls? Let's look at the function:

In [None]:
# returns a list of urls of each course department
def get_course_department_urls(cache: HTMLCache) -> list:
    course_department_urls = []
    all_courses_soup = cache.get_soup(COURSES_URL)
    letter_lists = all_courses_soup.find(id="atozindex").find_all("ul")
    for letter_list in letter_lists:
        department_urls = scrape_urls(letter_list, COURSES_URL)
        course_department_urls.extend(department_urls)
    return course_department_urls

The function starts by grabbing the soup of the url COURSES_URL, which has the value of "http://catalogue.uci.edu/allcourses/". Why don't you go to this page? You'll see that it has a list of departments, but more importantly, each of these departments links to its own department page. Try clicking on a department page. You'll notice that that department page contains all of the courses in that department with most of the data we need for our database. 

Go back to the COURSES_URL page and open up the inspector (Ctrl+Shift+I for Windows or right click and click on Inspect). If you go to the Elements tab (which is usually the default tab), you'll see that you're looking at the HTML code of the page. If you hover over various elements, that element will highlight on the website. This is how BeautifulSoup sees the website, as its HTML code.

Now, right click on one of the departments and click "Inspect".

![inspect1](inspect1.PNG)

If you look on the Inspector on the right now, you'll see that it jumped to where the element is in the HTML code.

![inspect2](inspect2.PNG)

The element is an anchor (\<a\>) element, which means that it stores a link to a different page. You can see which page it stores a link to in its "href" tag, which in this case leads to /allcourses/ac_eng/.

You might say, what kind of path is /allcourses/ac_eng/? Well, it's not a full url, it's supposed to be stuck at the end of the current host, which is https://catalogue.uci.edu/. If you stick the host and the path together, you get: https://catalogue.uci.edu/allcourses/ac_eng/, which is the url of the department page, which we can then download and scrape.

Then our battle plan is this, look for all of these anchor elements, compute the urls of the departments pages using their href tags by sticking the href path onto the host path, and then downloading the courses on those department pages.

But before we get there, how the heck do we grab all of these anchor elements and their hrefs? If you read through BeautifulSoup's documentation, you'll find that if you're given a soup, there are 2 important important functions for finding stuff: find and find_all.

The "find" function lets you find the first element with a given set of parameters. If you use "soup.find("div", id="bob")", then it will return the first div element with the id "bob". The "find_all" function will return all elements with the given set of parameters. If you use "soup.find_all("ul")", then it will return all "ul" elements.

There's obviously a lot more features, but with these alone, we can grab all of the anchor elements we need and their href tags. If you notice the structure of the HTML code, these anchor elements live inside list (\<li\>) elements, which live inside unordered list (\<ul\>) elements, which live inside a div element with id "atozindex". This means that if we search for all of the anchor elements inside of these unordered list elements inside of the "atozindex" div, we'll get all the anchor tags we need. Let's look back at the get_course_department_urls function:

In [None]:
# returns a list of urls of each course department
def get_course_department_urls(cache: HTMLCache) -> list:
    course_department_urls = []
    all_courses_soup = cache.get_soup(COURSES_URL)
    letter_lists = all_courses_soup.find(id="atozindex").find_all("ul")
    for letter_list in letter_lists:
        department_urls = scrape_urls(letter_list, COURSES_URL)
        course_department_urls.extend(department_urls)
    return course_department_urls

We start by grabbing the soup for the page, and from that soup, we find an element with id="atozindex" (the div we were talking about earlier). Remember that ids are unique across the entire page, so only this particular div will have this id. We then call .find_all("ul") to grab all unordered list elements inside of that div. Now for each of these unordered list divs, we search for all of the anchor elements we need and stick their href tags onto COURSES_URL. Since this process is kinda tedious, I put it in its own utility function, scrape_urls.

If you read through scrape_urls, you'll notice that it doesn't look for all of the list item elements (\<li\>) first before looking for the anchor elements. That's because BeautifulSoup looks recursively for elements. That means it will look inside the children of an element, and the children of those children, and the children of those children, and so on. Why didn't we just do that from the start? Look for all anchor tags on the page. The problem is that these course department anchor elements aren't the only anchor elements on the page, which means find_all will return too many elements and it would be annoying to look through them all to see if they're what we're looking for.

After grabbing those urls, the get_course_department_urls function appends them all to one list and returns them.

I'm not going to follow the rest of the functions because I feel like they're mostly self explanatory with the information I've given above. Web scraping really comes down to these 2 things:
- Inspect the HTML page manually for the elements you're looking for and their patterns
- Use BeautifulSoup to find those elements in the HTML code and grab the resources you're looking for from them

Let's say we've scraped all of the course data and we have this massive database of data. Now what?

## Search Engine

The goal of having all this data is to make it searchable by the user. Let's say that the user wants to find the course "COMPSCI 121". What we're providing is a search bar that lets the user type in whatever query they want, as well as a couple of filters that we'll talk about later.

Now here's the problem...the search query can be whatever the user wants. They might type in "compsci 121". They might type in "information retrieval". They might even type in "cOmPsCi 121", and expect the same results for each of these queries. They might even expect different results. How do we make sure that they get the course they're looking for?

This is called Free Text Search, and it has some other names too but I can't remember them. There's no 1 right way to do it, but here's the general formula:

- Preprocess/Tokenize everything
- Create indexes
- (and in our case, because we have those "filters") Filter results
- Rank results

## Tokenization

The first thing we have to do is standardize all of the text we have. That means lowercasing everything and tokenizing everything. What is tokenizing? Tokenizing is splitting up text into its words, i.e. "information retrieval" becomes "information", "retrieval". This makes it so that we don't have to look for individual characters in our database, we can look for whole words. Usually when you tokenize you get rid of things like punctuation so that users don't have to type it in, i.e. "That's his" becomes "thats", "his". It seems simple enough, but how would you tokenize hyphens? Would you split the word up into two different words? Should you include certain characters like "+" for "C++"? We can develop this tokenizer as we learn more about our data.

If we tokenize the user's search query, we should tokenize our database too, and ideally with the same tokenization algorithm. This way, we can see if the tokens in our search query exist in the tokens in our database.

But how would we search through these tokens? If we have the tokens "information" and "retrieval" from our search query, would we go through every single course in our database and check to see which ones have those tokens in their text somewhere? This may be fine if we had few enough courses, but you can imagine that looking through the entire database over and over again for each query would be pretty inefficient.

## Indexes

This is where indexes come into play. Indexes are data structures that make searching faster. We can imagine that after we tokenize our course database, it would look something like this:

In [None]:
# {
#     course_id1: [list of tokens1],
#     course_id2: [list of tokens2],
#     course_id3: [list of tokens3],
#     ...
# }

In which case, if we were looking for all of the courses with the tokens "information" and "retrieval", we could look through every single course id and check if both "information" and "retrieval" are in their list of tokens, and if they are, add them to some list. We have a little under 6000 courses in the database, which means that each search would have to look inside our database about 6000 times, not including the search for the tokens inside of each tokens list. What if our database had a different structure...what if we flipped things around?

In [None]:
# {
#     token1: [list of courses_ids1],
#     token2: [list of courses_ids2],
#     token3: [list of courses_ids3],
#     ...
# }

What if, as above, we had each token as an entry, and each value as the list of courses that have that token? Then, all we would have to do to find all courses with tokens "information" and "retrieval" would be to find those tokens from the database and match up whichever courses are in both lists. That's 2 database searched instead of 6000, which is pretty good in my opinion.

Now we wouldn't want to restructure our database because we might want it in that original structure for other purposes, but instead we could create a new data structure with the structure that we were just talking about. That would be our index.

Now let's say we use this index, and grab our list of courses that we wanted. Should we just show them as it is? One thing that you should know about search is that users are lazy. They want the result they were looking for at the very top of the list of results. When you look stuff up on Google, how often do you look past the first 3 or 4 results? That's why we have to rank our results by how relevant they are to the user's query.

There's actually no one way to do this, but there are some popular algorithms like tf-idf. It really should be unique to every search engine. In our case, since we know users are searching for courses, we probably want to put the course with id "ICS 33" at the top and courses with prerequisite "ICS 33" underneath instead of vice versa.

We probably also want to add some special cases into our index, like acronyms. You probably don't want to have to write "compsci 121" into the search bar, maybe you just want to write "cs 121". There's really no one way to do search, so we will adapt our algorithm as we go.

And if you couldn't tell from the length of this tutorial, there's a lot to web development, just in the web scraping/search engine side alone (the front end  and backend are their own beast), so ask questions! Spend some time on it. It'll take a couple of hours just to get one thing in at the beginning, but you'll get better slowly through experience. If you ever want to get on a call and have me walk you through stuff, I'd be happy to. Good luck!