# Taxi (C@B Scraper)

## Intro
This is tool to retrieve a list of courses and course descriptions from Courses@Brown. **I am not responsible for what you do with this script, or the data from it (specifically, I'm not liable for potential violations of C@B or Brown's TOS with this tool).** C@B's API endpoints are publicly accessible, and are noted below. If at any point C@B starts requiring authentication, or some other form of database restriction, these scripts will fail to function. 

The tool makes simple `POST` requests to the C@B API. In some cases, it makes requests with parameters that a website originating request would disallow. For example, requesting a list of _all_ the listed courses. Please be wary of this while using this tool. 

## Usage
You should _always_ look into the repository to see if there is already a latest version of C@B data. If so, you don't need to touch any of this code and you shouldn't make unnecessary requests! While I will try to keep these files up to date, sometimes they might become outdated. Feel free to fork this repository and create a pull request to update the data. 

Taxi will _always_ see if there is already output that has been generated in the past. If it exists, it loads said output instead of scraping C@B again. To manually retrieve course data again, delete the `out` directory and run the code. This will generate a new `out` directory with accompanying JSON files for reuse. 

## JSON Schema
I've tried to detail some of the return schema below. 

### Imports
We first import necessary packages. This includes `requests` for requests to the C@B API and `json` to parse the json payload and response. We also need to access the filesystem to read and write to files. You might need to install `tqdm` to use progress bars, but it can default to not using it. 

In [None]:
import requests
import json
from os import path
# Imports zipfile to (de)compress output
import zipfile
try:
    import zlib
    compression = zipfile.ZIP_DEFLATED
except:
    compression = zipfile.ZIP_STORED
# Progress bars. If tqdm is not installed, comment out the following line. The wrapper will be defaulted. 
from tqdm.auto import tqdm
if 'tqdm' not in globals():
    tqdm = lambda x: x

### API Endpoints
These define global variables for C@B API endpoints. 

In [None]:
# url to request list of courses. Must pass payload as specified (below). 
LIST_REQUEST_URL = "https://cab.brown.edu/api/?page=fose&route=search"
# url to request specific course information, must pass payload as specified (below). 
COURSE_REQUEST_URL = "https://cab.brown.edu/api/?page=fose&route=details"

### Directory Endpoints
It's also nice to define where we want our output files to go. These are relative paths. 

In [None]:
OUTPUT_DIR = "out/" # Make sure this directory exists!
COURSE_LIST_PATH = OUTPUT_DIR + "course_list.json"
COURSE_DESC_PATH = OUTPUT_DIR + "course_desc.json"

### Course Listing
The following functions request a full listing of courses that are currently catalogued by C@B. It returns them as a dictionary object and also saves a `.json` copy if possible. 

In [None]:
def get_course_list():
    # Payload with placeholder srcdb and empty criteria. Returns entire database of courses offered.
    payload = {"other": {"srcdb": "999999"}, "criteria": []}
    # Placeholder to test on a small set of data (doesn't overlad C@B.)
    # payload = {"other": {"srcdb": "999999"}, "criteria": [{"field":"keyword","value":"csci 0300"},{"field":"is_ind_study","value":"N"}]}
    # Makes said request
    r = requests.post(LIST_REQUEST_URL, data = json.dumps(payload))
    json_output = json.loads(r.text)
    print("Successfully retrieved records for " + str(len(json_output['results'])) + " courses.")
    return json_output

def save_list(response):
    # Saves response as a .json file for easy access
    with open(COURSE_LIST_PATH, 'w+') as outfile:
        json.dump(response, outfile)
    with zipfile.ZipFile(COURSE_LIST_PATH + ".zip", "w") as zip_ref:
        print("zipping")
        zip_ref.write(COURSE_LIST_PATH, compress_type=compression)

def course_listings():
    if path.isfile(COURSE_LIST_PATH):
        with open(COURSE_LIST_PATH) as course_list_file:
            data = json.load(course_list_file)
            print("Existing data found for course listings, " + str(len(data['results'])) + " courses loaded.")
            return data
    elif path.isfile(COURSE_LIST_PATH + ".zip"):
        print("Zipfile found, printing zipfile for data.")
        with zipfile.ZipFile(COURSE_LIST_PATH + ".zip", "r") as zip_ref:
            zip_ref.extractall('out')
        # Runs again since file now exists. 
        return course_listings()
    else:
        print("Could not find existing data, querying C@B.")
        data = get_course_list()
        save_list(data)
        return data

### Course Descriptions
The following functions request course descriptions for each course. 

In [None]:
def get_course(crn, srcdb, verbose=False):
    """
    Gets the response for a course with specific CRN and 'srcdb' (semester id). 
    Expects crn or srcdb in either string or number format. 
    """
    if verbose:
        print("Getting course description for CRN " + str(crn) + " and srcdb " + str(srcdb) + ".")
    payload = {"group": "", "key": "crn:" + str(crn), "srcdb": str(srcdb)}
    r = requests.post(COURSE_REQUEST_URL, data=json.dumps(payload))
    return json.loads(r.text)

def get_course_desc(course_list=course_listings()):
    descriptions = []
    for course in tqdm(course_list['results']):
        descriptions.append(get_course(course['crn'], course['srcdb']))
    return descriptions

def save_desc(response):
    # Saves response as a .json file for easy access
    with open(COURSE_DESC_PATH, 'w+') as outfile:
        json.dump(response, outfile)
    with zipfile.ZipFile(COURSE_DESC_PATH + ".zip", "w") as zip_ref:
        print("zipping")
        zip_ref.write(COURSE_DESC_PATH, compress_type=compression)

def course_desc():
    if path.isfile(COURSE_DESC_PATH):
        with open(COURSE_DESC_PATH) as course_desc_file:
            data = json.load(course_desc_file)
            print("Existing data found for course metadata, " + str(len(data)) + " courses loaded.")
            return data
    elif path.isfile(COURSE_DESC_PATH + ".zip"):
        print("Zipfile found, printing zipfile for course metadata.")
        with zipfile.ZipFile(COURSE_DESC_PATH + ".zip", "r") as zip_ref:
            zip_ref.extractall('out')
        # Runs again since file now exists. 
#         return course_listings()
    else:
        print("Could not find existing data, querying C@B for course metadata.")
        data = get_course_desc()
        save_desc(data)
        return data

In [None]:
cdesc = course_desc()

In [None]:
import pandas as pd

df = pd.DataFrame(cdesc)

In [None]:
df.head()

In [None]:
df['title']