# Scraping for, Storing, and Exploring New York City Comedy Shows

As a big fan of stand-up comedy, I found myself frustrated at the design of the Comedy Cellar's website and the lack of central lists of shows going on in New York City; apart from bigger shows which are usually listed on such sites as Ticketmaster, show information is only listed on individual comedy club websites.

I was inspired to assemble the Comedy Cellar's shows, after interacting with their website. Users choose a date from a dropdown menu and it shows  the headlines for all of the shows that night, but then it requires the user to click on each headline individually to reveal the shows full information.  For each page this could require a user to click upto about ten times to show all the information desired.

The goal of this project was to both include the show information for the Comedy Cellar in an easily queryable format, and to store show information for multiple clubs in one database.  Currently I have successfully scraped for shows from the Comedy Cellar, The Stand NYC, and Carolines, with the possibility of adding more clubs in the future.  The shows were scraped for using BeautifulSoup on December 16, 2017.

I chose MongoDB to store the shows, since it has a flexible schema and not all shows had values for all fields, and it would be better suited to storing array fields than SQL.

## Load in the Data

In [1]:
from pymongo import MongoClient
from pprint import pprint
import json
from datetime import datetime

In [2]:
client = MongoClient('localhost:27017')
db = client['comedy']
coll = db.shows

In [5]:
all_shows = []

# Import all shows into the database
with open('cellar_shows.json', 'r') as f_cellar, \
     open('the_stand_shows.json', 'r') as f_stand, \
     open('carolines_shows.json', 'r') as f_carolines:
     
    cellar_shows = json.load(f_cellar)
    stand_shows = json.load(f_stand)
    carolines_shows = json.load(f_carolines)
     
    all_shows = cellar_shows + stand_shows + carolines_shows
    
    # Convert dates and times to datetime objects
    for d in all_shows:
        d['date'] = datetime.strptime(d['date'], '%B %d, %Y')         
        for key, val in d['time'].items():
            d['time'][key] = datetime.strptime(val, '%B %d, %Y %I:%M %p')
        coll.insert_one(d)

In [6]:
coll.find().count()

190