## Notebook Intro

This notebook contains the code for the first step in this project. The following data points from <a href='https://www.ibdb.com/'>IBDB (Internet Broadway Database)</a> were included:<br><br>

<li><b>National Tour Routes</b> - List of cities and the length of stay for all <a href='https://www.ibdb.com/theatre/national-tour-100020'>national tours</a> listed.</li>
<li><b>Tour Details</b> - Details about each tour, including the opening and closing date, the tour label, and whether it was an original or a revival

### Master Import

In [1]:
import requests as requests
import re
from bs4 import BeautifulSoup
import numpy as np
import datetime
import pandas as pd

## Show Description Table

In this section, features for one particular show (<a href='https://www.ibdb.com/tour-production/miss-saigon-first-national-509979#Tours'>Miss Saigon</a>) were scraped, to be iterated for all other tours.

In [2]:
# Test Show - Miss Saigon

saigon = requests.get('https://www.ibdb.com/tour-production/miss-saigon-first-national-509979#Tours')
soup = BeautifulSoup(saigon.text, 'html5lib')

In [3]:
# SHOW TITLE
show_title = soup.find_all('h3')[0]
show_title = str(show_title).split('>')[1].split('<')[0]
show_title

'Miss Saigon'

In [4]:
# TOUR DESCRIPT
tour_descript = soup.find_all('div', class_='tag-block-compact')[0]
tour_descript = str(tour_descript).split('>')[2].split('<')[0]
tour_descript

'Tour: First National'

In [5]:
# PLAY / MUSICAL
show_type = soup.find_all('div', class_='tag-block-compact')[1]
show_type = str(show_type).split('>')[2].split('<')[0]
show_type

'Musical'

In [6]:
# TOUR OPENING DATE
tour_opening = soup.find_all('div', class_='col s6 txt-paddings')
tour_opening = str(tour_opening).split('>')[2].split('<')[0]
tour_opening = datetime.datetime.strptime(tour_opening, '%b %d, %Y')
print(tour_opening.date())

1992-10-03


In [7]:
# TOUR CLOSING DATE
tour_closing = soup.find_all('div', class_='col s6 txt-paddings vertical-divider')
tour_closing = str(tour_closing).split('>')[2].split('<')[0]
tour_closing = datetime.datetime.strptime(tour_closing, '%b %d, %Y')
print(tour_closing.date())

1996-07-07


In [8]:
# ORIGINAL / REVIVAL
revival = soup.find('div', class_='col s12 txt-paddings tag-block-compact')
stars = soup.find('div', class_= 'col s12 txt-paddings tag-block-compact').find_all('i', text='Original')
if len(stars)==1:
    stars = str(stars).split('>')[1].split('<')[0]
else:
    stars = 'Revival'
stars

'Original'

<h2>Tour Stops Table

This section higlights the scraping process for each individual stop (found at the same page ID listed above)

In [9]:
# SHOW TITLE
show_title = soup.find_all('h3')[0]
show_title = str(show_title).split('>')[1].split('<')[0]
show_title

'Miss Saigon'

In [10]:
# LIST OF CITIES ON TOUR
cities = soup.find_all('div', class_='col s12 m3 filter-key')
city_list = []

for city in cities:
    city = str(city).split('>')[1].split('<')[0]
    city_list.append(city)
    
city_list[0]

'Vancouver, BC'

In [11]:
# DATES IN TOWN (past, current, and future)
dates = soup.find_all('div', class_='col s12 m4')
date_list = []

for date in dates:
    date = str(date).split('>')[1].split('<')[0]
    date_list.append(date)
    
date_list[0]

'May 12, 1996 - Jul 07, 1996'

In [12]:
# THEATRE NAME
theatres = soup.find_all('div', class_='col s12 m5')
theatre_list = []

for theatre in theatres:
    theatre = str(theatre).split('>')[2].split('<')[0]
    theatre_list.append(theatre)
    
theatre_list[0]

'Queen Elizabeth Theatre'

<br><b>The above web scraping test was placed into a function to scrape these values from a list of URL (function also excepts a single URL as argument)</b><br> <br>

In [13]:
from project_functions import show_details, show_stops

# This function finds all details listed above

show_details('https://www.ibdb.com/tour-production/frozen-521605#Tours')

Unnamed: 0,title,tour_descript,show_type,tour_opening,tour_closing,original_or_revival,reference_url
0,Frozen (Tour),Tour,Musical,2019-11-10,,Original,https://www.ibdb.com/tour-production/frozen-52...


In [14]:
# This function finds details for each tour stop (as listed above)

show_stops('https://www.ibdb.com/tour-production/miss-saigon-first-national-509979#Tours')

Unnamed: 0,title,city,dates,theatre
0,Miss Saigon (Tour: First National),"Vancouver, BC","May 12, 1996 - Jul 07, 1996",Queen Elizabeth Theatre
1,Miss Saigon (Tour: First National),"Denver, CO","Mar 17, 1996 - May 05, 1996",Buell Theatre
2,Miss Saigon (Tour: First National),"Cleveland, OH","Jan 21, 1996 - Mar 10, 1996",Ohio Theatre
3,Miss Saigon (Tour: First National),"Chicago, IL","Oct 25, 1995 - Jan 14, 1996",Auditorium Theatre of Roosevelt University
4,Miss Saigon (Tour: First National),"Los Angeles, CA","Jan 17, 1995 - Oct 15, 1995",Ahmanson Theatre
5,Miss Saigon (Tour: First National),"Detroit, MI","Oct 11, 1994 - Jan 08, 1995",Masonic Temple Theatre
6,Miss Saigon (Tour: First National),"Washington, DC","Jun 07, 1994 - Oct 02, 1994",Opera House (DC)
7,Miss Saigon (Tour: First National),"Fort Lauderdale, FL","Mar 29, 1994 - May 29, 1994",Broward Center For The Performing Arts
8,Miss Saigon (Tour: First National),"Minneapolis, MN","Jan 11, 1994 - Mar 20, 1994",Orpheum Theatre - Minneapolis
9,Miss Saigon (Tour: First National),"Denver, CO","Oct 19, 1993 - Jan 02, 1994",Buell Theatre


## Obtain List of Tour URL's and Extract Details

In this step, a full list of URL's for each national tours was found, which will then be scraped later in the notebook

In [15]:
# Request HTML from page listing all national tours and their links

tours = requests.get('https://www.ibdb.com/theatre/national-tour-100020')
soup = BeautifulSoup(tours.text, 'html5lib')

In [16]:
# Find all URL ID's of different tours 

raw_urls = soup.find_all('a', class_='font-11pt')
raw_url_list = []

# Create actual URL's

for i, url in enumerate(raw_urls):
    url = "http://www.ibdb.com" + raw_urls[i]['href'] + '#Tours'
    raw_url_list.append(url)

In [17]:
# Test scraped URL's

sample = raw_url_list[0:5]
show_details(sample)

This script is 20.0% complete.
This script is 40.0% complete.
This script is 60.0% complete.
This script is 80.0% complete.
This script is 100.0% complete.


Unnamed: 0,title,tour_descript,show_type,tour_opening,tour_closing,original_or_revival,reference_url
0,The Lion King (Tour: Gazelle),Tour: Gazelle,Musical,2002-04-17,2017-07-23,Original,http://www.ibdb.com/broadway-production/the-li...
1,The Book of Mormon (Tour: Jumamosi),Tour: Jumamosi,Musical,2012-12-11,2020-03-11,Original,http://www.ibdb.com/broadway-production/the-bo...
2,The Book of Mormon (Tour: Latter Day),Tour: Latter Day,Musical,2012-08-14,2016-05-01,Original,http://www.ibdb.com/broadway-production/the-bo...
3,Pippin (Tour),Tour,Musical,2014-09-06,2016-02-28,Revival,http://www.ibdb.com/broadway-production/pippin...
4,Rodgers + Hammerstein's Cinderella (Tour),Tour,Musical,2014-10-10,2016-05-08,Original,http://www.ibdb.com/broadway-production/rodger...


In [18]:
sample = raw_url_list[0:5]
show_stops(sample)

Unnamed: 0,title,city,dates,theatre
0,The Lion King (Tour: Gazelle),"Houston, TX","Jun 27, 2017 - Jul 23, 2017",Hobby Center For The Performing Arts
1,The Lion King (Tour: Gazelle),"Greenville, SC","May 31, 2017 - Jun 25, 2017",Peace Center For The Performing Arts
2,The Lion King (Tour: Gazelle),"Oklahoma City, OK","May 09, 2017 - May 28, 2017",Civic Center Music Hall
3,The Lion King (Tour: Gazelle),"St. Louis, MO","Apr 18, 2017 - May 07, 2017",Fox Theatre - St. Louis
4,The Lion King (Tour: Gazelle),"Salt Lake City, UT","Mar 23, 2017 - Apr 16, 2017",George S. and Dolores Doré Eccles Theater
...,...,...,...,...
517,Rodgers + Hammerstein's Cinderella (Tour),"West Palm Beach, FL","Nov 11, 2014 - Nov 16, 2014",Raymond F. Kravis Center For The Performing Arts
518,Rodgers + Hammerstein's Cinderella (Tour),"Charlotte, NC","Nov 04, 2014 - Nov 09, 2014",Belk Theater
519,Rodgers + Hammerstein's Cinderella (Tour),"Miami, FL","Oct 28, 2014 - Nov 02, 2014",Adrienne Arsht Center for the Performing Arts ...
520,Rodgers + Hammerstein's Cinderella (Tour),"Tampa, FL","Oct 21, 2014 - Oct 26, 2014","David A. Straz, Jr. Center for the Performing ..."


<br><b>It looks like both of these work! Let's iterate on the full url list...</b><br><br>

In [None]:
show_details_df = show_details(raw_url_list)

In [20]:
show_details_df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,tour_descript,show_type,tour_opening,tour_closing,original_or_revival,reference_url,year
0,0,0,The Lion King (Tour: Gazelle),Tour: Gazelle,Musical,2002-04-17 00:00:00,2017-07-23 00:00:00,Original,http://www.ibdb.com/broadway-production/the-li...,2002
1,1,1,The Book of Mormon (Tour: Jumamosi),Tour: Jumamosi,Musical,2012-12-11 00:00:00,2020-03-11 00:00:00,Original,http://www.ibdb.com/broadway-production/the-bo...,2012
2,2,2,The Book of Mormon (Tour: Latter Day),Tour: Latter Day,Musical,2012-08-14 00:00:00,2016-05-01 00:00:00,Original,http://www.ibdb.com/broadway-production/the-bo...,2012
3,3,3,Pippin (Tour),Tour,Musical,2014-09-06 00:00:00,2016-02-28 00:00:00,Revival,http://www.ibdb.com/broadway-production/pippin...,2014
4,4,4,Rodgers + Hammerstein's Cinderella (Tour),Tour,Musical,2014-10-10 00:00:00,2016-05-08 00:00:00,Original,http://www.ibdb.com/broadway-production/rodger...,2014
...,...,...,...,...,...,...,...,...,...,...
714,715,715,Hadrian VII (Tour),Tour,Play,1969-09-04 00:00:00,1970-05-30 00:00:00,Original,http://www.ibdb.com/broadway-production/hadria...,1969
715,716,716,Dylan (Tour),Tour,Play,1970-01-14 00:00:00,1970-05-30 00:00:00,Original,http://www.ibdb.com/broadway-production/dylan-...,1970
716,717,717,Sarafina! (Tour),Tour,Musical,1990-03-19 00:00:00,1991-09-29 00:00:00,Original,http://www.ibdb.com/broadway-production/sarafi...,1990
717,718,718,Sophisticated Ladies (Tour),Tour,Musical,1983-05-24 00:00:00,1983-10-09 00:00:00,Original,http://www.ibdb.com/broadway-production/sophis...,1983


### Cleaning Show Details Dataframe

In [27]:
show_details_df = pd.read_csv('data/ibdb_show_details.csv')

In [23]:
# DROP ONE NAN VALUE 
show_details_df = show_details_df[show_details_df.year != 'nan']

# Set year to integer
show_details_df['year'] = [int(x) for x in show_details_df['year']]

In [24]:
shows_amount = len(show_details_df[show_details_df.year>2003])
print("There have been " + str(shows_amount) + " Broadway national tours since 2004.")

There have been 207 Broadway national tours since 2004.


In [None]:
show_details_df.to_csv('data/ibdb_show_details.csv')

## Scraping Full Tour Routes

In [23]:
show_stops_df = show_stops(raw_url_list)

In [26]:
show_stops_df.to_csv('data/idbd_show_stops.csv')

### List of Most Commonly Visited Cities in U.S. (1960 - Present)

In [30]:
show_stops_df.city.value_counts().head(25)

Chicago, IL          502
Boston, MA           466
Los Angeles, CA      450
Detroit, MI          430
St. Louis, MO        415
Philadelphia, PA     414
San Francisco, CA    407
Washington, DC       398
Cleveland, OH        396
Pittsburgh, PA       382
Baltimore, MD        379
Dallas, TX           360
Denver, CO           356
Atlanta, GA          344
Cincinnati, OH       323
Toronto, ON          312
Houston, TX          300
Seattle, WA          295
Minneapolis, MN      275
San Diego, CA        268
Louisville, KY       268
Columbus, OH         265
Indianapolis, IN     258
Hartford, CT         255
Orlando, FL          248
Name: city, dtype: int64