### Web Scraping the Ryman Calendar

In this exercise, your objective is to use BeautifulSoup in order to obtain a dataset of upcoming events at the Ryman. This information is available at https://ryman.com/events/, but you will take the contents of this website and convert it into a pandas DataFrame.

The website splits the events across multiple pages, but start by just working on the first page. Later on in the exercise, you'll take what you've done for the first page and apply it across other pages.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO
import re

##### Question 1

Start by using either the inspector or by viewing the page source. Can you identify a tag that might be helpful for finding the names of all performers? For now, just worry about the headliner and don't worry about the opener. (Eg. For Vince Gill, featuring Wendy Moten, we only care about Vince Gill.) Make use of this to create a list containing just the names of each inductee.

The div tag with class equal to 'info clearfix' would be a good way to iterate through all event descriptions (date, location, name of show), as long as we skip the first one.  
The h3 tag will be helpful for finding performers.

In [115]:
# Send get request to the Ryman events URL
URL = 'https://ryman.com/events/'
response = requests.get(URL)

# Convert response to BeautifulSoup object
soup = BeautifulSoup(response.text)

# Iterating through all performers on the first page and storing them in performer_names
all_info = soup.findAll('div', {'class' : 'info clearfix'})[1:]
performers = [info.h3.a.text for info in all_info]

##### Question 2

 Next, try and find a tag that could be used to find the date and time for each show. Extract these into a list. Challenge: Convert these into two lists, one containing the date and the other containing the time. (Eg. split Mar 9, 2023 8:00 PM into Mar 9, 2023 and 8:00 PM.) 

The span tag could be helpful for finding the date and time for each show, although there is a different class for each part of the date.  We could also get all text for each act and use a regex to match the date and time.

In [71]:
# Getting all text for all acts on the first page 
all_info_text = [info.get_text(separator=' ', strip=True) for info in all_info]

In [78]:
# Using regex to match the dates and times
datetimes = [re.match(r'\w{3,4}\s\d{1,2}\s-?\s?\d{0,2}\s?,\s\d{4}(\s\d{1,2}:\d{2})?', info).group().replace(' ,',',') for info in all_info_text]

In [79]:
# Using regex to match just the dates
dates = [re.match(r'\w{3,4}\s\d{1,2}\s-?\s?\d{0,2}\s?,\s\d{4}', info).group().replace(' ,',',') for info in all_info_text]

In [80]:
# Using regex to match just the times (some acts do not have a time)
times=[]
for item in datetimes:
    time = re.findall(r'\d{1,2}:\d{2}', item)
    if time:
        times.append(time[0])
    else:
        times.append('No time')

##### Question 3

Take the lists you created on parts 1 and 2 and convert them into a pandas DataFrame.

In [81]:
acts_df = pd.DataFrame({'Performer':performers, 'Date':dates, 'Time':times})
acts_df.head(2)

Unnamed: 0,Performer,Date,Time
0,Ella Langley,"Nov 6 - 7, 2025",No time
1,Watchhouse,"Nov 8, 2025",8:00


##### Question 4

Add to your data frame the opening act for all shows that list an opener.

In [82]:
openers=[]
for act in all_info:
    opener = act.h4
    if opener:
        openers.append(opener.text.split(maxsplit=1)[1].strip())
    else:
        openers.append('No opener')

acts_df['Opener']=openers
acts_df.head(2)

Unnamed: 0,Performer,Date,Time,Opener
0,Ella Langley,"Nov 6 - 7, 2025",No time,Kaitlin Butts (11/6) and Mae Estes (11/7)
1,Watchhouse,"Nov 8, 2025",8:00,Sarah Kate Morgan & Leo Shannon


##### Question 5

Now, let's see if we can get the results beyond the first page. For this, you'll need to Web Developer Tools of your browser and navigate to the Network tab. Click the "Load More Events" button and you should see a GET request to the www.ryman.com domain.  
    a. Inspect this request and you should see that it goes to a URL like "https://www.ryman.com/events/events_ajax/24?category=0&venue=0&team=0&exclude=&per_page=12&came_from_page=event-list-page". In your Jupyter notebook, send a get request to this url and inspect the results.  
    b. You should find that the results that you get are HTML, but that they are not exactly formatted in a way that can be parsed. See if you can clean up the results set so that you can extract out the same information as above.  
    c. Create a DataFrame that contains data for the next 60 shows.

In [109]:
URL = 'https://www.ryman.com/events/events_ajax/24?category=0&venue=0&team=0&exclude=&per_page=12&came_from_page=event-list-page'
response = requests.get(URL)
# Inspecting response.text shows a lot of unnecessary double backslashes, so we will clean those up
response.text

'"<div class=\\"eventItem entry holidays_at_the_ryman featured clearfix\\" data-month=\\"December\\" data-year=\\"2025\\" role=\\"group\\" aria-label=\\"event\\">\\n\\t<div class=\\"thumb\\">\\n\\t\\t<a href=\\"https:\\/\\/www.ryman.com\\/event\\/2025-andrew-peterson\\" id=\\"event_listing_thumb_2025-andrew-peterson\\" title=\\"More Info for Andrew Peterson\\" tabindex=\\"-1\\"><img src=\\"https:\\/\\/images.discovery-prod.axs.com\\/2025\\/06\\/uploadedimage_6855aa3f90d4c.jpg\\" alt=\\"More Info for Andrew Peterson\\"\\/><\\/a>\\t\\t\\n<\\/div>\\t<div class=\\"info clearfix\\">\\n\\t\\t<div class=\\"date\\">\\n\\t<a href=\\"\\/event\\/2025-andrew-peterson\\" tabindex=\\"-1\\" title=\\"More Info\\">\\n\\t<span class=\\"m-date__month\\">Dec <\\/span><span class=\\"m-date__day\\"> 7<\\/span><span class=\\"m-date__separator\\">-<\\/span><span class=\\"m-date__day\\">8<\\/span><span class=\\"m-date__year\\">, 2025<\\/span>\\t\\t<\\/a>\\n<\\/div>\\n\\t\\t\\t<div class=\\"location\\">\\n\\t\\

In [110]:
soup = BeautifulSoup(response.text.replace('\\',''))
print(soup.prettify())

<html>
 <body>
  <p>
   "
  </p>
  <div aria-label="event" class="eventItem entry holidays_at_the_ryman featured clearfix" data-month="December" data-year="2025" role="group">
   nt
   <div class="thumb">
    ntt
    <a href="https://www.ryman.com/event/2025-andrew-peterson" id="event_listing_thumb_2025-andrew-peterson" tabindex="-1" title="More Info for Andrew Peterson">
     <img alt="More Info for Andrew Peterson" src="https://images.discovery-prod.axs.com/2025/06/uploadedimage_6855aa3f90d4c.jpg"/>
    </a>
    ttn
   </div>
   t
   <div class="info clearfix">
    ntt
    <div class="date">
     nt
     <a href="/event/2025-andrew-peterson" tabindex="-1" title="More Info">
      nt
      <span class="m-date__month">
       Dec
      </span>
      <span class="m-date__day">
       7
      </span>
      <span class="m-date__separator">
       -
      </span>
      <span class="m-date__day">
       8
      </span>
      <span class="m-date__year">
       , 2025
      </span>
      tt
 

In [131]:
# Creating a DataFrame that contains data for the next 60 shows
num_shows=60
URL = f'https://www.ryman.com/events/events_ajax/24?category=0&venue=0&team=0&exclude=&per_page={num_shows}&came_from_page=event-list-page'
response = requests.get(URL)
soup = BeautifulSoup(response.text.replace('\\',''))

all_info = soup.findAll('div', {'class' : 'info clearfix'})[1:]
performers = [info.h3.a.text for info in all_info]

all_info_text = [info.get_text(separator=' ', strip=True) for info in all_info]
#all_info_text[0]
dates = [re.match(r'nt\s(.*?)\stt', info).group() for info in all_info_text[0]]
dates

AttributeError: 'NoneType' object has no attribute 'group'

In [None]:
times=[]
for item in datetimes:
    time = re.findall(r'\d{1,2}:\d{2}', item)
    if time:
        times.append(time[0])
    else:
        times.append('No time')

openers=[]
for act in all_info:
    opener = act.h4
    if opener:
        openers.append(opener.text.split(maxsplit=1)[1].strip())
    else:
        openers.append('No opener')

more_acts_df = pd.DataFrame({'Performer':performers, 'Date':dates, 'Time':times, 'Opener':openers})

more_acts_df