# Scraping the Digital Wildcatters Empower Event Agenda

![](https://media-exp1.licdn.com/dms/image/C561BAQGjVV5Nd2FfrQ/company-background_10000/0/1644015618982?e=1648008000&v=beta&t=jeylykbf83NNnYQaF8lxP7YoEmsgXSUDlcgjHvaWr-k)

Empower and other awesome events hosted by [Digital Wildcatters](https://digitalwildcatters.com/). Make sure you check them out. This is their sick banner.

# Scraping the Agenda

Conferences are awsome, but agendas interactions never are. Multiple speaking locations, a variety of  speakers, and networking happening inbetween. I had a simple goal. Hack a script together to download the agenda into a CSV so I can upload the events to a Google Calendar.


Why not just update the calendar? Well, because that is no fun and I wouldn't learn anything along the way. 

![](screen_shot.ping)

The Agenda is laid out perfectly for a little `Beautiful Soup` action. Fish out the title and dangling `<p></p>` tags and bingo bango, you've got an event. 

Each Agenda item is wrapped in the `elementor-toggle-item` div. Once those are selected, parsing each one to get the `elementor-toggle-title` and nested `<p>` text will provide all the details needed to populate the [Google Calendar](https://calendar.google.com/calendar/u/1?cid=Y19iazM5dms4aDBxNjZuMWs4cDVjODBnNXNna0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t).

```html
<div class="elementor-toggle-item">
    <div class="elementor-toggle-title"></div>
    <div class="elementor-tab-content">
        <p></p>
        ...
    </div>
</div>

```

***Plan of action***
1. Download the page
2. Parse with bs4
3. Extract the events with class tags
4. For each event:

    a. Get the title
    b. For each `<p>` tag:
        get relevant data from tag

5. Manipulate event dicts into GCalendar format
6. DataFrame to CSV

---

# Imports

In [3]:
import datetime
import os.path
import pandas as pd
import pytz
import re
import requests

from typing import Dict, List

import bs4

## Top level items

In [4]:
def get_item_title(item: bs4.element.Tag) -> str:
    """Grab title class"""
    return item.find("a", {"class": "elementor-toggle-title"}).text

def get_item_subitems(item: List[bs4.element.Tag]) -> List[str]:
    """Grab all talk details"""
    return [x.text for x in item.find_all("p")]

## Nested items

In [5]:
def process_item_time(subitems: List[str]) -> str:
    """Overkill time extraction"""
    # Careful! Possible StopIteration
    time = filter(lambda x: "Time:" in x, subitems).__next__()
        
    # Find the pattern
    match = re.search(r'([0-9]{1,2}:[0-9]{2}\s[AP]M)', time.upper())
    if match:
        return match[1]
    raise ValueError(f"No time found in: '{time}'")

In [6]:
def process_item_where(subitems: List[str]) -> str:
    """Grab the stage or tent"""
    loc = filter(lambda x: "Where:" in x, subitems).__next__()
    return loc.split(":")[-1].strip()

In [14]:
def process_item_description(subitems: List[str]) -> str:
    """Grab the description if it exists"""
    # Some don't have. Just return None. No big deal
    try:
        idx = subitems.index("Description:")
    except ValueError as e:
        return None
    return subitems[idx + 1]

## Combine into super function!

![Omni Man](omni-man.png)

In [15]:
def extract_data(item: bs4.element.Tag) -> Dict[str, str]:
    """This is the main processing function"""
    title = get_item_title(item)  # Title

    subitems = get_item_subitems(item)  # Talk details
    if len(subitems) <= 1:
        return None

    time = process_item_time(subitems)
    where = process_item_where(subitems)
    description = process_item_description(subitems)
    return {
        "Subject": title,
        "Start Date":time,
        "Start Time":time,
        "Location": where,
        "Description": description
    }

# Execute scraping magic

In [16]:
# Get the content
page = "https://digitalwildcatters.com/empower-energizing-bitcoin/agenda/"
resp = requests.get(page)
resp.raise_for_status()

# Parse into soup
soup = bs4.BeautifulSoup(resp.content, "html.parser")

In [17]:
# These are the tasty agenda bits!
items = soup.find_all("div", {"class": "elementor-toggle-item"})

The next cell solves the problem. It's not pretty, but it does not need to be. A one-off parsing script can be ugly if it works and saves time. What is being solved? The _Date_! The dates are `<h2>` elements not linked with each Agenda item. Instread of using the various `Navigable` options `BeautifulSoup` provide, I found the first event on the second day and updated all following Agenda items to have that date.

In [18]:
##############################
# WARNING - Ugly code ahead  
##############################


day = 30  # First day of event
events = [] 
for event in filter(lambda x: x is not None, (extract_data(item) for item in items)):
    
    # First event on Day 2
    if event and event["Subject"] and "The Rise of Renewables" == event["Subject"]:
        day = 31
            
    # Update dates and times
    event["Start Date"] = f"2022/03/{day}"
    event["End Date"] = event["Start Date"]
    event["End Time"] = (datetime.datetime.strptime(event["Start Time"], "%I:%M %p") + datetime.timedelta(minutes=20)).strftime("%I:%M %p")
    
    events.append(event)


# Bask in glory

Here we go. The rabbit has been extracted from the hat. All the events in a DataFrame ready to be exported to CSV and imported into the Google Calendar.

In [20]:
df = pd.DataFrame.from_dict(events)
df.head().sort_values(by="Location")

Unnamed: 0,Subject,Start Date,Start Time,Location,Description,End Date,End Time
0,Energy 101: Getting Schooled Up On Power Gener...,2022/03/30,9:30 AM,Beatles Stage,This is a crash course on energy as a whole. C...,2022/03/30,09:50 AM
2,From Hash to Cash: The Economics of Bitcoin Mi...,2022/03/30,10:15 AM,Beatles Stage,Why should anyone care about bitcoin mining? I...,2022/03/30,10:35 AM
4,Why Texas will be the Bitcoin Mining Capital o...,2022/03/30,11:00 AM,Beatles Stage,Bitcoin experts say Texas is the world’s newes...,2022/03/30,11:20 AM
1,Keynote: Crusoe,2022/03/30,9:45 AM,Big Tent,Keynote by Cully Cavness,2022/03/30,10:05 AM
3,Why Bitcoin Changes What We Know About Energy,2022/03/30,10:30 AM,Big Tent,The energy industry has historically been a pr...,2022/03/30,10:50 AM


# CSVs

In [None]:
df.to_csv("empower.agenda.csv", index=False)

Now, go upload the CSVs to a new Google Calendar.

# Calendar

You can access the calendar [HERE](https://calendar.google.com/calendar/u/1?cid=Y19iazM5dms4aDBxNjZuMWs4cDVjODBnNXNna0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t), until the event is over.

![](gcalendar.png)
