In [None]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Discussion 5

# Visualization, Imputation, and Web Scraping


### EECS 398: Practical Data Science, Winter 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/wn25">github.com/practicaldsc/wn25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/69737/discussion/5943734) </small>
    
</div>

### Agenda 📆

- Web Scraping using `BeautifulSoup`.
- Worksheet 📝.
    - Visualizing Data
    - Imputing Missing Values

## Example: Scraping the Happening @ Michigan page

---

### Example: Scraping the Happening @ Michigan page

- Our goal in today's discussion lecture is to create a DataFrame with the information about each event at [events.umich.edu](https://events.umich.edu).

In [None]:
res = requests.get('https://events.umich.edu')
res

In [None]:
soup = BeautifulSoup(res.text)

- Let's start by opening the page in Chrome, right clicking on the page, and clicking "Inspect".<br><small>As we can see, the HTML is relatively complicated – this is usually the case for real websites!</small>

### Identifying `<div>`s

- It's not easy identifying which `<div>`s we want. The Inspect tool makes this easier, but it's good to verify that `find_all` is finding the right number of elements.

In [None]:
divs = soup.find_all(class_='col-xs-12')

In [None]:
len(divs)

- Again, let's deal with one `<div>` at a time. First, we should extract the title of the event.

In [None]:
divs[0]

In [None]:
divs[0].find('div', class_='event-title').find('a').get('title')

- The time and location, too.

In [None]:
divs[0].find('time').get('datetime')

In [None]:
divs[0].find('ul').find('a').get('title')

### Parsing a single event, and then every event

- As before, we'll implement a function that takes in a BeautifulSoup object corresponding to a single `<div>` and returns a dictionary with the relevant information about that event.

In [None]:
def process_event(div):
    title = div.find('div', class_='event-title').find('a').get('title')
    location = div.find('ul').find('a').get('title')
    time = pd.to_datetime(div.find('time').get('datetime')) # Good idea!
    return {'title': title, 'time': time, 'location': location}

In [None]:
process_event(divs[12])

- Now, we can call it on every `<div>` in `divs`.<br><small>Remember, we already ran `divs = soup.find_all(class_='col-xs-12')`.</small>

In [None]:
row_list = []
for div in divs:
    try:
        row_list.append(process_event(div))
    except Exception as e:
        print(e)

In [None]:
events = pd.DataFrame(row_list)
events.head()

- Now, `events` is a DataFrame, like any other!

In [None]:
# Which events are in-person today?
events[~events['location'].isin(['Virtual', ''])]

<h2><a href="https://study.practicaldsc.org/disc05/">Worksheet</a> 📝</h2>

---