<a href="https://colab.research.google.com/github/kcding/datasets/blob/master/bsoup01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Scraping Python.org with Requests and Beautiful Soup

# Book: Python Web Scraping Cookbook (Packt)
# Chapter 1: 01_events_with_requests.py

import requests
from bs4 import BeautifulSoup

# Getting ready...

# In this recipe, we will scrape the upcoming Python events 
# from https://www.python.org/events/pythonevents. 

def get_upcoming_events(url):

    # Use requests to make a GET HTTP request for the following 
    # url: https://www.python.org/events/python-events/ by making 
    # a GET request:
    req = requests.get(url)

    # The above requests command downloaded the page content but it is stored 
    # in our requests object req. We can retrieve the content using the .text 
    # property. 
    soup = BeautifulSoup(req.text, 'lxml')

    # Now we tell Beautiful Soup to find the main <ul> tag for the recent 
    # events, and then to get all the <li> tags below it.
    events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')


    # And finally we can loop through each of the <li> elements, 
    # extracting the event details, and print each to the console:
    for event in events:
        event_details = dict()
        event_details['name'] = event.find('h3').find("a").text
        event_details['location'] = event.find('span', {'class', 'event-location'}).text
        event_details['time'] = event.find('time').text
        print(event_details)

get_upcoming_events('https://www.python.org/events/python-events/')

# How it works...

# We will dive into details of both Requests and Beautiful Soup in the next 
# chapter, but for now let's just summarize a few key points about how this 
# works. 

# The following important points about Requests:

# - Requests is used to execute HTTP requests. We used it to make a GET verb
#   request of the URL for the events page.
# - The Requests object holds the results of the request. This is not only 
#   the page content, but also many other items about the result such as HTTP 
#   status codes and headers.
# - Requests is used only to get the page, it does not do an parsing.

# We use Beautiful Soup to do the parsing of the HTML and also the finding of 
# content within the HTML. 

# We used the power of Beautiful Soup to:
# - Find the <ul> element representing the section, which is found by looking 
#   for a <ul> with the a class attribute that has a value of list-recent-events.
# - From that object, we find all the <li> elements.

# Each of these <li> tags represent a different event. We iterate over each of 
# those making a dictionary from the event data found in child HTML tags:
# - The name is extracted from the <a> tag that is a child of the <h3> tag
# - The location is the text content of the <span> with a class of event-location
# - And the time is extracted from the datetime attribute of the <time> tag.

{'name': 'PyCon JP 2020', 'location': 'Tokyo, Japan', 'time': '28 Aug. – 29 Aug.  2020'}
{'name': 'PyCon TW 2020', 'location': 'International Conference Hall ,No.1, University Road, Tainan City 701, Taiwan', 'time': '05 Sept. – 06 Sept.  2020'}
{'name': 'PyCon SK 2020', 'location': 'Bratislava, Slovakia', 'time': '11 Sept. – 13 Sept.  2020'}
{'name': 'DjangoCon Europe 2020', 'location': 'Porto, Portugal', 'time': '16 Sept. – 20 Sept.  2020'}
{'name': 'PyCon APAC 2020', 'location': 'Kota Kinabalu, Sabah, Malaysia', 'time': '19 Sept. – 20 Sept.  2020'}
{'name': 'DragonPy 2020', 'location': 'Ljubljana, Slovenia', 'time': '19 Sept. – 20 Sept.  2020'}
