# An Introduction to Web Scraping with Python

Web scraping is a way to collect information from websites using code. It can be especially useful when working with data that isn’t easily downloadable. There are several approaches and tools for web scraping—this workshop will focus on one of them: Python’s Beautiful Soup package. Python is an open-source language, so with the right setup, anyone can use this tool.

Web scraping can support different stages of the research data lifecycle, including the planning phase (e.g., identifying available online data) and the active data collection phase. This workshop is intended for those who are new to web scraping and want to explore how it can be used in a research context.

The session is hosted by Cornell University Library’s Research Data & Open Scholarship team and is part of the Data Den workshop series. 


Sharing statement: CC-By Attribution 4.0 International

## <a id = "contents">Contents</a>
* <a href='#intro'>Introduction</a>
    - <a href='#summary'>Workshop Summary</a> 
    - <a href='#presenters'>Presenters</a>
    - <a href='#collaborators'>Collaborators</a>
    - <a href='#objectives'>Learning Objectives</a> 
    - <a href='#knowledge'>Assumed Knowledge</a>
    - <a href='#logistics'>Logistics</a>
* <a href = "#problems">Exercises</a>
* <a href='#resources'>Additional Resources</a>

# <a id = "intro"></a>Introduction

## <a id = "summary"></a>Workshop Summary
Learn how to gather data from websites using Python! In this beginner-friendly workshop, you’ll learn the basics of web scraping with Beautiful Soup. We’ll show you how to dig through HTML to find the info you need, and talk about when it makes more sense to use an API or tools like Selenium. You’ll also get tips on cleaning up your data so it’s ready to use. 


## <a id = "presenters"></a>Instructor

Jacob Grippin\
Statistical Consultant\
Cornell Center for Social Sciences\
jrg363@cornell.edu 


## <a id = "collaborators"></a>Collaborators
Iliana Burgos\
Emerging Data Practices Librarian\
Digital Scholarship Services\
itb23@cornell.edu 

Lencia McKee\
Research Data Librarian\
Research Data & Open Scholarship\
lcb235@cornell.edu 

## <a id = "objectives"></a>Learning Objectives

Workshop attendees will:
1. **Basics of beautiful soup**: Learners will be able to use the beautiful soup Python package to parse through HTML (such as tags and attributes) and extract content from webpages relevant to their research questions. 
2. **Comparing methods and packages**: Learners will be able to compare different approaches (web scraping vs. APIs) and tools (Beautiful Soup vs. Selenium) and select the most appropriate one on their skill levels and needs. 
3. **Data preparation**: Learners will understand how to clean, structure, and export scraped data to make it ready for analysis. 

## <a id = "knowledge"></a>Assumed Knowledge

This workshop is for folks who are new to Python but have at least a little coding experience. You don’t need to be an expert, but some familiarity with basic programming ideas will be helpful. We won’t be covering the very basics of Python, so it's best if you've seen a bit of code before.

### Library Imports

Install and/or load libraries that we will use in this workshop.

In [1]:
!pip install bs4
!pip install pandas
!pip install python-docx

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import pandas as pd
import pprint
import re
import os
from docx import Document

# <a id = "problems"></a>Exercises

## <a id = "beautifulsoup"></a>BeautifulSoup

<p id = "exercise_desc">In this exercise we will look at the University Postings from Cornell President Martha Pollack. You can view the postings <a href = "https://statements.cornell.edu/pollack.cfm">here</a>. The objective is to take each articles title, content, date, and URL. Then save the the results into a data frame containing article title, date and URL. Along with a word document containing the full artcile  </p>

In [3]:
#Open the page containing University Statements by President Pollack
html = urlopen('https://statements.cornell.edu/pollack.cfm')
#Use the BeautifulSoup package to load the content. 
bs_html = BeautifulSoup(html, 'html.parser')

### HTML Tags & Attributes

- **Tags** are used to represent different component types within an HTML file, such as the `<title>`, `<div>` for a designated section within the document, `<h1>` for the first header or `<a>` for links.
- HTML tags usually require a closing tag such as </title> or </div>. Sections in HTML begin and end.
- HTML tags will often include additional information within the tag itself that are known as **attributes**. You can think of attributes as an additional identifier for a given tag.
- Attributes are marked by key words such as `class`, `id`, or `href` (referring to hyperlinks) followed by an equal sign and a label such as `<div class='container'>`

Here are two external references for learning more about the available [tags](https://www.w3schools.com/tags/default.asp) and [attributes](https://www.w3schools.com/tags/ref_attributes.asp) within HTML.


Our created Beautiful Soup parsed HTML object offers a variety of methods that allows us to look at specific tags within the HTML. The following retrieves all specified instances of the `a` tag containing links in the HTML file:

Let's spend some time talking about the Inspect feature. Right click on the item you want to extract, and click on 'inspect'. More information about the <a href = "https://www.theodinproject.com/lessons/foundations-inspecting-html-and-css">inspector feature</a>

In [4]:
#Use the find_all feature to find all article links. 
links = bs_html.find_all('a', {'class':'cu-statement-title'})

Attributes are used to distinguish different subgroups of the same base HTML tag. This facilitates the easy retrieval of distinct HTML elements within a web scraping program that we would otherwise struggle to differentiate due to having the same tags. An example of this is the `<a class="cu-statement-title">`to distinguish between the statement URLs from other URLS found on the page.

Let's look at the results. Just checking out the first result. 

In [6]:
print(links[0])

<a class="cu-statement-title" href="../2024/20240531-recommendations.cfm">External advisory committee recommendations</a>


### Looping to get URL
The above code extract all the HTML associated with the `<a>` tags that have the class attribute as 'cu-statement-title'. But when we print, it still looks messy. Let's incorporate a loop to go through all the results, and only extract the `href` attribute that provides us with an individual article URL 

In [5]:
link_url = []
for i in links:
    link_url.append(i['href'])

Now let's examine the contents of the list. The first line of code below will display the number of URLs we have extracted

In [7]:
len(link_url)

111

The above result shows there are 111 articles in this page. Let's look at the first 10 using the code below. 

In [9]:
print(link_url[0:9])

['../2024/20240531-recommendations.cfm', '../2024/20240530-student-referendum.cfm', '../2024/20240514-encampment-update.cfm', 'https://statements.cornell.edu/2024/20240509-some-news.cfm', 'https://statements.cornell.edu/2024/20240429-campus-events.cfm', 'https://statements.cornell.edu/2024/20240422-patient-safety-wcm.cfm', '../2024/20240419-resources.cfm', '../2024/20240327-community.cfm', '../2024/20240319-incident-in-collegetown.cfm']


### Cleaning Up
Notice from the results above. Some of the URLs we extracted contain a full URL. That's good. However, some do not. They only contain paths relative to the current page. This will not do. We must find a solution so all the URLs in our list are full.

In [10]:
base = "https://statements.cornell.edu/"

In [11]:
count = 0
for i in link_url:
    if "http" not in i:
        link_url[count] = base + i
    count = count + 1

In [12]:
link_url = list(map(lambda st: st.replace('../', ''), link_url))

## Explanation of the above process
Looking at the URLs manually from the articles page, I can see each URL has the base of "https://statements.cornell.edu/". So the URLs from my previous printing procedure that are not full, they just need the base of "https://statements.cornell.edu/" added to the beginning. And they would also need the '../' removed from the URLs to make them all full. Using an if condition, we check to see if 'http' is present in each of our links. If it is not present, the base will be added. 

Let's look at the first 10 results now. 


In [13]:
print(link_url[0:9])

['https://statements.cornell.edu/2024/20240531-recommendations.cfm', 'https://statements.cornell.edu/2024/20240530-student-referendum.cfm', 'https://statements.cornell.edu/2024/20240514-encampment-update.cfm', 'https://statements.cornell.edu/2024/20240509-some-news.cfm', 'https://statements.cornell.edu/2024/20240429-campus-events.cfm', 'https://statements.cornell.edu/2024/20240422-patient-safety-wcm.cfm', 'https://statements.cornell.edu/2024/20240419-resources.cfm', 'https://statements.cornell.edu/2024/20240327-community.cfm', 'https://statements.cornell.edu/2024/20240319-incident-in-collegetown.cfm']


### Extracting the Date, Title and content of each Article. Looping Again
Now each URL is full. Excellent! We are ready to move to the next step. We will bring in another loop. The plan is to go to each article page, reach the HTML code from it. And get the information we want (title, date, article content). Then save each article as an individual word file.

Lets talk about the inspect feature again. Right click on the item you want to extract, and click on 'inspect'. More information about the <a href = "https://www.theodinproject.com/lessons/foundations-inspecting-html-and-css">inspector feature</a>. 

In [14]:
#Display where word files will get saved
os.getcwd()

'C:\\Users\\jrg363\\Workshops FA25\\Web Scraping Library'

In [18]:
current_directory = os.getcwd()
#Loop through each individual URL
for i in link_url:
    #Open URL
    html = urlopen(i)
    #Store html code using beautifulSoup
    bs_html = BeautifulSoup(html, 'html.parser')
    #Find the title and get the associated text.
    title = bs_html.find('h2', {'class':'cu-headline'})
    title = title.get_text()
    #Find the date and get the associated text.
    date = bs_html.find('time', {'class':'news-date'})
    date = date.get_text()
    #Clean up the date a bit. 
    date = re.sub(r'\s+', ' ',date).strip()
    #Find the article content and get the associated text.
    paragraphs = bs_html.find_all('p')
    #Create a name for the word document that will be saved. 
    name = title + "_" + date
    name = re.sub(r'[^A-Za-z0-9 ]+', '_', name)
    #Save content to word file. 
    filename = current_directory + "\\" + name + ".doc" 
    #print(filename)
    doc = Document()
    doc.add_heading(title)
    for p in paragraphs:
        text = p.get_text()
        doc.add_paragraph(text)
    #Save word document. 
    doc.save(filename)

### Finishing Up
We already have a word file for each article. We could use that for text or sentiment analysis. Let's also create a dataframe that consists of article title, article date, and article URL. This information we already know how to get from the above section of code. The code below combines that into a python pandas dataframe and exports to an excel spreadsheet.

In [16]:
dates = []
titles = []
for i in link_url:
    #Open URL
    html = urlopen(i)
    #Store html code using beautifulSoup
    bs_html = BeautifulSoup(html, 'html.parser')
    #Find the title and get the associated text.
    title = bs_html.find('h2', {'class':'cu-headline'})
    title = title.get_text()
    #Append title to a list
    titles.append(title)
    #Find the date and get the associated text.
    date = bs_html.find('time', {'class':'news-date'})
    date = date.get_text()
    #Clean up the date a bit. 
    date = re.sub(r'\s+', ' ',date).strip()
    #Append date to list
    dates.append(date)

#Combine the results into a dataframe. 
article_data = pd.DataFrame({
    'title': titles,
    'date': dates,
    'url': link_url,
})
current_directory = os.getcwd()
filename = current_directory + "\\" + "article_data" + ".xlsx"
#Export to Excel
article_data.to_excel(filename)

# <a id = "resources"></a>Additional Resources

- <a></a>[Beautiful Soup Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/#making-the-soup)

- <a href = "https://www.geeksforgeeks.org/html/tags-vs-elements-vs-attributes-in-html/"> HTML tags, elements and attributes</a>

- <a href = "https://www.tutorialspoint.com/html/html_attributes.htm">What are HTML Attributes</a>

- <a href= "https://www.w3schools.com/tags/ref_attributes.asp">List of all HTML Attributes</a>

- <a href = "https://www.w3schools.com/TAGS/default.asp">List of all HTML Tags</a>

- <a href = "https://jsonapi.org/examples/">Sample API Json data</a>

- <a href = "https://socialsciences.cornell.edu/computing-and-data/workshops-and-training">List of Fall 2025 CCSS Workshops</a>

- <a href = "https://socialsciences.cornell.edu/computing-and-data/schedule-a-consultation">Schedule a 1v1 consultation with CCSS Staff</a>

- <a href = "https://www.geeksforgeeks.org/python/python-basics/">Python Basics</a>

- <a href = "https://colab.research.google.com/drive/1FM2lQlVqkq8t1gu9paKacfcnLIfAHZKV?authuser=1&usp=drive_link#scrollTo=q8L6UD2DawPL">Sample BeautifulSoup Python Script File and Guide created by CCSS</a>

- <a href = "https://vod.video.cornell.edu/media/Web+Scraping+in+Python%28BeautifulSoup%29/1_7v2s9fgz/319524772">CCSS Python BeautifulSoup Workshop Recording Fall 2023</a>

- <a href = "https://www.geeksforgeeks.org/web-scraping/scrape-table-from-website-using-python-selenium/">Python Selenium</a>

- <a href = "https://colab.research.google.com/drive/1u5kBOxzMH3ER4kRLuOJlKXfOVtofbE7J?usp=sharing#scrollTo=7H5_OqDn6tbm">Sample Selenium Python Script File and Guide created by CCSS</a>

- <a href = "https://vod.video.cornell.edu/media/Intermediate+Web+Scraping+in+Python+%28Selenium%29/1_jrv69o2p/319524772">CCSS Python Selenium Workshop Recording Fall 2023</a>



