# **Introduction to Web Scraping and BeautifulSoup**

BeautifulSoup is a Python library that helps you to easily extract data from HTML and XML files. It provides a number of methods that can be used to navigate and search through the parse tree of a document.

To use BeautifulSoup and the Requests library, you will need to install them first. You can do this by running the following commands:

In [1]:
# Install the necessary libraries
#!pip install beautifulsoup4
#!pip install requests

Let's go explore a website and take a look at the HTML. We will use the IMSDB database of movie scripts: https://imsdb.com

Explore the actual webpage we are currently working with.

*   Right click on the website
*   Left click on Inspect
*   Turn on the hover cursor button on top left

## **How we get access to this HTML**

With the requests library imported, you can then use its various methods to make HTTP requests. For example, to make a GET request to a web server, you can use the `get()` method:

In [5]:
import requests

url = 'https://www.nytimes.com/ca/section/technology'


# Make a request to the webpage
response = requests.get(url)

The response variable will contain the server's response to the request. You can then use various attributes and methods of the response object to inspect the server's response and any data it may have sent back.

You can also access the contents of the response by using the text attribute of the response object:


> response_text = response.text


This will give you the raw text of the response, which you can then parse or manipulate as needed.

In [6]:
print(response.text)

<!DOCTYPE html>
<html lang="en"  xmlns:og="http://opengraphprotocol.org/schema/">
  <head>
    <meta charset="utf-8" />
    <title data-rh="true">Technology - The New York Times Canada</title>
    <meta data-rh="true" property="og:description" content="Technology industry news, commentary and analysis, with reporting on big tech, startups, and internet culture."/><meta data-rh="true" name="description" content="Technology industry news, commentary and analysis, with reporting on big tech, startups, and internet culture."/><meta data-rh="true" property="twitter:description" name="description" content="Technology industry news, commentary and analysis, with reporting on big tech, startups, and internet culture."/><meta data-rh="true" property="og:title" content="Technology"/><meta data-rh="true" property="twitter:title" content="Technology"/><meta data-rh="true" property="og:image" content="https://static01.nyt.com/newsgraphics/images/icons/defaultPromoCrop.png"/><meta data-rh="true" pro

Now, we can start working with BeautifulSoup.

In [7]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

In [13]:
# pprint
soup.text

'\n\n\n\nTechnology - The New York Times Canada\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to contentSkip to site indexTechnology\xa0Today’s PaperAdvertisementSKIP ADVERTISEMENTSupported bySKIP ADVERTISEMENTTechnologyDealBookMarketsEconomyEnergyMediaTechnologyPersonal TechSmall BusinessYour MoneyMutual Funds & ETFsHighlightsTest Yourself: Which Faces Were Made by A.I.?People tend to overestimate their ability to spot digital fakes, researchers found.\xa0By Stuart A. ThompsonCreditApple Says It Will Remove a Health Feature From New Apple WatchesStarting on Thursday, the Apple Watch Series 9 and Watch Ultra 2 will no longer detect people’s blood oxygen levels, to comply with a ruling by the International Trade Commission.\xa0By Tripp MickleCreditMichael M. Santiago/Getty ImagesTech TipHow to Cut Down Your Screen Time but Still Get Stuff DoneGoogle’s Routines and Apple’s Shortcuts combine multiple steps into one command to make your phone or tablet do more of the work for you.\xa0B

In BeautifulSoup, find() and find_all() are methods used to search the parse tree of an HTML or XML document for tags and elements that match certain criteria.


> **find()** returns the first element that matches the specified criteria


> **find_all()** returns a list of all the elements that match the criteria.


Both find() and find_all() take a number of optional arguments that allow you to specify the criteria for the search. For example, you can use the name argument to search for elements with a specific tag name:

In [14]:
# Find the first element with the tag 'p'
p_element = soup.find('p')
print("The first element with tag p", p_element)

# Find all elements with the tag 'p'
p_elements = soup.find_all('p')
print("All elements with the tag p", p_elements)

The first element with tag p <p>Advertisement</p>
All elements with the tag p [<p>Advertisement</p>, <p>Supported by</p>, <p class="css-tskdi9 e1hr934v5">People tend to overestimate their ability to spot digital fakes, researchers found.</p>, <p class="css-1notxxm e1hr934v3"><span class="css-me3p27"><span> </span></span><span class="css-1ekjaje e1hr934v4"></span><span class="css-9voj2j">By<!-- --> <span class="css-1baulvz last-byline" itemprop="name">Stuart A. Thompson</span></span></p>, <p class="css-tskdi9 e1hr934v5">Starting on Thursday, the Apple Watch Series 9 and Watch Ultra 2 will no longer detect people’s blood oxygen levels, to comply with a ruling by the International Trade Commission.</p>, <p class="css-1notxxm e1hr934v3"><span class="css-me3p27"><span> </span></span><span class="css-1ekjaje e1hr934v4"></span><span class="css-9voj2j">By<!-- --> <span class="css-1baulvz last-byline" itemprop="name">Tripp Mickle</span></span></p>, <p class="css-tskdi9 e1hr934v5">Google’s Routi

In [18]:
soup.prettify()

'<!DOCTYPE html>\n<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">\n <head>\n  <meta charset="utf-8"/>\n  <title data-rh="true">\n   Technology - The New York Times Canada\n  </title>\n  <meta content="Technology industry news, commentary and analysis, with reporting on big tech, startups, and internet culture." data-rh="true" property="og:description"/>\n  <meta content="Technology industry news, commentary and analysis, with reporting on big tech, startups, and internet culture." data-rh="true" name="description"/>\n  <meta content="Technology industry news, commentary and analysis, with reporting on big tech, startups, and internet culture." data-rh="true" name="description" property="twitter:description"/>\n  <meta content="Technology" data-rh="true" property="og:title"/>\n  <meta content="Technology" data-rh="true" property="twitter:title"/>\n  <meta content="https://static01.nyt.com/newsgraphics/images/icons/defaultPromoCrop.png" data-rh="true" property="og:image"

You can also use the class_ argument to search for elements with a specific class name:

In [22]:
# Find all the elements with the class 'article-title'
articles = soup.find_all('div', attrs={'class':'css-14ee9cx'})

articles

# try to pull out some relevant information for the articles (title, description, link)

[<div class="css-14ee9cx"><article class="css-1l4spti"><div class="css-79elbk"><figure aria-label="media" class="css-ulz9xo" role="group"><div class="css-79elbk" data-testid="photoviewer-children-Image"><img alt="" class="css-rq4mmj" decoding="async" height="100" sizes="(min-width: 1024px) 205px, 150px" src="https://static01.nyt.com/images/2024/01/22/multimedia/22deepfake-biden-2-kjwm/22deepfake-biden-2-kjwm-thumbWide.jpg?quality=75&amp;auto=webp&amp;disable=upscale" srcset="https://static01.nyt.com/images/2024/01/22/multimedia/22deepfake-biden-2-kjwm/22deepfake-biden-2-kjwm-thumbWide.jpg?quality=100&amp;auto=webp 190w,https://static01.nyt.com/images/2024/01/22/multimedia/22deepfake-biden-2-kjwm/22deepfake-biden-2-kjwm-videoThumb.jpg?quality=100&amp;auto=webp 75w,https://static01.nyt.com/images/2024/01/22/multimedia/22deepfake-biden-2-kjwm/22deepfake-biden-2-kjwm-videoLarge.jpg?quality=100&amp;auto=webp 768w,https://static01.nyt.com/images/2024/01/22/multimedia/22deepfake-biden-2-kjwm/

There are many other arguments you can use to narrow down your search, such as `id`, `attrs`, or use `text` to extract the text from a tag.

## **ISMDB Exercise**

In [6]:
import random
from bs4 import BeautifulSoup
import requests

In [7]:
base_url = 'http://www.imsdb.com/'

To start with, we are going to be getting the information from the page on ISMDB that contains information from all of the scripts on the site. Again, we are using the request library and BeautifulSoup. We are looking for anything with a p tag as that contains the information we are looking for.

In [8]:
response = requests.get('https://imsdb.com/all-scripts.html')
html = response.text

soup = BeautifulSoup(html, "html.parser")
paragraphs = soup.find_all('p')

# what do the paragraphs look like?
paragraphs[0:5]

[<p><a href="/Movie Scripts/10 Things I Hate About You Script.html" title="10 Things I Hate About You Script">10 Things I Hate About You</a> (1997-11 Draft)<br/><i>Written by Karen McCullah Lutz,Kirsten Smith,William Shakespeare</i><br/></p>,
 <p><a href="/Movie Scripts/12 Script.html" title="12 Script">12</a> (Undated Draft)<br/><i>Written by Lawrence Bridges</i><br/></p>,
 <p><a href="/Movie Scripts/12 and Holding Script.html" title="12 and Holding Script">12 and Holding</a> (2004-04 Draft)<br/><i>Written by Anthony Cipriano</i><br/></p>,
 <p><a href="/Movie Scripts/12 Monkeys Script.html" title="12 Monkeys Script">12 Monkeys</a> (1994-06 Draft)<br/><i>Written by David Peoples,Janet Peoples</i><br/></p>,
 <p><a href="/Movie Scripts/12 Years a Slave Script.html" title="12 Years a Slave Script">12 Years a Slave</a> (Undated Draft)<br/><i>Written by John Ridley</i><br/></p>]

As you can see, this gives us information about the scripts like title, who wrote the script, and the link to the page for that particular script (which will be important for us to get the script).

In [9]:
paragraphs[1].a['href']

'/Movie Scripts/12 Script.html'

Since there are quite a few scripts, we will focus on getting a random selection of 20 using `random.choice()`. Below you can see that we are looping through this random list of 20 selections.  

In [11]:
for p in random.choices(paragraphs[1:], k=20):
  relative_link = p.a['href']
  #print(relative_link)
  tail = relative_link.split('/')[-1]
  #print('fetching %s' % tail)
  front_page_response = requests.get(base_url + relative_link)
  front_soup = BeautifulSoup(front_page_response.text, "html.parser")

  try:
    script_link = front_soup.find_all('p', align="center")[0].a['href']

  except IndexError:
    print('{} has no script.'.format(tail))
    continue

  if script_link.endswith('.html'):
    title = script_link.split('/')[-1].split(' Script')[0]
    script_url = base_url + script_link
    #print(script_url)
    script_response = requests.get(script_url)
    script_soup = BeautifulSoup(script_response.text, "html.parser")
    script = script_soup.find('td', attrs={'class':'scrtext'}).get_text()
    #print(title, script[0:200])
  else:
      print('{} is not html.'.format(tail))

Alien.html 



   "Alien", early draft, by Dan O'Bannon



   




                               ALIEN


                  (project formerly titled STARBEAST)


                Story by Dan O'
Aliens.html 



Aliens - by James Cameron






                               "ALIENS"


                                  by


                             James Cameron







         
Deer-Hunter,-The.html 



The Deer Hunter



EXT. PENNSYLVANIA STEEL MILL - LIGHT SNOW - DAY

The plant is massive, grime-streaked, squatting in the valley
under five massive stacks, each one trailing a black rib
Titanic.html 


                               T I T A N I C


                              a screenplay by
                               James Cameron



1 BLACKNESS

Then two faint lights appear, c
Sugar.html 



                                SUGAR



                              Written by

                        Anna Boden & Ryan Fleck





1    EXT. ACADEMY FIELD - DOMINIC

Some cleaning that we may want to do is getting rid of some of the tags and metadata at the top of the script, which can be accomplished with the code below if it is incorporated into the code in the previous cell.

## Accessing the robots.txt file

Websites will sometimes explicitly state whether you are allowed to scrape from them in their terms and conditions. Or they should have a robots.txt file that you can check to see what is allowed and what is not (will likely have something like can_fetch which is equal to true or false). If the site explicitly does not allow webscraping, I would avoid collecting data from the site (if it is possible at all - sometimes sites will have protections against scraping).

**For our previous examples, The New York Times allows for limited scraping of their content, which makes sense as it is subscription-based. It only allows specific agents like twitterbot to collect content so if we wanted to scrape full news articles or other information we would need to look elsewhere. IMSDB allows scraping for educational purposes and does not consider it an infringement of copyright under their disclaimers. It does not have a robots.txt file for us to look at so we will stick to looking at The New York Times.**

In [14]:
# We can look at robots.txt (if it exists)
response = requests.get('https://www.nytimes.com/robots.txt')

if response.status_code == 200:
    print(response.text)
else:
    print('Did not locate robots.txt')



User-agent: *
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /puzzles/leaderboards/invite/*
Disallow: /svc
Allow: /svc/crosswords
Allow: /svc/games
Allow: /svc/letter-boxed
Allow: /svc/spelling-bee
Allow: /svc/vertex
Allow: /svc/wordle
Disallow: /video/embedded/*
Disallow: /search
Disallow: /multiproduct/
Disallow: /hd/
Disallow: /inyt/
Disallow: /*?*query=
Disallow: /*.pdf$
Disallow: /*?*login=
Disallow: /*?*searchResultPosition=
Disallow: /*?*campaignId=
Disallow: /*?*mcubz=
Disallow: /*?*smprod=
Disallow: /*?*ProfileID=
Disallow: /*?*ListingID=
Disallow: /wirecutter/wp-admin/
Disallow: /wirecutter/*.zip$
Disallow: /wirecutter/*.csv$
Disallow: /wirecutter/deals/beta
Disallow: /wirecutter/data-requests
Disallow: /wirecutter/search
Disallow: /wirecutter/*?s=
Disallow: /wirecutter/*&xid=
Disallow: /wirecutter/*?q=
Disallow: /wirecutter/*?l=
Disallow: /search
Disallow: /*?*smid=
Disallow: /*?*partner=
Disallow: /*?*utm_source=
Allow: /wirecutter/*?*utm_source=
Allow: /ads/public/
Allow: /

As you can see, most things are disallowed for The New York Times.

## Text Augmentation

There are many different strategies for augmenting text to increase the number of samples in your datasets while adding variability to the samples. One example would be to swap in synonyms to a piece of text. How many words you swap in or out can be changes and may impact the performance of whatever task you are trying to accomplish.

How can I get synonyms for a word?

We have see examples of how to use wordnet in NLTK so expanding on this, we could collect synonyms for a particular word.

In [3]:
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')

word = "awesome"
synonyms = []
word_synsets = wordnet.synsets(word)
print(word_synsets)

[nltk_data] Downloading package wordnet to /root/nltk_data...


[Synset('amazing.s.02')]


So we have one synset here that is 'amazing', which can have a set of synonyms.

In [4]:
for syn in word_synsets:
    for lem in syn.lemmas():
        print(lem)

Lemma('amazing.s.02.amazing')
Lemma('amazing.s.02.awe-inspiring')
Lemma('amazing.s.02.awesome')
Lemma('amazing.s.02.awful')
Lemma('amazing.s.02.awing')


So we are going to extract these words.

In [5]:
for syn in word_synsets:
    for lem in syn.lemmas():
        synonym = lem.name().replace("_", " ").replace("-", " ")
        if synonym not in synonyms:
            synonyms.append(synonym)

# we also added awesome, but we don't want to swap in the same word so you can remove it
if word in synonyms:
    synonyms.remove(word)

print(synonyms)

['amazing', 'awe inspiring', 'awful', 'awing']


How do you think you might add to this code to choose a number of words to swap in synonyms?