# Web Scraping — Part 2 — Workbook

In this lesson, we're going to introduce how to scrape multiple web pages from the internet with the Python libraries requests and BeautifulSoup.

---

## Quick Demonstration of Image Scraping — NYT Front Page

### Import Requests and BeautifulSoup

Once again, we're going to use the `requests` library and the `BeautifulSoup` library to scrape data.

In [1]:
import requests
from bs4 import BeautifulSoup

### Get HTML Data and Extract Text

*The New York Times* Front Page: https://nytimes.com

Here we're going to request the url for *The New York Times* front page, extract the text of the web page, then transform it into BeautifulSoup document.

In [2]:
response = requests.get("https://nytimes.com")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

Here we search through the HTML code to find all the `<img>` tags:

In [3]:
document.find_all('img')

[<img alt="The Morning Logo" class="" src="/vi-assets/static-assets/icon-the-morning_144x144-b12a6923b6ad9102b766352261b1a847.webp"/>,
 <img alt="The Upshot Logo" class="" src="/vi-assets/static-assets/icon-the-upshot_144x144-0b1553ff703bbd07ac8fe73e6d215888.webp"/>,
 <img alt="The Daily Logo" class="" src="https://static01.nyt.com/images/2017/01/29/podcasts/the-daily-album-art/the-daily-album-art-mediumSquare149-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>,
 <img alt="The Run-Up Logo" class="css-hqhlyo" src="https://static01.nyt.com/images/2022/08/29/podcasts/the-run-up-album-art/the-run-up-album-art-thumbLarge.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>,
 <img alt="Morning Briefing: Europe Logo" class="" src="/vi-assets/static-assets/icon-europe-morning-briefing_144x144-f0a330cb12ba0c31f81f13e25f6d0d18.webp"/>,
 <img alt="The Interpreter Logo" class="" src="/vi-assets/static-assets/icon-the-interpreter_144x144-b29b74b2ebedb8e74823f33b16fb8167.webp"/>,
 <img alt="You

To display these images in our Jupyter notebook, we're going to import the Python modules `Markdown` and `display`, which allow us to transform code output into Markdown and thus display the images in this notebook

In [4]:
from IPython.display import Markdown, display

# Loop through all the images on the NYT front page
for image in document.find_all('img'):
    
    # Convert the image tag to a string
    image_string = str(image)
    
    # Transform the tag to Markdown and then display it as Markdown
    display(Markdown(image_string))

<img alt="The Morning Logo" class="" src="/vi-assets/static-assets/icon-the-morning_144x144-b12a6923b6ad9102b766352261b1a847.webp"/>

<img alt="The Upshot Logo" class="" src="/vi-assets/static-assets/icon-the-upshot_144x144-0b1553ff703bbd07ac8fe73e6d215888.webp"/>

<img alt="The Daily Logo" class="" src="https://static01.nyt.com/images/2017/01/29/podcasts/the-daily-album-art/the-daily-album-art-mediumSquare149-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="The Run-Up Logo" class="css-hqhlyo" src="https://static01.nyt.com/images/2022/08/29/podcasts/the-run-up-album-art/the-run-up-album-art-thumbLarge.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Morning Briefing: Europe Logo" class="" src="/vi-assets/static-assets/icon-europe-morning-briefing_144x144-f0a330cb12ba0c31f81f13e25f6d0d18.webp"/>

<img alt="The Interpreter Logo" class="" src="/vi-assets/static-assets/icon-the-interpreter_144x144-b29b74b2ebedb8e74823f33b16fb8167.webp"/>

<img alt="Your Places: Global Update Logo" class="" src="/vi-assets/static-assets/icon-yourplaces-globalupdate_144x144-c25aba1c2904f301a08ad33183f723c6.webp"/>

<img alt="Canada Letter Logo" class="" src="/vi-assets/static-assets/icon-canada-letter_144x144-65d899377edbcce9773d31fd03a77e8d.webp"/>

<img alt="DealBook Logo" class="" src="/vi-assets/static-assets/icon-dealbook_144x144-28e8f71aafff426804c3a92b1b176e07.webp"/>

<img alt="Hard Fork Logo" class="" src="https://static01.nyt.com/images/2022/09/28/podcasts/hard-fork-album-art/hard-fork-album-art-mediumSquare149-v2.png?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Read Like the Wind Logo" class="" src="/vi-assets/static-assets/icon-read-like-the-wind_144x144-5bcf9faf41d0b49df1df29e59a868b36.webp"/>

<img alt="Watching Logo" class="" src="/vi-assets/static-assets/icon-watching_144x144-631a1da177f9fda1a7f4614ad8e607bd.webp"/>

<img alt="Book Review Logo" class="" src="https://static01.nyt.com/images/2018/03/27/books/book-review-album-art-v2/book-review-album-art-v2-thumbLarge-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Popcast Logo" class="" src="https://static01.nyt.com/images/2011/05/20/multimedia/music-popcast/music-popcast-thumbLarge-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Open Thread Logo" class="" src="/vi-assets/static-assets/icon-open-thread-fashion_144x144-8e1b4b3fd68c2f333faa63097da2249b.webp"/>

<img alt="Well Logo" class="" src="/vi-assets/static-assets/icon-well_144x144-433c9d15dc985dded9b705942592c6fb.webp"/>

<img alt="Modern Love Logo" class="" src="https://static01.nyt.com/images/2020/09/21/podcasts/modernlove-logo/modernlove-logo-thumbLarge-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Matter of Opinion Logo" class="" src="https://static01.nyt.com/images/2023/05/08/podcasts/matter-of-opinion-album-art/matter-of-opinion-album-art-thumbLarge-v2.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="The Ezra Klein Show Logo" class="" src="https://static01.nyt.com/images/2023/04/05/podcasts/ezra-klein-album-art/ezra-klein-album-art-thumbLarge-v3.png"/>

<img alt="The Interview Logo" class="" src="/vi-assets/static-assets/NYT-TheInterview-0232c6c95d42d77941fd3d8e5d2776cb.webp"/>

<img alt="The Headlines Logo" class="" src="https://static01.nyt.com/images/2022/10/12/podcasts/headlines-albumartwork-audioapp-2/headlines-albumartwork-audioapp-2-thumbLarge.png?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Serial: The Good Whale Logo" class="" src="/vi-assets/static-assets/TheGoodWhale_144x144-f0eb13463973a1da8814328998990640.webp"/>

<img alt="Audio Logo" class="" src="/vi-assets/static-assets/icon-audio_144x144-dc00c6581be29065cbd19ec7a83a3767.webp"/>

<img alt="Gameplay Logo" class="" src="/vi-assets/static-assets/icon-gameplay_144x144-b6cc5e2a7cc27a43096274a02921329c.webp"/>

<img alt="Easy Mode Logo" class="" src="/vi-assets/static-assets/icon-games-easymode_144x144-307b8f657d987516abff44220313daae.webp"/>

<img alt="The Cooking Newsletter Logo" class="" src="/vi-assets/static-assets/icon-cooking_144x144-5a8be1ef711d4ba5e66b0be7a2ca8bfe.webp"/>

<img alt="The Veggie Logo" class="" src="/vi-assets/static-assets/icon-the-veggie_144x144-f99606e1ca100f88cdfd8d763bf442c5.webp"/>

<img alt="Five Weeknight Dishes Logo" class="" src="/vi-assets/static-assets/icon-five-weeknight-dishes_144x144-97d51c5d4ba98233667b4057e3d852ab.webp"/>

<img alt="The Recommendation Logo" class="" src="/vi-assets/static-assets/icon-the-recommendation_144x144-3e66bd6cc82013bd511c31a8f04d4ff7.webp"/>

<img alt="Clean Everything Logo" class="" src="/vi-assets/static-assets/icon-clean-everything_144x144-97312e349d7284039a2153cb541b7fda.webp"/>

<img alt="The Pulse Logo" class="" src="/vi-assets/static-assets/icon-athletic-pulse_144x144-393cbda91e2678278456723b62a9b21f.webp"/>

<img alt="Scoop City Logo" class="" src="/vi-assets/static-assets/icon-athletic-scoop-city_144x144-131bb9a92c77857aa6cac44772a74a77.webp"/>

<img alt="The Windup Logo" class="" src="/vi-assets/static-assets/icon-athletic-windup_144x144-c03f2bf7ebd88f1c239ba4a6b2228679.webp"/>

<img alt="The Athletic FC Logo" class="" src="/vi-assets/static-assets/icon-athletic-fc_144x144-a673fb497a7a58fd0a80b3d007b73b2f.webp"/>

<img alt="The Morning Logo" class="" src="/vi-assets/static-assets/icon-the-morning_144x144-b12a6923b6ad9102b766352261b1a847.webp"/>

<img alt="The Upshot Logo" class="" src="/vi-assets/static-assets/icon-the-upshot_144x144-0b1553ff703bbd07ac8fe73e6d215888.webp"/>

<img alt="The Daily Logo" class="" src="https://static01.nyt.com/images/2017/01/29/podcasts/the-daily-album-art/the-daily-album-art-mediumSquare149-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="The Run-Up Logo" class="css-hqhlyo" src="https://static01.nyt.com/images/2022/08/29/podcasts/the-run-up-album-art/the-run-up-album-art-thumbLarge.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Morning Briefing: Europe Logo" class="" src="/vi-assets/static-assets/icon-europe-morning-briefing_144x144-f0a330cb12ba0c31f81f13e25f6d0d18.webp"/>

<img alt="The Interpreter Logo" class="" src="/vi-assets/static-assets/icon-the-interpreter_144x144-b29b74b2ebedb8e74823f33b16fb8167.webp"/>

<img alt="Your Places: Global Update Logo" class="" src="/vi-assets/static-assets/icon-yourplaces-globalupdate_144x144-c25aba1c2904f301a08ad33183f723c6.webp"/>

<img alt="Canada Letter Logo" class="" src="/vi-assets/static-assets/icon-canada-letter_144x144-65d899377edbcce9773d31fd03a77e8d.webp"/>

<img alt="DealBook Logo" class="" src="/vi-assets/static-assets/icon-dealbook_144x144-28e8f71aafff426804c3a92b1b176e07.webp"/>

<img alt="Hard Fork Logo" class="" src="https://static01.nyt.com/images/2022/09/28/podcasts/hard-fork-album-art/hard-fork-album-art-mediumSquare149-v2.png?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Read Like the Wind Logo" class="" src="/vi-assets/static-assets/icon-read-like-the-wind_144x144-5bcf9faf41d0b49df1df29e59a868b36.webp"/>

<img alt="Watching Logo" class="" src="/vi-assets/static-assets/icon-watching_144x144-631a1da177f9fda1a7f4614ad8e607bd.webp"/>

<img alt="Book Review Logo" class="" src="https://static01.nyt.com/images/2018/03/27/books/book-review-album-art-v2/book-review-album-art-v2-thumbLarge-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Popcast Logo" class="" src="https://static01.nyt.com/images/2011/05/20/multimedia/music-popcast/music-popcast-thumbLarge-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Open Thread Logo" class="" src="/vi-assets/static-assets/icon-open-thread-fashion_144x144-8e1b4b3fd68c2f333faa63097da2249b.webp"/>

<img alt="Well Logo" class="" src="/vi-assets/static-assets/icon-well_144x144-433c9d15dc985dded9b705942592c6fb.webp"/>

<img alt="Modern Love Logo" class="" src="https://static01.nyt.com/images/2020/09/21/podcasts/modernlove-logo/modernlove-logo-thumbLarge-v3.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Matter of Opinion Logo" class="" src="https://static01.nyt.com/images/2023/05/08/podcasts/matter-of-opinion-album-art/matter-of-opinion-album-art-thumbLarge-v2.jpg?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="The Ezra Klein Show Logo" class="" src="https://static01.nyt.com/images/2023/04/05/podcasts/ezra-klein-album-art/ezra-klein-album-art-thumbLarge-v3.png"/>

<img alt="The Interview Logo" class="" src="/vi-assets/static-assets/NYT-TheInterview-0232c6c95d42d77941fd3d8e5d2776cb.webp"/>

<img alt="The Headlines Logo" class="" src="https://static01.nyt.com/images/2022/10/12/podcasts/headlines-albumartwork-audioapp-2/headlines-albumartwork-audioapp-2-thumbLarge.png?quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Serial: The Good Whale Logo" class="" src="/vi-assets/static-assets/TheGoodWhale_144x144-f0eb13463973a1da8814328998990640.webp"/>

<img alt="Audio Logo" class="" src="/vi-assets/static-assets/icon-audio_144x144-dc00c6581be29065cbd19ec7a83a3767.webp"/>

<img alt="Gameplay Logo" class="" src="/vi-assets/static-assets/icon-gameplay_144x144-b6cc5e2a7cc27a43096274a02921329c.webp"/>

<img alt="Easy Mode Logo" class="" src="/vi-assets/static-assets/icon-games-easymode_144x144-307b8f657d987516abff44220313daae.webp"/>

<img alt="The Cooking Newsletter Logo" class="" src="/vi-assets/static-assets/icon-cooking_144x144-5a8be1ef711d4ba5e66b0be7a2ca8bfe.webp"/>

<img alt="The Veggie Logo" class="" src="/vi-assets/static-assets/icon-the-veggie_144x144-f99606e1ca100f88cdfd8d763bf442c5.webp"/>

<img alt="Five Weeknight Dishes Logo" class="" src="/vi-assets/static-assets/icon-five-weeknight-dishes_144x144-97d51c5d4ba98233667b4057e3d852ab.webp"/>

<img alt="The Recommendation Logo" class="" src="/vi-assets/static-assets/icon-the-recommendation_144x144-3e66bd6cc82013bd511c31a8f04d4ff7.webp"/>

<img alt="Clean Everything Logo" class="" src="/vi-assets/static-assets/icon-clean-everything_144x144-97312e349d7284039a2153cb541b7fda.webp"/>

<img alt="The Pulse Logo" class="" src="/vi-assets/static-assets/icon-athletic-pulse_144x144-393cbda91e2678278456723b62a9b21f.webp"/>

<img alt="Scoop City Logo" class="" src="/vi-assets/static-assets/icon-athletic-scoop-city_144x144-131bb9a92c77857aa6cac44772a74a77.webp"/>

<img alt="The Windup Logo" class="" src="/vi-assets/static-assets/icon-athletic-windup_144x144-c03f2bf7ebd88f1c239ba4a6b2228679.webp"/>

<img alt="The Athletic FC Logo" class="" src="/vi-assets/static-assets/icon-athletic-fc_144x144-a673fb497a7a58fd0a80b3d007b73b2f.webp"/>

<img alt="Donald Trump Jr. raises his right hand while speaking at a lectern that bears a “Trump Vance” sign." class="css-dzl7b5" loading="lazy"/>

<img alt="Donald Trump Jr. raises his right hand while speaking at a lectern that bears a “Trump Vance” sign." class="css-122y91a" src="https://static01.nyt.com/images/2024/11/26/multimedia/26DC-TRUMPJR-TOP-qhkb/26DC-TRUMPJR-TOP-qhkb-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="John Fetterman in a plum-colored short-sleeved shirt and standing in a hallway. A gold-plated elevator is visible behind him." class="css-dzl7b5" loading="lazy"/>

<img alt="John Fetterman in a plum-colored short-sleeved shirt and standing in a hallway. A gold-plated elevator is visible behind him." class="css-122y91a" src="https://static01.nyt.com/images/2024/11/27/multimedia/27-pol-on-politics-topitem-qpkg/27-pol-on-politics-topitem-qpkg-smallSquare252.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="President Claudia Sheinbaum of Mexico, sitting at a table and wearing earphones. A small Mexican flag is in front of her." class="css-dzl7b5" loading="lazy"/>

<img alt="President Claudia Sheinbaum of Mexico, sitting at a table and wearing earphones. A small Mexican flag is in front of her." class="css-122y91a" src="https://static01.nyt.com/images/2024/11/27/multimedia/27trump-news-mexico-sheinbaum1-ptvb/27trump-news-mexico-sheinbaum1-ptvb-smallSquare252-v3.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/12/27/multimedia/27NAT-TEXAS-BORDER-promo-ctwg/27NAT-TEXAS-BORDER-promo-ctwg-smallSquare252-v3.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-jox6xh e1hd6kzw0" height="1102" src="https://static01.nyt.com/images/2024/11/27/multimedia/Thanksgiving-Balloons-Tapper-02-vtkl/Thanksgiving-Balloons-Tapper-02-vtkl-verticalTwoByThree735.jpg" width="735"/>

<img alt="Graham Dickie" class="css-dc6zx6 ey68jwv2" src="https://static01.nyt.com/images/icons/t_logo_150_black.png" title="Graham Dickie"/>

<img alt="" class="css-jox6xh e1hd6kzw0" height="1102" src="https://static01.nyt.com/images/2024/11/27/multimedia/Thanksgiving-Balloons-Tapper-06-vtkl/Thanksgiving-Balloons-Tapper-06-vtkl-verticalTwoByThree735.jpg" width="735"/>

<img alt="" class="css-jox6xh e1hd6kzw0" height="1102" src="https://static01.nyt.com/images/2024/11/27/multimedia/Thanksgiving-Balloons-Tapper-07-vtkl/Thanksgiving-Balloons-Tapper-07-vtkl-verticalTwoByThree735.jpg" width="735"/>

<img alt="" class="css-jox6xh e1hd6kzw0" height="1102" src="https://static01.nyt.com/images/2024/11/27/multimedia/Thanksgiving-Balloons-Tapper-01-qzjc/Thanksgiving-Balloons-Tapper-01-qzjc-verticalTwoByThree735.jpg" width="735"/>

<img alt="" class="css-jox6xh e1hd6kzw0" height="1102" src="https://static01.nyt.com/images/2024/11/27/multimedia/Thanksgiving-Balloons-Tapper-tfjl/Thanksgiving-Balloons-Tapper-tfjl-verticalTwoByThree735.jpg" width="735"/>

<img alt="" class="css-jox6xh e1hd6kzw0" height="1102" src="https://static01.nyt.com/images/2024/11/27/multimedia/Thanksgiving-Balloons-Tapper-kvpw/Thanksgiving-Balloons-Tapper-kvpw-verticalTwoByThree735.jpg" width="735"/>

<img alt="" class="css-jox6xh e1hd6kzw0" height="1102" src="https://static01.nyt.com/images/2024/11/27/multimedia/Thanksgiving-Balloons-Tapper-05-vtkl/Thanksgiving-Balloons-Tapper-05-vtkl-verticalTwoByThree735.jpg" width="735"/>

<img alt="A black-and-white photo of men in overhauls pulling produce along a long hall with a soaring ceiling and lamps hanging in a row overhead." class="css-dzl7b5" loading="lazy"/>

<img alt="A black-and-white photo of men in overhauls pulling produce along a long hall with a soaring ceiling and lamps hanging in a row overhead." class="css-122y91a" src="https://static01.nyt.com/images/2024/11/27/multimedia/27uk-smithfield-03-vfbw-homepage/27uk-smithfield-03-vfbw-homepage-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="People sitting on and around an escalator." class="css-dzl7b5" loading="lazy"/>

<img alt="People sitting on and around an escalator." class="css-122y91a" src="https://static01.nyt.com/images/2024/11/28/multimedia/28ukraine-strikes-1-cmlb/28ukraine-strikes-1-cmlb-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/11/18/multimedia/00india-exams-01-wplz/00india-exams-01-wplz-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="A colonial-style house behind trees and shrubbery." class="css-dzl7b5" loading="lazy"/>

<img alt="A colonial-style house behind trees and shrubbery." class="css-122y91a" src="https://static01.nyt.com/images/2024/11/26/multimedia/26kenburns-newhampshirehome-alt1-lbkm/26kenburns-newhampshirehome-alt1-lbkm-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/11/28/multimedia/28nat-big-turkey-hp-fader-01-wzkc/28nat-big-turkey-hp-fader-01-wzkc-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/11/28/multimedia/28nat-big-turkey-hp-fader-02-wzkc/28nat-big-turkey-hp-fader-02-wzkc-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/11/28/multimedia/28nat-big-turkey-hp-fader-03-wzkc/28nat-big-turkey-hp-fader-03-wzkc-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/11/28/multimedia/28nat-big-turkey-hp-fader-04-wzkc/28nat-big-turkey-hp-fader-04-wzkc-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/11/28/multimedia/28nat-big-turkey-hp-fader-05-wzkc/28nat-big-turkey-hp-fader-05-wzkc-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/11/28/multimedia/28nat-big-turkey-hp-fader-06-wzkc/28nat-big-turkey-hp-fader-06-wzkc-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Maureen Dowd" class="css-1ii2lp6" loading="lazy"/>

<img alt="Maureen Dowd" class="css-122y91a" src="https://static01.nyt.com/images/2018/04/02/opinion/maureen-dowd/maureen-dowd-thumbLarge.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Donald Trump, onstage and in a navy suit with a red tie, dances with his arms in the air." class="css-dzl7b5" loading="lazy"/>

<img alt="Donald Trump, onstage and in a navy suit with a red tie, dances with his arms in the air." class="css-122y91a" src="https://static01.nyt.com/images/2024/11/28/multimedia/28dowd-hgzl/28dowd-hgzl-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Daniela J. Lamas" class="css-1ii2lp6" loading="lazy"/>

<img alt="Daniela J. Lamas" class="css-122y91a" src="https://static01.nyt.com/images/2021/04/23/opinion/daniela-lamas/daniela-lamas-thumbLarge.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Pamela Paul" class="css-1ii2lp6" loading="lazy"/>

<img alt="Pamela Paul" class="css-122y91a" src="https://static01.nyt.com/images/2022/07/12/opinion/pamela-paul-new/pamela-paul-new-thumbLarge-v2.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="A headdress with red, white and black feathers and beads." class="css-dzl7b5" loading="lazy"/>

<img alt="A headdress with red, white and black feathers and beads." class="css-122y91a" src="https://static01.nyt.com/images/2024/11/26/multimedia/00spotted-tail-01-hkpv/00spotted-tail-01-hkpv-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/11/27/multimedia/27PROJECTIONIST-DEADWYLER-02-bktq/27PROJECTIONIST-DEADWYLER-02-bktq-threeByTwoMediumAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/03/02/crosswords/alpha-wordle-icon-new/alpha-wordle-icon-new-smallSquare252-v3.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2023/08/25/crosswords/alpha-connections-icon-original/alpha-connections-icon-original-smallSquare252.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2024/01/16/crosswords/alpha-strands-icon/alpha-strands-icon-smallSquare252.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2020/03/23/crosswords/spelling-bee-logo-nytgames-hi-res/spelling-bee-logo-nytgames-hi-res-smallSquare252-v4.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2020/03/23/crosswords/crossword-logo-nytgames-hires/crossword-logo-nytgames-hires-smallSquare252-v3.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2021/03/23/multimedia/alpha-mini-promo-1616527576800/alpha-mini-promo-1616527576800-smallSquare252-v4.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

## Quick Demonstration of Image Scraping — Bill Gates's LinkedIn Page

https://www.linkedin.com/in/williamhgates/

In [5]:
response = requests.get("https://www.linkedin.com/in/williamhgates/")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [6]:
from IPython.display import Markdown, display

# Loop through all the images on the NYT front page
for image in document.find_all('img'):
    # Convert the image tag to a string
    image_string = str(image)
    # Transform the tag to Markdown and then display it as Markdown
    display(Markdown(image_string))

What's going wrong here?

In [7]:
response

<Response [999]>

## Scraping Multiple Web Pages At a Time

In the last lesson, we figured out how to scrape the lyrics for a single Missy Elliott song.

In [8]:
response = requests.get("https://genius.com/Missy-elliott-work-it-lyrics")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [10]:
lyrics_tag = document.find_all("div", attrs={"class":"Lyrics__Container-sc-1ynbvzw-1 kUgSbL"})

lyrics = []
for content in lyrics_tag:
    line = content.get_text(separator=" ")
    lyrics.append(line)

for content in lyrics:
    print(content)

[Intro] DJ, please pick up your phone, I'm on the request line This is a Missy Elliott one-time exclusive (C'mon) [Chorus] Is it worth it? Let me work it I put my thing down, flip it and reverse it ​ti esrever dna ti pilf ,nwod gniht ym tuP ​ti esrever dna ti pilf ,nwod gniht ym tuP If you got a big— let me search ya And find out how hard I gotta work ya ​ti esrever dna ti pilf ,nwod gniht ym tuP ​ti esrever dna ti pilf ,nwod gniht ym tuP  (C'mon) [Verse 1] I'd like to get to know ya so I could show ya Put the pussy on ya like I told ya Give me all your numbers so I can phone ya Your girl acting stank, then call me over Not on the bed, lay me on your sofa Call before you come, I need to shave my chocha You do or you don't or you will or won't ya? Go downtown and eat it like a vulture See my hips and my tips, don't ya? See my ass and my lips, don't ya? Lost a few pounds and my waist for ya This the kind of beat that go ra-ta-ta Ra-ta-ta-ta-ta-ta-ta-ta-ta-ta Sex me so good I say blah-bla

But how can we scrape lyrics for multiple Missy Elliott songs at a time?

### Figure Out the Pattern

What we need to do is figure out how to progammatically generate the correct Genius web page URL for each song we're interested in:

`f"https://genius.com/Missy-elliott-{formatted_song}-lyrics"`

In [11]:
song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

```
for song in song_titles:
    formatted_song = ?????
    response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
    html_string = response.text
    document = BeautifulSoup(html_string, "html.parser")
    document.find('p').text
```

Let's inspect the Genius web pages for each of these songs:

https://genius.com/Missy-elliott-work-it-lyrics

https://genius.com/Missy-elliott-the-rain-supa-dupa-fly-lyrics

https://genius.com/Missy-elliott-wtf-where-they-from-lyrics

### Make Song Titles Fit Pattern — Your Turn!

Create a function called `format_song()` that will take in a song title and then return the song title correctly formatted for its Genius web page.

For example, the song `WTF (Where They From)` needs to be converted to `wtf-where-they-from`.

Hint: You will need to use [string methods](https://info1350.github.io/Intro-CA-SP21/02-Python/06-String-Methods.html#id1)!

In [12]:
def format_song(song):
    #Your Code Here 👇
    
    
    
    
    return formatted_song

Test of your function on these two song titles to make sure it's working correctly.

In [149]:
format_song('WTF (Where They From)')

'wtf-where-they-from'

In [150]:
format_song('Work It')

'work-it'

### Put It All Together

In [156]:
song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

Now use your `format_song()` function to create the variable `formatted_song`, which will allow the code below to work.

In [None]:
for song in song_titles:
    formatted_song = ???? #Use your format_song() function here
    response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
    html_string = response.text
    document = BeautifulSoup(html_string, "html.parser")
    lyrics = document.find('p').text
    print(lyrics)

## Write Lyrics to a Text File

In [152]:
song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

Here we are writing the lyrics to a text file rather than printing them out.

Again, use your `format_song()` function to create the variable `formatted_song`, which will allow the code below to work.

In [160]:
with open('Missy-Elliott-Lyrics.txt', mode='w') as file_object:
    
    for song in song_titles:
        formatted_song = format_song(song)  #Use your format_song() function here
        response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
        html_string = response.text
        document = BeautifulSoup(html_string, "html.parser")
        lyrics = document.find('p').text
        
        file_object.write(lyrics)

## Count Top Words From File

If we wanted to find out the most frequent words in Missy Elliott's lyrics, we could use the word counter code that we've used in previous lessons.

In [13]:
import re
from collections import Counter

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']


def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

def get_top_words(full_text, number_of_words=20):
    all_the_words = split_into_words(full_text)
    meaningful_words = [word for word in all_the_words if word not in stopwords]
    meaningful_words_tally = Counter(meaningful_words)
    most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_words)
    return most_frequent_meaningful_words

Let's read in the file that we created and get the top words.

In [None]:
missy_lyrics = open('Missy-Elliott-Lyrics.txt').read()
get_top_words(missy_lyrics)

## What patterns do you notice about the top 20 words from these Missy Elliott songs?
Feel free to open the text file in the file browser at the left and inspect the lyrics manually

## Bonus: If You Wanted to Change the Artist...

In [None]:
artist = 'Bts'
song_titles = ['Dynamite', 'Euphoria', 'Fake Love']

for song in song_titles:
    formatted_song = ???? #Use your format_song() function here
    response = requests.get(f"https://genius.com/{artist}-{formatted_song}-lyrics")
    html_string = response.text
    document = BeautifulSoup(html_string, "html.parser")
    lyrics = document.find('p').text
    print(lyrics)

## Group Discussion

* Do you think scholars should use web scraping in their research? Why or why not?
* How would you feel if you found out that one of your social media posts had been included in an academic article without your knowledge?
* What are some strategies that you think scholars might use to do web scraping in an ethical way?

If there is anything wrong, please open [an issue on GitHub](https://github.com/GroningenDH/Cultural-Analytics-Open-Science-Guide/issues) or email f.pianzola@rug.nl