# Web Scraping with Beautiful Soup

Webscraping is the task of extracting information from web sites programmatically. There are many Python packages that can be used for web scraping. This notebook looks at Beautiful Soup. Follow this link for the docs and [how to install](https://pypi.org/project/beautifulsoup4/)

This notebook shows how to extract text from an online article with Beautiful Soup, then do some cleanup with Python.

In [2]:
import urllib
from urllib import request
from bs4 import BeautifulSoup

In [3]:
url = 'https://nyti.ms/2uAQS89'
html = request.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(html)

In [4]:
# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out


In [5]:
# extract text
text = soup.get_text()
print(text[:200])




With Snowflakes and Unicorns, Marina Ratner and Maryam Mirzakhani Explored a Universe in Motion - The New York Times



















SectionsSEARCHSkip to contentSkip to site indexScienceToday’s


### Clean up

The next code block uses regex to get rid of the white space, creating a list of text chunks. The chunks are numbered as they are printed out to see how many chunks are there.

In [10]:
import re
text_chunks = [chunk for chunk in text.splitlines() if not re.match(r'^\s*$', chunk)]
for i, chunk in enumerate(text_chunks):
    print(i+1, chunk)

1 With Snowflakes and Unicorns, Marina Ratner and Maryam Mirzakhani Explored a Universe in Motion - The New York Times
2 SectionsSEARCHSkip to contentSkip to site indexScienceToday’s PaperScience|With Snowflakes and Unicorns, Marina Ratner and Maryam Mirzakhani Explored a Universe in Motionhttps://nyti.ms/2vdyTIHAdvertisementContinue reading the main storySupported byContinue reading the main storyEssayWith Snowflakes and Unicorns, Marina Ratner and Maryam Mirzakhani Explored a Universe in MotionMarina Ratner in Moscow in 1991.Credit...via Anna RatnerBy Amie WilkinsonAug. 7, 2017The mathematics section of the National Academy of Sciences lists 104 members. Just four are women. As recently as June, that number was six.Marina Ratner and Maryam Mirzakhani could not have been more different, in personality and in background. Dr. Ratner was a Soviet Union-born Jew who ended up at the University of California, Berkeley, by way of Israel. She had a heart attack at 78 at her home in early July

There is still a lot of junk at the top and bottom of the text, and other problems that need further text processing.

### Extracting paragraph tags

Using the soup object created above, the next code extracts <p> tags. These could be further processed to extract text by removing everything within the p tags.

In [11]:
for p in soup.select('p'):
    print(p)

<p>Advertisement</p>
<p>Supported by</p>
<p class="css-c2jxua e6idgb70">Essay</p>
<p class="css-1nuro5j e1jsehar1" itemprop="author" itemscope="" itemtype="http://schema.org/Person">By<!-- --> <span class="css-1baulvz last-byline" itemprop="name">Amie Wilkinson</span></p>
<p class="css-158dogj evys1bk0">The mathematics section of the National Academy of Sciences lists 104 members. Just four are women. As recently as June, that number was six.</p>
<p class="css-158dogj evys1bk0">Marina Ratner and Maryam Mirzakhani could not have been more different, in personality and in background. Dr. Ratner was a Soviet Union-born Jew who ended up at the University of California, Berkeley, by way of Israel. She had a <a class="css-1g7m0tk" href="https://www.nytimes.com/2017/07/25/science/marina-ratner-dead-mathematician.html" title="">heart attack at 78</a> at her home in early July.</p>
<p class="css-158dogj evys1bk0">Success came relatively late in her career, in her 50s, when she produced her most

### Extracting links

The following code block extracts hyperlinks from the page.

In [12]:
counter = 0
for link in soup.find_all('a'):
    counter += 1
    if counter > 10:
        break
    print(link.get('href'))

#site-content
#site-index
https://www.nytimes.com/section/science
/
https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi
https://www.nytimes.com/section/todayspaper
/section/science
/
/
https://www.facebook.com/dialog/feed?app_id=9869919170&link=https%3A%2F%2Fwww.nytimes.com%2F2017%2F08%2F07%2Fscience%2Fwomen-mathematicians-maryam-mirzakhani-marina-ratner.html%3Fsmid%3Dfb-share&name=With%20Snowflakes%20and%20Unicorns%2C%20Marina%20Ratner%20and%20Maryam%20Mirzakhani%20Explored%20a%20Universe%20in%20Motion&redirect_uri=https%3A%2F%2Fwww.facebook.com%2F
