# Python Web Scraping Tutorial using BeautifulSoup
   Part of it adapted from Vik Paruchuri
   
   When performing data science tasks, it’s common to want to use data found on the internet. You’ll usually be able to access this data in csv format, or via an Application Programming Interface(API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you’ll want to use a technique called web scraping to get the data from the web page into a format you can work with in your analysis.
   
##   The components of a web page
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

1. HTML – contain the main content of the page.
2. CSS – add styling to make the page look nicer.
3. JS – Javascript files add interactivity to web pages.
4. Images – image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.

## HTML
HyperText Markup Language(HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python – instead, it’s a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word – make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:

> `<html>`

> `</html>`

We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we wouldn’t see anything:

Right inside an html tag, we put two other tags, the head tag, and the body tag. The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:

> `<html>
    <head>
    </head>
    <body>
    </body>
> </html>`
>

We still haven’t added any content to our page (that goes inside the body tag), so we again won’t see anything:

You may have noticed above that we put the head and body tags inside the html tag. In HTML, tags are nested, and can go inside other tags.

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:

>`<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
></html>`

Here’s how this will look:

>Here's a paragraph of text!

>Here's a second paragraph of text!

Tags have commonly used names that depend on their position in relation to other tags:

+ child – a child is a tag inside another tag. So the two p tags above are both children of the body tag.
+ parent – a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
+ sibiling – a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior:

>`<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
            <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <p>
            Here's a second paragraph of text!
            <a href="https://www.python.org">Python</a>
        </p>
    </body>
></html>`

Here’s how this will look:

>Here's a paragraph of text! [Learn Data Science Online](https://www.dataquest.io)

>Here's a second paragraph of text! [Python](https://www.python.org)

In the above example, we added two a tags.  a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

+ div – indicates a division, or area, of the page.
+ b – bolds any text inside.
+ i – italicizes any text inside.
+ table – creates a table.
+ form – creates an input form.
+ For a full list of tags, Google it, :-).

Before we move into actual web scraping, let’s learn about the class and id properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:

>`<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
></html>`

Here’s how this will look:

>Here's a paragraph of text! [Learn Data Science Online](https://www.dataquest.io)

>Here's a second paragraph of text! [Python](https://www.python.org)

As you can see, adding classes and ids doesn’t change how the tags are rendered at all.

## The requests library

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one. If you want to learn more, check out [request documentation](http://docs.python-requests.org/en/master/).

**Twitter, Spotify, Microsoft, Amazon, Lyft, BuzzFeed, Reddit, The NSA, Her Majesty's Government, Google, Twilio, Runscope, Mozilla, Heroku, PayPal, NPR, Obama for America, Transifex, Native Instruments, The Washington Post, SoundCloud, Kippt, Sony, and Federal U.S. Institutions that prefer to be unnamed claim to use Requests internally.**

Let’s try downloading a simple sample website, http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.

In [3]:
import requests

In [2]:
page = requests.get("https://www.zillow.com/salt-lake-city-ut/sold/") #the url of the page you want to download.

After running our request, we get a **Response object**. This object has a status_code property, which indicates if the page was downloaded successfully:

In [3]:
page.status_code

200

A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

We can print out the HTML content of the page using the content property:

In [4]:
page.content #get content of the page

b'<html><head><meta name="robots" content="noindex, nofollow"/><script src="https://www.google.com/recaptcha/api.js"></script><link href="https://www.zillowstatic.com/vstatic/80d5e73/static/css/z-pages/captcha.css" type="text/css" rel="stylesheet" media="screen"/><script>\n            function handleCaptcha(response) {\n                var vid = getQueryString("vid"); // getQueryString is implemented below\n                var uuid = getQueryString("uuid");\n                var name = \'_pxCaptcha\';\n                var cookieValue =  btoa(JSON.stringify({r:response,v:vid,u:uuid}));\n                var cookieParts = [name, \'=\', cookieValue, \'; path=/\'];\n                cookieParts.push(\'; domain=\' + window.location.hostname);\n                cookieParts.push(\'; max-age=10\');//expire after 10 seconds\n                document.cookie = cookieParts.join(\'\');\n                var originalURL = getOriginalUrl("url");\n                var originalHost = window.location.host;\n 

In [5]:
page.headers #get headers of the page

{'Content-Type': 'text/html;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Date': 'Sat, 21 Sep 2019 17:33:41 GMT', 'Server': 'Apache-Coyote/1.1', 'X-B3-TraceId': '5d865ef56fe571034b428f944e4a594f', 'X-B3-SpanId': '4b428f944e4a594f', 'X-B3-Sampled': '1', 'Cache-Control': 'no-cache', 'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'X-Internal-Host': '044', 'Z-Using-Act': '2', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'via': '1.1 zgs, 1.1 27b097f1b9769d8459cc46b29d99a61b.cloudfront.net (CloudFront)', 'Set-Cookie': 'AWSALB=d5qjW23nVtnbISO30vqcWGhuFGAONZe/P2qrngkhLnpMpy5NuMolvkTmk5NjOkYftLzgSAeK4UV9m2s6QehEf1CKPwJPgMrmaPiCt5uElEwHiynopeAi6QTWFS99; Expires=Sat, 28 Sep 2019 17:33:41 GMT; Path=/, JSESSIONID=8FDEE8018B46911C8A2F9843F5BB88E7; Path=/; HttpOnly, zguid=23|%248096bc7d-0c35-4295-9e60-77dbfed49d56; Max-Age=315576000; Expires=Fri, 21 Sep 2029 05:33:41 GMT; Path=/; Domain=.zillow.com; HTTPOnly, zgsession=1|dedbf5d7-fb10-4d0e-aace-c7e52f0c6b97; Path

In [6]:
page.encoding #get encoding of the page

'UTF-8'

### Passing Parameters In URLs

You often want to send some sort of data in the URL’s query string. If you were constructing the URL by hand, this data would be given as key/value pairs in the URL after a question mark, e.g. http://httpbin.org/get?key=val. Requests allows you to provide these arguments as a dictionary of strings, using the params keyword argument. As an example, if you wanted to pass key1=value1 and key2=value2 to http://httpbin.org/get, you would use the following code:

In [7]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('http://httpbin.org/get', params=payload)

You can see that the URL has been correctly encoded by printing the URL:

In [8]:
r.url

'http://httpbin.org/get?key1=value1&key2=value2'

In [9]:
r = requests.get('http://apple.com')

In [10]:
r.content[:100]

b'\t\n\n\n\t\n\n\n\t\n\n\n\t\n\n\t\t\n\t\t\n\n\t\n\n\n\t \n\n\n\t\t\t\t\n\n\t\t\t\t\t\n\t\t\t\t\t\t\n\n\n\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/'

### Custom Headers

If you’d like to add HTTP headers to a request, simply pass in a dict to the headers parameter.

For example, we didn’t specify our user-agent in the previous example:

In [11]:
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}

r = requests.get(url, headers=headers)

Sometimes, websites do not allow applications (for example, by python) to access, you have to send the request as if it is sent by a web browser.

The header is the tool to fake the browser.

In [12]:
#Make your python request as if it were from firefox browser
url="https://www.zillow.com/salt-lake-city-ut/sold/"
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko)'}
r = requests.get(url, headers=headers)

In [13]:
r.content

b'<!DOCTYPE html><html itemscope="" itemtype="http://schema.org/Organization" class="no-js zsg-theme-modernized null" lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:product="http://ogp.me/ns/product#" >\n<head>\n<link rel="preconnect" href="//fonts.googleapis.com"/><link rel="preconnect" href="//maps.googleapis.com"/><link rel="preconnect" href="//c.go-mpulse.net"/><link rel="preconnect" href="//s.zillow.net"/><link rel="preconnect" href="//keystone-ext.develop.zillow.net"/><link rel="preconnect" href="//fonts.gstatic.com"/><link rel="preconnect" href="//www.google-analytics.com"/><link rel="preconnect" href="//sb.scorecardresearch.com"/><link rel="preconnect" crossorigin="true" href="https://www.zillow.com/graphql/"></link><link rel="preconnect" crossorigin="true" href="https://mortgageapi.zillow.com"></link><meta charset="utf-8"/><script type="application/javascript">\n        !function(e){var t={};functi

In [14]:
r.headers # display the full information of the header of the page received

{'Content-Type': 'text/html;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Date': 'Sat, 21 Sep 2019 17:33:43 GMT', 'Server': 'Apache-Coyote/1.1', 'X-Internal-Host': '038', 'Z-Using-Act': '2', 'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'Cache-Control': 'no-cache', 'X-Frame-Options': 'deny', 'Content-Security-Policy': 'frame-ancestors zillow.highspot.com view.highspot.com;', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'via': '1.1 zgs, 1.1 3046ef5af023075cfbd500968062319e.cloudfront.net (CloudFront)', 'Set-Cookie': 'AWSALB=c9OzWW7nwuLA/cTXEIXV4/mJwbHWi3BoNYSM0ImDAr48JC+KCeQaK7bhuEyRo85aVge3GLL82cgtwiZ/YJJFyGsndKfOHwtJ+aTlDpu2aUt3ccaowO00cPxtvoBg; Expires=Sat, 28 Sep 2019 17:33:42 GMT; Path=/, JSESSIONID=F16962700109666635DD0430A40C651C; Path=/; HttpOnly, search=6|1571679222979%7Cregion%3Dsalt-lake-city-ut%26rect%3D40.85297%252C-111.739457%252C40.700246%252C-112.101511%26fs%3D0%26fr%3D0%26mmm%3D0%26rs%3D1%26ah%3D0%09%01%096909%09%09%09%090%09US_%09

### Response Types
There several response types:
+ Text web page
+ Bytes contents, e.g., images, files
+ Json contents
+ Raw contents

In [15]:
#Text response
r = requests.get('https://api.github.com/events')
r.text[:200]#display the first 200 characters of the content
#with r.text, Requests will automatically decode content from the server.

'[{"id":"10466257126","type":"PushEvent","actor":{"id":30449848,"login":"facfur58","display_login":"facfur58","gravatar_id":"","url":"https://api.github.com/users/facfur58","avatar_url":"https://avatar'

In [16]:
#bytes reponse contents
r = requests.get('http://higheredutah.org/wp-content/uploads/2016/03/USU-search-cmte.jpg')
r.content[:100]

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00`\x00`\x00\x00\xff\xdb\x00C\x00\n\x07\x07\x07\x07\x07\n\x07\x07\n\x0e\t\t\t\x0e\x11\x0c\x0b\x0b\x0c\x11\x14\x10\x10\x10\x10\x10\x14\x11\x0f\x11\x11\x11\x11\x0f\x11\x11\x17\x1a\x1a\x1a\x17\x11\x1f!!!!\x1f+---+2222222222\xff\xdb\x00C\x01\x0b\x0e\x0e\x1f\x14\x1f'

In [17]:
from PIL import Image
from io import BytesIO

i = Image.open(BytesIO(r.content))
i.show()

#Save bytes contents
with open('usu.jpg', 'wb') as fd:
    fd.write(r.content)

In [18]:
#Json response content
r = requests.get('https://api.github.com/events')
r.json()

[{'id': '10466257271',
  'type': 'PushEvent',
  'actor': {'id': 44934141,
   'login': 'scottjordanswimming',
   'display_login': 'scottjordanswimming',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/scottjordanswimming',
   'avatar_url': 'https://avatars.githubusercontent.com/u/44934141?'},
  'repo': {'id': 209838817,
   'name': 'scottjordanswimming/my-each-online-web-pt-090819',
   'url': 'https://api.github.com/repos/scottjordanswimming/my-each-online-web-pt-090819'},
  'payload': {'push_id': 4061010772,
   'size': 1,
   'distinct_size': 1,
   'ref': 'refs/heads/wip',
   'head': 'c1dc8f5e737ed368eb7c2f0e7d89d89b80e497f1',
   'before': 'd3ab19aadfd2b214e76bd6d6e76abd87951ab43b',
   'commits': [{'sha': 'c1dc8f5e737ed368eb7c2f0e7d89d89b80e497f1',
     'author': {'email': 'scottjordanswimming@gmail.com',
      'name': 'Scott Jordan'},
     'message': 'Automatically backed up by Learn',
     'distinct': True,
     'url': 'https://api.github.com/repos/scottjordanswimming/my-

#### Raw response
In the rare case that you'd like to get the raw socket response from the server, you can access r.raw. If you want to do this, make sure you set stream=True in your initial request. Once you do, you can do this:

In [19]:
r = requests.get('https://api.github.com/events', stream=True)
#r.raw
r.raw.read(10)


b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

In general, however, you should use a pattern like this to save what is being streamed to a file:

In [20]:
filename='test'
with open(filename, 'wb') as fd: # 'wb' is write as bytes
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

ChunkedEncodingError: ('Connection broken: IncompleteRead(39 bytes read)', IncompleteRead(39 bytes read))

Using Response.iter_content will handle a lot of what you would otherwise have to handle when using Response.raw directly. When streaming a download, the above is the preferred and recommended way to retrieve the content. Note that chunk_size can be freely adjusted to a number that may better fit your use cases.

#### This is especially useful if you are downloading large files of which you may use just a small part. It will save your download time and storage space.

In [4]:
url='https://www.sec.gov/Archives/edgar/data/7323/0000065984-14-000065.txt' #this file is more than 400 megabytes.
r = requests.get(url, stream=True)
filename='test.txt'
with open(filename, 'w') as fd:
    n=0 #see how many chunk we download in the end
    cont='' #considering the '<TYPE>GRAPHIC' may be spread in two chunks, we chop the last 12 characters in previous chunk and add it to the next chunk
    for chunk in r.iter_content(chunk_size=1024*1024):
        test=cont+chunk.decode('utf-8') #adding the previous chunk's last 12 character to next chunk and decode bytes into string
        inde=test.find('<TYPE>GRAPHIC')#search for the string in a string, if returns -1, meaning not found, otherwise the index of first stance
        #inde=test.find('</html>') 
        if inde!=-1: #if found, write to file and break, but only write the string ending at '<\html>'
            print('found it')
            fd.write(chunk.decode('utf-8')[:inde-12])#offsetting the added 12 characters of last chunk
            break
        fd.write(chunk.decode('utf-8')) #if not found,write current chunk to file
        cont=str(chunk.decode('utf-8')[-12:])#retain the last 12 characters for next chunk
        n+=1
print(n)        

found it
566


## Parsing a page with BeautifulSoup

As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

For more information, visit [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [7]:
nytimes=requests.get('https://www.nytimes.com/section/technology')

In [22]:
nytimes.text[:1000]

'<!DOCTYPE html>\n<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <title data-rh="true">Technology - The New York Times</title>\n    <meta data-rh="true" itemprop="inLanguage" content="en"/><meta data-rh="true" id="applicationName" name="applicationName" content="collection"/><meta data-rh="true" name="nyt-collection:identifier" content="technology"/><meta data-rh="true" name="CN" content="technology"/><meta data-rh="true" name="nyt-collection:type" content="sectioncollection"/><meta data-rh="true" name="CT" content="sectionfront"/><meta data-rh="true" name="nyt-collection:display-name" content="Technology"/><meta data-rh="true" name="nyt-collection:tagline" content=""/><meta data-rh="true" name="nyt-collection:promotional-image" content=""/><meta data-rh="true" name="PT" content="collection"/><meta data-rh="true" name="asset_id" content="100000004820462"/><meta data-rh="true" name="slug" content="technology"/><meta data-rh="true" property="og:descriptio

In [8]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(nytimes.content, 'html.parser') #we need to specify a parser to soup

In [None]:
soup.get_text()[15000:17000]

### Basic tagging information

In [25]:
soup.a #get the first a tag

<a class="css-1rn5q1r" href="#site-content">Skip to content</a>

In [14]:
soup.a.name #get the tag name

'a'

In [15]:
soup.a.string #get the string between the tag

'Skip to content'

In [10]:
soup.a['class'] #get the 'class' attribute of tag a

['css-1rn5q1r']

In [11]:
soup.a['href'] #get the 'href' attribute of tag a, which the hyperlink

'#site-content'

In [12]:
soup.title #get the first title tag

<title data-rh="true">Technology - The New York Times</title>

In [16]:
soup.title.parent.name # get the parent name of title tag

'head'

In [17]:
soup.img #get the first 'img' tag

<img alt="" src="https://static01.nyt.com/images/2019/09/23/world/23forgotten-print1/merlin_161147955_f2afa5be-eee6-4593-ba75-86717cc01aa2-jumbo.jpg" srcset="https://static01.nyt.com/images/2019/09/23/world/23forgotten-print1/merlin_161147955_f2afa5be-eee6-4593-ba75-86717cc01aa2-jumbo.jpg 1024w, https://static01.nyt.com/images/2019/09/23/world/23forgotten-print1/23forgotten-print1-videoLarge-v2.jpg 768w, https://static01.nyt.com/images/2019/09/23/world/23forgotten-print1/23forgotten-print1-mediumThreeByTwo440-v2.jpg 440w"/>

### Finding all instances of a tag at once

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

`soup.find_all('a')` create an iterator for all 'a' tags

`soup.find_all('div')` create an iterator for all 'div' tags

In [18]:
#get all hyperlinks of a page

links=[a['href'] for a in soup.find_all('a')]#list comprehension, function as if in a for loop:
                                             #for x in soup.find_all('a')
                                             #    links.append(x['href'])
'''
same as
links=[]
for a in soup.find_all('a'):
    links.append(a['href'])

'''        

"\nsame as\nlinks=[]\nfor a in soup.find_all('a'):\n    links.append(a['href'])\n\n"

In [19]:
print(soup.find_all('a'))

[<a class="css-1rn5q1r" href="#site-content">Skip to content</a>, <a class="css-1rn5q1r" href="#site-index">Skip to site index</a>, <a class="css-nuvmzp" href="https://www.nytimes.com/section/technology">Technology</a>, <a aria-label="New York Times Logo. Click to visit the homepage" class="css-nhjhh0 e1huz5gh1" href="/"><svg class="" fill="#000" viewbox="0 0 184 25" xmlns="http://www.w3.org/2000/svg"><path d="M13.8 2.9c0-2-1.9-2.5-3.4-2.5v.3c.9 0 1.6.3 1.6 1 0 .4-.3 1-1.2 1-.7 0-2.2-.4-3.3-.8C6.2 1.4 5 1 4 1 2 1 .6 2.5.6 4.2c0 1.5 1.1 2 1.5 2.2l.1-.2c-.2-.2-.5-.4-.5-1 0-.4.4-1.1 1.4-1.1.9 0 2.1.4 3.7.9 1.4.4 2.9.7 3.7.8v3.1L9 10.2v.1l1.5 1.3v4.3c-.8.5-1.7.6-2.5.6-1.5 0-2.8-.4-3.9-1.6l4.1-2V6l-5 2.2C3.6 6.9 4.7 6 5.8 5.4l-.1-.3c-3 .8-5.7 3.6-5.7 7 0 4 3.3 7 7 7 4 0 6.6-3.2 6.6-6.5h-.2c-.6 1.3-1.5 2.5-2.6 3.1v-4.1l1.6-1.3v-.1l-1.6-1.3V5.8c1.5 0 3-1 3-2.9zm-8.7 11l-1.2.6c-.7-.9-1.1-2.1-1.1-3.8 0-.7 0-1.5.2-2.1l2.1-.9v6.2zm10.6 2.3l-1.3 1 .2.2.6-.5 2.2 2 3-2-.1-.2-.8.5-1-1V9.4l.8-.6 1.7 1

In [20]:
len(links)

201

In [21]:
links[:20]

['#site-content',
 '#site-index',
 'https://www.nytimes.com/section/technology',
 '/',
 'https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi',
 'https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi',
 'https://www.nytimes.com/section/todayspaper',
 '/pages/business/dealbook/index.html',
 'https://markets.on.nytimes.com',
 '/section/business/economy',
 '/section/business/energy-environment',
 '/section/business/media',
 '/section/technology',
 '/section/technology/personaltech',
 '/section/business/smallbusiness',
 '/section/your-money',
 '/2019/09/23/technology/right-to-be-forgotten-law-europe.html',
 '/2019/09/23/technology/right-to-be-forgotten-law-europe.html',
 '/2019/09/23/technology/apple-mac-pro-texas-tariffs.html',
 '/2019/09/23/technology/apple-mac-pro-texas-tariffs.html']

In [22]:
l=set(links) #convert list to set, removing duplicates
print(len(l))

116


In [23]:
links=list(l)
links[:20]

['/2019/09/22/business/media/mit-media-lab-food-computer.html',
 '/2019/09/23/technology/right-to-be-forgotten-law-europe.html',
 'https://www.nytimes.com/section/t-magazine',
 'https://nytmediakit.com/',
 'https://www.nytimes.com/section/climate',
 'https://thewirecutter.com',
 '/2019/09/20/technology/house-antitrust-investigation-big-tech.html',
 'https://www.nytimes.com/section/todayspaper',
 'https://www.nytimes.com/section/arts/television',
 '/section/business/energy-environment',
 'https://www.nytimes.com/section/politics',
 '/2019/09/19/smarter-living/wirecutter/smart-lights-enhance-home-security-and-shine-a-light-on-crime.html',
 'https://www.nytimes.com/section/science',
 '/2019/09/22/business/china-social-credit-business.html',
 'https://twitter.com/nytimestech',
 'https://www.nytimes.com/section/us',
 '/2019/09/20/arts/design/imagenet-trevor-paglen-ai-facial-recognition.html',
 'https://www.nytimes.com/video',
 'https://www.nytimes.com/marketing/newsletters',
 'https://www.n

#### Find all headlines

In [24]:
x = soup.find_all('h2')
x

[<h2 class="css-1dv1kvn">Highlights</h2>,
 <h2 class="css-cgu9om e1xdw5350"><a data-rref="" href="/2019/09/23/technology/right-to-be-forgotten-law-europe.html">One Brother Stabbed the Other. The Journalist Who Wrote About It Paid a Price.</a></h2>,
 <h2 class="css-1hdf4fa e1xdw5352"><a data-rref="" href="/2019/09/23/technology/apple-mac-pro-texas-tariffs.html">Apple Keeps Making Computer in Texas After Tariff Waivers</a></h2>,
 <h2 class="css-1hdf4fa e1xdw5352"><a data-rref="" href="/2019/09/22/business/china-social-credit-business.html">China Scores Businesses, and Low Grades Could Be a Trade-War Weapon</a></h2>,
 <h2 class="css-1hdf4fa e1xdw5352"><a data-rref="" href="/2019/09/20/technology/airbnb-employees-ipo-payouts.html">Inside Airbnb, Employees Eager for Big Payouts Pushed It to Go Public</a></h2>,
 <h2 class="css-f5vkeu ejccfdg1"><a href="/section/technology/personaltech">Personal Technology</a></h2>,
 <h2 class="css-y3otqb e134j7ei0"><a href="/2019/09/20/style/tinder-swipenigh

In [25]:
headlines=[x.get_text() for x in soup.find_all('h2', class_='css-1j9dxys e1xfvim30')]#Note: the use of class_, 

In [26]:
headlines

['As HBO Celebrates a Big Night, Questions About Its Future Loom',
 'WeWork C.E.O.’s Ouster Is Weighed in Bid to Salvage I.P.O.',
 'M.I.T. Media Lab, Already Rattled by the Epstein Scandal, Has a New Worry',
 'Congress Asks More than 80 Companies for Big Tech Complaints',
 'Twitter Suspends Account of Former Adviser to Saudi Crown Prince',
 '‘Nerd,’ ‘Nonsmoker,’ ‘Wrongdoer’: How Might A.I. Label You?',
 'Facebook’s Suspension of ‘Tens of Thousands’ of Apps Reveals Wider Privacy Issues',
 'India’s Chandrayaan-2 Marks 60 Years of Moon Crashes and Hard Landings',
 'The Week in Tech: An Emerging Twist on Antitrust',
 'Funny or Die Finds New Life in the Streaming Era']

In [27]:
len(headlines)

10

In [28]:
h=set(headlines) #removing duplicates

In [29]:
len(h)

10

In [30]:
h

{'As HBO Celebrates a Big Night, Questions About Its Future Loom',
 'Congress Asks More than 80 Companies for Big Tech Complaints',
 'Facebook’s Suspension of ‘Tens of Thousands’ of Apps Reveals Wider Privacy Issues',
 'Funny or Die Finds New Life in the Streaming Era',
 'India’s Chandrayaan-2 Marks 60 Years of Moon Crashes and Hard Landings',
 'M.I.T. Media Lab, Already Rattled by the Epstein Scandal, Has a New Worry',
 'The Week in Tech: An Emerging Twist on Antitrust',
 'Twitter Suspends Account of Former Adviser to Saudi Crown Prince',
 'WeWork C.E.O.’s Ouster Is Weighed in Bid to Salvage I.P.O.',
 '‘Nerd,’ ‘Nonsmoker,’ ‘Wrongdoer’: How Might A.I. Label You?'}

### What if we only need the latest news?

In [34]:
latest_panel=soup.find('section', id='latest-panel')

In [28]:
latest_panel

In [35]:
hl=latest_panel.li #find the first 'li' tag within latest_panel

AttributeError: 'NoneType' object has no attribute 'li'

In [33]:
hl

NameError: name 'hl' is not defined

In [32]:
hl.a['href'] #get link from hl

NameError: name 'hl' is not defined

In [31]:
hl.h2.get_text().replace('\n','').strip() #get the h2 tag from hl and get the text between tags, removing '\n' and trailing spaces

NameError: name 'hl' is not defined

In [36]:
hl.time #get the 'time' tag from hl

NameError: name 'hl' is not defined

In [37]:
hl.time['datetime'] #get the 'datetime' attribute of time tage

NameError: name 'hl' is not defined

### So, get all li tags from latest_panel, with id start with 'story-id'

In [38]:
hls=latest_panel.find_all('li', id=lambda x: x and x.startswith('story-id'))

AttributeError: 'NoneType' object has no attribute 'find_all'

In [39]:
len(hls)

NameError: name 'hls' is not defined

In [40]:
#get the headline titles, hyperlink, and date time of the headlines
headlines=[(x.h2.get_text().replace('\n','').strip(),x.a['href'], x.time['datetime']) for x in hls]

NameError: name 'hls' is not defined

In [41]:
len(headlines)

10

In [42]:
headlines

['As HBO Celebrates a Big Night, Questions About Its Future Loom',
 'WeWork C.E.O.’s Ouster Is Weighed in Bid to Salvage I.P.O.',
 'M.I.T. Media Lab, Already Rattled by the Epstein Scandal, Has a New Worry',
 'Congress Asks More than 80 Companies for Big Tech Complaints',
 'Twitter Suspends Account of Former Adviser to Saudi Crown Prince',
 '‘Nerd,’ ‘Nonsmoker,’ ‘Wrongdoer’: How Might A.I. Label You?',
 'Facebook’s Suspension of ‘Tens of Thousands’ of Apps Reveals Wider Privacy Issues',
 'India’s Chandrayaan-2 Marks 60 Years of Moon Crashes and Hard Landings',
 'The Week in Tech: An Emerging Twist on Antitrust',
 'Funny or Die Finds New Life in the Streaming Era']