# Crawling and Extracting Data from Websites

In [None]:
# !sudo -H apt-get -y install libxml2-dev libxslt-dev python3-dev
# !sudo -H pip3 install -U lxml

## Searching in HTML: Fetching the webpage title from ESPN.com

Let's start by trying to fetch the headlines from the site ESPN.com.



In [1]:
import requests # This command allows us to fetch URLs
from lxml import html # This module will allow us to parse the returned HTML/XML
import pandas # To create a dataframe

# Let's start by fetching the page, and parsing it
url = "http://www.espn.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

In [2]:
doc

<Element html at 0x7efde8e64a98>

Let's start by getting the content of the `<title>` node from the site:

In [3]:
# This will return back a list of nodes that match
# are called <title>......</title>
results = doc.xpath('//title/text()')
results

['ESPN: Serving sports fans. Anytime. Anywhere.']

In [4]:
results[0]

'ESPN: Serving sports fans. Anytime. Anywhere.'

In [5]:
results = doc.xpath('//title')
results

[<Element title at 0x7efde8e64ae8>]

In [6]:
# gives only the text stored directly under the node
results[0].text 

'ESPN: Serving sports fans. Anytime. Anywhere.'

In [7]:
# gives all the text stored directly under the node *and* its children
# for the <title> node, it does not make a difference
results[0].text_content() 

'ESPN: Serving sports fans. Anytime. Anywhere.'

In [None]:
# Compare the two results to see the difference
# The first one gives all the text stored under the root <HTML> node
# The second gives only the text immediately under HTML (but not under the children)
# doc.xpath("/html")[0].text_content()
# doc.xpath("/html")[0].text

### Exercise

* Connect to the NYU Stern website, and fetch the title of the page

In [None]:
# your code here

#### Solution

In [8]:
stern_url = 'http://www.stern.nyu.edu'
stern_html = requests.get(stern_url).text
stern_doc = html.fromstring(stern_html)

In [9]:
title1 = stern_doc.xpath("//title/text()")
title1[0]

'NYU Stern School of Business | Full-time MBA, Part-time (Langone) MBA, Undergraduate, PhD, Executive MBA Business Programs - NYU Stern'

In [10]:
title2 = stern_doc.xpath("//title")
title2[0].text

'NYU Stern School of Business | Full-time MBA, Part-time (Langone) MBA, Undergraduate, PhD, Executive MBA Business Programs - NYU Stern'

## Searching for elements of interest in the web page

The `doc` variable is an `HtmlElement` object, and we can now use **XPath** queries to locate the elements that we need. (Depending on time, we may do in class a tutorial on XPath. For now, you can look at the [W3Schools tutorial](http://www.w3schools.com/xpath/xpath_nodes.asp))

For example, to find all the `<a ...> ... </a>` tags in the returned html, which store the links in the page, we issue the command:

In [11]:
links = doc.xpath("//a")
len(links)

162

In [None]:
links = doc.xpath("//a")
# Iterates over all the links (this means all the nodes
# that matched the //a XPath query) and prints the content
# of the attribute href and the text for that node
for link in links:
    print("==================================")
    print(link.get("href"), "==>", link.text_content())

In [21]:
# Let's pick now a random link

lnk = links[80]
type(lnk)

lxml.html.HtmlElement

The `lnk` variable is again an HtmlElement. To get parts of the html element that we need, we can use the `get` method (e.g., to get the `href` attribute) and the `text` method (to get the text within the `<a>...</a>` tag.

In [22]:
lnk.get("href")

'/mlb/story/_/id/30147626/world-series-daily-get-ready-clayton-kershaw-vs-tyler-glasnow-game-1'

In [23]:
lnk.text_content()

"World Series Daily: Get ready for Clayton Kershaw vs. Tyler Glasnow in Game 1After all the oddities of 2020, the World Series is here. We've got all the info you need for Tuesday's opener.7hMLB Insiders"

In [24]:
lnk.text_content().strip()

"World Series Daily: Get ready for Clayton Kershaw vs. Tyler Glasnow in Game 1After all the oddities of 2020, the World Series is here. We've got all the info you need for Tuesday's opener.7hMLB Insiders"

Now, let's revisit the _list comprehension_ approach that we discussed in the Python Primer session, for quickly constructing lists:

In [None]:
urls = [lnk.get("href") for lnk in doc.xpath('//a')]
urls

### Exercise

Use a list compresension approach, to get the text_content of all the URLs in the page.

In [None]:
# your code here


And now create a list where we put together text content and the URL for each link

In [None]:
# your code here

#### Solution

In [None]:
text = [lnk.text_content().strip() for lnk in doc.xpath("//a")]
text

In [30]:
# Creating a list of tuples where we put together href and text for each link
list_tuples = [(lnk.get("href"), lnk.text_content().strip()) for lnk in doc.xpath("//a")]

In [31]:
# Creating a list of dictionaries with the text and URL for each link
list_dicts = [{"URL": lnk.get("href"), "Text": lnk.text_content().strip()} for lnk in doc.xpath("//a")]

In [32]:
import pandas as pd
pd.DataFrame(list_dicts)

Unnamed: 0,URL,Text
0,#,Menu
1,/,ESPN
2,#,
3,#,scores
4,/nfl/,NFL
...,...,...
157,https://disneyprivacycenter.com/kids-privacy-p...,Children's Online Privacy Policy
158,http://preferences-mgr.truste.com/?type=espn&a...,Interest-Based Ads
159,http://www.nielsen.com/digitalprivacy,About Nielsen Measurement
160,https://privacy.thewaltdisneycompany.com/en/dn...,Do Not Sell My Info


### More Advanced Example: Get the list of headlines from ESPN


Now, let's examine how we can get the data from the website. The key is to understand the structure of the HTML, where the data that we need is stored, and how to fetch the elements. Then, using the appropriate XPath queries, we will get what we want.

Let's start by fetching the page, and parsing it

In [53]:
url = "http://www.espn.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

By using the `"Right-Click > Inspect"` option of Chrome,
we right click on the headlines and select `"Inspect"`.
This opens the source code.
There we see that all under a `<div class="headlineStack">` tag.

In [54]:
headlineNode = doc.xpath('//div[@class="headlineStack"]')

The result of that operation is a list with 8 elements.

In [55]:
type(headlineNode)

list

In [56]:
len(headlineNode)

8

Each headline is under a  `<li><a href="...."></a>` tag.
So, we get all the `<li><a ...>` tags within the `<div class="headlineStack">`
(which is stored in the "`headlineNode`" variable)

In [57]:
headlines = headlineNode[0].xpath('.//li/a')
len(headlines)

7

Now, we have the nodes with the conent in the headlines variable.
We extract the text and the URL.

In [58]:
data = [{"Title": a.text_content(), "URL": a.get("href")} for a in headlines]
data

[{'Title': 'Source: Dolphins to go with Tua as starting QB',
  'URL': '/nfl/story/_/id/30154407/source-miami-dolphins-naming-tua-tagovailoa-starting-quarterback'},
 {'Title': 'Khabib: Only want St-Pierre bout after Gaethje',
  'URL': '/mma/story/_/id/30154475/khabib-nurmagomedov-only-want-georges-st-pierre-bout-justin-gaethje'},
 {'Title': "Raptors' Bjorkgren named Pacers head coach",
  'URL': '/nba/story/_/id/30154580/indiana-pacers-hire-toronto-raptors-assistant-nate-bjorkgren-head-coach'},
 {'Title': "Jones 'not in feel-good mood' about Dallas at 2-4",
  'URL': '/nfl/story/_/id/30154402/jerry-jones-not-feel-good-mood-dallas-cowboys-2-4-record-nfc-east-lead'},
 {'Title': 'Luhnow again denies role in Astros sign scandal',
  'URL': '/mlb/story/_/id/30151774/ex-astros-gm-jeff-luhnow-again-denies-role-houston-astros-sign-stealing-scandal'},
 {'Title': "Zeke on Dallas loss: 'I'm sorry; this one is on me'",
  'URL': '/nfl/story/_/id/30152139/dallas-cowboys-ezekiel-elliott-takes-blame-blowo

And let's create our dataframe, so that we can have a better view

In [59]:
dataframe = pandas.DataFrame(data)
dataframe

Unnamed: 0,Title,URL
0,Source: Dolphins to go with Tua as starting QB,/nfl/story/_/id/30154407/source-miami-dolphins...
1,Khabib: Only want St-Pierre bout after Gaethje,/mma/story/_/id/30154475/khabib-nurmagomedov-o...
2,Raptors' Bjorkgren named Pacers head coach,/nba/story/_/id/30154580/indiana-pacers-hire-t...
3,Jones 'not in feel-good mood' about Dallas at 2-4,/nfl/story/_/id/30154402/jerry-jones-not-feel-...
4,Luhnow again denies role in Astros sign scandal,/mlb/story/_/id/30151774/ex-astros-gm-jeff-luh...
5,Zeke on Dallas loss: 'I'm sorry; this one is o...,/nfl/story/_/id/30152139/dallas-cowboys-ezekie...
6,Kiper/McShay: Sleepers for NFL draft?,/nfl/draft2021/insider/story/_/id/30147118/202...


#### Of course, there are always more than one way to skin a cat...

Alternatively, if we did not want to restrict ourselves to just the first headline box, we could write an alternative query, to get back all the headlines, that appear in an XPath `//div[@class="headlineStack"]//li/a`:

In [64]:
headlines = doc.xpath('//div[@class="headlineStack"]//li/a')
data = [{"Title": a.text_content(), "URL": a.get("href")} for a in headlines]
df = pandas.DataFrame(data)
df

Unnamed: 0,Title,URL
0,Source: Dolphins to go with Tua as starting QB,/nfl/story/_/id/30154407/source-miami-dolphins...
1,Khabib: Only want St-Pierre bout after Gaethje,/mma/story/_/id/30154475/khabib-nurmagomedov-o...
2,Raptors' Bjorkgren named Pacers head coach,/nba/story/_/id/30154580/indiana-pacers-hire-t...
3,Jones 'not in feel-good mood' about Dallas at 2-4,/nfl/story/_/id/30154402/jerry-jones-not-feel-...
4,Luhnow again denies role in Astros sign scandal,/mlb/story/_/id/30151774/ex-astros-gm-jeff-luh...
5,Zeke on Dallas loss: 'I'm sorry; this one is o...,/nfl/story/_/id/30152139/dallas-cowboys-ezekie...
6,Kiper/McShay: Sleepers for NFL draft?,/nfl/draft2021/insider/story/_/id/30147118/202...
7,"Expert picks: Dodgers or Rays? MVP? Most HRs, ...",/mlb/story/_/id/30151229/world-series-2020-pic...
8,Everything you need to know about Dodgers-Rays,/mlb/story/_/id/30144920/world-series-2020-ult...
9,"Women + Sports Summit: Connecting, engaging an...",http://promo.espn.com/espnw/?addata=espn:front...


In [65]:
headlines = doc.xpath('//a[@data-mptype="headline"]')
data = [{"Title": a.text_content(), "URL": a.get("href")} for a in headlines]
df = pandas.DataFrame(data)
df

Unnamed: 0,Title,URL
0,Source: Dolphins to go with Tua as starting QB,/nfl/story/_/id/30154407/source-miami-dolphins...
1,Khabib: Only want St-Pierre bout after Gaethje,/mma/story/_/id/30154475/khabib-nurmagomedov-o...
2,Raptors' Bjorkgren named Pacers head coach,/nba/story/_/id/30154580/indiana-pacers-hire-t...
3,Jones 'not in feel-good mood' about Dallas at 2-4,/nfl/story/_/id/30154402/jerry-jones-not-feel-...
4,Luhnow again denies role in Astros sign scandal,/mlb/story/_/id/30151774/ex-astros-gm-jeff-luh...
5,Zeke on Dallas loss: 'I'm sorry; this one is o...,/nfl/story/_/id/30152139/dallas-cowboys-ezekie...
6,Kiper/McShay: Sleepers for NFL draft?,/nfl/draft2021/insider/story/_/id/30147118/202...
7,"Expert picks: Dodgers or Rays? MVP? Most HRs, ...",/mlb/story/_/id/30151229/world-series-2020-pic...
8,Everything you need to know about Dodgers-Rays,/mlb/story/_/id/30144920/world-series-2020-ult...
9,"Women + Sports Summit: Connecting, engaging an...",http://promo.espn.com/espnw/?addata=espn:front...


## In Class Example: Crawl BuzzFeed

* We will try to get the top articles that appear on Buzzfeed
* We will grab the link for the article, the text of the title, the description, and the editor.
* The results will be stored in a dataframe (we will see in detail what a dataframe is, in a couple of modules)


In [43]:
#your code here
import requests

from lxml import html

resp = requests.get("http://www.buzzfeed.com")
doc = html.fromstring(resp.text)

In [46]:
story_nodes = doc.xpath('//article[@data-buzzblock="story-card"]')

#### Solution for Buzzfeed (as of October 20, 2020)

In [52]:
import requests # This command allows us to fetch URLs
from lxml import html # This module will allow us to parse the returned HTML/XML
import pandas

# Let's start by fetching the page, and parsing it
url = "http://www.buzzfeed.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

articleNodes = doc.xpath("//article[@data-buzzblock='story-card']") 

def parseArticleNode(article):
    headline = article.xpath(".//h2")[0].text_content().strip()
    headline_link = article.xpath(".//a")[0].get("href")
    description = article.xpath(".//p")[0].text_content()
    editor = article.xpath(".//a[contains(@class,'card__meta__link')]/span")[0].text_content().strip()

    result = {
        "headline": headline,
        "URL" : headline_link,
        "description" : description,
        "editor" : editor
    }
    return result

data = [parseArticleNode(article) for article in articleNodes]
df = pandas.DataFrame(data)
df

Unnamed: 0,headline,URL,description,editor
0,31 Products With Before-And-After Photos That ...,https://www.buzzfeed.com/jennifertonti/before-...,Seeing really is believing — and we believe.,Jennifer Tonti
1,WTF Is Happening Today? BuzzFeed Newsletters C...,https://www.buzzfeed.com/newsletters?ref=hplego,We have newsletters to help you stay in the lo...,BuzzFeed
2,Culinary School Grads Are Sharing The Cooking ...,https://www.buzzfeed.com/melissaharrison/culin...,Who knew??,Melissa Jameson
3,TikTok Exploded With Some Major Balloon Drama ...,https://www.buzzfeednews.com/article/laurenstr...,People were sucked into drama over one woman's...,Lauren Strapagiel
4,"25 ""Haunting Of Bly Manor"" Details That Actual...",https://www.buzzfeed.com/noradominick/haunting...,"I'll always cry over the word ""confetti,"" and ...",Nora Dominick
...,...,...,...,...
83,A Woman In Her Thirties Died Of COVID-19 On A ...,https://www.buzzfeednews.com/article/tasneemna...,A Texas woman in her thirties died after she h...,Tasneem Nashrulla
84,Justin Bieber Got Really Worked Up During His ...,https://www.buzzfeed.com/larryfitzmaurice/just...,You have to see it.,larryfitzmaurice
85,24 Things From Bellesa Boutique To Help You Ha...,https://www.buzzfeed.com/taylor_steele/things-...,Fall has come and so will you.,Taylor Steele
86,"18 Side-By-Sides Of The ""Great British Bake Of...",https://www.buzzfeed.com/jenniferabidor/great-...,So. Much. Chocolate.,Jen Abidor
