## Crawling and Extracting Data from Websites

This module uses a set of non-standard libraries, which need to be installed on your machine. By default, your instance should have these installed, but if this is not the case, type these in the Unix shell prompt





In [None]:
!sudo -H apt-get -y install libxml2-dev libxslt-dev python3-dev

and then

In [None]:
!sudo -H pip3 install -U lxml

In [None]:
!sudo -H pip3 install -U pandas

### Fetching the headlines from ESPN.com

Let's start by trying to fetch the headlines from the site ESPN.com.



In [1]:
import requests # This command allows us to fetch URLs
from lxml import html # This module will allow us to parse the returned HTML/XML
import pandas # To create a dataframe

# Let's start by fetching the page, and parsing it
url = "http://www.espn.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

In [2]:
doc

<Element html at 0x7fb6f579f458>

Let's start by getting the content of the `<title>` node from the site:

In [3]:
# This will return back a list of nodes that match
# are called <title>......</title>
results = doc.xpath('//title/text()')
results

['ESPN: The Worldwide Leader in Sports']

In [4]:
results[0]

'ESPN: The Worldwide Leader in Sports'

In [5]:
results = doc.xpath('//title')
results

[<Element title at 0x7fb6d3290c78>]

In [6]:
# gives only the text stored directly under the node
results[0].text 

'ESPN: The Worldwide Leader in Sports'

In [7]:
# gives all the text stored directly under the node *and* its children
# for the <title> node, it does not make a difference
results[0].text_content() 

'ESPN: The Worldwide Leader in Sports'

In [10]:
# Compare the two results to see the difference
# The first one gives all the text stored under the root <HTML> node
# The second gives only the text immediately under HTML (but not under the children)
# doc.xpath("/html")[0].text_content()
# doc.xpath("/html")[0].text

In [11]:
stern_url = 'http://www.stern.nyu.edu'
stern_html = requests.get(stern_url).text
stern_doc = html.fromstring(stern_html)

In [14]:
title1 = stern_doc.xpath("//title/text()")
title1[0]

'NYU Stern School of Business | Full-time MBA, Part-time (Langone) MBA, Undergraduate, PhD, Executive MBA Business Programs - NYU Stern'

In [15]:
title2 = stern_doc.xpath("//title")
title2[0].text

'NYU Stern School of Business | Full-time MBA, Part-time (Langone) MBA, Undergraduate, PhD, Executive MBA Business Programs - NYU Stern'

The `doc` variable is an `HtmlElement` object, and we can now use **XPath** queries to locate the elements that we need. (Depending on time, we may do in class a tutorial on XPath. For now, you can look at the [W3Schools tutorial](http://www.w3schools.com/xpath/xpath_nodes.asp))

For example, to find all the `<a ...> ... </a>` tags in the returned html, which store the links in the page, we issue the command:

In [None]:
<a href='http://www.stern.nyu.edu'>NYU Stern</a>

In [16]:
links = doc.xpath("//a")
len(links)

166

In [17]:
links = doc.xpath("//a")
# Iterates over all the links (this means all the nodes
# that matched the //a XPath query) and prints the content
# of the attribute href and the text for that node
for link in links:
    print("==================================")
    print(link.get("href"), "==>", link.text_content())

# ==> Menu
/ ==> ESPN
# ==> 
# ==> Scores
/nfl/ ==> NFL
/nba/ ==> NBA
/mlb/ ==> MLB
/college-football/ ==> NCAAF
/soccer/ ==> Soccer
/nhl/ ==> NHL
/golf/ ==> Golf
/tennis/ ==> Tennis
# ==> …
/mma/ ==> MMA
/wwe/ ==> WWE
/boxing/ ==> Boxing
/esports/ ==> esports
/chalk/ ==> Chalk
/analytics/ ==> Analytics
/mens-college-basketball/ ==> NCAAM
/womens-basketball/ ==> WNBA
/womens-basketball/ ==> NCAAW
http://www.espn.com/espnw/sport/ncaa-softball/ ==> NCAA Softball
/racing/nascar/ ==> NASCAR
/jayski/ ==> Jayski
/f1/ ==> F1
/racing/ ==> Racing
/olympics/ ==> Olympic Sports
/horse-racing/ ==> Horse
http://www.espn.com/college-sports/football/recruiting/ ==> RN FB
http://www.espn.com/college-sports/basketball/recruiting/ ==> RN BB
/college-sports/ ==> NCAA
/little-league-world-series/ ==> LLWS
http://www.espn.com/specialolympics/ ==> Special Olympics
http://xgames.com/ ==> X Games
http://espncricinfo.com/ ==> Cricket
/rugby/ ==> Rugby
http://www.tsn.ca/cfl ==> CFL
# ==> More ESPN
/fantasy/ ==>

In [18]:
lnk = links[70]
type(lnk)

lxml.html.HtmlElement

The `lnk` variable is again an HtmlElement. To get parts of the html element that we need, we can use the `get` method (e.g., to get the `href` attribute) and the `text` method (to get the text within the `<a>...</a>` tag.

In [20]:
lnk.get("href")

'/nfl/story/_/id/24940559/michael-thomas-calls-josh-norman-goof-ball-con-artist-cap-late-night-twitter-feud'

In [21]:
lnk.text_content()

"Saints' Thomas rips Norman in Twitter tirade"

In [22]:
lnk.text_content().strip()

"Saints' Thomas rips Norman in Twitter tirade"

Now, let's revisit the _list comprehension_ approach that we discussed in the Python Primer session, for quickly constructing lists:

In [23]:
urls = [lnk.get("href") for lnk in doc.xpath('//a')]
urls

['#',
 '/',
 '#',
 '#',
 '/nfl/',
 '/nba/',
 '/mlb/',
 '/college-football/',
 '/soccer/',
 '/nhl/',
 '/golf/',
 '/tennis/',
 '#',
 '/mma/',
 '/wwe/',
 '/boxing/',
 '/esports/',
 '/chalk/',
 '/analytics/',
 '/mens-college-basketball/',
 '/womens-basketball/',
 '/womens-basketball/',
 'http://www.espn.com/espnw/sport/ncaa-softball/',
 '/racing/nascar/',
 '/jayski/',
 '/f1/',
 '/racing/',
 '/olympics/',
 '/horse-racing/',
 'http://www.espn.com/college-sports/football/recruiting/',
 'http://www.espn.com/college-sports/basketball/recruiting/',
 '/college-sports/',
 '/little-league-world-series/',
 'http://www.espn.com/specialolympics/',
 'http://xgames.com/',
 'http://espncricinfo.com/',
 '/rugby/',
 'http://www.tsn.ca/cfl',
 '#',
 '/fantasy/',
 'http://www.espn.com/espnradio/index',
 'http://www.espn.com/watch/',
 'http://www.espn.com/watch/espnplus/?om-navmethod=topnav',
 'https:plus.espn.com/',
 'http://www.espn.com/watch/espnplus/',
 'http://www.espn.com/watch/collections/11264/nhl-live

#### Exercise

Use a list compresension approach, to get the text_content of all the URLs in the page.

In [25]:
# your code here
text = [lnk.text_content().strip() for lnk in doc.xpath("//a")]
text

['Menu',
 'ESPN',
 '',
 'Scores',
 'NFL',
 'NBA',
 'MLB',
 'NCAAF',
 'Soccer',
 'NHL',
 'Golf',
 'Tennis',
 '…',
 'MMA',
 'WWE',
 'Boxing',
 'esports',
 'Chalk',
 'Analytics',
 'NCAAM',
 'WNBA',
 'NCAAW',
 'NCAA Softball',
 'NASCAR',
 'Jayski',
 'F1',
 'Racing',
 'Olympic Sports',
 'Horse',
 'RN FB',
 'RN BB',
 'NCAA',
 'LLWS',
 'Special Olympics',
 'X Games',
 'Cricket',
 'Rugby',
 'CFL',
 'More ESPN',
 'Fantasy',
 'Listen',
 'Watch',
 'ESPN+',
 'Subscribe Now',
 'ESPN+ Home',
 'Watch 180+ NHL Games',
 "Kobe Bryant's Detail",
 '30 for 30 Archive',
 'MLB Playoffs',
 'Fantasy Basketball: Sign Up',
 '',
 'Manage Favorites',
 'ESPN Deportes',
 'The Undefeated',
 'espnW',
 'ESPNFC',
 'X Games',
 'SEC Network',
 'ESPN',
 'ESPN Fantasy',
 'Facebook',
 'Twitter',
 'Instagram',
 'Snapchat',
 'YouTube',
 'ESPN Daily Newsletter',
 'ESPN Daily Calendar',
 "Step into Kyrie's world of ridiculous dribble movesChris Forsberg",
 'Injury-riddled Jags sign free-agent RB Charles',
 "Steelers' Brown sued 

And now create a list where we put together text content and the URL for each link

In [28]:
# Creating a list of tuples where we put together href and text for each link
list_tuples = [(lnk.get("href"), lnk.text_content().strip()) for lnk in doc.xpath("//a")]

In [29]:
# Creating a list of dictionaries with the text and URL for each link
list_dicts = [{"URL": lnk.get("href"), "Text": lnk.text_content().strip()} for lnk in doc.xpath("//a")]

In [31]:
import pandas as pd
pd.DataFrame(list_dicts)

Unnamed: 0,Text,URL
0,Menu,#
1,ESPN,/
2,,#
3,Scores,#
4,NFL,/nfl/
5,NBA,/nba/
6,MLB,/mlb/
7,NCAAF,/college-football/
8,Soccer,/soccer/
9,NHL,/nhl/


#### Let's get the headlines...

Now, let's examine how we can get the data from the website. The key is to understand the structure of the HTML, where the data that we need is stored, and how to fetch the elements. Then, using the appropriate XPath queries, we will get what we want.

In [32]:
import requests # This command allows us to fetch URLs
from lxml import html # This module will allow us to parse the returned HTML/XML
import pandas # To create a dataframe

Let's start by fetching the page, and parsing it

In [33]:
url = "http://www.espn.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

By using the `"Right-Click > Inspect"` option of Chrome,
we right click on the headlines and select `"Inspect"`.
This opens the source code.
There we see that all under a `<div class="headlineStack">` tag.

In [47]:
headlinenodes = doc.xpath('//div[@class="headlineStack"]')
headlines = headlinenodes[1].xpath('.//li/a')
[(h.text_content(), h.get("href")) for h in headlines]

[('Injury-riddled Jags sign free-agent RB Charles',
  '/nfl/story/_/id/24942611/jacksonville-jaguars-sign-free-agent-running-back-jamaal-charles'),
 ("Steelers' Brown sued after alleged incident",
  '/nfl/story/_/id/24942401/antonio-brown-sued-allegedly-throwing-items-balcony'),
 ("Saints' Thomas rips Norman in Twitter tirade",
  '/nfl/story/_/id/24940559/michael-thomas-calls-josh-norman-goof-ball-con-artist-cap-late-night-twitter-feud'),
 ("Annoyed Bregman: Astros should be 'main event'",
  '/mlb/story/_/id/24942409/alex-bregman-houston-astros-annoyed-alds-schedule-glad-return-prime'),
 ('Sources: Suns leaning toward making Jones GM',
  '/nba/story/_/id/24942773/phoenix-suns-leaning-promoting-james-jones-full-general-manager'),
 ("Jones: Cowboys haven't had No. 1 WR in years",
  '/nfl/story/_/id/24941622/jerry-jones-says-cowboys-had-no-1-wr-years'),
 ('Kiper & McShay: RB prospects for 2019 draft',
  '/college-football/insider/story/_/id/24919864/favorite-top-running-backs-2019-nfl-dra

In [None]:
headlineNode = doc.xpath('//div[@class="headlineStack"]')

The result of that operation is a list with 6 elements.

In [None]:
type(headlineNode)

In [None]:
len(headlineNode)

Each headline is under a  `<li><a href="...."></a>` tag.
So, we get all the `<li><a ...>` tags within the `<div class="headlineStack">`
(which is stored in the "`headlineNode`" variable)

In [None]:
headlines = headlineNode[1].xpath('.//li/a')
len(headlines)

Now, we have the nodes with the conent in the headlines variable.
We extract the text and the URL.

In [None]:
data = [{"Title": a.text_content(), "URL": a.get("href")} for a in headlines]
data

And let's create our dataframe, so that we can have a better view

In [None]:
dataframe = pandas.DataFrame(data)
dataframe

#### Of course, there are always more than one way to skin a cat...

Alternatively, if we did not want to restrict ourselves to just the first headline box, we could write an alternative query, to get back all the headlines, that appear in an XPath `//div[@class="headlineStack"]//li/a`:

In [48]:
headlines = doc.xpath('//div[@class="headlineStack"]//li/a')
data = [{"Title": a.text_content(), "URL": a.get("href")} for a in headlines]
df = pandas.DataFrame(data)
df

Unnamed: 0,Title,URL
0,Players not getting enough Heisman love,/college-football/story/_/id/24930711/heisman-...
1,Injury-riddled Jags sign free-agent RB Charles,/nfl/story/_/id/24942611/jacksonville-jaguars-...
2,Steelers' Brown sued after alleged incident,/nfl/story/_/id/24942401/antonio-brown-sued-al...
3,Saints' Thomas rips Norman in Twitter tirade,/nfl/story/_/id/24940559/michael-thomas-calls-...
4,Annoyed Bregman: Astros should be 'main event',/mlb/story/_/id/24942409/alex-bregman-houston-...
5,Sources: Suns leaning toward making Jones GM,/nba/story/_/id/24942773/phoenix-suns-leaning-...
6,Jones: Cowboys haven't had No. 1 WR in years,/nfl/story/_/id/24941622/jerry-jones-says-cowb...
7,Kiper & McShay: RB prospects for 2019 draft,/college-football/insider/story/_/id/24919864/...
8,Exploring the complex nature of love as it rel...,http://www.espn.com/watch/series/c31393a0-346f...
9,Early NFL betting look for Week 6: Another big...,/chalk/insider/story/_/id/24941619/another-big...


In [49]:
headlines = doc.xpath('//a[@data-mptype="headline"]')
data = [{"Title": a.text_content(), "URL": a.get("href")} for a in headlines]
df = pandas.DataFrame(data)
df

Unnamed: 0,Title,URL
0,Players not getting enough Heisman love,/college-football/story/_/id/24930711/heisman-...
1,Injury-riddled Jags sign free-agent RB Charles,/nfl/story/_/id/24942611/jacksonville-jaguars-...
2,Steelers' Brown sued after alleged incident,/nfl/story/_/id/24942401/antonio-brown-sued-al...
3,Saints' Thomas rips Norman in Twitter tirade,/nfl/story/_/id/24940559/michael-thomas-calls-...
4,Annoyed Bregman: Astros should be 'main event',/mlb/story/_/id/24942409/alex-bregman-houston-...
5,Sources: Suns leaning toward making Jones GM,/nba/story/_/id/24942773/phoenix-suns-leaning-...
6,Jones: Cowboys haven't had No. 1 WR in years,/nfl/story/_/id/24941622/jerry-jones-says-cowb...
7,Kiper & McShay: RB prospects for 2019 draft,/college-football/insider/story/_/id/24919864/...
8,Exploring the complex nature of love as it rel...,http://www.espn.com/watch/series/c31393a0-346f...
9,Early NFL betting look for Week 6: Another big...,/chalk/insider/story/_/id/24941619/another-big...


### In Class Example: Crawl BuzzFeed

* We will try to get the top articles that appear on Buzzfeed
* We will grab the link for the article, the text of the title, the description, and the editor.
* The results will be stored in a dataframe (we will see in detail what a dataframe is, in a couple of modules)


In [50]:
#your code here
import requests

from lxml import html

resp = requests.get("http://www.buzzfeed.com")
doc = html.fromstring(resp.text)

In [51]:
len(story_nodes)

20

In [4]:
import pandas as pd
results = []
story_nodes = doc.xpath('//div[@data-buzzblock="story-card"]')
for story in story_nodes:
    headline = story.xpath(".//h2")[0].text_content()
    description = story.xpath(".//p")[0].text_content()
    url = story.xpath(".//a")[0].get("href")
    editor = story.xpath(".//a/span")[0].text_content().strip()
    entry = {
        "headline": headline,
        "description": description,
        "url": url,
        "editor": editor
    }
    results.append(entry)

pd.DataFrame(results)

Unnamed: 0,description,editor,headline,url
0,"No matter how fast your life is, you still hav...",Devric Kiyota,28 Useful Products For Anyone Who's Always On ...,https://www.buzzfeed.com/devrickiyota9/things-...
1,We have newsletters to help you stay in the lo...,BuzzFeed,WTF Is Happening Today? BuzzFeed Newsletters C...,https://www.buzzfeed.com/newsletters?ref=hplego
2,"""In the movie Freaky Friday she does a body sw...",Michael Blackmon,People Blasted Fox News For Saying Jamie Lee C...,/michaelblackmon/fox-news-halloween-jamie-lee-...
3,"At near-Category 5 strength, Michael is the mo...",BuzzFeed News,"""Worst Case Scenario"": Hurricane Michael Makes...",https://www.buzzfeednews.com/article/buzzfeedn...
4,"Caramel apples, pumpkins, and crunchy leaves g...",Margret Wiggins,Go On A Sweater Shopping Spree And We'll Tell ...,/mwiggins/go-on-a-sweater-shopping-spree-and-w...
5,"Nauman Hussain, whose father owns the limo at ...",Tasneem Nashrulla,Limo Company Owner's Son Charged With Criminal...,https://www.buzzfeednews.com/article/tasneemna...
6,A little bit of sass and class.,Javier Moreno,Everyone Is At Least A Little Bit Sassy — How ...,/javiermoreno/sassy-quiz?bfsource=ovthpvariant
7,"Dust, declutter, and dump everything you don't...",BuzzFeed Promotions,Nifty's 7-Day Cleaning Challenge Will Help You...,https://www.buzzfeed.com/buzzfeedpromotions/si...
8,"""It’s definitely been more polarizing than I e...",Tanya Chen,The Therapist In The Shane Dawson Series About...,https://www.buzzfeednews.com/article/tanyachen...
9,It only took 30 minutes!,Jenna Guillaume,The New Bachelorette Made A Transphobic Commen...,/jennaguillaume/bachelorette-australia-queerba...


#### Solution for Buzzfeed (as of October 9, 2018)

In [1]:
import requests # This command allows us to fetch URLs
from lxml import html # This module will allow us to parse the returned HTML/XML
import pandas

# Let's start by fetching the page, and parsing it
url = "http://www.buzzfeed.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

articleNodes = doc.xpath("//div[@data-buzzblock='story-card']") 

def parseArticleNode(article):
    headline = article.xpath(".//h2")[0].text_content()
    headline_link = article.xpath(".//a")[0].get("href")
    description = article.xpath(".//p")[0].text_content()
    editor = article.xpath(".//a[contains(@class,'card__meta__link')]/span")[0].text_content().strip()

    result = {
        "headline": headline,
        "URL" : headline_link,
        "description" : description,
        "editor" : editor
    }
    return result

data = [parseArticleNode(article) for article in articleNodes]
df = pandas.DataFrame(data)
df

Unnamed: 0,URL,description,editor,headline
0,https://www.buzzfeed.com/devrickiyota9/things-...,"No matter how fast your life is, you still hav...",Devric Kiyota,28 Useful Products For Anyone Who's Always On ...
1,https://www.buzzfeed.com/newsletters?ref=hplego,We have newsletters to help you stay in the lo...,BuzzFeed,WTF Is Happening Today? BuzzFeed Newsletters C...
2,/michaelblackmon/fox-news-halloween-jamie-lee-...,"""In the movie Freaky Friday she does a body sw...",Michael Blackmon,People Blasted Fox News For Saying Jamie Lee C...
3,https://www.buzzfeednews.com/article/buzzfeedn...,"At near-Category 5 strength, Michael is the mo...",BuzzFeed News,"""Worst Case Scenario"": Hurricane Michael Makes..."
4,/mwiggins/go-on-a-sweater-shopping-spree-and-w...,"Caramel apples, pumpkins, and crunchy leaves g...",Margret Wiggins,Go On A Sweater Shopping Spree And We'll Tell ...
5,https://www.buzzfeednews.com/article/tasneemna...,"Nauman Hussain, whose father owns the limo at ...",Tasneem Nashrulla,Limo Company Owner's Son Charged With Criminal...
6,/javiermoreno/sassy-quiz?bfsource=ovthpvariant,A little bit of sass and class.,Javier Moreno,Everyone Is At Least A Little Bit Sassy — How ...
7,https://www.buzzfeed.com/buzzfeedpromotions/si...,"Dust, declutter, and dump everything you don't...",BuzzFeed Promotions,Nifty's 7-Day Cleaning Challenge Will Help You...
8,https://www.buzzfeednews.com/article/tanyachen...,"""It’s definitely been more polarizing than I e...",Tanya Chen,The Therapist In The Shane Dawson Series About...
9,/jennaguillaume/bachelorette-australia-queerba...,It only took 30 minutes!,Jenna Guillaume,The New Bachelorette Made A Transphobic Commen...
