## Crawling and Extracting Data from Websites

This module uses a set of non-standard libraries, which need to be installed on your machine. By default, your instance should have these installed, but if this is not the case, type these in the Unix shell prompt





In [1]:
!sudo -H apt-get -y install libxml2-dev libxslt-dev python3-dev

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'libxslt1-dev' instead of 'libxslt-dev'
libxslt1-dev is already the newest version (1.1.28-2.1).
python3-dev is already the newest version (3.5.1-3).
libxml2-dev is already the newest version (2.9.3+dfsg1-1ubuntu0.1).
The following packages were automatically installed and are no longer required:
  linux-headers-4.4.0-62 linux-headers-4.4.0-62-generic
  linux-image-4.4.0-62-generic
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.


and then

In [2]:
!sudo -H pip3 install -U lxml

Requirement already up-to-date: lxml in /usr/local/lib/python3.5/dist-packages


In [3]:
!sudo -H pip3 install -U pandas

Requirement already up-to-date: pandas in /usr/local/lib/python3.5/dist-packages
Requirement already up-to-date: pytz>=2011k in /usr/local/lib/python3.5/dist-packages (from pandas)
Requirement already up-to-date: python-dateutil>=2 in /usr/local/lib/python3.5/dist-packages (from pandas)
Requirement already up-to-date: numpy>=1.7.0 in /usr/local/lib/python3.5/dist-packages (from pandas)
Requirement already up-to-date: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2->pandas)


### Fetching the headlines from ESPN.com

Let's start by trying to fetch the headlines from the site ESPN.com.



In [4]:
import requests # This command allows us to fetch URLs
from lxml import html # This module will allow us to parse the returned HTML/XML
import pandas # To create a dataframe

# Let's start by fetching the page, and parsing it
url = "http://www.espn.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

The `doc` variable is an `HtmlElement` object, and we can now use **XPath** queries to locate the elements that we need. (Depending on time, we may do in class a tutorial on XPath. For now, you can look at the [W3Schools tutorial](http://www.w3schools.com/xpath/xpath_nodes.asp))

For example, to find all the `<a ...> ... </a>` tags in the returned html, which store the links in the page, we issue the command:

In [5]:
links = doc.xpath("//a")
len(links)

228

In [21]:
lnk = links[70]
type(lnk)


lxml.html.HtmlElement

The `lnk` variable is again an HtmlElement. To get parts of the html element that we need, we can use the `get` method (e.g., to get the `href` attribute) and the `text_content()` method (to get the text within the `<a>...</a>` tag.

In [25]:
lnk.get("href")

'/mlb/story/_/id/18747575/tim-tebow-baseball-journey-hits-mets-spring-training'

In [26]:
lnk.text_content()

"MLBESPN.comTebow's baseball journey hits Mets spring trainingFrom Heisman Trophy-winning and NFL starting quarterback to minor league-hopeful outfielder, here's how Tim Tebow's path brought him to the Grapefruit League."

Now, let's revisit the _list comprehension_ approach that we discussed in the Python Primer session, for quickly constructing lists:

In [9]:
urls = [lnk.get("href") for lnk in doc.xpath('//a')]
urls

['#',
 '/',
 '#',
 '#',
 '/nfl/',
 '/nba/',
 '/mlb/',
 '/college-football/',
 'http://www.espnfc.com',
 '/mens-college-basketball/',
 '#',
 '/nhl/',
 '/golf/',
 '/tennis/',
 '/mma/',
 '/wwe/',
 '/boxing/',
 '/esports/',
 '/chalk/',
 '/analytics/',
 '/womens-basketball/',
 '/womens-basketball/',
 '/racing/nascar/',
 '/racing/',
 '/horse-racing/',
 'http://www.espn.com/college-sports/football/recruiting/',
 'http://www.espn.com/college-sports/basketball/recruiting/index',
 '/college-sports/',
 'http://www.espn.com/moresports/story/_/page/LittleLeagueWorldSeries/little-league-world-series-espn',
 '/olympics/',
 'http://www.espn.com/extra/specialolympics/',
 'http://xgames.com/',
 'http://espncricinfo.com/',
 'http://www.espnscrum.com/',
 '/endurance/',
 'http://www.tsn.ca/cfl',
 '#',
 '/fantasy/',
 'http://www.espn.com/espnradio/index',
 'http://www.espn.com/watchespn/index',
 None,
 '#',
 'http://www.espn.com/nfl/story/_/id/18749426/2017-nfl-combine-preview-prospect-targets-positions-nee

#### Exercise

Use a list compresension approach, to get the text_content of all the URLs in the page.

In [10]:
# your code here

And now create a list where we put together text content and the URL for each link

In [11]:
# your code here

#### Let's get the headlines...

Now, let's examine how we can get the data from the website. The key is to understand the structure of the HTML, where the data that we need is stored, and how to fetch the elements. Then, using the appropriate XPath queries, we will get what we want.

In [12]:
import requests # This command allows us to fetch URLs
from lxml import html # This module will allow us to parse the returned HTML/XML
import pandas # To create a dataframe

Let's start by fetching the page, and parsing it

In [13]:
url = "http://www.espn.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

By using the `"Right-Click > Inspect"` option of Chrome,
we right click on the headlines and select `"Inspect"`.
This opens the source code.
There we see that all under a `<div class="headlineStack">` tag
which is also the only such tag in the html source

In [14]:
headlineNode = doc.xpath('//div[@class="headlineStack"]')[0]

Each headline is under a  `<li><a href="...."></a>` tag.
So, we get all the `<li><a ...>` tags within the `<div class="headlineStack">`
(which is stored in the "`headlineNode`" variable)

In [15]:
headlines = headlineNode.xpath('.//ul/li/a')

Now, we have the nodes with the conent in the headlines variable.
We extract the text and the URL.

In [16]:
data = [{"Title": a.text_content(), "URL": a.get("href")} for a in headlines]
data

[{'Title': '76ers C Embiid to have MRI on injured knee',
  'URL': '/nba/story/_/id/18779961/joel-embiid-philadelphia-76ers-undergo-mri-injured-knee'},
 {'Title': "Sources: Season likely over for Knicks' Noah",
  'URL': '/nba/story/_/id/18780034/joakim-noah-new-york-knicks-likely-done-season-knee-injury'},
 {'Title': 'Source: Bears unlikely to tag WR Jeffery again',
  'URL': '/nfl/story/_/id/18778296/chicago-bears-unlikely-tag-alshon-jeffery-again'},
 {'Title': "Baylor's Mulkey expresses regret over remarks",
  'URL': '/womens-college-basketball/story/_/id/18774890/kim-mulkey-baylor-bears-expresses-regret-comment'},
 {'Title': 'Sources: Bogut seeks 76ers exit to join Cavs',
  'URL': '/nba/story/_/id/18776384/andrew-bogut-negotiating-philadelphia-76ers-release-wants-join-cavs'},
 {'Title': "Nicklaus: Tiger's current status is puzzling",
  'URL': '/golf/story/_/id/18774951/jack-nicklaus-says-tiger-woods-status-puzzling'},
 {'Title': 'Eight NFL trades that should happen',
  'URL': '/nfl/in

And let's create our dataframe, so that we can have a better view

In [17]:
dataframe = pandas.DataFrame(data)
dataframe

Unnamed: 0,Title,URL
0,76ers C Embiid to have MRI on injured knee,/nba/story/_/id/18779961/joel-embiid-philadelp...
1,Sources: Season likely over for Knicks' Noah,/nba/story/_/id/18780034/joakim-noah-new-york-...
2,Source: Bears unlikely to tag WR Jeffery again,/nfl/story/_/id/18778296/chicago-bears-unlikel...
3,Baylor's Mulkey expresses regret over remarks,/womens-college-basketball/story/_/id/18774890...
4,Sources: Bogut seeks 76ers exit to join Cavs,/nba/story/_/id/18776384/andrew-bogut-negotiat...
5,Nicklaus: Tiger's current status is puzzling,/golf/story/_/id/18774951/jack-nicklaus-says-t...
6,Eight NFL trades that should happen,/nfl/insider/story/_/id/18775789/eight-nfl-tra...


### In Class Example: Crawl BuzzFeed

* We will try to get the top articles that appear on Buzzfeed
* We will grab the link for the article, the text of the title, the description, and the editor.
* The results will be stored in a dataframe (we will see in detail what a dataframe is, in a couple of modules)


In [18]:
#your code here
import requests
from lxml import html

resp = requests.get("http://www.buzzfeed.com")
doc = html.fromstring(resp.text)



#### Solution for Buzzfeed (as of February 27, 2017)

In [27]:
import requests # This command allows us to fetch URLs
from lxml import html # This module will allow us to parse the returned HTML/XML
import pandas

# Let's start by fetching the page, and parsing it
url = "http://www.buzzfeed.com/"
response = requests.get(url) # get the html of that url
doc = html.fromstring(response.text) # parse it and create a document

articleNodes = doc.xpath("//div[@data-module='card-article']") 

def parseArticleNode(article):
    headline = article.xpath(".//a/h2")[0].text_content()
    headline_link = article.xpath(".//a/h2")[0].text_content()
    description = article.xpath(".//a/p")[0].text_content()
    editor = article.xpath(".//a[contains(@class,'card__byline-link')]/span")[0].text_content().strip()
    
    result = {
        "headline": headline,
        "URL" : headline_link,
        "description" : description,
        "editor" : editor
    }
    return result

data = [parseArticleNode(article) for article in articleNodes]
df = pandas.DataFrame(data)
df

Unnamed: 0,URL,description,editor,headline
0,Your Man Mahershala Ali Now Has An Oscar,"If you thought he looked good in yellow, wait ...",Marcus Jones,Your Man Mahershala Ali Now Has An Oscar
1,This Colour Test Will Reveal How Often You Mas...,How many doses of self-loving are you receiving?,Ben Henry,This Colour Test Will Reveal How Often You Mas...
2,Build The Perfect Date And We'll Tell You What...,"Roses are red, violets are blue...",Kevin Smith,Build The Perfect Date And We'll Tell You What...
3,23 Dirty Jokes You'll Only Appreciate If You H...,[watching porn] I hope they stay together.,Gena-mour Barrett,23 Dirty Jokes You'll Only Appreciate If You H...
4,This Woman Responded To The Body-Shaming Comme...,"""As it turns out, happiness isn't a size.""",Rachael Krishna,This Woman Responded To The Body-Shaming Comme...
5,Build A Dream Date Night And We'll Tell You If...,Are you in lust or loved up?,Jasmin Nahar,Build A Dream Date Night And We'll Tell You If...
6,How Old Were These Actors When They Filmed The...,Warning: This quiz will make you feel old.,paceyandjoey,How Old Were These Actors When They Filmed The...
7,Another Jewish Cemetery Was Vandalized,Around 100 headstones were knocked over at the...,Claudia Koerner,Another Jewish Cemetery Was Vandalized
8,Brie Larson Refused To Applaud Casey Affleck W...,This is the second time this awards season tha...,Ellie Woodward,Brie Larson Refused To Applaud Casey Affleck W...
9,27 Ways To Trick People Into Thinking You're G...,Bad hair day? I don't know her.,Maitland Quitmeyer,27 Ways To Trick People Into Thinking You're G...
