<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/04-Web_Scraping/B-Crawling_HTML_Pages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crawling and Extracting Data from Websites

In [1]:
from bs4 import BeautifulSoup

## Searching in HTML: Fetching the webpage title from ESPN.com

Let's start by trying to fetch the headlines from the site ESPN.com.



In [5]:
import requests # This command allows us to fetch URLs
import pandas # To create a dataframe

# Let's start by fetching the page, and parsing it
url = "http://www.espn.com/"

# Add a user-agent, to pretend to be a browser, not a Python script
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# get the html of that url
response = requests.get(url, headers=headers)

# Parse the web page
espn_soup = BeautifulSoup(response.text, 'html.parser')

Let's start by getting the content of the `<title>` node from the site:

In [6]:
# This code finds the tag <title> in the HTML of ESPN.com
results = espn_soup.find('title')
results

<title>ESPN - Serving Sports Fans. Anytime. Anywhere.</title>

In [7]:
# Now let's get the text of that node
results = espn_soup.find('title').string
results

'ESPN - Serving Sports Fans. Anytime. Anywhere.'

### Exercise

* Connect to the NYU Stern website, and fetch the title of the page

In [10]:
# your code here

# Let's start by fetching the page, and parsing it
url = "http://www.stern.nyu.edu/"

# Add a user-agent, to pretend to be a browser, not a Python script
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# get the html of that url
response = requests.get(url, headers=headers)

# Parse the web page
stern_soup = BeautifulSoup(response.text, 'html.parser')

stern_soup.find('title')

<title>NYU Stern School of Business | Full-time MBA, Part-time (Langone) MBA, Undergraduate, PhD, Executive MBA Business Programs - NYU Stern</title>

In [None]:
# @title Solution
stern_url = 'http://www.stern.nyu.edu'
stern_html = requests.get(stern_url).text
stern_soup = BeautifulSoup(stern_html, 'html.parser')

title = stern_soup.find('title').string
title

## Searching for elements of interest in the web page

Now, let's say that we are looking to retrieve *multiple* elements from a web page. For that we can use the `soup.find_all` command.

For example, to find all the `<a ...> ... </a>` tags in the returned html, which store the links in the page, we issue the command:

In [11]:
# Get all the <a ...> ... </a> elements, which are the links on the page
links = espn_soup.find_all("a")
len(links)

164

In [12]:
# Let's pick now one of the many links
lnk = links[80]
type(lnk.string)

bs4.element.NavigableString

In [14]:
links[80]

<a class="" data-mptype="headline" href="/sports-betting/story/_/id/38749633/2023-nba-thursday-betting-odds-tips-lines-stats-more" name="&amp;lpos=fp:feed:xx:story:1:related">NBA betting: Three picks for Thursday</a>

 To get parts of the html element that we need, we can use the `get` method (e.g., to get the `href` attribute) and the `text` method (to get the text within the `<a>...</a>` tag.

In [15]:
lnk.get("href")

'/sports-betting/story/_/id/38749633/2023-nba-thursday-betting-odds-tips-lines-stats-more'

In [16]:
lnk.text

'NBA betting: Three picks for Thursday'

In [None]:
# The strip() removes blank spaces before and after the text
lnk.text.strip()

Let's put everything together

In [17]:
links = espn_soup.find_all("a")

# Iterates over all the links (this means all the nodes
# that matched the //a XPath query) and prints the content
# of the attribute href and the text for that node
for link in links:
    print("==================================")
    print(link.get("href"), "==>", link.text.strip())

# ==> Menu
/ ==> ESPN
# ==> 
# ==> scores
/nfl/ ==> NFL
/mlb/ ==> MLB
/college-football/ ==> NCAAF
/nba/ ==> NBA
/nhl/ ==> NHL
/soccer/ ==> Soccer
# ==> …
/mens-college-basketball/ ==> NCAAM
/womens-college-basketball/ ==> NCAAW
/sports-betting/ ==> Sports Betting
/boxing/ ==> Boxing
http://www.tsn.ca/cfl ==> CFL
/college-sports/ ==> NCAA
https://www.espncricinfo.com/ ==> Cricket
/f1/ ==> F1
/golf/ ==> Golf
/horse-racing/ ==> Horse
/little-league-world-series/ ==> LLWS
/mma/ ==> MMA
/racing/nascar/ ==> NASCAR
/nba-g-league/ ==> NBA G League
/olympics/ ==> Olympic Sports
/pll/ ==> PLL
/racing/ ==> Racing
/college-sports/basketball/recruiting/ ==> RN BB
/college-sports/football/recruiting/ ==> RN FB
/rugby/ ==> Rugby
/tennis/ ==> Tennis
/wnba/ ==> WNBA
/wwe/ ==> WWE
http://xgames.com/ ==> X Games
/xfl/ ==> XFL
# ==> More ESPN
/fantasy/ ==> Fantasy
https://www.espn.com/espnradio/index ==> Listen
https://www.espn.com/watch/ ==> Watch
https://www.espn.com/espnplus/?om-navmethod=topnav ==> E

Now, let's revisit the _list comprehension_ approach that we discussed in the Python Primer session, for quickly constructing lists:

In [18]:
# List comprehension approach
urls = [lnk.get("href") for lnk in espn_soup.find_all('a')]
urls

# Verbose version instead of the link comprehension approach
urls = []
for lnk in espn_soup.find_all('a'):
  urls.append(lnk.get("href"))


['#',
 '/',
 '#',
 '#',
 '/nfl/',
 '/mlb/',
 '/college-football/',
 '/nba/',
 '/nhl/',
 '/soccer/',
 '#',
 '/mens-college-basketball/',
 '/womens-college-basketball/',
 '/sports-betting/',
 '/boxing/',
 'http://www.tsn.ca/cfl',
 '/college-sports/',
 'https://www.espncricinfo.com/',
 '/f1/',
 '/golf/',
 '/horse-racing/',
 '/little-league-world-series/',
 '/mma/',
 '/racing/nascar/',
 '/nba-g-league/',
 '/olympics/',
 '/pll/',
 '/racing/',
 '/college-sports/basketball/recruiting/',
 '/college-sports/football/recruiting/',
 '/rugby/',
 '/tennis/',
 '/wnba/',
 '/wwe/',
 'http://xgames.com/',
 '/xfl/',
 '#',
 '/fantasy/',
 'https://www.espn.com/espnradio/index',
 'https://www.espn.com/watch/',
 'https://www.espn.com/espnplus/?om-navmethod=topnav',
 'http://www.espn.com/watch/espnplus/',
 'https://secure.web.plus.espn.com/billing/purchase/ESPN_PURCHASE_CMPGN/ESPN_PURCHASE_VOCHR/YESPN?start=login&locale=en_US&om-navmethod=LRailSubscribe',
 'https://www.espn.com/espnplus/player/_/id/20734b62-3

In [None]:
# You can safely skip the code below.
# A bit fancier, adding a prefix of http://www.espn.com/ when the URL is
# relative and does not include the domain
domain = "http://www.espn.com/"
urls = [
    lnk.get("href") if lnk.get("href").startswith("http") else domain + lnk.get("href")
    for lnk in espn_soup.find_all('a') if lnk.get("href")
]
urls

### Exercise

Use a list compresension approach, to get the text_content of all the URLs in the page.

In [None]:
# your code here


And now create a list where we put together text content and the URL for each link

In [None]:
# your code here

#### Solution

In [None]:
text = [lnk.text.strip() for lnk in espn_soup.find_all("a")]
text

In [None]:
# Do not include empty pieces of text
text = [lnk.text.strip() for lnk in espn_soup.find_all("a") if len(lnk.text.strip())>0]
text

In [None]:
# Creating a list of tuples where we put together href and text for each link
list_tuples = [(lnk.get("href"), lnk.text.strip()) for lnk in espn_soup.find_all("a")]
list_tuples

In [None]:
# Creating a list of dictionaries with the text and URL for each link
list_dicts = [{"URL": lnk.get("href"), "Text": lnk.text.strip()} for lnk in espn_soup.find_all("a")]
list_dicts

In [None]:
import pandas as pd
pd.DataFrame(list_dicts)

### More Advanced Example: Get the list of headlines from ESPN


Now, let's examine how we can get the data from the website. The key is to understand the structure of the HTML, where the data that we need is stored, and how to fetch the elements. Then, using the appropriate XPath queries, we will get what we want.

Let's start by fetching the page, and parsing it

In [20]:
# Let's start by fetching the page, and parsing it
url = "http://www.espn.com/"

# Add a user-agent, to pretend to be a browser, not a Python script
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# get the html of that url
response = requests.get(url, headers=headers)

# Parse the web page
espn_soup = BeautifulSoup(response.text, 'html.parser')

By using the `"Right-Click > Inspect"` option of Chrome,
we right click on the headlines and select `"Inspect"`.
This opens the source code.
There we see that all under a `<div class="headlineStack">` tag.

In [21]:
headlineNode = espn_soup.find_all('div', class_='headlineStack')

The result of that operation is a list with 8 elements.

In [22]:
type(headlineNode)

bs4.element.ResultSet

In [23]:
len(headlineNode)

6

Each headline is under a  `<li><a href="...."></a>` tag.
So, we get all the `<li><a ...>` tags within the `<div class="headlineStack">`
(which is stored in the "`headlineNode`" variable)

In [26]:
headlines = headlineNode[0].find_all('li')
headlines = [li.find('a') for li in headlines]
len(headlines)

9

In [28]:
headlines[1]

<a class="" data-mptype="headline" href="/mlb/story/_/id/38750240/astros-dusty-baker-retires-26-seasons-mlb-manager" name="&amp;lpos=fp:feed:xx:coll:headlines:2">Retiring Baker: Not goodbye, just 'see you later'</a>

Now, we have the nodes with the conent in the headlines variable.
We extract the text and the URL.

In [29]:
data = [{"Title": a.text, "URL": a.get("href")} for a in headlines]
data

[{'Title': "Fins WR Hill says he's 'good,' will play vs. Pats",
  'URL': '/nfl/story/_/id/38750787/dolphins-wr-tyreek-hill-says-good-play-vs-patriots'},
 {'Title': "Retiring Baker: Not goodbye, just 'see you later'",
  'URL': '/mlb/story/_/id/38750240/astros-dusty-baker-retires-26-seasons-mlb-manager'},
 {'Title': 'Suns without stars Beal, Booker against Lakers',
  'URL': '/nba/story/_/id/38750754/suns-stars-bradley-beal-devin-booker-lakers'},
 {'Title': 'Masters unlikely to ease entry for LIV golfers',
  'URL': '/golf/story/_/id/38749880/masters-unlikely-change-qualifying-criteria-liv-golfers'},
 {'Title': "Sens' Pinto gets 41-game gambling suspension",
  'URL': '/nhl/story/_/id/38749754/sources-senators-shane-pinto-suspended-41-games-gambling'},
 {'Title': "Browns' Watson unsure how long injury will linger",
  'URL': '/nfl/story/_/id/38749890/deshaun-watson-unsure-how-long-shoulder-injury-linger'},
 {'Title': 'NCAA at Michigan for investigation, sources say',
  'URL': '/college-footb

And let's create our dataframe, so that we can have a better view

In [30]:
dataframe = pandas.DataFrame(data)
dataframe

Unnamed: 0,Title,URL
0,"Fins WR Hill says he's 'good,' will play vs. Pats",/nfl/story/_/id/38750787/dolphins-wr-tyreek-hi...
1,"Retiring Baker: Not goodbye, just 'see you later'",/mlb/story/_/id/38750240/astros-dusty-baker-re...
2,"Suns without stars Beal, Booker against Lakers",/nba/story/_/id/38750754/suns-stars-bradley-be...
3,Masters unlikely to ease entry for LIV golfers,/golf/story/_/id/38749880/masters-unlikely-cha...
4,Sens' Pinto gets 41-game gambling suspension,/nhl/story/_/id/38749754/sources-senators-shan...
5,Browns' Watson unsure how long injury will linger,/nfl/story/_/id/38749890/deshaun-watson-unsure...
6,"NCAA at Michigan for investigation, sources say",/college-football/story/_/id/38750516/ncaa-mic...
7,Messi named finalist for MLS newcomer award,/soccer/story/_/id/38750217/lionel-messi-named...
8,Latest NFL trade deadline buzz,/nfl/insider/story/_/id/38725576/nfl-week-8-tr...


#### Of course, there are always more than one way to skin a cat...

Alternatively, if we did not want to restrict ourselves to just the first headline box, we could write an alternative query, to get back all the headlines, that appear with the pattern of appearing under a `<div class=headlineStack>` and then under a `<li>` tag and then under an `<a>` tag

In [31]:
headlines = espn_soup.select('div.headlineStack li a')
data = [{"Title": a.text, "URL": a.get("href")} for a in headlines if a.has_attr('href')]
df = pandas.DataFrame(data)
df

Unnamed: 0,Title,URL
0,"Fins WR Hill says he's 'good,' will play vs. Pats",/nfl/story/_/id/38750787/dolphins-wr-tyreek-hi...
1,"Retiring Baker: Not goodbye, just 'see you later'",/mlb/story/_/id/38750240/astros-dusty-baker-re...
2,"Suns without stars Beal, Booker against Lakers",/nba/story/_/id/38750754/suns-stars-bradley-be...
3,Masters unlikely to ease entry for LIV golfers,/golf/story/_/id/38749880/masters-unlikely-cha...
4,Sens' Pinto gets 41-game gambling suspension,/nhl/story/_/id/38749754/sources-senators-shan...
5,Browns' Watson unsure how long injury will linger,/nfl/story/_/id/38749890/deshaun-watson-unsure...
6,"NCAA at Michigan for investigation, sources say",/college-football/story/_/id/38750516/ncaa-mic...
7,Messi named finalist for MLS newcomer award,/soccer/story/_/id/38750217/lionel-messi-named...
8,Latest NFL trade deadline buzz,/nfl/insider/story/_/id/38725576/nfl-week-8-tr...
9,NBA betting: Three picks for Thursday,/sports-betting/story/_/id/38749633/2023-nba-t...


In [33]:
headlines = espn_soup.find_all('a', {'data-mptype': 'headline'})
data = [{"Title": a.text, "URL": a.get("href")} for a in headlines]
df = pandas.DataFrame(data)
df

Unnamed: 0,Title,URL
0,"Fins WR Hill says he's 'good,' will play vs. Pats",/nfl/story/_/id/38750787/dolphins-wr-tyreek-hi...
1,"Retiring Baker: Not goodbye, just 'see you later'",/mlb/story/_/id/38750240/astros-dusty-baker-re...
2,"Suns without stars Beal, Booker against Lakers",/nba/story/_/id/38750754/suns-stars-bradley-be...
3,Masters unlikely to ease entry for LIV golfers,/golf/story/_/id/38749880/masters-unlikely-cha...
4,Sens' Pinto gets 41-game gambling suspension,/nhl/story/_/id/38749754/sources-senators-shan...
5,Browns' Watson unsure how long injury will linger,/nfl/story/_/id/38749890/deshaun-watson-unsure...
6,"NCAA at Michigan for investigation, sources say",/college-football/story/_/id/38750516/ncaa-mic...
7,Messi named finalist for MLS newcomer award,/soccer/story/_/id/38750217/lionel-messi-named...
8,Latest NFL trade deadline buzz,/nfl/insider/story/_/id/38725576/nfl-week-8-tr...
9,NBA betting: Three picks for Thursday,/sports-betting/story/_/id/38749633/2023-nba-t...


## In Class Example: Crawl BuzzFeed

* We will try to get the top articles that appear on Buzzfeed
* We will grab the link for the article, the text of the title,  and the editor.
* The results will be stored in a dataframe (we will see in detail what a dataframe is, in a couple of modules)
* Let's also try to create an API that returns a structured JSON object with the results from Buzzfeed.


In [83]:
#your code here
import requests
resp = requests.get("http://www.buzzfeed.com")
buzzfeed = BeautifulSoup(resp.text, 'html.parser')


In [84]:
story_nodes = buzzfeed.find_all('li', {'aria-label': 'item', "role": "group"})

In [None]:
print(story_nodes[0].prettify())

In [37]:
len(story_nodes)


42

[31mERROR: Cannot uninstall 'blinker'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.[0m[31m
[0m



 * Our page is at https://17e6-34-73-64-71.ngrok-free.app/buzzfeed_api
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [26/Oct/2023 23:10:36] "GET /buzzfeed_api HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Oct/2023 23:10:36] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -


In [None]:
# @title Solution for Buzzfeed (as of October 23, 2023)

import requests # This command allows us to fetch URLs
import pandas
import re

# Let's start by fetching the page, and parsing it

resp = requests.get("http://www.buzzfeed.com")
buzzfeed = BeautifulSoup(resp.text, 'html.parser')

story_nodes = buzzfeed.find_all('li', {'aria-label': 'item', "role": "group"})

def parseStory(s):
    headline = s.find("h2").text.strip()
    link = s.find("a").get("href")
    category_node = s.find("span", class_="bold")

    if category_node:
      category_text = category_node.text
      if category_text.endswith("Trending"):
        category_text = category_text.replace("Trending", ", Trending")
    else:
      category_text = None

    # Let's search all the "span" nodes
    time_ago_nodes = s.find_all("span")
    # We will store in "time ago" the time the article was published
    # if we can find it. Otherwise it will remain empty
    time_ago = None
    for t in time_ago_nodes:
      # If the text ends with the "ago" then it is time
      if t.text.endswith(" ago"):
        time_ago = t.text.strip()

    # Find the editor
    editor_node = s.find("div", class_="xs-text-6 text-gray xs-mt1")
    editor_text = None
    if editor_node:
      editor_text = editor_node.text[3:]


    entry = {"headline": headline,
             "URL": link,
             "category": category_text,
             "time_ago": time_ago,
             "editor": editor_text
             }

    return entry

data = [parseStory(article) for article in story_nodes]
df = pandas.DataFrame(data)
df

In [88]:
# @title Solution for Buzzfeed API (as of October 23, 2023)

def get_buzzfeed():

  resp = requests.get("http://www.buzzfeed.com")
  buzzfeed = BeautifulSoup(resp.text, 'html.parser')
  story_nodes = buzzfeed.find_all('li', {'aria-label': 'item', "role": "group"})

  entries = []
  for s in story_nodes:
    entry = parseStory(s)
    # Let's add the entry to the list
    entries.append(entry)

  return entries


buzzfeed = get_buzzfeed()



In [89]:
# @title Solution for Buzzfeed API (as of October 23, 2023)

!pip install -U -q flask pyngrok

from flask import Flask, render_template, jsonify
from pyngrok import ngrok

ngrok_authtoken = '2X0B0I6l2YtIb4teDH937b7rbdH_4rzM4YUiPatv34aLk35PW'
ngrok.set_auth_token(ngrok_authtoken)

port = 5000
app = Flask(__name__)
public_url = ngrok.connect(port).public_url

@app.route('/buzzfeed_api',  methods=['GET'])
def buzzfeed():

    list_of_articles = get_buzzfeed()
    api_results = {"total": len(list_of_articles), "articles": list_of_articles}

    return jsonify(api_results)

print(f" * Our page is at {public_url}/buzzfeed_api")
app.run(use_reloader=False, port=port)

[31mERROR: Cannot uninstall 'blinker'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.[0m[31m
[0m



 * Our page is at https://f2a7-34-73-64-71.ngrok-free.app/buzzfeed_api
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [26/Oct/2023 23:38:41] "GET /buzzfeed_api HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [26/Oct/2023 23:38:42] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
