<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/04-Web_Scraping/B-Crawling_HTML_Pages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crawling and Extracting Data from Websites

In [None]:
from bs4 import BeautifulSoup

## Searching in HTML: Fetching the webpage title from ESPN.com

Let's start by trying to fetch the headlines from the site ESPN.com.



In [None]:
import requests # This command allows us to fetch URLs
import pandas # To create a dataframe

# Let's start by fetching the page, and parsing it
url = "http://www.espn.com/"

# Add a user-agent, to pretend to be a browser, not a Python script
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# get the html of that url
response = requests.get(url, headers=headers)

# Parse the web page
espn_soup = BeautifulSoup(response.text, 'html.parser')

Let's start by getting the content of the `<title>` node from the site:

In [None]:
results = espn_soup.find('title')
results

In [None]:
# Now let's get the text of that node
results = espn_soup.find('title').string
results

### Exercise

* Connect to the NYU Stern website, and fetch the title of the page

In [None]:
# your code here

In [None]:
# @title Solution
stern_url = 'http://www.stern.nyu.edu'
stern_html = requests.get(stern_url).text
stern_soup = BeautifulSoup(stern_html, 'html.parser')

title = stern_soup.find('title').string
title

## Searching for elements of interest in the web page

Now, let's say that we are looking to retrieve *multiple* elements from a web page. For that we can use the `soup.find_all` command.

For example, to find all the `<a ...> ... </a>` tags in the returned html, which store the links in the page, we issue the command:

In [None]:
# Get all the <a ...> ... </a> elements, which are the links on the page
links = espn_soup.find_all("a")
len(links)

In [None]:
# Let's pick now one of the many links
lnk = links[80]
type(lnk.string)

 To get parts of the html element that we need, we can use the `get` method (e.g., to get the `href` attribute) and the `text` method (to get the text within the `<a>...</a>` tag.

In [None]:
lnk.get("href")

In [None]:
lnk.text

In [None]:
# The strip() removes blank spaces before and after the text
lnk.text.strip()

Let's put everything together

In [None]:
links = espn_soup.find_all("a")

# Iterates over all the links (this means all the nodes
# that matched the //a XPath query) and prints the content
# of the attribute href and the text for that node
for link in links:
    print("==================================")
    print(link.get("href"), "==>", link.text.strip())

Now, let's revisit the _list comprehension_ approach that we discussed in the Python Primer session, for quickly constructing lists:

In [None]:
urls = [lnk.get("href") for lnk in espn_soup.find_all('a')]
urls

In [None]:
# You can safely skip the code below.
# A bit fancier, adding a prefix of http://www.espn.com/ when the URL is
# relative and does not include the domain
domain = "http://www.espn.com/"
urls = [
    lnk.get("href") if lnk.get("href").startswith("http") else domain + lnk.get("href")
    for lnk in espn_soup.find_all('a') if lnk.get("href")
]
urls

### Exercise

Use a list compresension approach, to get the text_content of all the URLs in the page.

In [None]:
# your code here


And now create a list where we put together text content and the URL for each link

In [None]:
# your code here

#### Solution

In [None]:
text = [lnk.text.strip() for lnk in espn_soup.find_all("a")]
text

In [None]:
# Do not include empty pieces of text
text = [lnk.text.strip() for lnk in espn_soup.find_all("a") if len(lnk.text.strip())>0]
text

In [None]:
# Creating a list of tuples where we put together href and text for each link
list_tuples = [(lnk.get("href"), lnk.text.strip()) for lnk in espn_soup.find_all("a")]
list_tuples

In [None]:
# Creating a list of dictionaries with the text and URL for each link
list_dicts = [{"URL": lnk.get("href"), "Text": lnk.text.strip()} for lnk in espn_soup.find_all("a")]
list_dicts

In [None]:
import pandas as pd
pd.DataFrame(list_dicts)

### More Advanced Example: Get the list of headlines from ESPN


Now, let's examine how we can get the data from the website. The key is to understand the structure of the HTML, where the data that we need is stored, and how to fetch the elements. Then, using the appropriate XPath queries, we will get what we want.

Let's start by fetching the page, and parsing it

In [None]:
# Let's start by fetching the page, and parsing it
url = "http://www.espn.com/"

# Add a user-agent, to pretend to be a browser, not a Python script
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# get the html of that url
response = requests.get(url, headers=headers)

# Parse the web page
espn_soup = BeautifulSoup(response.text, 'html.parser')

By using the `"Right-Click > Inspect"` option of Chrome,
we right click on the headlines and select `"Inspect"`.
This opens the source code.
There we see that all under a `<div class="headlineStack">` tag.

In [None]:
headlineNode = espn_soup.find_all('div', class_='headlineStack')


The result of that operation is a list with 8 elements.

In [None]:
type(headlineNode)

In [None]:
len(headlineNode)

Each headline is under a  `<li><a href="...."></a>` tag.
So, we get all the `<li><a ...>` tags within the `<div class="headlineStack">`
(which is stored in the "`headlineNode`" variable)

In [None]:
headlines = headlineNode[1].find_all('li')
headlines = [li.find('a') for li in headlines]
len(headlines)

Now, we have the nodes with the conent in the headlines variable.
We extract the text and the URL.

In [None]:
data = [{"Title": a.text, "URL": a.get("href")} for a in headlines]
data

And let's create our dataframe, so that we can have a better view

In [None]:
dataframe = pandas.DataFrame(data)
dataframe

#### Of course, there are always more than one way to skin a cat...

Alternatively, if we did not want to restrict ourselves to just the first headline box, we could write an alternative query, to get back all the headlines, that appear with the pattern of appearing under a `<div class=headlineStack>` and then under a `<li>` tag and then under an `<a>` tag

In [None]:
headlines = espn_soup.select('div.headlineStack li a')
data = [{"Title": a.text, "URL": a.get("href")} for a in headlines if a.has_attr('href')]
df = pandas.DataFrame(data)
df

In [None]:
headlines = espn_soup.find_all('a', {'data-mptype': 'headline'})
data = [{"Title": a.text, "URL": a.get("href")} for a in headlines]
df = pandas.DataFrame(data)
df

## In Class Example: Crawl BuzzFeed

* We will try to get the top articles that appear on Buzzfeed
* We will grab the link for the article, the text of the title,  and the editor.
* The results will be stored in a dataframe (we will see in detail what a dataframe is, in a couple of modules)


In [None]:
#your code here
import requests

resp = requests.get("http://www.buzzfeed.com")
buzzfeed = BeautifulSoup(resp.text, 'html.parser')

In [None]:
story_nodes = buzzfeed.find_all('li', {'aria-label': 'item', "role": "group"})

In [None]:
len(story_nodes)


In [None]:
# @title Solution for Buzzfeed (as of October 23, 2023)

import requests # This command allows us to fetch URLs
import pandas
import re

# Let's start by fetching the page, and parsing it

resp = requests.get("http://www.buzzfeed.com")
buzzfeed = BeautifulSoup(resp.text, 'html.parser')

articleNodes = buzzfeed.find_all('li', {'aria-label': 'item', "role": "group"})

def parseArticleNode(article):
    headline = article.find("h2").text.strip()
    headline_link = article.find("a").get("href")

    editor_node = article.find("div", {"class": "xs-text-6 text-gray xs-mt1"})
    editor_text = editor_node.text if editor_node else ""

    regex = re.compile(r'^by (.*)$')
    matches = list(regex.finditer(editor_text))
    editor = matches[0].group(1) if len(matches)>0 else ""

    result = {
        "headline": headline,
        "URL" : headline_link,
        "editor" : editor
    }
    return result

data = [parseArticleNode(article) for article in articleNodes]
df = pandas.DataFrame(data)
df