# Web Scraping With Python and Requests-HTML

In this Python for SEO tutorial, we will learn how to scrape a website with Python using the `Requests-HTML` library.

(Code example included)

@author: Jean-Christophe Chouinard: Technical SEO / Data Scientist > [LinkedIn](https://www.linkedin.com/in/jeanchristophechouinard/) > [@ChouinardJC](https://twitter.com/ChouinardJC) > Blog > [jcchouinard.com](https://www.jcchouinard.com/) > Complete Tutorial > [Web Scraping With Python and Requests-HTML](https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/)

## What is Web Scraping?
*Web scraping* means the action of *parsing* the content of a webpage to extract specific information.

*Parsing* means that you analyze a document to describe the syntax (i.e. the HTML structure). Without a *parser*, your HTML document will look like a single block of text.

When you are scraping a website, you are asking the server to send you an HTML document that you *parse* to understand the building blocks (`<head>`,`<body>`,`<title>`,`<h1>`, etc.). Once the structure is understood, you can pull out any information that you want.

## What is Requests-HTML Library?

The `requests-HTML` library is a HTML parser that lets you use *CSS Selectors* and *XPath Selectors* to extract the information that you want from a web page.


## Install and load Libraries

In this tutorial, we will use the `requests` library to "call" the URL by making HTTP requests to servers, the `requests-HTML` library to *parse* the data, and the `pandas` library to work with the scraped information. 

In [None]:
!pip install requests
!pip install requests-HTML
!pip install pandas
!pip install regex
!pip install urlparse4

## Call the URL With requests.get()

Use `HTMLSession()` to initialize the GET requests and the `.get()` function from `requests` to call the URL that you want to scrape.

Just to make sure that there is no error, I will add a `try` and `except` statement to return an error in any case the code don't work.

We will store the response in a variable called `response`.

In [None]:
import requests
from requests_html import HTMLSession

url = "https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/"

try:
    session = HTMLSession()
    response = session.get(url)
    
except requests.exceptions.RequestException as error:
    print(error)

## Structure of Scraping Functions

The structure of the `requests-HTML` parsing call goes like this:

`variable.attribute.function(*selector*, parameters)`

The `variable` is the instance that you created using the `.get(url)` function.

The `attribute` is the type of content that you want to extract (`html` / `lxml`).

The `requests-HTML` parser also has many useful built-in `methods` for SEOs.

* **links**: Get all links found on a page (anchors included);
* **absolute_links**: Get all links found on a page  (anchors excluded);
* **find()**: Find a specific element on a page with a CSS Selector;
* **xpath()**: Get elements using Xpath function;

### Extract the Title From the Page

Here, we are going to use `find()` with the `html` attribute to "find" the `<title>` tag using the `'title'` *CSS Selector* and return a list of elements (`[<Element 'title' >]`).

In [None]:
title =  response.html.find('title')
print(title)

To print to actual title, we need to use the index with the `text` attribute.

In [None]:
print(title[0].text)

This is the same as using the `first` parameter in the `function` in a one-liner.

In [None]:
title =  response.html.find('title', first=True).text
print(title)

### Extract Meta Description

To extract the meta description from a page, we will use the `xpath()` function with the `//meta[@name="description"]/@content` Xpath. 

In [None]:
meta_desc =  response.html.xpath('//meta[@name="description"]/@content')
print(meta_desc)

### Extract All Links From a Webpage

In [None]:
links = response.html.absolute_links
print(links)

### Extract Information Using Class or ID

You can extract any specific information from a page using the dot (`.`) notation to select a class, or the pound (`#`) notation to select the ID.

Here we are going to extract the author name using the class.

In [7]:
author = response.html.find('.post-author', first=True).text
print(author)

Hamlet Batista


### Extract Canonical Link

In [8]:
canonical = response.html.xpath("//link[@rel='canonical']/@href")
print(canonical)

['https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/']


### Extract Hreflang

In [9]:
hreflang = response.html.xpath("//link[@rel='alternate']/@hreflang")
print(hreflang)

[]


### Extract Meta Robots

In [10]:
meta_robots = response.html.xpath("//meta[@name='ROBOTS']/@content")
print(meta_robots)

['NOODP']


### Extract Nested Information

To extract information within a specific location you can dig down the DOM using CSS Selectors.

In [11]:
get_nav_links = response.html.find('a.sub-m-cat span')

We will build a for loop to loop through all the indices in the `nav_links` list and add the text to another list called `nav_links`.

In [12]:
nav_links = []

for i in range(len(get_nav_links)):
    x = get_nav_links[i].text
    nav_links.append(x)
    
nav_links

['SEO', 'PPC', 'CONTENT', 'SOCIAL', 'NEWS', 'ADVERTISE', 'MORE']

### Save a Subsection of a Page in a Variable

If the content that you want to extract is always in a specific `<div>`, you can save the path in a variable to call it.

Here, I will extract links that are in the actual content of a post by "saving" the `post-342779` article in a variable called `article`.

In [13]:
article = response.html.find('article.cis_post_item_initial.post-342779', first=True)
article_links = article.xpath('//a/@href')

## Case Study: Extract Broken Links

In [15]:
import re
import requests
from requests_html import HTMLSession
from urllib.parse import urlparse

# Get Domain Name With urlparse
url = "https://www.jobillico.com/fr/partenaires-corporatifs"
parsed_url = urlparse(url)
domain = parsed_url.scheme + "://" + parsed_url.netloc

# Get URL 
session = HTMLSession()
r = session.get(url)

# Extract Links
jlinks = r.html.xpath('//a/@href')

# Remove bad links and replace relative path for absolute path
updated_links = []

for link in jlinks:
    if re.search(".*@.*|.*javascript:.*|.*tel:.*",link):
        link = ""
    elif re.search("^(?!http).*",link):
        link = domain + link
        updated_links.append(link)
    else:
        updated_links.append(link)

In [None]:
print(updated_links)

In [None]:
broken_links = []

for link in updated_links:
    print(link)
    try: 
        requests.get(link, timeout=10).status_code
        if requests.get(link, timeout=10).status_code != 200:
            broken_links.append(link)
    except requests.exceptions.RequestException as e:
        print(e)

broken_links

## Full Code

In [None]:
import pandas as pd
import requests
from requests_html import HTMLSession


url = "https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/"

try:
    session = HTMLSession()
    response = session.get(url)
except HTTPError as error:
    print(error)

    
# Get Title
title =  response.html.find('title', first=True).text

#Get H1
h1 =  response.html.find('h1', first=True).text

#Get all Links
links = response.html.absolute_links

#Get Author using Class
author = response.html.find('.post-author', first=True).text

#Get Canonical Link
canonical = response.html.xpath("//link[@rel='canonical']/@href")

#Get Hreflang
hreflang = response.html.xpath("//link[@rel='alternate']/@hreflang")

#Get Meta Robots
meta_robots = response.html.xpath("//meta[@name='ROBOTS']/@content")

#Get Navigational links using nested CSS Selector and For Loops
get_nav_links = response.html.find('a.sub-m-cat span')

nav_links = []

for i in range(len(get_nav_links)):
    x = get_nav_links[i].text
    nav_links.append(x)
    
nav_links

#Create a variable to extract dat from the actual article only.
article = response.html.find('article.cis_post_item_initial.post-342779', first=True)
article_links = article.xpath('//a/@href')
