We start by importing needed libs

In [24]:
import lxml.html as parser
import requests
import csv
from urllib.parse import urlsplit, urljoin

Our starting page will be of a search I made on submarino about the Moto G phone.
The first thing is to download the page and parse it with lxml

In [25]:
start_url = "https://www.submarino.com.br/busca/?conteudo=moto%20g&filtro=%5B%7B%22id%22%3A%22category_breadcrumb_name_level_pt_suba_1%22%2C%22value%22%3A%22Celulares%20e%20Smartphones%22%7D%2C%7B%22id%22%3A%22category_breadcrumb_name_level_pt_suba_2%22%2C%22value%22%3A%22Moto%20G%22%7D%5D&ordenacao=moreRelevant&origem=nanook"
r = requests.get(start_url)
html = parser.fromstring(r.text)

Now we inspect our product's links with our browser:

![title](images/product.png)

There are a few things to note from this image:

1. All the product anchor tag's have the same class name which is not used anywhere else
2. Links are relative
3. The items page link is full of queries that are irelevant for the routing(probably some analytics/marketing stuff)

Thus, we need xpath to extract the links and tinker a bit with the links for later usage

In [26]:
links = html.xpath("//a[@class='card-product-url']/@href")
print(len(links))
print(links[:3])

24
['/produto/123683569?pfm_carac=moto%20g&pfm_index=0&pfm_page=search&pfm_pos=grid&pfm_type=search_page%20', '/produto/131349704?pfm_carac=moto%20g&pfm_index=1&pfm_page=search&pfm_pos=grid&pfm_type=search_page%20', '/produto/131349641?pfm_carac=moto%20g&pfm_index=2&pfm_page=search&pfm_pos=grid&pfm_type=search_page%20']


In [27]:
base_url = "https://www.submarino.com.br"
links = [urljoin(base_url, l) for l in links]
print(links[:3])
links = [urlsplit(l)._replace(query="").geturl() for l in links]
print(links[:3])

['https://www.submarino.com.br/produto/123683569?pfm_carac=moto%20g&pfm_index=0&pfm_page=search&pfm_pos=grid&pfm_type=search_page%20', 'https://www.submarino.com.br/produto/131349704?pfm_carac=moto%20g&pfm_index=1&pfm_page=search&pfm_pos=grid&pfm_type=search_page%20', 'https://www.submarino.com.br/produto/131349641?pfm_carac=moto%20g&pfm_index=2&pfm_page=search&pfm_pos=grid&pfm_type=search_page%20']
['https://www.submarino.com.br/produto/123683569', 'https://www.submarino.com.br/produto/131349704', 'https://www.submarino.com.br/produto/131349641']


Now that we have figured a way to get our product's links, we need to handle pagination

By inspecting the link to load more pages we find this:

![title](images/next_page.png)

Using this information, we can get the next page's link

In [28]:
next_page = urljoin(base_url, html.xpath("//div[@class='card card-pagination']/a/@href")[0])
next_page

'https://www.submarino.com.br/busca/?conteudo=moto g&filtro=[{"id":"category_breadcrumb_name_level_pt_suba_1","value":"Celulares e Smartphones"},{"id":"category_breadcrumb_name_level_pt_suba_2","value":"Moto G"}]&ordenacao=moreRelevant&origem=nanook&limite=24&offset=24'

Now let's check everything is the same on the next page

In [29]:
r = requests.get(next_page)
next_page_html = parser.fromstring(r.text)
next_page_links = html.xpath("//a[@class='card-product-url']/@href")
print(len(next_page_links))
print(next_page_links[0])
third_page = urljoin(base_url, next_page_html.xpath("//div[@class='card card-pagination']/a/@href")[0])
print(third_page)

24
/produto/123683569?pfm_carac=moto%20g&pfm_index=0&pfm_page=search&pfm_pos=grid&pfm_type=search_page%20
https://www.submarino.com.br/busca/?conteudo=moto g&filtro=[{"id":"category_breadcrumb_name_level_pt_suba_1","value":"Celulares e Smartphones"},{"id":"category_breadcrumb_name_level_pt_suba_2","value":"Moto G"}]&limite=24&offset=0&ordenacao=moreRelevant&origem=nanook


The links extraction part seems to be fine, but there is something weird with the next page extraction

If we take a closer look we will see two query parameters "limite" and "offset", this time they are &limite=24&offset=0, whereas the second page url has &limite=24&offset=24 as params, it seems this link is back at the first page.

Let's check our links extraction, maybe on the second page we have two divs with 'card card-pagination' as classes

In [30]:
next_page_html.xpath("//div[@class='card card-pagination']/a/@href")

['/busca/?conteudo=moto g&filtro=[{"id":"category_breadcrumb_name_level_pt_suba_1","value":"Celulares e Smartphones"},{"id":"category_breadcrumb_name_level_pt_suba_2","value":"Moto G"}]&limite=24&offset=0&ordenacao=moreRelevant&origem=nanook',
 '/busca/?conteudo=moto g&filtro=[{"id":"category_breadcrumb_name_level_pt_suba_1","value":"Celulares e Smartphones"},{"id":"category_breadcrumb_name_level_pt_suba_2","value":"Moto G"}]&limite=24&offset=48&ordenacao=moreRelevant&origem=nanook']

As suspected there is a previous page link and a next page link. The next one is always the last one.

We can use this information do write a simple loop that will extract all the product item's links for this search

In [31]:
# First a simple url cleaning method
def clean_url(url):
    return urlsplit(urljoin(base_url, url))._replace(query="").geturl()
# Initialize our links array
links = []
# First page scraping
r = requests.get(start_url)
h = parser.fromstring(r.text)
links += h.xpath("//a[@class='card-product-url']/@href")
# Next page on the starting page is at 0 index, there is no previous page
next_page = urljoin(base_url, h.xpath("//div[@class='card card-pagination']/a/@href")[0])
while next_page:
    r = requests.get(next_page)
    h = parser.fromstring(r.text)
    links += h.xpath("//a[@class='card-product-url']/@href")
    try:
        next_page = urljoin(base_url, h.xpath("//div[@class='card card-pagination']/a/@href")[1])
    # If we get a Index Error, we are at the last page and there is only a previous page link, break the loop
    except IndexError as e: 
            next_page = None
            
# Clean links
links = [clean_url(l) for l in links]

In [32]:
print(len(links))
print(links[:3])

70
['https://www.submarino.com.br/produto/123683569', 'https://www.submarino.com.br/produto/131349704', 'https://www.submarino.com.br/produto/131349641']


As we can see, it seems we have all links for this search, let's now investigate the product name and price extraction

In [33]:
r = requests.get("https://www.submarino.com.br/produto/22339133/moto-g-5-geracao-plus-32gb-ouro?condition=NEW")
product_html = parser.fromstring(r.text)

As before we start by inspecting the html to find where the product's name and price lie and how we could write some xpath expressions to extract them

![title](images/name.png)

![title](images/price.png)

They seem both very straight forward, we should only need to do a little bit of string manipulation to get the price as a float

In [34]:
name = product_html.xpath("//h1[@class='product-name']/text()")[0]
price_str = product_html.xpath("//p[@class='sales-price']/text()")[0]
price = float(price_str[3:].replace(".", "").replace(",", "."))
print("Name: %s, Price: %s" % (name, str(price)))

Name: Moto G (5ª Geração) Plus 32gb - Ouro, Price: 2999.0


And here we finish our exploration of the pages we want to scrape as we have all the information needed to write a good crawler for this website