## *Guía de práctica y estudio de WebScraping con request*

In [61]:
# Importar las librerías necesarias

import requests 
import lxml.html as html
import pandas as pd

In [24]:
url_ppal = 'https://books.toscrape.com/index.html' # Página web con un catálogo de libros.

### Listado de expresiones xpath

In [25]:
# Obtener links de categorías de libros
links_categorias = '//ul[@class="nav nav-list"]/li//ul/li/a/@href'

# Obtener los títulos de los libros
titulos = '//article[@class="product_pod"]//h3/a/text()'

# Obtener el precio de cada libro
precios = '//li//div/p[@class="price_color"]/text()'

In [26]:
# Códigos para la consola del navegador en xpath.
# $x('//article[@class="product_pod"]//h3/a/text()').map(x=>x.value)
# $x('//li//div/p[@class="price_color"]').map(x=>x.text)

### Requests sobre urls de cada libro

In [27]:
# Creo una variable que almacena la respuesta HTTP obtenida después de enviar la solicitud GET al servidor de la página.
# requests.get() es una función de la librería requests que se utiliza para enviar una solicitud GET a una URL. 
# La URL que deseas solicitar se pasa como argumento a esta función.

r = requests.get(url_ppal)

# Imprimo el código del estado de la respuesta a la solicitud GET que se hizo a la URL. 
# Indica el resultado de la solicitud, como 200 para éxito, 404 para no encontrado, 500 para error interno del servidor, etc.
print(r.status_code)

200


In [28]:
# Defino la decodificación para evitar problemas con los caracteres
home = r.content.decode('utf-8') 

# Creo un objeto que almacena el resultado del análisis del contenido HTML de la variable home. El objeto 'parser' facilita navegar por el contenido de la
# página web asi como la manipulación y extracción posterior de la información.

parser = html.fromstring(home)

# Creo una variable para almacenar el listado de las URLs que surgen al aplicar la función de consulta xpath en el objeto parser.

categorias_url = parser.xpath(links_categorias)
print(categorias_url)

['catalogue/category/books/travel_2/index.html', 'catalogue/category/books/mystery_3/index.html', 'catalogue/category/books/historical-fiction_4/index.html', 'catalogue/category/books/sequential-art_5/index.html', 'catalogue/category/books/classics_6/index.html', 'catalogue/category/books/philosophy_7/index.html', 'catalogue/category/books/romance_8/index.html', 'catalogue/category/books/womens-fiction_9/index.html', 'catalogue/category/books/fiction_10/index.html', 'catalogue/category/books/childrens_11/index.html', 'catalogue/category/books/religion_12/index.html', 'catalogue/category/books/nonfiction_13/index.html', 'catalogue/category/books/music_14/index.html', 'catalogue/category/books/default_15/index.html', 'catalogue/category/books/science-fiction_16/index.html', 'catalogue/category/books/sports-and-games_17/index.html', 'catalogue/category/books/add-a-comment_18/index.html', 'catalogue/category/books/fantasy_19/index.html', 'catalogue/category/books/new-adult_20/index.html', 

In [29]:
#Las url no tienen un formato que sea útil, asique debo transformar esos strings concatenando el formato de página web.

root_url = 'https://books.toscrape.com/'
categorias_url = [root_url + x for x in categorias_url]

In [30]:
# Imprimo una url al azar y chequeo que la transformación se realizó correctamente.
print(categorias_url[5])

https://books.toscrape.com/catalogue/category/books/philosophy_7/index.html


In [31]:
# Hago una request sobre el primer título de la categoría 5 (Philosophy), que se que es "Sophie's World".

prueba =  requests.get(categorias_url[5])
home = prueba.content.decode('utf-8')
parser = html.fromstring(home)
resultado = parser.xpath(titulos)
print(resultado[0])

Sophie's World


---

Extra explanation, let's break down the logic step by step in this code:

**r = requests.get(url_ppal):**

This line sends an HTTP GET request to the URL specified in the url_ppal variable.
It retrieves the web page's content, including HTML, and stores it in the r variable.

**home = r.content.decode('utf-8'):**

Here, we decode the content from bytes to a UTF-8 encoded string.
The content attribute of the r variable contains the raw HTML content of the web page.
Decoding it as UTF-8 ensures that the text is correctly interpreted.

**parser = html.fromstring(home):**

We create an HTML parser object using the fromstring method from the html module.
This parser will allow us to navigate and extract data from the HTML content.

**titulo = parser.xpath(titulos):**

Using the xpath method of the parser, we search for elements that match the titulos XPath expression.
The titulos variable likely contains an XPath expression to locate specific elements on the web page.
This line extracts the data that matches the provided XPath expression and stores it in the titulo variable.

**print(titulo):**

Finally, we print the extracted data (likely titles) to the console. The titulo variable holds the result of the XPath query, which is printed as output.

In summary, this code retrieves a web page, decodes its content, creates an HTML parser, extracts specific data (titles) from the HTML using an XPath expression, and prints the extracted data to the console. It's a common pattern for web scraping and data extraction tasks.


---

### Brief concepts on xpath:

XPath is a language for navigating and querying XML or HTML documents. In the context of web scraping, XPath helps locate specific elements on a web page based on their attributes or structure. Here's how it works step by step:

Document Structure:

XPath treats an HTML or XML document as a tree structure, with elements (tags) represented as nodes and relationships between elements as branches.
Element Selection:

You specify an XPath expression, which is like a query, to pinpoint the element(s) you want to extract.
For example, if you want to select all div elements with a specific class attribute, you write an XPath expression to target those elements.
XPath Path: XPath expressions often follow a path-like syntax, resembling the hierarchy of elements in the document. Here's a breakdown:

//: Selects nodes from the current node regardless of their location in the tree. It's often used to search the entire document.
div: Selects all div elements.
[@class="your-class"]: Filters elements with a specific attribute and value, such as class="your-class".
Results:

When you apply the XPath expression to the document, it returns a list of matching elements.
In your example, when you use an XPath expression to select div elements with a specific class, you get a list of all those div elements.
Data Extraction:

You can further extract data from the selected elements, such as text content, attributes, or even nested elements, as needed.
In your case, you're interested in the text content of these div elements, which contains the titles of items.
In summary, XPath is a powerful tool for navigating the hierarchical structure of HTML or XML documents. It allows you to target specific elements based on their attributes or position in the tree, making it an essential tool for web scraping tasks like extracting data from web pages.




User
This code does not work now, but it used to work fine.
links_para_ciclo = get_urls(url_ppal)

Gives me this error:

NameError: name 'get_urls' is not defined
ChatGPT
The error message you're seeing, "NameError: name 'get_urls' is not defined," indicates that the Python function get_urls is not recognized or defined in your current code. This error typically occurs when you try to call a function that hasn't been defined or imported.

To fix this issue, you need to make sure that the get_urls function is defined and available in your code. Here are some steps to resolve the problem:

Check Function Definition: First, ensure that you have a function named get_urls defined somewhere in your code. It should look something like this:
python
Copy code
def get_urls(url):
    # Function implementation here
    return urls
Import Modules: If the get_urls function is defined in an external module or script, make sure you have imported that module at the beginning of your code using an import statement. For example:
python
Copy code
from my_module import get_urls
Replace my_module with the actual name of the module where get_urls is defined.

Function Call: Ensure that you are calling the get_urls function correctly with the appropriate arguments. Double-check that the function call matches the function signature.
If you can provide more details about your code or share the relevant parts of it, I can offer more specific guidance on how to resolve the issue.







---

### Requests sobre títulos de cada libro

In [33]:
r = requests.get(url_ppal)
home = r.content.decode('utf-8') 
parser = html.fromstring(home)
titulo = parser.xpath(titulos)
print(titulo)

['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]


### Requests sobre precio de cada libro

In [34]:
# Verifico el precio de Sophie's World (£15.94)

r = requests.get(categorias_url[5])
home = r.content.decode('utf-8') 
parser = html.fromstring(home)
precio = parser.xpath(precios)
print(precio[0])

£15.94


### Generación de funciones de Scraping

In [44]:
def fc_get_urls(url_ppal):
    root_url = 'https://books.toscrape.com/'
    links_categorias = '//ul[@class="nav nav-list"]/li//ul/li/a/@href'
    
    r = requests.get(url_ppal)
    home = r.content.decode('utf-8') 
    parser = html.fromstring(home)

    categorias_url = parser.xpath(links_categorias)
    categorias_url = [root_url + x for x in categorias_url]

    return categorias_url

In [45]:
def fc_parser_content(url):
    diccionario = {} #Inicializo un diccionario vacío para luego guardar los datos de las request sobre títulos y precios.

    r = requests.get(url)
    home = r.content.decode('utf-8') 
    parser = html.fromstring(home)

    if r.status_code == 200:
        titulos = '//article[@class="product_pod"]//h3/a/text()'
        titulo = parser.xpath(titulos) 
        diccionario['Título'] = titulo

        precios = '//li//div/p[@class="price_color"]/text()'
        precio = parser.xpath(precios)
        diccionario['Precio'] = precio
    return diccionario        

In [37]:
# Chequeo que funciona para los primeros 20 títulos de la página principal.
fc_parser_content(url_ppal)

{'Título': ['A Light in the ...',
  'Tipping the Velvet',
  'Soumission',
  'Sharp Objects',
  'Sapiens: A Brief History ...',
  'The Requiem Red',
  'The Dirty Little Secrets ...',
  'The Coming Woman: A ...',
  'The Boys in the ...',
  'The Black Maria',
  'Starving Hearts (Triangular Trade ...',
  "Shakespeare's Sonnets",
  'Set Me Free',
  "Scott Pilgrim's Precious Little ...",
  'Rip it Up and ...',
  'Our Band Could Be ...',
  'Olio',
  'Mesaerion: The Best Science ...',
  'Libertarianism for Beginners',
  "It's Only the Himalayas"],
 'Precio': ['£51.77',
  '£53.74',
  '£50.10',
  '£47.82',
  '£54.23',
  '£22.65',
  '£33.34',
  '£17.93',
  '£22.60',
  '£52.15',
  '£13.99',
  '£20.66',
  '£17.46',
  '£52.29',
  '£35.02',
  '£57.25',
  '£23.88',
  '£37.59',
  '£51.33',
  '£45.17']}

In [46]:
# Genero una variable que tenga todos los links de la página para luego scrapear una por una
links_para_ciclo = fc_get_urls(url_ppal)

In [47]:
links_para_ciclo

['https://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'https://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'https://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'https://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'https://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'https://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'https://books.toscrape.com/catalogue/category/books/religion_12/index.html',
 'https://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html',
 'https://books.toscrape.com/catalogue

In [48]:
# Corroboro que sean la misma cantidad de todas las páginas de la web.
len(links_para_ciclo) 

50

In [49]:
# Data va a ser una lista que en cada elemento tendrá un diccionario que contendrá títulos y precios de cada libro correspondiente a cada página.

data = []
for index , i in enumerate(links_para_ciclo):
    print(f'Se está escrapeando la página número {index}')
    data.append(fc_parser_content(i))

Se está escrapeando la página número 0
Se está escrapeando la página número 1
Se está escrapeando la página número 2
Se está escrapeando la página número 3
Se está escrapeando la página número 4
Se está escrapeando la página número 5
Se está escrapeando la página número 6
Se está escrapeando la página número 7
Se está escrapeando la página número 8
Se está escrapeando la página número 9
Se está escrapeando la página número 10
Se está escrapeando la página número 11
Se está escrapeando la página número 12
Se está escrapeando la página número 13
Se está escrapeando la página número 14
Se está escrapeando la página número 15
Se está escrapeando la página número 16
Se está escrapeando la página número 17
Se está escrapeando la página número 18
Se está escrapeando la página número 19
Se está escrapeando la página número 20
Se está escrapeando la página número 21
Se está escrapeando la página número 22
Se está escrapeando la página número 23
Se está escrapeando la página número 24
Se está es

### Generación de un DF

In [50]:
df = pd.DataFrame()
for j in data:
    df_temporal = pd.DataFrame(j)
    df = pd.concat([df,df_temporal])

In [59]:
df.head(15)

Unnamed: 0,Título,Precio
0,It's Only the Himalayas,£45.17
1,Full Moon over Noah’s ...,£49.43
2,See America: A Celebration ...,£48.87
3,Vagabonding: An Uncommon Guide ...,£36.94
4,Under the Tuscan Sun,£37.33
5,A Summer In Europe,£44.34
6,The Great Railway Bazaar,£30.54
7,A Year in Provence ...,£56.88
8,The Road to Little ...,£23.21
9,Neither Here nor There: ...,£38.95


In [57]:
"""
Cleaning Data.
"""
def clean_data(df):
    # Reseting the indexes
    df = df.reset_index(drop=True)
    # Rename column 'Precio' to 'Price(£)'
    df = df.rename(columns={'Precio': 'Price(£)'})
    # Remove the "£" symbol and convert to numeric
    df["Price(£)"] = df["Price(£)"].str.replace("£", "").astype(float)
    return df

df_clean = clean_data(df.copy())
df_clean.head()

Unnamed: 0,Título,Price(£)
0,It's Only the Himalayas,45.17
1,Full Moon over Noah’s ...,49.43
2,See America: A Celebration ...,48.87
3,Vagabonding: An Uncommon Guide ...,36.94
4,Under the Tuscan Sun,37.33


In [60]:
df_clean.tail()

Unnamed: 0,Título,Price(£)
512,Why the Right Went ...,52.65
513,Equal Is Unfair: America's ...,56.86
514,Amid the Chaos,36.58
515,Dark Notes,19.19
516,The Long Shadow of ...,10.97


De esta forma ya tengo un dataframe listo para ser utilizado.