<a href="https://colab.research.google.com/github/leobioinf0/Web_scraping/blob/main/S12_T02_web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IT Academy - Data Science with Python
## [Sprint 12. Machine learning avançat](https://github.com/jesussantana/Supervised-Regression/blob/main/notebooks/S12_T01_Supevised_Regression.ipynb) 
### [S12 T02: Tasca de web scraping](https://github.com/jesussantana/Supervised-Regression) 



#### Exercises: 

Learn how to do web scraping.


- Exercise 1: 
    - Scraping a page on the Madrid Stock Exchange (https://www.bolsamadrid.es) using BeautifulSoup and Selenium.

- Exercise 2: 
    - Document in a word your data set generated with the information that the different Kaggle files have.

- Exercise 2: 
    - Choose a web page you want and do web scraping using the Scrapy library.


# Prerequisites

## Upgrade modules

In [None]:
!apt-get update
!apt-get install chromium-chromedriver

!pip3 install --upgrade selenium
!pip3 install --upgrade python-docx
!pip3 install --upgrade fake_useragent
!pip3 install --upgrade scrapy
!pip3 install --upgrade scrapy-fake-useragent

## Load modules

In [None]:
import pandas as pd

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx import Document 

from fake_useragent import UserAgent
import scrapy_fake_useragent
from scrapy.http import TextResponse

import warnings
warnings.filterwarnings('ignore')

#  Exercise 1: 
  - Scraping a page on the Madrid Stock Exchange (https://www.bolsamadrid.es) using BeautifulSoup and Selenium.

## BeautifulSoup

In [None]:
url = 'https://www.bolsamadrid.es'

We perform an HTTP GET request to the given URL. 

We retrieve the HTML data that the server sends us back and stores that data in a the python object "`html`".

The `<Response [200]>` means that we successfully fetched the site content from the Internet! 

In [None]:
html = requests.get(url)
html

<Response [200]>

We create a Beautiful Soup object that takes `html.content` as its input.


In [None]:
soup = BeautifulSoup(html.content, 'html.parser')

Inspecting the page we see that the link to the values of the shares has the tag `<a>` and the text `Actions`

![01.png](https://github.com/leobioinf0/Web_scraping/blob/main/01.png?raw=true)


In [None]:
title = soup.find("title").get_text().replace("\r\n","").replace("\t","")
print(title)

Bolsa de Madrid


We find that specific HTML element by its tag and text:

In [None]:
a_tag = soup.find('a', string = 'Acciones')
print(a_tag)

<a href="/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000" target="_self">Acciones</a>


We extract the value of its href attribute

In [None]:
link = a_tag.get(key='href')
print(link)

/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000


In [None]:
url_shares = url+link.replace("esp","ing")
print(url_shares)

https://www.bolsamadrid.es/ing/aspx/Mercados/Precios.aspx?indice=ESI100000000


We perform an HTTP GET request to new URL.

In [None]:
html = requests.get(url_shares)
html

<Response [200]>

We create a Beautiful Soup object

In [None]:
soup = BeautifulSoup(html.content, 'html.parser')

Inspecting the page we see that the the values of the shares are tabulated in a `table` tag with the id  `ctl00_Contenido_tblAcciones`

![02.png](https://github.com/leobioinf0/Web_scraping/blob/main/02.png?raw=true)


We find that specific HTML element by its ID:

In [None]:
table_tag = soup.find(id='ctl00_Contenido_tblAcciones')

We iterate over the child elements and extract the text.

In [None]:
rows = []
for child in table_tag.children:
    element = []
    if child != "\n":
        for i in child:
            if i != "\n":
                element.append(i.text)
        rows.append(element)

In [None]:
for r in rows:
    print(r)

['Name', 'Last', '% Dif.', 'High', 'Low', 'Volume', 'Turnover (€ Thousands)', 'Date', 'Time']
['ACCIONA', '142.2000', '2.30', '142.2000', '137.2000', '135,038', '19,028.55', '15/02/2022', 'Close']
['ACERINOX', '11.5600', '2.21', '11.6750', '11.2350', '1,037,506', '11,990.35', '15/02/2022', 'Close']
['ACS', '22.2900', '1.50', '22.3400', '21.7400', '697,215', '15,486.02', '15/02/2022', 'Close']
['AENA', '153.6500', '2.43', '154.1000', '149.1500', '95,529', '14,625.18', '15/02/2022', 'Close']
['ALMIRALL', '10.4000', '1.76', '10.4300', '10.1400', '273,597', '2,812.44', '15/02/2022', 'Close']
['AMADEUS', '61.9800', '2.72', '62.0600', '59.6600', '658,758', '40,662.63', '15/02/2022', 'Close']
['ARCELORMIT.', '26.9350', '2.18', '27.3900', '25.8300', '1,225,413', '32,963.22', '15/02/2022', 'Close']
['B.SANTANDER', '3.4230', '2.29', '3.4230', '3.3090', '81,077,793', '272,444.88', '15/02/2022', 'Close']
['BA.SABADELL', '0.9152', '1.13', '0.9298', '0.8992', '60,332,575', '55,048.44', '15/02/2022',

We tablate the extracted data. 

In [None]:
df = pd.DataFrame(rows[1:], columns=rows[0])
df 

Unnamed: 0,Name,Last,% Dif.,High,Low,Volume,Turnover (€ Thousands),Date,Time
0,ACCIONA,142.2,2.3,142.2,137.2,135038,19028.55,15/02/2022,Close
1,ACERINOX,11.56,2.21,11.675,11.235,1037506,11990.35,15/02/2022,Close
2,ACS,22.29,1.5,22.34,21.74,697215,15486.02,15/02/2022,Close
3,AENA,153.65,2.43,154.1,149.15,95529,14625.18,15/02/2022,Close
4,ALMIRALL,10.4,1.76,10.43,10.14,273597,2812.44,15/02/2022,Close
5,AMADEUS,61.98,2.72,62.06,59.66,658758,40662.63,15/02/2022,Close
6,ARCELORMIT.,26.935,2.18,27.39,25.83,1225413,32963.22,15/02/2022,Close
7,B.SANTANDER,3.423,2.29,3.423,3.309,81077793,272444.88,15/02/2022,Close
8,BA.SABADELL,0.9152,1.13,0.9298,0.8992,60332575,55048.44,15/02/2022,Close
9,BANKINTER,5.786,-0.89,5.906,5.724,4203853,24352.77,15/02/2022,Close


We extract some other information

In [None]:
TituloPag = soup.find("div", {"class": "TituloPag"}).text
TituloPag

'Session Prices'

In [None]:
Ctr = soup.find("div", {"class": "Ctr"}).text
Ctr

'IBEX 35®'

In [None]:
Nota = soup.find("div", {"class": "Nota"}).text
Nota = Nota.replace("\n","").split(".")
Nota = [txt.lstrip() for txt in Nota[:-1]]
Nota

['Data Delayed 15 minutes',
 'Prices expressed in euros',
 'Turnover expressed in thousands of euros',
 'Volume and turnover includes all transactions until the closing of trading session',
 'The total volume and turnover, including special operations carried out after the closing of trading session, is available in the historic information']

Save data into a file

In [None]:
filename= './{}_{}_BeautifulSoup.csv'.format(TituloPag.replace(" ",""),Ctr.replace(r"®","").replace(r" ",""))
df.to_csv(filename, index=False)

## Selenium

Web Driver for Chrome

In [None]:
# to run Selenium in Google Colab
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-extensions')

We instantiate a webdriver for Chrome

In [None]:
wd = webdriver.Chrome('chromedriver',options=options)

We call the `driver` object we created above and use the `get` method, which we pass the URL of the website we'd like to extract. 

In [None]:
wd.get(url_shares)

To carry out this section of the exercise in a different way from the previous one, instead of looking for the data using the tag ID, we are going to do the search using XPATH

The table with the values of interest has the following XPATH:
- For the header row: `/html/body/div[1]/table/tbody/tr[4]/td[2]/div[1]/form/div[6]/table/tbody/tr/th`

- For the entrie rows: `/html/body/div[1]/table/tbody/tr[4]/td[2]/div[1]/form/div[6]/table/tbody/tr[1]/td`

We exract the header

In [None]:
table_header = wd.find_elements(By.XPATH, "/html/body/div[1]/table/tbody/tr[4]/td[2]/div[1]/form/div[6]/table/tbody/tr[1]/th")

We iterate over the elements of the header row and extract the text.

In [None]:
col_names = []
for column in table_header:
    col_names.append(column.text)
print(col_names)

['Name', 'Last', '% Dif.', 'High', 'Low', 'Volume', 'Turnover (€ Thousands)', 'Date', 'Time']


We extract the table entries

In [None]:
table_rows = wd.find_elements(By.XPATH, "/html/body/div[1]/table/tbody/tr[4]/td[2]/div[1]/form/div[6]/table/tbody/tr")

We iterate over the elements of the table and extract the text

In [None]:
rows = []
for row in table_rows[1:]:
    element = []
    values = row.find_elements(By.XPATH, "td")
    for value in values:
        element.append(value.text)
    rows.append(element)

In [None]:
for r in rows:
    print(r)

['ACCIONA', '142.2000', '2.30', '142.2000', '137.2000', '135,038', '19,028.55', '15/02/2022', 'Close']
['ACERINOX', '11.5600', '2.21', '11.6750', '11.2350', '1,037,506', '11,990.35', '15/02/2022', 'Close']
['ACS', '22.2900', '1.50', '22.3400', '21.7400', '697,215', '15,486.02', '15/02/2022', 'Close']
['AENA', '153.6500', '2.43', '154.1000', '149.1500', '95,529', '14,625.18', '15/02/2022', 'Close']
['ALMIRALL', '10.4000', '1.76', '10.4300', '10.1400', '273,597', '2,812.44', '15/02/2022', 'Close']
['AMADEUS', '61.9800', '2.72', '62.0600', '59.6600', '658,758', '40,662.63', '15/02/2022', 'Close']
['ARCELORMIT.', '26.9350', '2.18', '27.3900', '25.8300', '1,225,413', '32,963.22', '15/02/2022', 'Close']
['B.SANTANDER', '3.4230', '2.29', '3.4230', '3.3090', '81,077,793', '272,444.88', '15/02/2022', 'Close']
['BA.SABADELL', '0.9152', '1.13', '0.9298', '0.8992', '60,332,575', '55,048.44', '15/02/2022', 'Close']
['BANKINTER', '5.7860', '-0.89', '5.9060', '5.7240', '4,203,853', '24,352.77', '15/0

We tabulate the extracted data.

In [None]:
df= pd.DataFrame(rows, columns=col_names)
df.head(3)

Unnamed: 0,Name,Last,% Dif.,High,Low,Volume,Turnover (€ Thousands),Date,Time
0,ACCIONA,142.2,2.3,142.2,137.2,135038,19028.55,15/02/2022,Close
1,ACERINOX,11.56,2.21,11.675,11.235,1037506,11990.35,15/02/2022,Close
2,ACS,22.29,1.5,22.34,21.74,697215,15486.02,15/02/2022,Close


We extract some other information

In [None]:
TituloPag = wd.find_element(By.CLASS_NAME, "TituloPag").text
TituloPag

'Session Prices'

In [None]:
Ctr = wd.find_element(By.CLASS_NAME, "Ctr").text
Ctr

'IBEX 35®'

In [None]:
Nota = wd.find_element(By.CLASS_NAME, "Nota").text
Nota = Nota.replace("\n","").split(".")
Nota = [txt.lstrip() for txt in Nota[:-1]]
Nota

['Data Delayed 15 minutes',
 'Prices expressed in euros',
 'Turnover expressed in thousands of euros',
 'Volume and turnover includes all transactions until the closing of trading session',
 'The total volume and turnover, including special operations carried out after the closing of trading session, is available in the historic information']

Save data into a file

In [None]:
filename= './{}_{}_Selenium.csv'.format(TituloPag.replace(" ",""),Ctr.replace(r"®","").replace(r" ",""))
df.to_csv(filename, index=False)

We close the driver

In [None]:
wd.quit()

# Exercise 2: 
  - Document in a word your data set generated with the information that the different Kaggle files have.

In [None]:
df = pd.read_csv(filename,
                 decimal='.',
                 thousands=',')
df.columns =  df.columns.str.replace(r"% |\.| \(.*\)","")
df=df.round(1)
df.head(3)

Unnamed: 0,Name,Last,Dif,High,Low,Volume,Turnover,Date,Time
0,ACCIONA,142.2,2.3,142.2,137.2,135038,19028.6,15/02/2022,Close
1,ACERINOX,11.6,2.2,11.7,11.2,1037506,11990.4,15/02/2022,Close
2,ACS,22.3,1.5,22.3,21.7,697215,15486.0,15/02/2022,Close


In [None]:
date = df.Date[0]
print(url)
print(date)
print(title)
print(TituloPag)
print(Ctr)
print(Nota)

https://www.bolsamadrid.es
15/02/2022
Bolsa de Madrid
Session Prices
IBEX 35®
['Data Delayed 15 minutes', 'Prices expressed in euros', 'Turnover expressed in thousands of euros', 'Volume and turnover includes all transactions until the closing of trading session', 'The total volume and turnover, including special operations carried out after the closing of trading session, is available in the historic information']


In [None]:
content = ["Name: Legal name of the company.",
           "Last: Last price per share of the company expressed in euros.",
           "Dif: Net change of the day expressed in percentage.",
           "High: Maximum price during the day expressed in euros.",
           "Low: Minimum price during the day expressed in euros.",
           "Volume: Number of shares that changed hands during a given day.",
           "Turnover: Expressed in thousands of euros. Is the result of multiplying the number of shares traded by the price of each share.",
           "Date: Trading date.",
           "Time: Trading time."]

In [98]:
# Create document object
document = Document()

header_1 = document.add_heading("'{}' webpage Scrapping".format(title), 0)
header_1.alignment = WD_ALIGN_PARAGRAPH.CENTER

#Context
document.add_heading("Context", 1)
p = document.add_paragraph("")
p.add_run("This dataset contains the {} of the shares of all the {} companies registered on the date {}. The information was scraped from {}".format(TituloPag,Ctr,date,url))
document.paragraphs[2].runs[0].font.size = Pt(8)
document.paragraphs[2].runs[0].font.name = 'Verdana'

#Content
document.add_heading("Content", 1)
p2 = document.add_paragraph("")
p2.add_run("The information contained in this dataset include: {}".format(', '.join(df.columns)))
document.paragraphs[4].runs[0].font.size = Pt(8)
document.paragraphs[4].runs[0].font.name = 'Verdana'

for column in content:
    document.add_paragraph(column, style="List Bullet")

for paragraph in document.paragraphs[5:]:
    paragraph.runs[0].font.size = Pt(8)
    paragraph.runs[0].font.name = 'Verdana'

#Acknowledgements
document.add_heading("Acknowledgements", 1)
p3 = document.add_paragraph("")
p3.add_run("The information was scraped from {}".format(url))
document.paragraphs[15].runs[0].font.size = Pt(8)
document.paragraphs[15].runs[0].font.name = 'Verdana'

#Inspiration
document.add_heading("Inspiration", 1)
p4 = document.add_paragraph("")
p4.add_run("Exercici 2: Documenta en un word el teu conjunt de dades generat amb la informació que tenen els diferents arxius de Kaggle.")
document.paragraphs[17].runs[0].font.size = Pt(8)
document.paragraphs[17].runs[0].font.name = 'Verdana'

#Data
document.add_heading("Data", 1)
document.add_paragraph("")

##Table
t = document.add_table(df.shape[0]+1, df.shape[1])
t.allow_autofit =True
t.autofit = True

for j in range(df.shape[-1]):
    t.cell(0,j).text = df.columns[j]

for i in range(df.shape[0]):
    for j in range(df.shape[-1]):
        t.cell(i+1,j).text = str(df.values[i,j])

for row in t.rows:
    for cell in row.cells:
        paragraphs = cell.paragraphs
        for paragraph in paragraphs:
            for run in paragraph.runs:
                font = run.font
                font.size= Pt(6)

# Justify all the paragraphs
p.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
p2.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
p3.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
p4.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY

document.save('{}_{}.doc'.format(TituloPag.replace(" ",""),Ctr.replace(r"®","").replace(r" ","")))

# Exercise 3: 
  - Choose a web page you want and do web scraping using the Scrapy library.

Xpath checking

In [104]:
url = 'https://esenciacosmetics.com/3-esenciaspaises?id_category={}&n={}'.format(3, 31)
print(url)

https://esenciacosmetics.com/3-esenciaspaises?id_category=3&n=31


Make header

In [105]:
headers = {'user-agent': UserAgent().Chrome}

In [106]:
print(headers)

{'user-agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36'}


Make request

In [107]:
res = requests.get(url, headers=headers)
print(res)

<Response [200]>


Get response

In [110]:
response = TextResponse(res.url, body=res.text, encoding='utf-8')
print(response)

<200 https://esenciacosmetics.com/3-esenciaspaises?id_category=3&n=31>


Check links by xpath

In [111]:
links = response.xpath('//*[@class="product_list grid row"]/li/div/div/div[1]/a/@href').extract()
links

['https://esenciacosmetics.com/esenciaspaises/28-brasil-champu-color-protect-de-acai.html',
 'https://esenciacosmetics.com/esenciaspaises/29-brasil-crema-acondicionadora-color-protect-de-acai.html',
 'https://esenciacosmetics.com/esenciaspaises/42-brasil-pack-supercolor-de-acai.html',
 'https://esenciacosmetics.com/esenciaspaises/4-brasil-rostro-radiante-de-acai-crema-de-dia.html',
 'https://esenciacosmetics.com/esenciaspaises/25-corea-antiaging-manos-de-ginseng-vegana.html',
 'https://esenciacosmetics.com/esenciaspaises/31-corea-crema-acondicionadora-estimulante-de-ginseng.html',
 'https://esenciacosmetics.com/esenciaspaises/10-corea-crema-facial-antiedad-rejuvenecedora-de-ginseng.html',
 'https://esenciacosmetics.com/esenciaspaises/43-corea-pack-hair-stimulator-de-ginseng.html',
 'https://esenciacosmetics.com/esenciaspaises/52-corea-ritual-antiedad-total-cara-y-manos-de-ginseng.html',
 'https://esenciacosmetics.com/esenciaspaises/50-india-pack-antimanchas-total-manos-y-cara-de-azafra

Check prducts titles by xpath

In [112]:
titles = response.xpath('//*[@class="product_list grid row"]/li/div/div/div[2]/h5/a/@title').extract()
titles

['Brasil - champú COLOR PROTECT de açai',
 'Brasil - crema acondicionadora COLOR PROTECT de açai',
 'Brasil - pack SUPERCOLOR de açai',
 'Brasil - rostro RADIANTE de açai crema de día',
 'Corea - ANTIAGING MANOS de ginseng (VEGANA)',
 'Corea - crema acondicionadora ESTIMULANTE de ginseng',
 'Corea - crema facial ANTIEDAD REJUVENECEDORA de ginseng',
 'Corea - pack HAIR STIMULATOR de ginseng',
 'Corea - ritual ANTIEDAD TOTAL cara y manos de ginseng',
 'India - pack ANTIMANCHAS TOTAL manos y cara de azafrán',
 'Italia - crema corporal SUPERHIDRATANTE de lirio blanco (VEGANA)',
 'Italia - pack SUPERHIDRATANTE cara y cuerpo de LIRIO',
 'Italia - rostro HIPERHIDRATADO noche de lirio blanco',
 'La India - crema ANTIMANCHAS manos de azafrán (VEGANA)',
 'La India - DIFUMINADOR MANCHAS facial noche de azafrán',
 'Malasia - champú ANTICAÍDA de centella asiática',
 'Malasia - crema acondicionadora ANTICAÍDA de centella asiática',
 'Malasia - crema ULTRAINTENSIVA ANTICELULÍTICA de centella asiática

Check prices by xpath

In [113]:
prices = response.xpath('//*[@class="product_list grid row"]/li/div/div/div[2]/div[1]/span[1]/text()').extract()
prices

['16,00 €',
 '18,00 €',
 '29,00 €',
 '24,95 €',
 '12,50 €',
 '18,00 €',
 '24,95 €',
 '29,00 €',
 '32,00 €',
 '32,00 €',
 '24,95 €',
 '45,00 €',
 '24,95 €',
 '12,50 €',
 '24,95 €',
 '16,00 €',
 '18,00 €',
 '27,00 €',
 '29,00 €',
 '24,95 €',
 '24,95 €',
 '45,00 €',
 '16,00 €',
 '18,00 €',
 '24,95 €',
 '29,00 €',
 '24,95 €',
 '12,50 €',
 '24,95 €',
 '24,95 €',
 '12,50 €']

Check images sources by xpath

In [114]:
imgs =  response.xpath('//*[@class="product_list grid row"]/li/div/div/div[1]/a/img/@src').extract()
imgs

['https://esenciacosmetics.com/166-home_default/brasil-champu-color-protect-de-acai.jpg',
 'https://esenciacosmetics.com/131-home_default/brasil-crema-acondicionadora-color-protect-de-acai.jpg',
 'https://esenciacosmetics.com/170-home_default/brasil-pack-supercolor-de-acai.jpg',
 'https://esenciacosmetics.com/115-home_default/brasil-rostro-radiante-de-acai-crema-de-dia.jpg',
 'https://esenciacosmetics.com/184-home_default/corea-antiaging-manos-de-ginseng-vegana.jpg',
 'https://esenciacosmetics.com/132-home_default/corea-crema-acondicionadora-estimulante-de-ginseng.jpg',
 'https://esenciacosmetics.com/83-home_default/corea-crema-facial-antiedad-rejuvenecedora-de-ginseng.jpg',
 'https://esenciacosmetics.com/172-home_default/corea-pack-hair-stimulator-de-ginseng.jpg',
 'https://esenciacosmetics.com/154-home_default/corea-ritual-antiedad-total-cara-y-manos-de-ginseng.jpg',
 'https://esenciacosmetics.com/197-home_default/india-pack-antimanchas-total-manos-y-cara-de-azafran.jpg',
 'https://e

### Make scrapy project

In [115]:
!scrapy startproject esenciacosmetics

New Scrapy project 'esenciacosmetics', using template directory '/usr/local/lib/python3.7/dist-packages/scrapy/templates/project', created in:
    /content/esenciacosmetics

You can start your first spider with:
    cd esenciacosmetics
    scrapy genspider example example.com


write items.py file

In [116]:
%%writefile esenciacosmetics/esenciacosmetics/items.py
import scrapy

class EsenciacosmeticsItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    img = scrapy.Field()
    link = scrapy.Field()
    
    def to_dict(self):
        ### For conveneince
        data= {'title': self['title'],
                'price': self['price'],
                'img': self['img'],
                'link': self['link']}
        return data

Overwriting esenciacosmetics/esenciacosmetics/items.py


Check items.py file

In [117]:
!cat esenciacosmetics/esenciacosmetics/items.py

import scrapy

class EsenciacosmeticsItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    img = scrapy.Field()
    link = scrapy.Field()
    
    def to_dict(self):
        ### For conveneince
        data= {'title': self['title'],
                'price': self['price'],
                'img': self['img'],
                'link': self['link']}
        return data

Write spider.py file

In [120]:
%%writefile esenciacosmetics/esenciacosmetics/spiders/spider.py
import scrapy
from esenciacosmetics.items import EsenciacosmeticsItem
import scrapy_fake_useragent

class EsenciacosmeticsSpider(scrapy.Spider):
    name = 'EsenciacosmeticsSpider'
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES' : {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : None,
            'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware' : 300,
        }
    }

    def __init__(self, cat1=3, cat2=31, **kwargs):
        self.start_url = 'https://esenciacosmetics.com/3-esenciaspaises?id_category={}&n={}'.format(cat1, cat2)
        super().__init__(**kwargs)
        
    def start_requests(self):
        import requests
        from fake_useragent import UserAgent
        from scrapy.http import TextResponse
        
        # Iterate all pages
        headers = {'user-agent': UserAgent().Chrome}
        res = requests.get(self.start_url, headers=headers)
        response = TextResponse(res.url, body=res.text, encoding='utf-8')
        urls = response.xpath('//*[@class="product_list grid row"]/li/div/div/div[1]/a/@href').extract()
        for url in urls:
            yield scrapy.Request(url, callback=self.parse_page)
            
    def parse_page(self, response):
        item = EsenciacosmeticsItem()
        item['title'] = response.xpath('//*[@class="product_main_name"]/text()').extract()
        item['price'] = response.xpath('//*[@id="our_price_display"]/text()').extract()
        item['img'] = response.xpath('//*[@id="bigpic"]/@src').extract()[0]
        item['link'] = response.url
        yield item

Writing esenciacosmetics/esenciacosmetics/spiders/spider.py


Check spider.py file

In [121]:
!cat esenciacosmetics/esenciacosmetics/spiders/spider.py

import scrapy
from esenciacosmetics.items import EsenciacosmeticsItem
import scrapy_fake_useragent

class EsenciacosmeticsSpider(scrapy.Spider):
    name = 'EsenciacosmeticsSpider'
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES' : {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : None,
            'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware' : 300,
        }
    }

    def __init__(self, cat1=3, cat2=31, **kwargs):
        self.start_url = 'https://esenciacosmetics.com/3-esenciaspaises?id_category={}&n={}'.format(cat1, cat2)
        super().__init__(**kwargs)
        
    def start_requests(self):
        import requests
        from fake_useragent import UserAgent
        from scrapy.http import TextResponse
        
        # Iterate all pages
        headers = {'user-agent': UserAgent().Chrome}
        res = requests.get(self.start_url, headers=headers)
        response = TextResponse(res.url, body=res.text, encoding='utf-8')
     

Write run.sh file

In [139]:
%%writefile run.sh
cd esenciacosmetics
scrapy crawl EsenciacosmeticsSpider -o Esenciacosmetics.csv -a cat1=3 -a cat2=31

Writing run.sh


Check run.sh file

In [140]:
!cat ./run.sh

cd esenciacosmetics
scrapy crawl EsenciacosmeticsSpider -o Esenciacosmetics.csv -a cat1=3 -a cat2=31

Check access permissions 

In [141]:
!ls -l run.sh

-rw-r--r-- 1 root root 100 Feb 16 06:59 run.sh


Change the access permissions of file

In [142]:
!chmod +x run.sh

Check access permissions 

In [143]:
!ls -l run.sh

-rwxr-xr-x 1 root root 100 Feb 16 06:59 run.sh


Execute run.sh

In [126]:
!./run.sh

2022-02-16 06:29:31 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: esenciacosmetics)
2022-02-16 06:29:31 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.7.12 (default, Jan 15 2022, 18:48:18) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
2022-02-16 06:29:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-02-16 06:29:31 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'esenciacosmetics',
 'NEWSPIDER_MODULE': 'esenciacosmetics.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['esenciacosmetics.spiders']}
2022-02-16 06:29:31 [scrapy.extensions.telnet] INFO: Telnet Password: 2f93e7d08f277aa0
2022-02-16 06:29:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensi

Read csv file

In [127]:
filepath = "./esenciacosmetics/Esenciacosmetics.csv"
df = pd.read_csv(filepath)
df

Unnamed: 0,img,link,price,title
0,https://esenciacosmetics.com/166-large_default...,https://esenciacosmetics.com/esenciaspaises/28...,"16,00 €",Brasil - champú COLOR PROTECT de açai
1,https://esenciacosmetics.com/131-large_default...,https://esenciacosmetics.com/esenciaspaises/29...,"18,00 €",Brasil - crema acondicionadora COLOR PROTECT d...
2,https://esenciacosmetics.com/115-large_default...,https://esenciacosmetics.com/esenciaspaises/4-...,"24,95 €",Brasil - rostro RADIANTE de açai crema de día
3,https://esenciacosmetics.com/172-large_default...,https://esenciacosmetics.com/esenciaspaises/43...,"29,00 €",Corea - pack HAIR STIMULATOR de ginseng
4,https://esenciacosmetics.com/132-large_default...,https://esenciacosmetics.com/esenciaspaises/31...,"18,00 €",Corea - crema acondicionadora ESTIMULANTE de g...
5,https://esenciacosmetics.com/83-large_default/...,https://esenciacosmetics.com/esenciaspaises/10...,"24,95 €",Corea - crema facial ANTIEDAD REJUVENECEDORA d...
6,https://esenciacosmetics.com/170-large_default...,https://esenciacosmetics.com/esenciaspaises/42...,"29,00 €",Brasil - pack SUPERCOLOR de açai
7,https://esenciacosmetics.com/184-large_default...,https://esenciacosmetics.com/esenciaspaises/25...,"12,50 €",Corea - ANTIAGING MANOS de ginseng (VEGANA)
8,https://esenciacosmetics.com/154-large_default...,https://esenciacosmetics.com/esenciaspaises/52...,"32,00 €",Corea - ritual ANTIEDAD TOTAL cara y manos de ...
9,https://esenciacosmetics.com/197-large_default...,https://esenciacosmetics.com/esenciaspaises/50...,"32,00 €",India - pack ANTIMANCHAS TOTAL manos y cara de...
