# M7 T02: Tasca de web scraping

Aprèn a realitzar web scraping.

## Nivell 1 - Exercici 1
Realitza web scraping d'una pàgina de la borsa de Madrid (https://www.bolsamadrid.es) utilitzant BeautifulSoup i Selenium.

_Exploració de la pàgina web de la borsa de Madrid.Només accedir a la pàgina web, trobem diferents seccions amb notícies rellevants, gràfics evolutius, taules resumns de diferents indexos i tweets. A mà dreta hi ha un menú on es pot navegar pels diferents mercats, índexos, empreses cotitzadesn i estadístiques i publicacions_

_Dentre totes les dades que trobem, hi ha una taula molt adient per a fer web scraping (només veure-la ja te l'imagines com una taula d'Excel). Aquesta taula fa referència a l'evolució d'índexos, i es pot trobar en el següent enllaç: https://www.bolsamadrid.es/esp/aspx/Indices/Resumen.aspx_

_Es tracta d'una taula que porta per títol "Resumen de Índices", on es troben les següents variables:_
- _Nombre_	
- _Anterior_
- _Último_	
- _% Dif._	
- _Máximo_	
- _Mínimo_	
- _Fecha_	
- _Hora_	
- _% Dif. Año 2021_

_Fem servir la llibreria **request** per accedir al codi web_

In [50]:
import requests

URL = "https://www.bolsamadrid.es/esp/aspx/Indices/Resumen.aspx"
page = requests.get(URL)

print(page.text)


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head data-idioma="esp" data-hora-act="Wed, 01 Dec 2021 10:02:26 GMT" data-app-path="/" data-bolsa="BMadrid" data-analytics-id="UA-35966870-2"><meta http-equiv="X-UA-Compatible" content="IE=11" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta id="ctl00_copyright" name="copyright" content="Copyright © BME 2021" /><title>
	Bolsa de Madrid - Resumen de Índices
</title><link id="ctl00_RSSLink1" rel="alternate" type="application/rss+xml" href="/esp/aspx/RSS/RSS.ashx?feed=Todo" title="Bolsa de Madrid: Todos los contenidos agregados" /><link id="ctl00_RSSLink2" rel="alternate" type="application/rss+xml" href="/esp/aspx/RSS/RSS.ashx?feed=NotasPrensa" title="Bolsa de Madrid: Notas de Prensa" /><link id="ctl00_RSSLink3" rel="alternate" type="application/rss+xml" href="/esp/aspx/RSS/RSS.ashx?feed=Reg

_Després d'inspeccionar el codi HTML i la taula de la web. La taula que ens interessa té el id **id="ctl00_Contenido_tblÍndices"**_

_Carreguem la llibreria BeautifulSoup i creem el corresponent objecte:_

In [51]:
from bs4 import BeautifulSoup

In [52]:
soup = BeautifulSoup(page.content, "html.parser")

_Accedim a la taula que ens interessa: id="ctl00_Contenido_tblÍndices"_

In [53]:
results = soup.find(id="ctl00_Contenido_tblÍndices")

_Visualitzem més clarament les dades:_

In [54]:
print(results.prettify())

<table align="Center" cellpadding="3" cellspacing="0" class="TblPort" id="ctl00_Contenido_tblÍndices" style="margin-bottom: 20px;" width="85%">
 <tr align="center">
  <th scope="col">
   Nombre
  </th>
  <th scope="col">
   Anterior
  </th>
  <th scope="col">
   Último
  </th>
  <th scope="col">
   % Dif.
  </th>
  <th scope="col">
   Máximo
  </th>
  <th scope="col">
   Mínimo
  </th>
  <th scope="col">
   Fecha
  </th>
  <th scope="col">
   Hora
  </th>
  <th class="Ult" scope="col">
   % Dif.
   <br/>
   Año 2021
  </th>
 </tr>
 <tr align="right">
  <td align="left" class="DifFlSb">
   IBEX 35®
  </td>
  <td>
   8.305,10
  </td>
  <td>
   8.404,50
  </td>
  <td class="DifClSb">
   1,20
  </td>
  <td>
   8.405,70
  </td>
  <td>
   8.342,00
  </td>
  <td align="center">
   01/12/2021
  </td>
  <td align="center">
   09:46:54
  </td>
  <td class="DifClSb Ult">
   4,10
  </td>
 </tr>
 <tr align="right">
  <td align="left" class="DifFlSb">
   IBEX 35® con Dividendos
  </td>
  <td>
   25.

_Les diferents files de la taula venen indexades per < td >_

In [55]:
rows = results.find_all('tr')

_Mirem el nombre de files de la taula_

In [56]:
print(len(rows))

82


_Crearem el dataframe dadesBorsa, on guardarem totes les dades extretes del web_

In [57]:
import pandas as pd

In [58]:
dadesBorsa = pd.DataFrame()

In [59]:
for i in rows: 
    table_data = i.find_all('td') 
    data = [j.text for j in table_data]
    dadesBorsa = dadesBorsa.append(pd.DataFrame(data).T)

In [60]:
dadesBorsa.columns =['Nombre','Anterior','Ultimo','PercDif','Maximo','Minimo','Fecha','Hora','PercDif2021']

_Revisem les dades del nostre dataframe:_

In [61]:
dadesBorsa.head()

Unnamed: 0,Nombre,Anterior,Ultimo,PercDif,Maximo,Minimo,Fecha,Hora,PercDif2021
0,IBEX 35®,"8.305,10","8.404,50",120,"8.405,70","8.342,00",01/12/2021,09:46:54,410
0,IBEX 35® con Dividendos,"25.546,10","25.899,70",138,"25.903,50","25.707,30",01/12/2021,09:46:54,675
0,IBEX MEDIUM CAP®,"13.000,80","13.084,00",64,"13.091,60","13.040,20",01/12/2021,09:46:25,289
0,IBEX SMALL CAP®,"7.810,70","7.854,00",55,"7.854,20","7.810,20",01/12/2021,09:46:07,-301
0,IBEX 35® Bancos,42870,43720,198,43740,43080,01/12/2021,09:46:51,1625


In [15]:
dadesBorsa.tail()

Unnamed: 0,Nombre,Anterior,Ultimo,PercDif,Maximo,Minimo,Fecha,Hora,PercDif2021
0,Índice ITX Inverso X3,19030,22500,1823,22730,20830,30/11/2021,17:38:00,-5603
0,Índice TEF Inverso X5,"11.197,90","10.693,90",-450,"11.775,10","10.432,40",30/11/2021,17:38:00,10128
0,Índice SAN Inverso X5,"5.818,30","6.163,60",593,"6.503,20","5.620,10",30/11/2021,17:38:00,59564
0,Índice BBVA Inverso X5,"12.438,90","13.723,60",1033,"14.183,30","12.402,60",30/11/2021,17:38:00,57091
0,Índice ITX Inverso X5,99050,"1.291,80",3042,"1.311,80","1.146,60",30/11/2021,17:38:00,-8169


## Nivell 2 - Exercici 2
Documenta en un word el teu conjunt de dades generat amb la informació que tenen els diferents arxius de Kaggle.

_Explorem una mica el nostre conjunt de dades_

In [62]:
dadesBorsa.shape

(81, 9)

In [63]:
dadesBorsa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81 entries, 0 to 0
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Nombre       81 non-null     object
 1   Anterior     81 non-null     object
 2   Ultimo       81 non-null     object
 3   PercDif      81 non-null     object
 4   Maximo       81 non-null     object
 5   Minimo       81 non-null     object
 6   Fecha        81 non-null     object
 7   Hora         81 non-null     object
 8   PercDif2021  81 non-null     object
dtypes: object(9)
memory usage: 6.3+ KB


In [64]:
dadesBorsa.describe()

Unnamed: 0,Nombre,Anterior,Ultimo,PercDif,Maximo,Minimo,Fecha,Hora,PercDif2021
count,81,81,81,81,81,81,81,81,81
unique,81,79,81,68,81,81,2,15,78
top,IBEX 35® Inverso X3,-,"16.055,75",111,"5.482,40","8.342,00",01/12/2021,09:46:29,616
freq,1,3,1,4,1,1,70,37,2


In [65]:
dadesBorsa.dtypes

Nombre         object
Anterior       object
Ultimo         object
PercDif        object
Maximo         object
Minimo         object
Fecha          object
Hora           object
PercDif2021    object
dtype: object

## Nivell 3 - Exercici 3
Tria una página web que tu vulguis i realitza web scraping mitjançant la llibreria Scrapy.

_Al web http://quotes.toscrape.com trobem un seguit de frases conegudes, amb el nom de l'autor_

_Accedim al codi web:_

In [29]:
import requests

URL3 = "http://quotes.toscrape.com"
page3 = requests.get(URL3)

print(page3.text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

_Les dades que ens interessen estan a la class quote_

_Carreguem les llibreries que ens fan falta_

In [31]:
import scrapy
from scrapy.crawler import CrawlerProcess

_Creem les funcions necessàries, les dades extretes les posarem en un arxiu JSON (línea a línea):_

In [32]:
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('Exercici3_resultats_frases.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

In [33]:
import logging

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, 
        'FEED_FORMAT':'json',                                 
        'FEED_URI': 'quoteresult.json'                        
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

_Executem_

In [34]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

2021-12-01 09:39:32 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2021-12-01 09:39:32 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.18362-SP0
2021-12-01 09:39:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor


In [35]:
process.crawl(QuotesSpider)

2021-12-01 09:39:33 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
  exporter = cls(crawler)



<Deferred at 0x25da9337310>

In [36]:
process.start()

_A l'arxiu Exercici3_resultats_frases.jl, hi ha les frases extretes_