# [IT Academy - Data Science with Python](https://www.barcelonactiva.cat/es/itacademy)
## Sprint 16: Web Scraping
### [Github Web Scraping](https://github.com/jesussantana/Web-Scraping)

[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)  
[![Made withJupyter](https://img.shields.io/badge/Made%20with-Jupyter-orange?style=for-the-badge&logo=Jupyter)](https://jupyter.org/try)  


In [1]:
import pandas as pd
import numpy as np
# ^^^ pyforest auto-imports - don't write above this line
# ==============================================================================
# Auto Import Dependencies
# ==============================================================================
# pyforest imports dependencies according to use in the notebook
# ==============================================================================

In [2]:
#%pip install selenium
#%pip install webdriver_manager
#import sys
#!{sys.executable} -m pip install -U selenium
import pyforest

In [3]:
# Dependencies not Included in Auto Import*
# ==============================================================================

import csv
import requests
from time import sleep

# urllib
# ==============================================================================
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

# BeautifulSoup
# ==============================================================================
from bs4 import BeautifulSoup

# Sellenium
# ==============================================================================
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

## Exercise 1: 
  - Perform web scraping of a page on the Madrid Stock Exchange (https://www.bolsamadrid.es) using BeautifulSoup and Selenium.

![Bolsa De Madrid](img/bolsademadrid.png)

## Web Scraping with Beautifoul Soup

In [4]:
url = 'https://www.bolsamadrid.es'

### Funcion for Scrap Web Page

In [5]:
## Scraping page Handling HTTP exceptions

def scrap_page(url):
    try:
    
        html = requests.get(url)

    except HTTPError as e:

        print(e)

    except URLError:

        print("Server down or incorrect domain")
    
    else:
        soup = BeautifulSoup(html.content, 'html.parser')
    
        return soup 

## Scrap Principal Page

In [6]:
soup = scrap_page(url)

### Explore HTML

In [7]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head data-analytics-id="UA-35966870-2" data-app-path="/" data-bolsa="BMadrid" data-hora-act="Wed, 30 Jun 2021 06:45:42 GMT" data-idioma="esp">
  <meta content="IE=11" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Copyright © BME 2021" id="ctl00_copyright" name="copyright"/>
  <title>
   Bolsa de Madrid
  </title>
  <link href="/esp/aspx/RSS/RSS.ashx?feed=Todo" id="ctl00_RSSLink1" rel="alternate" title="Bolsa de Madrid: Todos los contenidos agregados" type="application/rss+xml"/>
  <link href="/esp/aspx/RSS/RSS.ashx?feed=NotasPrensa" id="ctl00_RSSLink2" rel="alternate" title="Bolsa de Madrid: Notas de Prensa" type="application/rss+xml"/>
  <link href="/esp/aspx/RSS/RSS.ashx?feed=Regulacion" id="ctl00_RSSLink3" rel="alternate" title="Bolsa de Madrid: 

In [8]:
soup.title

<title>
	Bolsa de Madrid
</title>

In [9]:
soup.title.parent.prettify()

'<head data-analytics-id="UA-35966870-2" data-app-path="/" data-bolsa="BMadrid" data-hora-act="Wed, 30 Jun 2021 06:45:42 GMT" data-idioma="esp">\n <meta content="IE=11" http-equiv="X-UA-Compatible"/>\n <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n <meta content="Copyright © BME 2021" id="ctl00_copyright" name="copyright"/>\n <title>\n  Bolsa de Madrid\n </title>\n <link href="/esp/aspx/RSS/RSS.ashx?feed=Todo" id="ctl00_RSSLink1" rel="alternate" title="Bolsa de Madrid: Todos los contenidos agregados" type="application/rss+xml"/>\n <link href="/esp/aspx/RSS/RSS.ashx?feed=NotasPrensa" id="ctl00_RSSLink2" rel="alternate" title="Bolsa de Madrid: Notas de Prensa" type="application/rss+xml"/>\n <link href="/esp/aspx/RSS/RSS.ashx?feed=Regulacion" id="ctl00_RSSLink3" rel="alternate" title="Bolsa de Madrid: Regulación: Circulares e Instrucciones Operativas" type="application/rss+xml"/>\n <link href="/esp/aspx/RSS/RSS.ashx?feed=Indices" id="ctl00_RSSLink4" rel="alternate"

In [10]:
for child in soup.title.children:
    print(child)


	Bolsa de Madrid



### Search data of interest

In [11]:
soup.table

<table align="center" cellpadding="0" cellspacing="0"><tr class="noimpr">
<td colspan="6" id="CabeceraArr">
<div id="Idiomas"><ul><li class="mclick"><a href="/?id=ing" target="_self"> English </a></li></ul></div>
<div id="MenuSup"><ul><li class="mclick"><a href="/esp/BMadrid/Contacto.aspx" target="_self"> Contacto </a></li><li class="mclick"><a href="/esp/Inversores/Agenda/HorarioMercado.aspx" target="_self"> Horario Mercado </a></li><li class="mclick"><a href="/esp/aspx/Inversores/Agenda/Calendario.aspx" target="_self"> Calendario bursátil </a></li><li class="mclick"><a href="/esp/RSS.aspx" target="_self"> RSS   <img align="absmiddle" alt="RSS" border="0" src="/images/IconoRSS.png"/> </a></li></ul></div>
</td>
</tr><tr valign="top">
<td class="BaseIzq noimpr" rowspan="3"><div></div></td>
<td class="BaseSep noimpr" rowspan="3"></td>
<td class="BaseMenu" id="CabeceraLogo"><a href="/?id=esp"><img alt="Bolsa de Madrid" border="0" src="/images/Base/LogoBMadrid.gif"/></a></td>
<td class="no

In [12]:
print(soup.find_all('a'))

[<a href="/?id=ing" target="_self"> English </a>, <a href="/esp/BMadrid/Contacto.aspx" target="_self"> Contacto </a>, <a href="/esp/Inversores/Agenda/HorarioMercado.aspx" target="_self"> Horario Mercado </a>, <a href="/esp/aspx/Inversores/Agenda/Calendario.aspx" target="_self"> Calendario bursátil </a>, <a href="/esp/RSS.aspx" target="_self"> RSS   <img align="absmiddle" alt="RSS" border="0" src="/images/IconoRSS.png"/> </a>, <a href="/?id=esp"><img alt="Bolsa de Madrid" border="0" src="/images/Base/LogoBMadrid.gif"/></a>, <a href="https://www.bolsasymercados.es/" target="_blank"><img alt="Bolsas y Mercados Españoles" border="0" height="45" src="/images/Base/LogoBMEBlanco.png?v=Six" width="118"/></a>, <a></a>, <a href="javascript:document.forms.formBusq.submitbusq();"><span class="BtnBuscarDcha" title="Buscar"></span></a>, <a href="/?id=esp" target="_self">Inicio</a>, <a href="#" target="_self">SOBRE NOSOTROS</a>, <a href="/esp/BMadrid/BMadrid.aspx" target="_self">Bolsa de Madrid</a>, 

In [13]:
list(soup.find_all('a', string='Acciones'))

[<a href="/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000" target="_self">Acciones</a>,
 <a href="/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000" target="_self">Acciones</a>]

In [14]:
# search related links 

links = []

for link in soup.find_all('a', string='Acciones'):
    
    links.append(link.get('href'))

In [15]:
links

['/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000',
 '/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000']

In [16]:
new_url = url + links[0]

new_url

'https://www.bolsamadrid.es/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000'

## Scrap Sub Page - esp/aspx/Mercados/Precios.aspx?indice=ESI100000000 -

In [17]:
soup2 = scrap_page(new_url)

In [18]:
print(soup2.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head data-analytics-id="UA-35966870-2" data-app-path="/" data-bolsa="BMadrid" data-hora-act="Wed, 30 Jun 2021 06:46:07 GMT" data-idioma="esp">
  <meta content="IE=11" http-equiv="X-UA-Compatible"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Copyright © BME 2021" id="ctl00_copyright" name="copyright"/>
  <title>
   Bolsa de Madrid - Precios de la sesión
  </title>
  <link href="/esp/aspx/RSS/RSS.ashx?feed=Todo" id="ctl00_RSSLink1" rel="alternate" title="Bolsa de Madrid: Todos los contenidos agregados" type="application/rss+xml"/>
  <link href="/esp/aspx/RSS/RSS.ashx?feed=NotasPrensa" id="ctl00_RSSLink2" rel="alternate" title="Bolsa de Madrid: Notas de Prensa" type="application/rss+xml"/>
  <link href="/esp/aspx/RSS/RSS.ashx?feed=Regulacion" id="ctl00_RSSLink3" rel="alternate" t

In [19]:
soup2.table

<table align="center" cellpadding="0" cellspacing="0"><tr class="noimpr">
<td colspan="6" id="CabeceraArr">
<div id="Idiomas"><ul><li class="mclick"><a href="/?id=ing" target="_self"> English </a></li></ul></div>
<div id="MenuSup"><ul><li class="mclick"><a href="/esp/BMadrid/Contacto.aspx" target="_self"> Contacto </a></li><li class="mclick"><a href="/esp/Inversores/Agenda/HorarioMercado.aspx" target="_self"> Horario Mercado </a></li><li class="mclick"><a href="/esp/aspx/Inversores/Agenda/Calendario.aspx" target="_self"> Calendario bursátil </a></li><li class="mclick"><a href="/esp/RSS.aspx" target="_self"> RSS   <img align="absmiddle" alt="RSS" border="0" src="/images/IconoRSS.png"/> </a></li></ul></div>
</td>
</tr><tr valign="top">
<td class="BaseIzq noimpr" rowspan="3"><div></div></td>
<td class="BaseSep noimpr" rowspan="3"></td>
<td class="BaseMenu" id="CabeceraLogo"><a href="/?id=esp"><img alt="Bolsa de Madrid" border="0" src="/images/Base/LogoBMadrid.gif"/></a></td>
<td class="no

In [20]:
scrap_url = new_url + '&punto=indice'

In [21]:
scrap_url

'https://www.bolsamadrid.es/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000&punto=indice'

### Scrap Table Sub Page esp/aspx/Mercados/Precios.aspx?indice=ESI100000000&punto=indice

In [22]:
## Function to scrape the page table and retorn a DF, Handling HTTP exceptions

def get_table(round, url=url):
    
    round_url = f'{url}/{round}'
    
    # Call function to scrap page
    soup = scrap_page(round_url)
        
    
    # Extract columns & rows for the table
    rows = []
    
    for child in soup.find_all('table')[4].children:
        row = []
        
        for td in child:
            
            try:
                row.append(td.text.replace('\n', ''))
                
            except:
                continue
                
        if len(row) > 0:
            
            rows.append(row)
                
    #create FataFrame from the rows
    df = pd.DataFrame(rows[1:], columns=rows[0])

    return df

### Create DataFrame

In [23]:
df = get_table(1, scrap_url)

In [24]:
df

Unnamed: 0,Nombre,Últ.,% Dif.,Máx.,Mín.,Volumen,Efectivo (miles €),Fecha,Hora
0,ACCIONA,1279000,55,1287000,1255000,134.957,"17.228,35",29/06/2021,Cierre
1,ACERINOX,102050,252,102150,99380,1.245.523,"12.652,92",29/06/2021,Cierre
2,ACS,225600,108,227000,222700,859.168,"19.364,47",29/06/2021,Cierre
3,AENA,1387500,-222,1424000,1375500,204.709,"28.513,77",29/06/2021,Cierre
4,ALMIRALL,148500,-34,150200,148300,278.377,"4.143,63",29/06/2021,Cierre
5,AMADEUS,589800,-372,609200,585400,981.049,"58.245,16",29/06/2021,Cierre
6,ARCELORMIT.,264100,367,265550,255400,484.879,"12.682,41",29/06/2021,Cierre
7,B.SANTANDER,32740,35,33080,32505,22.552.387,"73.841,55",29/06/2021,Cierre
8,BA.SABADELL,5742,-49,5904,5702,26.484.511,"15.263,90",29/06/2021,Cierre
9,BANKINTER,42590,-16,43210,42380,1.526.039,"6.507,48",29/06/2021,Cierre


### Create CSV's recursively

In [None]:
# Create CSV's recursively, calls in random periods for not to block the page
# Change range to select how many times (30 times 'days' Exemple)

for round in range(1, 30):
    
    table = get_table(round, scrap_url)
    table.to_csv(f'../data/external/PL_tablematchday{round}.csv', index=False)
    sleep(np.random.randint(1, 10))

# Sellenium

![esmarketingdigital](img/esmarketingdigital.png)

In [25]:
url = "https://esmarketingdigital.com"

In [26]:
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 91.0.4472
Get LATEST driver version for 91.0.4472
Driver [/home/jesus/.wdm/drivers/chromedriver/linux64/91.0.4472.101/chromedriver] found in cache


In [27]:
driver.get(url)
print(driver.page_source)

<html class="wide wow-animation desktop landscape rd-navbar-static-linked" lang="en"><head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <title>
    es Marketing Digital | Fresh Ideas for Grow Your Business
  </title>
  <meta name="description" content="Digital Marketing Services for Business, Data Science, Applications Development, Social Selling, Positioning, 
      Seo for results, Marketing Automation, Web Analytics.">
  <meta name="author" content="Jesús Santana | esMarketingdigital.com | info@esmarketingdigital.com">
  <meta content="global" name="distribution">
  <meta content="15 days" name="revisit">
  <meta content="15 days" name="revisit-after">
  <meta content="document" name="resource-type">
  <meta content="all" name="audience">
  <meta content="general" name="rating">
  <meta content="all" name="robots">
  <meta content="es-ES" name="langua

In [28]:
print(driver.title)

es Marketing Digital | Fresh Ideas for Grow Your Business


In [29]:
print(driver.current_url)

https://esmarketingdigital.com/


In [30]:
portfolio = driver.find_element_by_css_selector("#portfolio")

In [31]:
print(portfolio.text)

We offer different solutions that completely cover the needs of our clients.
Feel free to request a custom project or learn more about some of our latest work below.


In [32]:
for a in driver.find_elements_by_xpath('.// a'):
    print(a.get_attribute('href'))

tel:+34608574193
mailto:info@esmarketingdigital.com
https://esmarketingdigital.com/#
https://es.linkedin.com/in/chus-santana
https://github.com/jesussantana
https://twitter.com/esmktdigital
https://www.facebook.com/esMarketingDigital.es/
https://www.instagram.com/esmarketingdigital
https://esmarketingdigital.com/#
https://esmarketingdigital.com/#
https://esmarketingdigital.com/#
https://esmarketingdigital.com/#
https://esmarketingdigital.com/#
https://www.esmarketingdigital.es/
https://esmarketingdigital.com/#
https://esmarketingdigital.com/#
https://esmarketingdigital.com/#
https://www.construimostucasa.com/
https://teachablemachine.withgoogle.com/models/ILuyfBO_9/
https://esmarketingdigital.com/#
https://jesussantana.github.io/TetrisConquest/
https://teachablemachine.withgoogle.com/models/51m9nmioQ/
https://drive.google.com/drive/folders/1-CuUumpUn0aP398X21IrwKuHo2NOVHXn?usp=sharing
https://github.com/jesussantana
mailto:info@esmarketingdigital.com
tel:+34608574193
None
https://esmar

In [33]:
driver.quit()

## Exercise 2: 
  - Document in a word your data set generated with the information that the different Kaggle files have.

In [34]:
ibex = pd.read_csv('../data/external/PL_tablematchday1.csv', parse_dates = [0])

In [35]:
ibex.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Nombre              35 non-null     object
 1   Últ.                35 non-null     object
 2   % Dif.              35 non-null     object
 3   Máx.                35 non-null     object
 4   Mín.                35 non-null     object
 5   Volumen             35 non-null     object
 6   Efectivo (miles €)  35 non-null     object
 7   Fecha               35 non-null     object
 8   Hora                35 non-null     object
dtypes: object(9)
memory usage: 2.6+ KB


In [36]:
ibex.head()

Unnamed: 0,Nombre,Últ.,% Dif.,Máx.,Mín.,Volumen,Efectivo (miles €),Fecha,Hora
0,ACCIONA,1279000,55,1287000,1255000,134.957,"17.228,35",29/06/2021,Cierre
1,ACERINOX,102050,252,102150,99380,1.245.523,"12.652,92",29/06/2021,Cierre
2,ACS,225600,108,227000,222700,859.168,"19.364,47",29/06/2021,Cierre
3,AENA,1387500,-222,1424000,1375500,204.709,"28.513,77",29/06/2021,Cierre
4,ALMIRALL,148500,-34,150200,148300,278.377,"4.143,63",29/06/2021,Cierre


# Creation of a Dataset to serve Web Scraping:

## Updated variation of the price of the IBEX 35 shares  

### Context  

The main objective of the request dataset is to obtain the results of the IBEX 35 for 30 sessions in real time. It must serve the web:

https://www.bolsamadrid.es/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000&punto=indice  

### Dataset title  

The title is: PL_tablematchday{n}1

### Description of the dataset  

In this dataset I counted the main information related to the bid and the lowering of the prices of the shares of the IBEX35 companies in real times. We have the price of the batch, the percentage difference between the previous day, the fluctuation of the price during the day, the volume of shares, the price in €, the date and time when you accessed the website.

### Content:  

- Nombre: company name
- Last: last share price (in euros)
- % Dif .: percentage difference between the last price of the shares and the second price of the previous day
- Max .: maximum price to which the actions have arrived during the day
- Min .: minimum price at which the actions have arrived during the day
- Volumen: number of shares of each company
- Efectivo (€ thousand): total share price (in thousand euros)
- Fecha: day, month and any of the information available (dd / mm / yyyy)
- Hora: time of the available information (hh: mm). From the closing time (17:35 Spanish time) we receive 'closing'.  

### Code and dataset

https://github.com/jesussantana/Web-Scraping

## Exercise 3: 
  - Choose a web page you want and do web scraping using the Scrapy library.

## Web Scrapy whith Scrappy

### Create a archive html with a copy code of web page

In [None]:
"""import scrapy


class QuotesSpider(scrapy.Spider):
    name = "spider"

    def start_requests(self):
        urls = [
            'http://esmarketingdigital.es/',
            'http://esmarketingdigital.es/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'esmarketingdigital_copy.html'
        #filename = f'spyder-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')"""

In [37]:
!scrapy runspider myspider.py

2021-06-30 06:47:39 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-06-30 06:47:39 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.5 (default, May 27 2021, 13:30:53) - [GCC 9.3.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.8.0-59-generic-x86_64-with-glibc2.29
2021-06-30 06:47:39 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-06-30 06:47:39 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2021-06-30 06:47:39 [scrapy.extensions.telnet] INFO: Telnet Password: 742671d6fff5141b
2021-06-30 06:47:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-06-30 06:47:39 [scrapy.middleware] INFO: Enabled downloader mi

In [38]:
print("File esmarketingdigital_copy.html created")

File esmarketingdigital_copy.html created


![Spider esmarketingdigital](img/spider_esmarketingdigital.png)

In [39]:
filename = "esmarketingdigital_copy.html"

In [40]:
f = open(filename)
print(f.read())

<!DOCTYPE html>
<html dir='ltr' lang='es-ES' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' xmlns:data='http://www.google.com/2005/gml/data' xmlns:expr='http://www.google.com/2005/gml/expr'>
<head>
<link href='https://www.blogger.com/static/v1/widgets/204402360-widget_css_bundle.css' rel='stylesheet' type='text/css'/>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async='async' src='https://www.googletagmanager.com/gtag/js?id=UA-82403465-1'></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-82403465-1');
</script>
<meta content='width=device-width, initial-scale=1, maximum-scale=1' name='viewport'/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<meta content='blogger' name='generator'/>
<link href='https://www.esmarketingdigital.es/favicon.ico' rel='icon' type='image/x-icon'/>
<link href='https://www.esmarketin