# Scraping javascript sin  javascript

Tecnicas de web scrapping para poder extraer informacion de sitios que dependen fuertemente de javascript sin necesidad de renderizarlos en algun headless browser y solo usando python, beautifulsoup, requests, el inspector de firefox y regex.

## Caso 1 - El comienzo de una busqueda

En este caso vamos a ver como extraer los resultados de una busqueda en google a partir de indagar en el protocolo HTTP y entender el formato de las URL


**Resources**
- [HTTP Methods](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods)
- [HTTP Overview](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)


In [1]:
!pip install beautifulsoup4 requests



Anatomia de una URL 

![URL Anatomy](https://pbs.twimg.com/media/EzLO6aeVoAcYyYM.jpg)

In [3]:
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import quote, urlparse

search=quote("dolar blue")  # realizamos un encoding del string de busqueda
rsp = requests.get(f"https://www.google.com/search?q={search}")
soup = bs(rsp.text)
links = soup.find_all(href=True) 
# links = soup.find_all("a") 
links 

[<a href="/?sa=X&amp;ved=0ahUKEwj-6MLIobbzAhWRElkFHQx2DJ4QOwgC"><span class="V6gwVd">G</span><span class="iWkuvd">o</span><span class="cDrQ7">o</span><span class="V6gwVd">g</span><span class="ntlR9">l</span><span class="iWkuvd tJ3Myc">e</span></a>,
 <a class="l" href="/?output=search&amp;ie=UTF-8&amp;sa=X&amp;ved=0ahUKEwj-6MLIobbzAhWRElkFHQx2DJ4QPAgE"><span class="V6gwVd">G</span><span class="iWkuvd">o</span><span class="cDrQ7">o</span><span class="V6gwVd">g</span><span class="ntlR9">l</span><span class="iWkuvd tJ3Myc">e</span></a>,
 <a href="/search?q=dolar+blue&amp;ie=UTF-8&amp;gbv=1&amp;sei=a9ZdYb6HOpGl5NoPjOyx8Ak">here</a>,
 <a class="eZt8xd" href="/search?q=dolar+blue&amp;ie=UTF-8&amp;source=lnms&amp;tbm=nws&amp;sa=X&amp;ved=0ahUKEwj-6MLIobbzAhWRElkFHQx2DJ4Q_AUICCgB">News</a>,
 <a class="eZt8xd" href="/search?q=dolar+blue&amp;ie=UTF-8&amp;source=lnms&amp;tbm=isch&amp;sa=X&amp;ved=0ahUKEwj-6MLIobbzAhWRElkFHQx2DJ4Q_AUICSgC">Images</a>,
 <a class="eZt8xd" href="/search?q=dolar+blue&a

In [4]:
from urllib.parse import parse_qs
for x in links:
  u = urlparse(x["href"])
  possible_url = parse_qs(u.query).get("q")
  if possible_url:
    parsed =urlparse(possible_url[0])
    if  parsed.netloc != "" and "google.com" not in parsed.netloc :
      print(possible_url)

['https://dolarhoy.com/cotizaciondolarblue']
['https://dolarhoy.com/cotizacion-dolar-mep']
['https://dolarhoy.com/cotizacion-dolar-bolsa']
['https://dolarhoy.com/seccion/merval']
['https://dolarhoy.com/seccion/empresas']
['https://www.cronista.com/MercadosOnline/moneda.html?id=ARSB']
['https://www.cronista.com/finanzas-mercados/dolar-hoy-puedo-comprar-dolares-con-cuentas-bancarias-cotitulares/']
['https://www.cronista.com/finanzas-mercados/todos-los-dolares-falsos-que-se-detectaron-este-ano-que-es-el-dolar-morgan/']
['https://www.cronista.com/MercadosOnline/dolar.html']
['https://www.pagina12.com.ar/372493-dolar-blue-hoy-a-cuanto-cotiza-el-lunes-04-de-octubre']
['https://www.ambito.com/contenidos/dolar.html']
['https://www.ambito.com/finanzas/dolar-blue/hoy-cuanto-cerro-este-martes-5-octubre-n5292366']
['https://www.lanacion.com.ar/economia/dolar/dolar-blue-hoy-a-cuanto-cotiza-el-martes-5-de-octubre-nid05102021/']
['https://www.lanacion.com.ar/economia/dolar/dolar-blue-hoy-a-cuanto-cot

## Caso 2 - Siguiendo el precio del dólar

accediendo JSON directamente. 

![Web arch](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview/fetching_a_page.png)

![web development timeline](https://raw.githubusercontent.com/mraible/history-of-web-frameworks-timeline/master/history-of-web-frameworks-timeline.png)

**Resources**
- [web development timeline](https://github.com/mraible/history-of-web-frameworks-timeline)

In [5]:
url = "https://www.dolarsi.com/api/api.php?type=valoresprincipales"

r = requests.get(url)
data = r.json()
data[1]

{'casa': {'agencia': '310',
  'compra': '182,00',
  'decimales': '2',
  'nombre': 'Dolar Blue',
  'variacion': '0',
  'venta': '185,00',
  'ventaCero': 'TRUE'}}

## Caso 3 - No todo es json o html

En este caso vamos a basarnos en el protocolo de robots.txt y sitemap.xml para conseguir la lista de URLS que nos interesa.

**Resources**
- [The anatomy of a Large-Scale Hypertextual Web Search Engine](http://infolab.stanford.edu/~backrub/google.html)
- [Introduction to robots.txt](https://developers.google.com/search/docs/advanced/robots/intro)
- [Sitemaps.org](https://sitemaps.org/)



In [7]:
robots = requests.get("https://www.carrefour.com.ar/robots.txt")
sitemap = None
for x in robots.text.split("\n"):
  if "Sitemap" in x:
    print(x.split()[1])
    sitemap = x.split()[1]

https://carrefour.com.ar/sitemap.xml


In [8]:
rsitemap = requests.get(sitemap)
soup = bs(rsitemap.text)
soup

<html><body><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><sitemap>
<loc>https://www.carrefour.com.ar/sitemap/appsRoutes-0.xml</loc>
<lastmod>2021-10-06T06:57:14.500Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.carrefour.com.ar/sitemap/brand-0.xml</loc>
<lastmod>2021-10-06T06:57:14.500Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.carrefour.com.ar/sitemap/category-0.xml</loc>
<lastmod>2021-10-06T06:57:14.500Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.carrefour.com.ar/sitemap/subcategory-0.xml</loc>
<lastmod>2021-10-06T06:57:14.500Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.carrefour.com.ar/sitemap/department-0.xml</loc>
<lastmod>2021-10-06T06:57:14.500Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.carrefour.com.ar/sitemap/product-0.xml</loc>
<lastmod>2021-10-06T06:57:14.500Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.carrefour.com.ar/sitemap/product-1.xml</loc>
<lastmod>2021-10-06T06:57:14.500Z</lastmod>
</sitemap>
<sitemap>
<loc>https://w

In [9]:
product_links = []
for x in soup.find_all("loc"):
  if "product" in x.text:
    print(x.text)
    product_links.append(x.text)
products_response = requests.get(product_links[0])
soup = bs(products_response.text)

https://www.carrefour.com.ar/sitemap/product-0.xml
https://www.carrefour.com.ar/sitemap/product-1.xml
https://www.carrefour.com.ar/sitemap/product-2.xml
https://www.carrefour.com.ar/sitemap/product-3.xml
https://www.carrefour.com.ar/sitemap/product-4.xml


In [11]:
product_link = [x.text for x in soup.find_all("loc")]
for p in product_link[:5]:
  print(p)

https://www.carrefour.com.ar/pegamento-para-plasticos-suprabond-25-cc/p
https://www.carrefour.com.ar/adhesivo-transparente-poxipol-16-g/p
https://www.carrefour.com.ar/masilla-epoxi-poxilina-70-g/p
https://www.carrefour.com.ar/burlete-autoadhesivo-suprabond-5-mt/p
https://www.carrefour.com.ar/limpia-radiadores-bardahl--350-cc/p


### Caso 4 - JavaScript lo mira por Youtube

**Resources**
- [BigPipe](https://www.facebook.com/notes/10158791368532200/)

In [12]:
url = "https://www.youtube.com/watch?v=828rZgV9t1g"
r = requests.get(url)
soup = bs(r.text)

In [15]:
for s in soup.find_all('script'):
  print(s)
  print("======")

<script nonce="e8+s5C6q+RYAq4hJ2lS8Tg">var ytcfg={d:function(){return window.yt&&yt.config_||ytcfg.data_||(ytcfg.data_={})},get:function(k,o){return k in ytcfg.d()?ytcfg.d()[k]:o},set:function(){var a=arguments;if(a.length>1)ytcfg.d()[a[0]]=a[1];else for(var k in a[0])ytcfg.d()[k]=a[0][k]}};
window.ytcfg.set('EMERGENCY_BASE_URL', '\/error_204?t\x3djserror\x26level\x3dERROR\x26client.name\x3d1\x26client.version\x3d2.20211005.01.00-canary_control');</script>
<script nonce="e8+s5C6q+RYAq4hJ2lS8Tg">(function(){window.yterr=window.yterr||true;window.unhandledErrorMessages={};window.unhandledErrorCount=0;
window.onerror=function(msg,url,line,columnNumber,error){var err;if(error)err=error;else{err=new Error;err.stack="";err.message=msg;err.fileName=url;err.lineNumber=line;if(!isNaN(columnNumber))err["columnNumber"]=columnNumber}var message=String(err.message);if(!err.message||message in window.unhandledErrorMessages||window.unhandledErrorCount>=5)return;window.unhandledErrorCount+=1;window.un

In [20]:
# Podemos ver un tag script en detalle como "ytInitialPlayerResponse" 
# es un objecto JSON
str(soup.find_all('script')[18])

'<script nonce="e8+s5C6q+RYAq4hJ2lS8Tg">var ytInitialPlayerResponse = {"responseContext":{"serviceTrackingParams":[{"service":"GFEEDBACK","params":[{"key":"is_viewed_live","value":"False"},{"key":"logged_in","value":"0"},{"key":"e","value":"24111809,23884386,23804281,24034168,23858058,24077266,23990877,23946420,24001373,24108863,24106628,24111165,24094715,23882502,23966208,24088877,23998056,24004644,24105791,23918597,24007246,24005534,24082661,24101686,24002923,24082057,24106839,24590705,24085811,24058380,24100822,24590712,24109594,24094668,9406012,24002022,24056274,24108448,24007790,23983296,23934970,23944779,1714253,24080738,24077242,24028143,24091859,23857948,24101841,23968386,24049820,24105186,24036947,24107035,23986017,23744176,24106092,23735348,24002025"}]},{"service":"CSI","params":[{"key":"c","value":"WEB"},{"key":"cver","value":"2.20211005.01.00-canary_control"},{"key":"yt_li","value":"0"},{"key":"GetPlayer_rid","value":"0x223ff2f06fb47064"}]},{"service":"GUIDED_HELP","params"

In [16]:
[import re
import json

data = []
for ix,s in enumerate(soup.find_all('script')):
  # print(s.string)
  #print(type(str(s.string)))
  parsed = re.findall(r"{.+[:,].+}|\[.+[,:].+\]", str(s.string))
  try:
    if parsed:
      _d = json.loads(parsed[0])
      print(ix, json.loads(parsed[0]))
      data.append(_d)
  except json.JSONDecodeError:
    pass


17 {'@context': 'http://schema.org', '@type': 'BreadcrumbList', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': 'http://www.youtube.com/user/PyDataTV', 'name': 'PyData'}}]}
18 {'responseContext': {'serviceTrackingParams': [{'service': 'GFEEDBACK', 'params': [{'key': 'is_viewed_live', 'value': 'False'}, {'key': 'logged_in', 'value': '0'}, {'key': 'e', 'value': '24111809,23884386,23804281,24034168,23858058,24077266,23990877,23946420,24001373,24108863,24106628,24111165,24094715,23882502,23966208,24088877,23998056,24004644,24105791,23918597,24007246,24005534,24082661,24101686,24002923,24082057,24106839,24590705,24085811,24058380,24100822,24590712,24109594,24094668,9406012,24002022,24056274,24108448,24007790,23983296,23934970,23944779,1714253,24080738,24077242,24028143,24091859,23857948,24101841,23968386,24049820,24105186,24036947,24107035,23986017,23744176,24106092,23735348,24002025'}]}, {'service': 'CSI', 'params': [{'key': 'c', 'value': 'WEB'}, {'key': 'cver', 'v

In [17]:
data[1].keys()

dict_keys(['responseContext', 'playabilityStatus', 'streamingData', 'playbackTracking', 'captions', 'videoDetails', 'annotations', 'playerConfig', 'storyboards', 'microformat', 'trackingParams', 'attestation', 'videoQualityPromoSupportedRenderers', 'messages', 'frameworkUpdates'])

In [18]:
data[1]["videoDetails"]

{'allowRatings': True,
 'author': 'PyData',
 'averageRating': 4.7454543,
 'channelId': 'UCOjD18EJYcsBog4IozkF_7w',
 'isCrawlable': True,
 'isLiveContent': False,
 'isOwnerViewing': False,
 'isPrivate': False,
 'isUnpluggedCorpus': False,
 'lengthSeconds': '1269',
 'shortDescription': 'PyData Tel Aviv Meetup #17 \n7 November 2018\nSponsored and hosted by SimilarWeb\nhttps://www.meetup.com/PyData-Tel-Aviv/\n\nwww.pydata.org\n\nPyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. \n\nPyData conferences aim to be acces