# Agregateur-Web Pas-à-pas

## But

Extraire les URLs et les formats de download des jeux de données relatifs à l'eau des données ouvertes de la ville de Montréal.

## Approche

1. comprendre les données avant de coder
    * HTML, css
2. trouver _library_ de _scraping_ (HTML) de Python
    * scrapy: https://pypi.org/project/Scrapy/
    * beautifulsoup: https://pypi.org/project/beautifulsoup4/
3. regarder la doc
    * https://www.crummy.com/software/BeautifulSoup/bs4/doc/
4. install
    * colab `!pip install beautifulsoup4`
    * machine locale `pip install beautifulsoup4`
5. explorer
    * `type()`
    * introspection: `objet.` + tab
        * méthodes: `objet.method()`
        * attributs: `objet.attribute`
        
6. conserver code final (code qui marche)

## Installer

In [None]:
!pip install beautifulsoup4

## Imports

Le nom d'import n'est pas toujours le même que celui du _package_... :(

In [None]:
import requests
import bs4

## Explorer

In [None]:
# tag eau avec org ville mtl
# constante
EAU_URL = "https://donnees.montreal.ca/search?q=tags:Eau%20organization:ville-de-montreal&from=0"

# request, response
response = requests.get(EAU_URL)

In [None]:
# explorer response
response

In [None]:
# instance de classe principale
# type + introspection (namespace, attributes, methods)
soup = bs4.BeautifulSoup(response.text)

In [None]:
# explorer soup
soup

In [None]:
# DOM
# element (tags): elem
# contenu (text): elem.text
# attribut: elem['attr'] ou mieux, elem.get('attr')
# tree
html = soup.find('html')

In [None]:
# explorer élément
html

In [None]:
# scraping
h3_links = soup.select("h3.text-lg a")

In [None]:
# explorer list d'éléments scrapés
h3_links

## Exercice: liste des jeux de données

In [None]:
# liens relatifs... et slashes (faut striper?)
base_url = 'https://donnees.montreal.ca'
# for loop
for a in h3_links:
    # string formatting: f-string
    print(f"{a.text}: {base_url}{a.get('href')}")

## Exercice: formats de fichier disponibles

In [None]:
## Exercice: formats de fichier disponible
results = soup.select("ul.pt-gutter li.mt-2")
for result in results:
    links = result.select("h3.text-lg a")
    # if else dans un oneliner
    a = links[0] if len(links) == 1 else None
    formats_as = result.select("ul.mt-2 li a")
    # list comprehension
    formats_texts = [a.text for a in formats_as]
    # join
    formats_str = ','.join(formats_texts)
    print(f"{a.text} ({formats_str}): {base_url}{a.get('href')}")

## Exercice: pagination

In [None]:
# la page web dit qu'il y a 24 résultats
len(results)
# mais on n'en a que 10... car pagination


In [None]:
# get all pages
results_per_page = 10
results_from = 0    # from = reserved names
has_results = True
pages = []

# while loop
while has_results:
    # get page with a specific "from" parameter
    url = f"https://donnees.montreal.ca/search?q=tags:Eau%20organization:ville-de-montreal&from={results_from}"
    print(f"Calling: {url}")  # print permits infinite loop troubleshooting
    response = requests.get(url)
    soup = bs4.BeautifulSoup(response.text)

    # test if there is results
    h3 = soup.select("h3.text-lg")
    if len(h3) > 0:
        # ok this page has results
        # add to list of pages (append! because ordered)
        pages.append(soup)
        # increment from for next page
        results_from += results_per_page
        print(f"next = {results_from}")
    else:
        # no more results
        # exit condition
        has_results = False
        print("Stop it...")

In [None]:
# get all results
all_results = []
for soup in pages:
    results = soup.select("ul.pt-gutter li.mt-2")
    all_results.extend(results)
len(all_results)
# 24, voilaaaa

In [None]:
# format results (previous code)
for result in all_results:
    links = result.select("h3.text-lg a")
    # if else dans un oneliner
    a = links[0] if len(links) == 1 else None
    formats_as = result.select("ul.mt-2 li a")
    formats_texts = [a.text for a in formats_as]
    formats_str = ','.join(formats_texts)
    print(f"{a.text} ({formats_str}): {base_url}{a['href']}")

## Conclusion

* pas de magie: scraping très sensible à structure de page(s)...
* faut la connaître
* peut changer
* pour ça qu'on préfère des APIs...


# Licence

Copyright 2021 Montréal-Python

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
