# Data Science applied to CEDAE data

CEDAE is the coorporation that provides drinking water and wastewater services for the city of Rio de Janeiro.
They provide plenty of data for the press and for the population, due to a law imposed by the Ministry of Health of Brazil.

There is data for physical, chemical and biological parameters of the drinking water of all ETAs (Estações de Tratamento de Água, or _Water Treatment Stations_).
In specific, the Guandu is the largest ETA of the world, providing drinking water for the municipalities of Nilópolis, Nova Iguaçu, Duque de Caxias, Belford Roxo, São João de Meriti, Itaguaí, Queimados and Rio de Janeiro.

The data is in PDF form. We would like to convert this PDF into CSV so that we can vizualize it better in [Tableau](https://www.tableau.com/). First, we need to download the HTML page with the links to the PDFs. They are available in the following URL, encoded in UTF-8.

In [None]:
import urllib.request

with urllib.request.urlopen('https://cedae.com.br/relatoriosguandu') as fp:
    HTML_page = fp.read().decode('utf8')

For scrapping links to PDFs relating to geosmin, we inherit from the `HTMLParser` class from the `html.parser` module. This child class finds links whose text match at least one from a list of keywords.

In [None]:
import html.parser

class URLScrapper(html.parser.HTMLParser):
    def __init__(self, predicate):
        super().__init__()
        self.URL = None
        self.URLs = []
        self.pred = predicate

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.URL = [value.replace(' ', '%20')
                       for name, value in attrs
                       if name == 'href']

    def handle_data(self, data):
        if self.URL and self.pred(data):
            self.URLs += self.URL

    def handle_endtag(self, tag):
        if tag == 'a':
            self.URL = None

We then feed the parser with the HTML page contents and get the scrapped URLs. We are first interested in data relating to geosmin.

In [None]:
scrapper = URLScrapper(lambda data: 'GEOSMINA' in data)
scrapper.feed(HTML_page)

We create a directory for storing the PDFs (and really don't care if it already exists).

In [None]:
import os

try:
    os.mkdir('data')
except FileExistsError:
    pass

Finally, we download the PDFs in the recently created directory.

In [None]:
import urllib.parse

for scrapped_URL in scrapper.URLs:
    with urllib.request.urlopen(scrapped_URL) as infp:
        parts = urllib.parse.urlparse(scrapped_URL)
        urlpath = urllib.parse.unquote(parts.path)
        filepath = os.path.join('data', os.path.basename(urlpath))
        with open(filepath, 'wb') as outfp:
            outfp.write(infp.read())