# Data Science applied to CEDAE data

<img src="cedae-logo.png" align="right" width="200" style="margin: 20px; 20px"/>

CEDAE is the coorporation that provides drinking water and wastewater services for the city of Rio de Janeiro.
They provide plenty of data for the press and for the population, due to a law imposed by the Ministry of Health of Brazil.

There is data for physical, chemical and biological parameters of the drinking water of all ETAs (Estações de Tratamento de Água, or _Water Treatment Stations_).
In specific, the Guandu is the largest ETA of the world, providing drinking water for the municipalities of Nilópolis, Nova Iguaçu, Duque de Caxias, Belford Roxo, São João de Meriti, Itaguaí, Queimados and Rio de Janeiro.

The data is in PDF form. We would like to convert this PDF into CSV so that we can vizualize it better in [Tableau](https://www.tableau.com/). First, we are interested to download the PDFs relating to geosmin (geosmina, in Portuguese). Links to them are available in the following public web page:

In [1]:
CEDAE_page_URL = 'https://cedae.com.br/relatoriosguandu'
CEDAE_page_encoding = 'utf8'
CEDAE_page_keywords = ['GEOSMINA']

We download the HTML page with the `urllib` module. The `urllib.request.urlopen` returns a file pointer, which yields a stream of bytes, which is then decoded to a UTF-8 string.

In [2]:
import urllib.request

print('Requesting \'%s\'' % CEDAE_page_URL)
with urllib.request.urlopen(CEDAE_page_URL) as fp:
    CEDAE_data_bytes = fp.read()
    print('Decoding HTML to \'%s\' encoding' % CEDAE_page_encoding)
    CEDAE_data_Unicode = CEDAE_data_bytes.decode(CEDAE_page_encoding)

Requesting 'https://cedae.com.br/relatoriosguandu'
Decoding HTML to 'utf8' encoding


For scrapping links to PDFs relating to geosmin, we inherit from the `HTMLParser` class from the `html.parser` module.

In [3]:
import html.parser

class URLScrapper(html.parser.HTMLParser):
    def __init__(self, keywords):
        super().__init__()
        self.URL = None
        self.URLs = []
        self.keywords = keywords

    def get_URLs(self):
        return self.URLs

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    self.URL = value
                    break

    def handle_data(self, data):
        if self.URL is not None:
            for keyword in self.keywords:
                if keyword in data:
                    self.URLs.append(self.URL)
                    break

    def handle_endtag(self, tag):
        if tag == 'a':
            self.URL = None

We then feed the parser with the HTML page contents and get the scrapped URLs.

In [8]:
scrapper = URLScrapper(CEDAE_page_keywords)
print('Scrapping URLs from HTML')
scrapper.feed(CEDAE_data_Unicode)
scrapped_URLs = scrapper.get_URLs()
print('%d URLs scrapped' % len(scrapped_URLs))

Scrapping URLs from HTML
2 URLs scrapped


We create a directory for storing the PDFs (and really don't care if it already exists).

In [5]:
import os

try:
    os.mkdir('data')
except FileExistsError:
    pass

Now we iterate through the URLs, downloading each file and storing in the recently created directory.

In [9]:
import urllib.parse

for scrapped_URL in scrapped_URLs:
    scrapped_URL = scrapped_URL.replace(' ', '%20')
    print('Requesting \'%s\'' % scrapped_URL)
    with urllib.request.urlopen(scrapped_URL) as infp:
        parsed_URL = urllib.parse.urlparse(scrapped_URL)
        filename = os.path.basename(parsed_URL.path).replace('%20', ' ')
        filepath = os.path.join('data', filename)
        print('Writing to \'%s\'' % filepath)
        with open(filepath, 'wb') as outfp:
            outfp.write(infp.read())

Requesting 'https://storage.googleapis.com/site-cedae/Qualidade_da_Agua/RelatorioGuandu/2021/RESULTADOS%20GEOSMINA%20-%20MIB%20-%20%2020210504.pdf'
Writing to 'data/RESULTADOS GEOSMINA - MIB -  20210504.pdf'
Requesting 'https://storage.googleapis.com/site-cedae/Qualidade_da_Agua/RelatorioGuandu/RESULTADOS%20GEOSMINA%20-%20MIB%20-%20%202020.pdf'
Writing to 'data/RESULTADOS GEOSMINA - MIB -  2020.pdf'
