## Wrangling and analyzing data from the brazilian water reservoir system

## Table of Contents
- [1. Introduction](#intro)
- [2. Gathering Data](#gathering)

<a id='intro'></a>
### 1. Introduction

Brazil may be the owner of 20% of the world's water supply, as stated by [this World Bank's article](https://www.worldbank.org/en/news/feature/2016/07/27/how-brazil-managing-water-resources-new-report-scd). However, hand in hand with the abundance comes the wastage, access' inequality and tough management. Searching on the web, I found very difficult to have a clear view regarding the situation of water levels in our rivers and reservoirs. So I decided to build this Python webscrape program to create a clean dataset, not just for me but for other data scientists that could use the data. Also, I performed some analysis and data visualization to provide some summary results.  

The data source of the Brazilian Water Reservoirs can be found at [HIDROWEB](http://www.snirh.gov.br/hidroweb/serieshistoricas), the portal maintained by the National Water Agency (ANA).

<a id='gathering'></a>
### 2. Gathering Data

In [49]:
# Import all the necessary Python modules for this project
import pandas as pd
import numpy as np
import requests
import re
import glob
import json
import os
import time
from timeit import default_timer as timer
from bs4 import BeautifulSoup
from csv import writer
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### 2.1. Gathering the list of Water Reservoirs

This list of reservoirs will be used later to webscrape all the data series content from each reservoir.

In [2]:
# Save a randon URL (from one reservoir) inside a variable
URL = r"https://www.ana.gov.br/sar0/MedicaoSin?dropDownListEstados=&dropDownListReservatorios=19083&dataInicial=01%2F01%2F1900&dataFinal=01%2F06%2F2019&button=Buscar"

In [3]:
# Saving the requests response inside a variable
# Set the 'verify' pameter as 'False' as the website does not use security certificate verification (SSL) 
response = requests.get(URL, verify=False)



In [8]:
# Using Beautiful Soup to parse the HTML content of the URL
soup = BeautifulSoup(response.text, 'html.parser')

In [13]:
# Find the html tag that contains the list of the reservoirs and store it inside 'reservoir_tag'
# Then find inside the 'reservoir_tag' all the 'option' tags that store the reservoir's code and name
reservoir_tag = soup.find('select', class_="form-control input-m-sm", id="dropDownListReservatorios")
options = reservoir_tag.find_all_next("option")

In [14]:
# Count and inspect the content of the option tags:
for i,option in enumerate(options):
    print(i, option)

0 <option value="">Selecione</option>
1 <option selected="selected" value="19083">14 DE JULHO                                       </option>
2 <option value="19015">A. VERMELHA                                       </option>
3 <option value="19111">AIMORES                                           </option>
4 <option value="19140">ANTA                                              </option>
5 <option value="19040">B. BONITA                                         </option>
6 <option value="19029">B.COQUEIROS                                       </option>
7 <option value="19110">BAGUARI                                           </option>
8 <option value="19141">BALBINA                                           </option>
9 <option value="19041">BARIRI                                            </option>
10 <option value="19067">BARRA GRANDE                                      </option>
11 <option value="19143">BATALHA                                           </option>
12 <option value

In [15]:
# Delete the first option tag on the list as it is not a value option tag
del options[0]

In [18]:
# Create the RegEx to extract the 5 digits of the reservoir code, store inside a variable to use on the loop below
# Create the RegEx to extract the reservoir names inside the >< signs, store inside a variable to use on the loop below
re_number = r"[0-9]{5}"
re_name = r">(.+)<"

# Create an empty dictionary to store the result of the loop inside it
reservoirs = dict()

# Inside the loop:
# Get the first string of 're_number' RegEx result and store it as key inside the 'reservoirs' dictionary
# Get the first string of 're_name' RegEx result and store it as value inside the 'reservoirs' dictionary
for option in options:
    number = re.findall(re_number, str(option))[0]    
    name = re.findall(re_name, str(option))[0].strip()    
    reservoirs[number] = name

In [19]:
# Checking the reservoirs dictionary created
reservoirs

{'19083': '14 DE JULHO',
 '19015': 'A. VERMELHA',
 '19111': 'AIMORES',
 '19140': 'ANTA',
 '19040': 'B. BONITA',
 '19029': 'B.COQUEIROS',
 '19110': 'BAGUARI',
 '19141': 'BALBINA',
 '19041': 'BARIRI',
 '19067': 'BARRA GRANDE',
 '19143': 'BATALHA',
 '19152': 'BELO MONTE',
 '19036': 'BILLINGS',
 '19127': 'BOA ESPERANÇA',
 '19026': 'C. DOURADA',
 '19020': 'C.BRANCO-1',
 '19021': 'C.BRANCO-2',
 '19153': 'CACHOEIRA CALDEIRAO',
 '19011': 'CACONDE',
 '19028': 'CACU',
 '19001': 'CAMARGOS',
 '19068': 'CAMPOS NOVOS',
 '19129': 'CANA BRAVA',
 '19109': 'CANDONGA',
 '19054': 'CANOAS I',
 '19053': 'CANOAS II',
 '19055': 'CAPIVARA',
 '19081': 'CASTRO ALVES',
 '19050': 'CHAVANTES',
 '19144': 'COARACY NUNES',
 '19154': 'COLIDER',
 '19024': 'CORUMBA',
 '19023': 'CORUMBA-3',
 '19022': 'CORUMBA-4',
 '19139': 'CURUA-UNA',
 '19080': 'D. FRANCISCA',
 '19145': 'DARDANELOS',
 '19012': 'E. DA CUNHA',
 '19039': 'EDGARD SOUZA',
 '19017': 'EMBORCAÇÃO',
 '19076': 'ERNESTINA',
 '19033': 'ESPORA',
 '19133': 'ESTREITO',

In [20]:
# Creating a directory to save all the html files that are going to be created on the next cell

# Get the working directory
pwd = os.getcwd()

# Create the path for the new directory 'reservoirs_html'
dir_res = os.path.join(pwd, "reservoirs_html")
print(dir_res)

# If the new directory is not created, then create it
if not os.path.isdir(dir_res):
    os.mkdir(dir_res)

C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html


In [21]:
# Creating the html files for each reservoir monitoring the time
# For this project, we will grab the data series from January 1st, 1900 to January 1st, 2019

%time
# Create two separated urls to insert the number of the reservoir on the middle, on the loop below 
URL1 = "https://www.ana.gov.br/sar0/MedicaoSin?dropDownListEstados=&dropDownListReservatorios="
URL2 = "&dataInicial=01%2F01%2F1900&dataFinal=01%2F01%2F2019&button=Buscar"

# This loop will use the iterator i and reservoir number 'num_res' inside 'reservoirs' dictionary 
# to create a url 'URLRES' for each reservoir, as explained below:
# Create a new directory path 'file_name' for this new html file
# Print the name of the file path on the output 
# If the file_name does not exist, then create it

for i,num_res in enumerate(reservoirs.keys()):    
    URLRES = URL1 + num_res + URL2
    file_name = os.path.join(dir_res, num_res + ".html")
    print(file_name, end= '> ')
    if not (os.path.isfile(file_name)): # Verifica se o arquivo existir não baixa tudo novamente
        response = requests.get(URLRES, verify=False)    
        with open(file_name, mode='wb') as file:
            file.write(response.content)

Wall time: 0 ns
C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19083.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19015.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19111.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19140.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19040.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19029.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19110.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19141.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19041.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19067.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19143.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19152.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19036.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19127.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19026.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19020.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19021.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19153.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19011.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19028.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19001.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19068.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19129.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19109.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19054.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19053.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19055.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19081.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19050.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19144.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19154.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19024.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19023.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19022.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19139.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19080.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19145.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19012.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19039.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19017.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19076.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19033.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19133.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19146.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19102.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19073.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19030.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19062.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19093.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19003.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19004.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19059.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19084.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19147.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19136.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19037.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19105.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19035.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19034.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19042.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19008.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19097.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19115.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19070.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19058.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19116.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19079.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19087.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19155.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19025.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19002.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19078.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19007.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19092.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19089.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19148.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19063.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19046.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19048.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19006.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19132.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19099.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19013.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19122.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19005.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19069.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19086.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19014.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19112.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19142.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19019.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19072.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19082.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19123.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19044.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19103.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19018.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19051.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19124.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19125.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19010.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19090.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19071.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19077.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19149.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19117.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19131.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19104.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19095.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19156.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19049.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19088.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19038.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19108.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19047.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19043.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19074.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19120.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19118.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19138.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19113.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19057.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19016.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19032.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19106.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19031.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19066.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19107.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19052.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19065.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19085.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19064.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19137.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19091.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19094.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19061.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19100.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19135.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19161.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19157.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19075.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19158.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19130.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19027.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19060.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19128.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19150.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19121.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19096.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19114.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19151.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19056.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19160.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19098.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19045.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19119.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19134.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19101.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19009.html> 



C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19126.html> 



### 2.1. Gathering the data series for each Water Reservoir

This will be the historical data series webscraped for each Water Reservoir. The data series is measured daily for each reservoir. Remember that we set a fixed starting date and finishing for each reservoir's data series (from Jan 1st, 1900 to Jan 1st, 2019) 

In [56]:
# Open one random html file created previously (reservoir 19001 on this example) and parse its content using Beautiful Soup 
file_name = os.path.join(dir_res, "19001"+".html")
                         
with open(file_name, encoding='utf8') as file:
    soup = BeautifulSoup(file, "html")

In [57]:
# Inspecting the data series inside the tbody tag
soup.find_all('tbody', class_="list")[0]

<tbody class="list">
<tr>
<td class="text-center coluna_1">19001</td>
<td class="text-center coluna_2">CAMARGOS                                          </td>
<td class="text-center coluna_3"></td>
<td class="text-center coluna_4"></td>
<td class="text-center coluna_5"></td>
<td class="text-center coluna_6"></td>
<td class="text-center coluna_7"></td>
<td class="text-center coluna_8">98,31</td>
<td class="text-center coluna_9"></td>
<td class="text-center coluna_11"></td>
<td class="text-center coluna_13">01/06/1932</td>
</tr>
<tr>
<td class="text-center coluna_1">19001</td>
<td class="text-center coluna_2">CAMARGOS                                          </td>
<td class="text-center coluna_3"></td>
<td class="text-center coluna_4"></td>
<td class="text-center coluna_5"></td>
<td class="text-center coluna_6"></td>
<td class="text-center coluna_7"></td>
<td class="text-center coluna_8">99,67</td>
<td class="text-center coluna_9"></td>
<td class="text-center coluna_11"></td>
<td class="

In [58]:
# Storing the 'tbody' tag content inside soup to test the data series webscrape with a small set of data
soup = soup.find('tbody', class_="list")

In [59]:
# Create a function to extract the data from each html_line (inside each tag)
# The function has a RegEx to only grab the content inside the >< tags and store it inside 'x'
# If the tag is empty, it will return blank, if it is not blank, then get the first value inside 'x' and remove the blanks
def get_row_value(html_line):
    string = str(html_line)
    x = re.findall(r'>(.+)<', string)
    x = x[0].split()[0] if x else ""
    return x

# This loop will iterate with each day's measure tag created previously if it has more than one value (other than the date),
# get the tags inside each tag and store it inside 'day_measure' list
# Then filter the data collected using a lambda expression if the string 'x' is not a new line (\n)
# Then map the tags list 'filtered' into the get_row_value function's parameter 'html_line'

for i, day_measure in enumerate(soup):    
    if len(day_measure) > 1:
        day_measure = list(day_measure)
        filtered = filter(lambda x: True if x!='\n' else False, day_measure)
        values = map(get_row_value, filtered)

In [60]:
# Checking if the code worked (compare it with the last <tr> </tr> tags inside the 'tbody' tags)
print(list(values))

['19001', 'CAMARGOS', '905,98', '112,65', '77,00', '0,00', '77,00', '114,54', '36,77', '112,95', '02/01/2019']


In [61]:
# Now that the function works, let's build it to run into every html file and create a consolidated data series csv file

# First, create a list with the translated columns labels
column_labels = ['codigo_reservatorio',
                 'nome_reservatorio',
                 'cota',
                 'afluencia',
                 'defluencia',
                 'vazao_vertida',
                 'vazao_turbinada',
                 'vazao_natural',
                 'volume_util',
                 'vazao_incremental',
                 'data']

In [62]:
# Create a csv file to store the results of all loops, monitoring the time
start_time = time.time()
csv_file = open(os.path.join(dir_res, "reservatorios.csv"), mode='w')
results = writer(csv_file, delimiter=',')

# Write the column lables on the file
results.writerow(column_labels)

# This loop will iterate over each html file inside the directory,
# print the file being scraped (just so we can monitor the process),
# read the html file and grab the 'tbody' tags using Beautiful Soup,
# get the tags inside each tag and store it inside 'day_measure' list,
# then filter the data collected using a lambda expression if the string 'x' is not a new line (\n),
# then map the tags list 'filtered' into the get_row_value function's parameter 'html_line',
# then store the results on the csv file

for i, file in enumerate(glob.glob(dir_res + "/*.html")):
    print(i, file)
    with open(file, encoding='utf8') as file:
        soup = BeautifulSoup(file, "html")
        soup = soup.find('tbody', class_="list")
        for i, day_measure in enumerate(soup):    
            if len(day_measure) > 1:
                day_measure = list(day_measure)
                filtered = filter(lambda x: True if x!='\n' else False, day_measure)
                values = map(get_row_value, filtered)
                results.writerow(values)

# Close the csv file
csv_file.close()   
# Print the time to run the program
print("--- %s seconds ---" % (time.time() - start_time))

0 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19001.html
1 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19002.html
2 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19003.html
3 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19004.html
4 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19005.html
5 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19006.html
6 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19007.html
7 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\res

64 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19065.html
65 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19066.html
66 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19067.html
67 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19068.html
68 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19069.html
69 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19070.html
70 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19071.html
71 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Find

128 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19129.html
129 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19130.html
130 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19131.html
131 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19132.html
132 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19133.html
133 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19134.html
134 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate Data Findings\reservoirs_html\19135.html
135 C:\Users\libor\DataScience\Data Analyst Nanodegree - Udacity\Project 5 - Communicate D