# Extract historic data from Guandu

Sometimes Guandu data is included in reports of more broad geographical areas like the city of Rio de Janeiro. So we'll treat them specially. As written in the report of Rio de Janeiro of 2005:

_"A partir de Maio de 2005 para melhor verificação da qualidade da água fornecida ao município as análises de qualidade de água passaram a ter controles estatísticos por sistema de
abastecimento."_

I would translate it to English as:

_"After May 2005, in order to better verify the quality of the water provided to the municipality, the water quality analysis have now statistical control per supply system."_

Because of that, we can't then use data from reports of 2004, because they are collected per municipality, not by water supply system.

First, let's get the path for every guandu report.

In [1]:
import os
import glob
import tabula
import datetime
import pandas as pd

In [19]:
reports = {
    2005: {
        'area': (59.169689641273685, 41.67616875712657, 67.11003627569528, 63.740022805017105),
        'columns': ('colour', 'turbity', 'pH', 'free_chlorine', 'total_coliforms', 'thermo_coliforms'),
        'page': 2,
        'shiftlast': True,
    },
    # 2006 is missing
    2007: {
        'area': (72.74800456100343, 40.92741935483871, 81.89851767388826, 64.63709677419355),
        'page': 1,
        'columns': ('colour', 'turbity', 'pH', 'free_chlorine', 'total_coliforms', 'thermo_coliforms'),
    },
    2008: {
        'area': (76.15771368086796, 41.548231025098275, 86.74252447737497, 63.501663138796495),
        'page': 1,
        'columns': ('colour', 'turbity', 'pH', 'free_chlorine', 'total_coliforms', 'thermo_coliforms'),
    },
    2009: {
        'area': (8.135593220338983, 1.6617790811339197, 56.440677966101696, 31.867057673509287),
        'page': 2,
        'shiftlast': True,
        'columns': ('month', 'colour_turbity_tests', 'chlorine_bacteriology_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'thermo_coliforms'),
    },
    2010: {
        'area': (8.135593220338983, 1.6617790811339197, 56.440677966101696, 31.867057673509287),
        'page': 2,
        'shiftlast': True,
        'columns': ('month', 'colour_turbity_tests', 'chlorine_bacteriology_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'thermo_coliforms'),
    },
    2011: {
        'area': (7.525655644241732, 2.4188671638782506, 34.83466362599772, 32.37250554323725),
        'page': 2,
        'columns': ('month', 'chlorine_bacteriology_turbity_tests', 'colour_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'thermo_coliforms'),
    },
    2012: {
        'area': (7.525655644241732, 2.297923805684338, 34.7206385404789, 32.412819995968555),
        'page': 2,
        'columns': ('month', 'chlorine_bacteriology_turbity_tests', 'colour_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'e_coli'),
    },
    2013: {
        'area': (7.525655644241732, 2.297923805684338, 34.7206385404789, 32.412819995968555),
        'page': 2,
        'columns': ('month', 'chlorine_bacteriology_turbity_tests', 'colour_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'e_coli'),
    },
    2014: {
        'area': (7.525655644241732, 2.3382382584156423, 34.94868871151653, 32.493448901431165),
        'page': 2,
        'columns': ('month', 'chlorine_bacteriology_turbity_tests', 'colour_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'e_coli'),
    },
    2015: {
        'area': (4.275940706955531, 2.479338842975207, 40.87799315849487, 31.62668816770812),
        'page': 2,
        'columns': ('month', 'chlorine_bacteriology_turbity_tests', 'colour_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'total_coliforms_rec', 'e_coli', 'e_coli_rec'),
    },
    # 2016 has no table
    2017: {
        'area': (8.022284122562674, 1.5513897866839044, 54.038997214484674, 33.80736910148675),
        'page': 2,
        'columns': ('month', 'chlorine_bacteriology_turbity_tests', 'colour_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'total_coliforms_rec', 'e_coli', 'e_coli_rec'),
    },
    2018: {
        'area': (8.0, 1.3680781758957654, 54.0, 33.811074918566774),
        'page': 2,
        'columns': ('month', 'chlorine_bacteriology_turbity_tests', 'colour_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'total_coliforms_rec', 'e_coli', 'e_coli_rec'),
    },
    # 2019 has no table
    2020: {
        'area': (25.725806451612904, 4.104903078677309, 49.71774193548387, 32.782212086659065),
        'page': 2,
        'columns': ('month', 'chlorine_bacteriology_turbity_tests', 'colour_tests', 'turbity', 'colour', 'free_chlorine', 'total_coliforms', 'total_coliforms_rec', 'e_coli', 'e_coli_rec'),
        'shiftlast': True,
    }
}

In [23]:
for year in os.listdir('input'):
    try:
        year_int = int(year)
    except ValueError:
        continue
    if year_int not in reports:
        continue
    # Newer reports have Guandu data exclusively
    path = os.path.join('input', year, 'guandu.pdf')
    if os.path.isfile(path):
        reports[year_int]['path'] = path
        continue
    # Older reports include Guandu in Rio de Janeiro reports
    altpath = os.path.join('input', year, 'rio_de_janeiro.pdf')
    if os.path.isfile(altpath):
        reports[year_int]['path'] = altpath

Let's take a look at the years we're able to obtain.

In [24]:
print(*sorted(reports.keys()), sep=', ')

2005, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2017, 2018, 2020


Let's now extract the tables from these PDFs. We need to define the area and page in the PDF where the table resides. This part is very manual, because the layout of reports have really changed over time, so we can't fully rely on `tabula-py` auto-detect feature...

Also, a common issue with tables parsed by `tabula-py` is that the last collumn is shifted by one to the side.

We also name columns with keywords for common access interface amongs reports.

In [25]:
for year, reportdata in reports.items():
    path = reportdata.get('path')
    area = reportdata.get('area')
    page = reportdata.get('page')
    if path and area and page:
        dfs = tabula.read_pdf(path, pages=page, area=area,
                              relative_area=True, silent=True, lattice=True)
        assert dfs, year
        df = dfs[0]
        df = df.dropna(how='all', axis=0)
        df = df.dropna(how='all', axis=1)
        if reportdata.get('shiftlast', False):
            df.iloc[-1] = df.tail(1).shift(1, axis=1).iloc[0]
        df = df.tail(12)
        df = df.dropna(how='all', axis=1)
        columns = reportdata.get('columns')
        if columns:
            df.columns = columns
        reportdata['df'] = df