# Mexican federal budget pre-processing pipeline

## Instructions

To you run the notebook:

1. choose a unique `ITERATION_LABEL` for each pipeline run
2. specify and describe your input files (`INPUT_FILES`)
3. make sure your column mapping (`COLUMN_ALIASES`) is correct
3. run the whole notebook by clicking on __Kernel > Restart & Run All__

## Imports

In [1]:
from sys import stdout
from pandas import read_csv, concat, DataFrame, ExcelWriter, ExcelFile, Series
from numpy import nan, isnan
from os.path import join, isdir
from os import mkdir
from json import dumps, loads
from pprint import pprint

## Settings

Choose a unique iteration label for each pipeline run.

In [2]:
ITERATION_LABEL = 'iteration-13-all-new-data-2008-2016'

Put your input files inside the `pipeline.in` folder and describe them here.

In [3]:
INPUT_FILES = {
#     2010: {'name': 'Cuenta_Publica_2010.csv', 'encoding': 'windows-1252'},
#     2011: {'name': 'Cuenta_Publica_2011.csv', 'encoding': 'windows-1252'},
#     2012: {'name': 'Cuenta_Publica_2012.csv', 'encoding': 'windows-1252'},
#     2013: {'name': 'Cuenta_Publica_2013.csv', 'encoding': 'windows-1252'},
#     2014: {'name': 'Cuenta_Publica_2014.csv', 'encoding': 'windows-1252'},
#     2015: {'name': 'Cuenta_Publica_2015.csv', 'encoding': 'windows-1252'},
#     2016: {'name': '2016_2T_Gasto_OS.csv', 'encoding': 'windows-1252'} # cp850 for the original 2016 file
    2008: {'name': 'Cuenta_Publica_2008.csv', 'encoding': 'windows-1252'},
    2009: {'name': 'Cuenta_Publica_2009.csv', 'encoding': 'windows-1252'},
    2010: {'name': 'Cuenta_Publica_2010.csv', 'encoding': 'windows-1252'},
    2011: {'name': 'Cuenta_Publica_2011.csv', 'encoding': 'windows-1252'},
    2012: {'name': 'Cuenta_Publica_2012.csv', 'encoding': 'windows-1252'},
    2013: {'name': 'CP_2013.csv', 'encoding': 'windows-1252'},
    2014: {'name': 'CP_2014.csv', 'encoding': 'windows-1252'},
    2015: {'name': 'CP_2015.csv', 'encoding': 'windows-1252'},
    2016: {'name': 'PEF_AC01_2t_2016.csv', 'encoding': 'windows-1252'} 
}


If your input files don't all have the same column names, define your mapping here. 

In [4]:
with open('columns.aliases.json') as json:
    COLUMN_ALIASES = loads(json.read())

In [5]:
from json import dumps

with open('column.aliases.json', 'w+', encoding='utf-8') as json:
    json.write(dumps(COLUMN_ALIASES, indent=4, ensure_ascii=False))

The following hierarchical categories will have IDs prefixed with the parent categories:

In [6]:
HIERARCHIES = {
    'functional': [
        'GPO_FUNCIONAL', 
        'ID_FUNCION', 
        'ID_SUBFUNCION', 
        'ID_AI'
    ],
    'administrative': [
        'ID_RAMO', 
        'ID_UR'
    ],
    'activities': [
        'DESC_MODALIDAD', 
        'PP'
    ],
}

The following columns are unsused and removed at the end of the pipeline:

In [7]:
REMOVE_OUTPUT_COLUMNS = [
    'Reasignacion',
    'Objeto del Gasto',
    'Descripción de Reasignacion',
    'Descripción de Objeto del Gasto'
]

In [8]:
REMOVE_INPUT_COLUMNS = {
    2016: [
        'Adefas',
        'Partida Específica',
        'Partida Genérica',
        'Descripción de Partida Genérica',
        'Descripcion de Partida Específica',
        'Ejercicio',
        'Devengado',
        'Ejercido',
    ]
}

That's it. Now just run the notebook from beginning to end.

## Configuration

In [9]:
BASENAME = 'mexican_federal_budget'
INPUT_FOLDER = 'pipeline.in.v2'
OUTPUT_FOLDER = 'pipeline.out'
ITERATION_FOLDER = join(OUTPUT_FOLDER, ITERATION_LABEL)
MERGED_FILE = join(ITERATION_FOLDER, BASENAME + '.merged.csv')
CATALOGS_FILE = 'objeto_del_gasto.catalog.xlsx'

In [10]:
if isdir(ITERATION_FOLDER):
    raise ValueError('Please enter a unique iteration label')
    
mkdir(ITERATION_FOLDER)

## Encoding inspection

Detect the file encodings of the input files using the `cChardet` utility library. __Warning:__ it's not always accurate. This is meant only as an indication only. In the end, encodings will be taken from `INPUT_FILES`.

In [11]:
def detect_encodings():
    """Detect CSV file encoding with the cChardet library"""

    try:
        import cchardet as chardet
    except ImportError:
        cChardet = 'https://github.com/PyYoshi/cChardet'
        print('Encoding inspection skipped: install %s', cChardet)
        return

    results = {}
    results_file = join(OUTPUT_FOLDER, ITERATION_LABEL, 'encodings.detected.json')
    
    for year, file in sorted(INPUT_FILES.items()):
        datafile = join(INPUT_FOLDER, file['name'])
        
        with open(datafile, 'rb') as f:
            text = f.read()
            
        result = chardet.detect(text)
        results.update({year: result})
        print(year, 'Inspected', file['name'], result)
    
    with open(results_file, 'w+') as json:
        json.write(dumps(results, indent=4))
        print('\nSaved encoding detection report to', results_file)
        
# detect_encodings()

## Load files

In [12]:
def read_columns(file, encoding):
    """Return clean CSV file headers"""
    
    with open(file, encoding=encoding) as csv:
        header = csv.readline()
        return header.replace('\n', '').split(',')

In [13]:
def force_strings(columns):
    """Return string enforcement for each column of a CSV file"""
    
    for column in columns:
        yield column, str

In [14]:
def load_csv_files():
    """Load raw data (CSV) files"""
    
    batch = {}
    
    for year, file in sorted(INPUT_FILES.items()):
        filepath = join(INPUT_FOLDER, file['name'])
        column_names = read_columns(filepath, file['encoding'])
        column_types = dict(force_strings(column_names))
        
        batch[year] = read_csv(filepath, encoding=file['encoding'], dtype=column_types)
        print('Loaded', file['name'], 'with encoding', file['encoding'])
    
    print()
    stdout.flush()

    for year in sorted(INPUT_FILES.keys()):
        if year in REMOVE_INPUT_COLUMNS:
            for column in REMOVE_INPUT_COLUMNS[year]:
                try:
                    del batch[year][column]
                    print(year, 'deleted', column)
                except KeyError:
                    print(year, column, 'not found in', file['name'])

        stdout.flush()

    return batch

## Clean the data

In [15]:
def strip_cell_padding(batch):
    for year in sorted(batch.keys()):
        for column in batch[year].columns:
            batch[year].rename(columns={column: column.strip()}, inplace=True)
            batch[year][column] = batch[year][column].apply(lambda x: x.strip() if x is not nan else x)
        print(year, 'stripped cell paddings')
        stdout.flush()

In [16]:
def delete_empty_columns(batch):
    for year in batch.keys():
        for column in batch[year].columns:
            if 'Unnamed:' in column:
                try:
                    del batch[year][column]
                    print(year, column, 'deleted')
                    stdout.flush()
                except KeyError:
                    pass  

In [17]:
def count_missing_values(batch):
    collector = {}
    table = []

    for column in get_union_of_columns(batch):
        row = {'Column': column}
        collector.update({column: []})
        
        for year in batch.keys():
            if column in batch[year].columns:
                is_empty = batch[year][column].isnull()
                empty_lines = batch[year].where(is_empty).dropna(how='all')
                collector[column].extend(empty_lines.to_dict(orient='records'))
                nb_empty_cells = len(empty_lines)
            else:
                nb_empty_cells = nan
                
            row.update({year: nb_empty_cells})
            if nb_empty_cells not in (nan, 0):
                print(year, 'found', nb_empty_cells, 'missing values in', column)

        table.append(row)
        
    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    empty_values_overview_table = DataFrame(table).reindex_axis(ordered_columns, axis=1)
    
    return empty_values_overview_table, collector

In [18]:
def count_duplicates(batch):
    for year, df in sorted(batch.items()):
        nb_duplicate_lines = df.duplicated().apply(lambda x: 1 if x is True else 0).sum()
        print(year, 'found', nb_duplicate_lines, 'duplicate lines')

## Alias column names

In [19]:
def get_union_of_columns(batch):
    union = set()
    for year in batch.keys():
        union = union | set(batch[year].columns)
    return union

In [20]:
from yaml import load

def load_aliases(file):
    with open(file) as yaml:
        aliases = load(yaml.read())
        return aliases

In [21]:
def map_columns_to_aliases(batch, list_of_aliases):
    for year in sorted(batch.keys()):
        for column in sorted(batch[year].columns):
            if not column in list_of_aliases:
                for reference, aliases in list_of_aliases.items():
                    if aliases:
                        if column in aliases:
                            batch[year].rename(columns={column: reference}, inplace=True)
                            print(year, column, 'replaced with', reference)
                            stdout.flush()
                            break  
                else:
                    print(year, 'NO ALIAS REGISTERED FOR', column)
                    stdout.flush()

In [22]:
def build_overview(batch):
    table = []
    
    for column in get_union_of_columns(batch):
        row = {'Column': column}
        for year in batch.keys():
            row.update({year: column in batch[year].columns})
        table.append(row)
        
    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    
    overview = DataFrame(table).reindex_axis(ordered_columns, axis=1)
    print('Column mapping overview: done')
    return overview

## Check expenditure sums

There's a little cleaning to do on the amount columns (zeros represented by a dash). Assume thousands are seperated by a comma.

In [23]:
EXPENDITURE_COLUMNS = [
    'MONTO_EJERCIDO', 
    'MONTO_DEVENGADO', 
    'MONTO_APROBADO', 
    'MONTO_PAGADO', 
    'MONTO_MODIFICADO', 
    'MONTO_ADEFAS', 
    'MONTO_EJERCICIO'
]
count = 0

def clean_expenditure_columns(batch):
    check_sums = []

    for column in EXPENDITURE_COLUMNS:
        row = {'Column': column}
        
        for year in sorted(batch.keys()):
            try:
                series = batch[year][column]
                
                # I'm assuming a single '-' represents zero
                series = series.apply(lambda x: '0' if x == '-' else x)
                try:
                    series = series.apply(lambda x: x.replace(',', '') if x is not nan else x)    
                except AttributeError:
                    if count < 10:
                        print(year, column)
                batch[year][column] = series.astype(float)
                check_sum = batch[year][column].sum()
                
                print(year, 'cleaned and summed', column, '=', check_sum, 'pesos')
                
            except KeyError:
                check_sum = nan
                
            row.update({year: check_sum})
        
        check_sums.append(row)

    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    return DataFrame(check_sums).reindex_axis(ordered_columns, axis=1)    

## Objeto del Gasto Column split

In [24]:
from os.path import join

def generate_catalog(file):
    
    catalog_ = {}
    catalog_file = ExcelFile(file)
    INDEX_COLUMN = 0
    
    for sheet in catalog_file.sheet_names:
        if sheet != 'Concatenated':
            name = sheet.lower().replace(' ', '_')
            output = join('objeto_del_gasto.catalog', name + '.csv')

            df = catalog_file.parse(sheet).dropna()
            index = df.columns[INDEX_COLUMN]

            df[index] =  df[index].astype(str)
            df.set_index(index, inplace=True)
            df = df.groupby(df.index).first()
            df.sort_index(inplace=True)
            
            message = 'Loaded catalog {sheet} into "{name}" ({nb} lines)'
            parameters = dict(sheet=sheet, name=name, nb=len(df))

            print(message.format(**parameters))
            catalog_[name] = df['DESCRIPCION']
    
    print()
    return catalog_

__Note!__ Years are hard coded in the script below.

In [25]:
def split_objeto_del_gasto(batch):
    catalog = generate_catalog(CATALOGS_FILE)
    missing_in_catalog = []
    
    def has_digits(n, N):
        return not isinstance(n, float) and len(n) >= N 
            
    def lookup(n, table, year):
        try:
            return catalog[table].loc[n]
        except KeyError:
            missing_in_catalog.append({'year': year, 'table': table, 'ID': n})
            return nan
        except TypeError:
            # n is nan
            return nan
    
    for year in sorted(batch.keys()):
        if year == 2016:
            print('Skipping', year, 'because the raw CSV already has the required columns')
        
        else:
            objeto = batch[year]['ID_CONCEPTO'].astype(str)

            batch[year]['ID_CAPITULO'] = objeto.apply(lambda x: x[0] + '000' if x not in (nan, 'nan') else nan)
            batch[year]['ID_CONCEPTO'] = objeto.apply(lambda x: x[:2] + '00' if x not in (nan, 'nan') else nan)
            batch[year]['DESC_CAPITULO'] = batch[year]['ID_CAPITULO'].map(lambda x: lookup(x, 'capitulo', year))  
            batch[year]['DESC_CONCEPTO'] = batch[year]['ID_CONCEPTO'].map(lambda x: lookup(x, 'concepto', year))  
            
            nb_generica_digits = 4 if year in (2008, 2009, 2010) else 3
            
            # Skip the LAST year of the dataset (currently 2016) it has split columns already
            batch[year]['PARTIDA_GENERICA'] = objeto.apply(lambda x: x[:nb_generica_digits] if has_digits(x, 4) else nan)
            batch[year]['DESC_PARTIDA_GENERICA'] = batch[year]['PARTIDA_GENERICA'].map(lambda x: lookup(x, 'partida_generica', year))  
            
            if year not in (2008, 2009, 2010):
                batch[year]['PARTIDA_ESPECIFICA'] = objeto.apply(lambda x: x if has_digits(x, 5) else nan)
                batch[year]['DESC_PARTIDA_ESPECIFICA'] = batch[year]['PARTIDA_ESPECIFICA'].map(lambda x: lookup(x, 'partida_específica', year) if has_digits(x, 5) else nan)  
            else:
                batch[year]['PARTIDA_ESPECIFICA'] = nan
                batch[year]['DESC_PARTIDA_ESPECIFICA'] = nan

            print(year, 'broke down "Objeto del Gasto" column')
        
    return DataFrame(missing_in_catalog).drop_duplicates(['ID', 'table'])

## Prefix IDs 
Disambiguating sub-categories may require prefixing their IDs with their parents' IDs.

In [26]:
def prefix_ids(batch):
    for year in batch.keys():       
        for hierarchy, levels in HIERARCHIES.items():
            prefix = batch[year]['CICLO'].apply(lambda x: '')
            for n, level in enumerate(levels):
                dash = '.' if n > 0 else ''
                prefix = prefix + dash + batch[year][level]  
                batch[year][level] = prefix
                
                print(year, 'prefixed', hierarchy, 'level', n, level)
                stdout.flush()

## Remove unused columns

In [27]:
def remove_unused_columns(batch):
    for year, budget in batch.items():
        for column in REMOVE_OUTPUT_COLUMNS:
            try:
                del budget[column]
                print(year, 'deleted', column)
            except KeyError:
                print(column, ': no such column to delete')

##  Pipeline

In [28]:
def do_pipeline():

    def echo_section(section):
        print('\n', section, '\n')

    echo_section('Loading files')
    datasets = load_csv_files()
    
    echo_section('Delete empty columns')
    delete_empty_columns(datasets)

    echo_section('Stripping padding from cells')
    strip_cell_padding(datasets)
    
    echo_section('Counting duplicate lines (NOT de-duplicating)')
    count_duplicates(datasets)
    
    echo_section('Mapping column to aliases')
    map_columns_to_aliases(datasets, COLUMN_ALIASES)

    echo_section('Counting missing values')
    missing_values_report, bad_records = count_missing_values(datasets)
    
    echo_section('Building column mapping overview')
    column_mapping_report = build_overview(datasets)
    
    echo_section('Cleaning expenditure columns')
    sums_report = clean_expenditure_columns(datasets)
    
    echo_section('Breaking down Objeto del Gasto column')
    missing_catalog_ids = split_objeto_del_gasto(datasets)
        
    echo_section('Prefixing IDs of certain category hierarchies')
    prefix_ids(datasets)

    echo_section('Removing unused columns')
    remove_unused_columns(datasets)

    echo_section('Saving pipeline configuration')

    reports_file = join(ITERATION_FOLDER, BASENAME + '.reports.xlsx')
    writer = ExcelWriter(reports_file)    
    missing_values_report.to_excel(writer, 'missing values', encoding='utf-8', index=False)
    column_mapping_report.to_excel(writer, 'column mapping', encoding='utf-8', index=False)
    sums_report.to_excel(writer, 'check sums', encoding='utf-8', index=False)
    missing_catalog_ids.to_excel(writer, 'missing_catalog_IDs', encoding='utf-8', index=False)    
    print('Saved 4 reports to', reports_file)    

    aliases_file = join(ITERATION_FOLDER, BASENAME + '.aliases.json')
    inputs_file = join(ITERATION_FOLDER, BASENAME + '.inputs.json')
    levels_file = join(ITERATION_FOLDER, BASENAME + '.levels.json')
    bad_records_file = join(ITERATION_FOLDER, BASENAME + '.missing.json')

    with open(bad_records_file, 'w+') as json:
        json.write(dumps(bad_records, indent=4))
        
    with open(aliases_file, 'w+') as json:
        json.write(dumps(COLUMN_ALIASES, indent=4))
        
    with open(levels_file, 'w+') as json:
        json.write(dumps(HIERARCHIES, indent=4))
        
    with open(inputs_file, 'w+') as json:
        json.write(dumps(INPUT_FILES, indent=4))
    
    print('Saved input configuration to', inputs_file)    
    print('Saved column aliases to', aliases_file) 
    print('Saved bad records (those with empty cells) to', bad_records_file)    
    print('Saved hierarchy levels used for prefixing to', levels_file) 
    
    echo_section('Pipeline run "%s" done' % ITERATION_LABEL)

    return datasets, missing_catalog_ids, column_mapping_report, missing_values_report, sums_report

## Run the pipeline

In [29]:
budgets, missing_ids, column_mapping, missing_values, sums = do_pipeline()


 Loading files 

Loaded Cuenta_Publica_2008.csv with encoding windows-1252
Loaded Cuenta_Publica_2009.csv with encoding windows-1252
Loaded Cuenta_Publica_2010.csv with encoding windows-1252
Loaded Cuenta_Publica_2011.csv with encoding windows-1252
Loaded Cuenta_Publica_2012.csv with encoding windows-1252
Loaded CP_2013.csv with encoding windows-1252
Loaded CP_2014.csv with encoding windows-1252
Loaded CP_2015.csv with encoding windows-1252


 Delete empty columns 

2009 Unnamed: 25 deleted
2009 Unnamed: 26 deleted
2009 Unnamed: 27 deleted
2009 Unnamed: 28 deleted
2009 Unnamed: 29 deleted
2009 Unnamed: 30 deleted
2009 Unnamed: 31 deleted
2009 Unnamed: 32 deleted
2009 Unnamed: 33 deleted
2009 Unnamed: 34 deleted
2009 Unnamed: 35 deleted
2009 Unnamed: 36 deleted
2009 Unnamed: 37 deleted
2009 Unnamed: 38 deleted
2009 Unnamed: 39 deleted
2011 Unnamed: 25 deleted
2011 Unnamed: 26 deleted
2011 Unnamed: 27 deleted
2011 Unnamed: 28 deleted
2011 Unnamed: 29 deleted
2011 Unnamed: 30 deleted
201

In [30]:
from gc import collect
collect()

3033

In [31]:
for year, budget in budgets.items():
    filepath = MERGED_FILE.replace('merged', str(year))
    budget.to_csv(filepath, encoding='utf-8', index=False)
    print('Saved', filepath)
    stdout.flush()

Saved pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.2008.csv
Saved pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.2009.csv
Saved pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.2010.csv
Saved pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.2011.csv
Saved pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.2012.csv
Saved pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.2013.csv
Saved pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.2014.csv
Saved pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.2015.csv


In [32]:
merged = concat(list(budgets.values()))
merged.to_csv(MERGED_FILE, encoding='utf-8', index=False)
print('Saved merged dataset to', MERGED_FILE)    

Saved merged dataset to pipeline.out/iteration-12-new-data-new-aliases/mexican_federal_budget.merged.csv


## Quality control

In [33]:
sorted(list(budget.columns))

['CICLO',
 'DESC_AI',
 'DESC_CAPITULO',
 'DESC_CONCEPTO',
 'DESC_ENT_FED',
 'DESC_FF',
 'DESC_FUNCION',
 'DESC_GPO_FUNCIONAL',
 'DESC_MODALIDAD',
 'DESC_PARTIDA_ESPECIFICA',
 'DESC_PARTIDA_GENERICA',
 'DESC_PP',
 'DESC_RAMO',
 'DESC_SUBFUNCION',
 'DESC_TIPOGASTO',
 'DESC_UR',
 'ENTIDAD',
 'GPO_FUNCIONAL',
 'ID_AI',
 'ID_CAPITULO',
 'ID_CONCEPTO',
 'ID_FF',
 'ID_FUNCION',
 'ID_MODALIDAD',
 'ID_PPI',
 'ID_RAMO',
 'ID_SUBFUNCION',
 'ID_TIPOGASTO',
 'ID_UR',
 'MONTO_ADEFAS',
 'MONTO_APROBADO',
 'MONTO_DEVENGADO',
 'MONTO_EJERCICIO',
 'MONTO_MODIFICADO',
 'MONTO_PAGADO',
 'PARTIDA_ESPECIFICA',
 'PARTIDA_GENERICA',
 'PP']

In [34]:
merged.sample(n=10)

Unnamed: 0,CICLO,DESC_AI,DESC_CAPITULO,DESC_CONCEPTO,DESC_ENT_FED,DESC_FF,DESC_FUNCION,DESC_GPO_FUNCIONAL,DESC_MODALIDAD,DESC_PARTIDA_ESPECIFICA,...,MONTO_ADEFAS,MONTO_APROBADO,MONTO_DEVENGADO,MONTO_EJERCICIO,MONTO_EJERCIDO,MONTO_MODIFICADO,MONTO_PAGADO,PARTIDA_ESPECIFICA,PARTIDA_GENERICA,PP
55804,2011,Servicios de apoyo administrativo,Servicios personales,Otras prestaciones sociales y económicas,,Recursos fiscales,Temas Empresariales,Desarrollo Económico,Apoyo al proceso presupuestario y para mejorar...,Compensación garantizada,...,,10026116.0,,,9421827.0,,,15402.0,154,Apoyo al proceso presupuestario y para mejorar...
210104,2012,Democracia preservada y fortalecida mediante l...,Servicios personales,Seguridad social,Coahuila,Recursos fiscales,Coordinación de la Política de Gobierno,Gobierno,Específicos,Cuotas para el seguro de separación individual...,...,,0.0,,,333853.0,,,14404.0,144,Específicos.3
101865,2011,Manejo eficiente y sustentable del agua y prev...,Servicios generales,"Servicios de instalacion, reparacion, mantenim...",,Recursos fiscales,Agua Potable y Alcantarillado,Desarrollo Social,Sujetos a Reglas de Operación,Mantenimiento y conservación de vehículos terr...,...,,0.0,,,108176.0,,,35501.0,355,Sujetos a Reglas de Operación.74
72443,2015,Carreteras alimentadoras y caminos rurales efi...,Servicios generales,"Servicios de instalacion, reparacion, mantenim...",Michoacán,Recursos fiscales,Transporte,Desarrollo Económico,Proyectos de Inversión,"Servicios de lavandería, limpieza e higiene",...,0.0,3420.0,3420.0,3420.0,,3420.0,3420.0,35801.0,358,Proyectos de Inversión.31
176904,2014,Investigación del delito federal,Servicios personales,Remuneraciones al personal de carácter permanente,Distrito Federal,Recursos fiscales,Justicia,Gobierno,Prestación de Servicios Públicos,Sueldos base,...,7941.48,616769.0,449615.72,449615.72,,449615.72,441674.24,11301.0,113,Prestación de Servicios Públicos.9
13173,2008,Política de ingresos equitativa y promotora de...,Servicios generales,"Servicios de instalacion, reparacion, mantenim...",,Recursos fiscales,Hacienda,Gobierno,"Planeación, formulación, implementación, segui...",,...,,629488.0,,,50500.0,,,,3502,"Planeación, formulación, implementación, segui..."
127074,2009,Democracia preservada y fortalecida mediante l...,Materiales y suministros,"Productos químicos, farmacéuticos y de laborat...",,Recursos fiscales,Gobernación,Gobierno,Específicos,,...,,0.0,,,0.0,,,,2500,Específicos.1
191662,2014,Turismo con sello propio de calidad hospitalid...,Servicios generales,"Servicios financieros, bancarios y comerciales",Distrito Federal,Recursos fiscales,Turismo,Desarrollo Económico,Promoción y fomento,Seguros de bienes patrimoniales,...,0.0,118648.0,110130.23,110130.23,,110130.23,110130.23,34501.0,345,Promoción y fomento.3
125217,2013,Servicios de apoyo administrativo,Materiales y suministros,"Vestuario, blancos, prendas de protección y ar...",Distrito Federal,Recursos fiscales,Seguridad Nacional,Gobierno,Apoyo al proceso presupuestario y para mejorar...,Prendas de protección personal,...,,106232.0,,64933.83,,,,27201.0,272,Apoyo al proceso presupuestario y para mejorar...
56655,2011,Diseño y aplicación de la política educativa,Materiales y suministros,"Vestuario, blancos, prendas de protección y ar...",,Recursos fiscales,Educación,Desarrollo Social,"Planeación, seguimiento y evaluación de políti...",Prendas de protección personal,...,,3935.0,,,27436.0,,,27201.0,272,"Planeación, seguimiento y evaluación de políti..."


In [35]:
objeto_breakdown = [
    'CICLO', 
    'ID_CAPITULO', 
    'ID_CONCEPTO', 
    'PARTIDA_ESPECIFICA', 
    'PARTIDA_GENERICA'
]
merged[objeto_breakdown].sample(n=20)

Unnamed: 0,CICLO,ID_CAPITULO,ID_CONCEPTO,PARTIDA_ESPECIFICA,PARTIDA_GENERICA
100459,2009,2000,2400,,2401
44344,2013,6000,6200,62501.0,625
78550,2015,3000,3700,37504.0,375
160312,2013,2000,2900,29201.0,292
235733,2015,3000,3700,37201.0,372
156169,2013,2000,2400,24201.0,242
63478,2011,3000,3700,37204.0,372
48721,2014,3000,3300,33602.0,336
86898,2008,3000,3800,,3813
65376,2009,2000,2100,,2101


In [36]:
print('Total: missing', len(missing_ids), 'catalog IDs to breakdown the "Objeto del Gasto" column')
print('Tables:', dict(missing_ids.groupby('table').count()['ID']))
print('Years:', dict(missing_ids.groupby('year').count()['ID']))
try:
    missing_ids.sample(n=20)
except ValueError:
    pass

Total: missing 76 catalog IDs to breakdown the "Objeto del Gasto" column
Tables: {'partida_generica': 45, 'partida_específica': 24, 'concepto': 7}
Years: {2008: 48, 2009: 4, 2012: 22, 2013: 1, 2015: 1}


In [37]:
missing_ids.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76 entries, 0 to 18706
Data columns (total 3 columns):
ID       76 non-null object
table    76 non-null object
year     76 non-null int64
dtypes: int64(1), object(2)
memory usage: 2.4+ KB


In [38]:
column_mapping

Unnamed: 0,Column,2008,2009,2010,2011,2012,2013,2014,2015
0,DESC_AI,True,True,True,True,True,True,True,True
1,MONTO_ADEFAS,False,False,False,False,False,False,True,True
2,ID_TIPOGASTO,True,True,True,True,True,True,True,True
3,DESC_PP,True,True,True,True,True,True,True,True
4,DESC_SUBFUNCION,True,True,True,True,True,True,True,True
5,DESC_FF,True,True,True,True,True,True,True,True
6,MONTO_APROBADO,True,True,True,True,True,True,True,True
7,MONTO_MODIFICADO,False,False,False,False,False,False,True,True
8,ID_SUBFUNCION,True,True,True,True,True,True,True,True
9,DESC_UR,True,True,True,True,True,True,True,True


In [39]:
missing_values

Unnamed: 0,Column,2008,2009,2010,2011,2012,2013,2014,2015
0,DESC_AI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,MONTO_ADEFAS,,,,,,,0.0,0.0
2,ID_TIPOGASTO,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,DESC_PP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,DESC_SUBFUNCION,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,DESC_FF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,MONTO_APROBADO,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,MONTO_MODIFICADO,,,,,,,0.0,0.0
8,ID_SUBFUNCION,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,DESC_UR,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
sums

Unnamed: 0,Column,2008,2009,2010,2011,2012,2013,2014,2015
0,MONTO_EJERCIDO,2576692000000.0,2296086000000.0,2474100000000.0,2717372000000.0,2896331000000.0,,,
1,MONTO_DEVENGADO,,,,,,,5076810000000.0,5508987000000.0
2,MONTO_APROBADO,1992356000000.0,2289715000000.0,2376915000000.0,2560196000000.0,2754868000000.0,4322619000000.0,4905401000000.0,5138442000000.0
3,MONTO_PAGADO,,,,,,,4972712000000.0,5366652000000.0
4,MONTO_MODIFICADO,,,,,,,5013990000000.0,5398608000000.0
5,MONTO_ADEFAS,,,,,,,36941610000.0,31122650000.0
6,MONTO_EJERCICIO,,,,,,4617618000000.0,5010877000000.0,5399018000000.0


In [41]:
merged.sample(n=20) 

Unnamed: 0,CICLO,DESC_AI,DESC_CAPITULO,DESC_CONCEPTO,DESC_ENT_FED,DESC_FF,DESC_FUNCION,DESC_GPO_FUNCIONAL,DESC_MODALIDAD,DESC_PARTIDA_ESPECIFICA,...,MONTO_ADEFAS,MONTO_APROBADO,MONTO_DEVENGADO,MONTO_EJERCICIO,MONTO_EJERCIDO,MONTO_MODIFICADO,MONTO_PAGADO,PARTIDA_ESPECIFICA,PARTIDA_GENERICA,PP
62025,2012,Servicios de apoyo administrativo,Servicios generales,Otros servicios generales,Guanajuato,Recursos fiscales,Transporte,Desarrollo Económico,Apoyo al proceso presupuestario y para mejorar...,Otros impuestos y derechos,...,,36425.0,,,28314.21,,,39202,392,Apoyo al proceso presupuestario y para mejorar...
123809,2013,Servicios de apoyo administrativo,Servicios personales,Seguridad social,Distrito Federal,Recursos fiscales,Seguridad Nacional,Gobierno,Apoyo al proceso presupuestario y para mejorar...,Aportaciones al ISSFAM,...,,706829.0,,601725.3,,,,14102,141,Apoyo al proceso presupuestario y para mejorar...
186218,2015,"Apoyo al ingreso, a la salud y a la educación ...",Servicios personales,Otras prestaciones sociales y económicas,Distrito Federal,Recursos fiscales,Protección Social,Desarrollo Social,"Planeación, seguimiento y evaluación de políti...",Otras prestaciones,...,0.0,435983.0,435983.0,435983.0,,435983.0,435983.0,15901,159,"Planeación, seguimiento y evaluación de políti..."
102264,2014,"Micro, pequeñas y medianas empresas productiva...",Servicios personales,Seguridad social,Guerrero,Recursos fiscales,"Asuntos Económicos, Comerciales y Laborales en...",Desarrollo Económico,Prestación de Servicios Públicos,Aportaciones al ISSSTE,...,0.0,230577.0,150463.9,150463.9,,150463.9,150463.9,14101,141,Prestación de Servicios Públicos.9
43260,2011,"Carreteras eficientes, seguras y suficientes",Servicios generales,Otros servicios generales,,Recursos fiscales,Comunicaciones y Transportes,Desarrollo Económico,Regulación y supervisión,Otros impuestos y derechos,...,,874500.0,,,2037601.0,,,39202,392,Regulación y supervisión.8
248448,2014,Generación de energía eléctrica,Servicios personales,Remuneraciones adicionales y especiales,Guanajuato,Ingresos Propios,Combustibles y Energía,Desarrollo Económico,Específicos,Remuneraciones por horas extraordinarias,...,0.0,718243.0,0.0,0.0,,0.0,0.0,13301,133,Específicos.582
64214,2013,Carreteras alimentadoras y caminos rurales efi...,Servicios generales,"Servicios profesionales, cientificos, tecnicos...",Oaxaca,Recursos fiscales,Transporte,Desarrollo Económico,Proyectos de Inversión,Otros servicios comerciales,...,,1200.0,,0.0,,,,33602,336,Proyectos de Inversión.31
286145,2013,"Pago de pensiones por retiro, cesantía en edad...",Servicios personales,Remuneraciones adicionales y especiales,Estado de México,Recursos fiscales,Protección Social,Desarrollo Social,Pensiones y jubilaciones,Prima quinquenal por años de servicios efectiv...,...,,0.0,,3205.0,,,,13101,131,Pensiones y jubilaciones.24
242204,2013,Entorno ecológico,Materiales y suministros,Materiales y articulos de construccion y de re...,Jalisco,Ingresos Propios,Combustibles y Energía,Desarrollo Económico,Prestación de Servicios Públicos,Otros materiales y artículos de construcción y...,...,,158801.0,,0.0,,,,24901,249,Prestación de Servicios Públicos.12
225022,2014,Fondo de Aportaciones para los Servicios de Salud,Participaciones y aportaciones,Aportaciones,Sinaloa,Recursos fiscales,Salud,Desarrollo Social,Gasto Federalizado,Aportaciones federales a las entidades federat...,...,0.0,121842.0,121842.0,121842.0,,121842.0,121842.0,83114,831,Gasto Federalizado.2


In [42]:
breakdown = [
    'CICLO', 
    'ID_CAPITULO', 
    'ID_CONCEPTO', 
    'PARTIDA_GENERICA',        
    'PARTIDA_ESPECIFICA', 
    'DESC_CAPITULO',
    'DESC_CONCEPTO', 
    'DESC_PARTIDA_GENERICA',
    'DESC_PARTIDA_ESPECIFICA'
]

merged[breakdown].sample(n=200)

Unnamed: 0,CICLO,ID_CAPITULO,ID_CONCEPTO,PARTIDA_GENERICA,PARTIDA_ESPECIFICA,DESC_CAPITULO,DESC_CONCEPTO,DESC_PARTIDA_GENERICA,DESC_PARTIDA_ESPECIFICA
14145,2014,1000,1400,143,14302,Servicios personales,Seguridad social,Aportaciones al sistema para el retiro,Depósitos para el ahorro solidario
108668,2015,2000,2100,211,21101,Materiales y suministros,"Materiales de administracion, emision de docum...","Materiales, útiles y equipos menores de oficina",Materiales y útiles de oficina
121045,2013,1000,1300,132,13201,Servicios personales,Remuneraciones adicionales y especiales,"Primas de vacaciones, dominical y gratificació...",Primas de vacaciones y dominical
340997,2015,1000,1500,154,15401,Servicios personales,Otras prestaciones sociales y económicas,Prestaciones contractuales,Prestaciones establecidas por condiciones gene...
38322,2011,3000,3500,358,35801,Servicios generales,"Servicios de instalacion, reparacion, mantenim...",Servicios de limpieza y manejo de desechos,"Servicios de lavandería, limpieza e higiene"
201815,2015,2000,2700,275,27501,Materiales y suministros,"Vestuario, blancos, prendas de protección y ar...","Blancos y otros productos textiles, excepto pr...","Blancos y otros productos textiles, excepto pr..."
101099,2015,2000,2600,261,26105,Materiales y suministros,"Combustibles, lubricantes y aditivos","Combustibles, lubricantes y aditivos","Combustibles, lubricantes y aditivos para maqu..."
94045,2013,1000,1400,144,14401,Servicios personales,Seguridad social,Aportaciones para seguros,Cuotas para el seguro de vida del personal civil
125993,2013,2000,2700,274,27401,Materiales y suministros,"Vestuario, blancos, prendas de protección y ar...",Productos textiles,Productos textiles
102735,2015,2000,2100,212,21201,Materiales y suministros,"Materiales de administracion, emision de docum...",Materiales y útiles de impresión y reproducción,Materiales y útiles de impresión y reproducción


In [43]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1811445 entries, 0 to 341255
Data columns (total 39 columns):
CICLO                      object
DESC_AI                    object
DESC_CAPITULO              object
DESC_CONCEPTO              object
DESC_ENT_FED               object
DESC_FF                    object
DESC_FUNCION               object
DESC_GPO_FUNCIONAL         object
DESC_MODALIDAD             object
DESC_PARTIDA_ESPECIFICA    object
DESC_PARTIDA_GENERICA      object
DESC_PP                    object
DESC_RAMO                  object
DESC_SUBFUNCION            object
DESC_TIPOGASTO             object
DESC_UR                    object
ENTIDAD                    object
GPO_FUNCIONAL              object
ID_AI                      object
ID_CAPITULO                object
ID_CONCEPTO                object
ID_FF                      object
ID_FUNCION                 object
ID_MODALIDAD               object
ID_PPI                     object
ID_RAMO                    object
ID_S