# Mexican federal budget pre-processing pipeline

## Instructions

To you run the notebook:

1. choose a unique `ITERATION_LABEL` for each pipeline run
2. specify and describe your input files (`INPUT_FILES`)
3. make sure your column mapping (`COLUMN_ALIASES`) is correct
3. run the whole notebook by clicking on __Kernel > Restart & Run All__

## Settings

Choose a unique iteration label for each pipeline run.

In [1]:
ITERATION_LABEL = 'iteration-9-development'

Put your input files inside the `pipeline.in` folder and describe them here.

In [2]:
INPUT_FILES = {
    2010: {'name': 'Cuenta_Publica_2010.csv', 'encoding': 'windows-1252'},
    2011: {'name': 'Cuenta_Publica_2011.csv', 'encoding': 'windows-1252'},
    2012: {'name': 'Cuenta_Publica_2012.csv', 'encoding': 'windows-1252'},
    2013: {'name': 'Cuenta_Publica_2013.csv', 'encoding': 'windows-1252'},
    2014: {'name': 'Cuenta_Publica_2014.csv', 'encoding': 'windows-1252'},
    2015: {'name': 'Cuenta_Publica_2015.csv', 'encoding': 'windows-1252'},
    2016: {'name': '2016_2T_Gasto_OS.csv', 'encoding': 'windows-1252'} # cp850 for the original 2016 file
}

If your input files don't all have the same column names, define your mapping here. 

In [3]:
COLUMN_ALIASES = {
    'Actividad Institucional': ['AI'],
    'Adefas': ['ADEFAS'],
    'Aprobado': [
        'PEF_2016',
        'Importe Presupuesto de Egresos de la Federación',
        'Importe Presupuesto de Egresos de la Federación (PEF)'
    ],
    'Ciclo': None,
    'Clave de cartera': ['CLAVE_CARTERA'],
    'Descripción de Fuente de Financiamiento': ['FUENTE_FINAN_DESCRIPCION'],
    'Descripción de Función': ['FUNCIONL_DESCRIPCION'],
    'Descripción de Grupo Funcional': [
        'Descripción de Finalidad',
        'GRUPO_FUN_DESCRIPCION',
        'Descripción de Grupo Funcional'
    ],
    'Descripción de Objeto del Gasto': ['CONCEPTO_DESCRIPCION'],
    'Descripción de Programa Presupuestario': ['PROGR_PRES_DESCRIPCION'],
    'Descripción de Ramo': ['RAMO_DESCRIPCION'],
    'Descripción de Reasignacion': ['REASIGNACION_DESCRIPCION'],
    'Descripción de Subfunción': ['SUBFUNCIONL_DESCRIPCION', 'Descripción de subfunción'],
    'Descripción de Tipo de Gasto': ['TIPO_GASTO_DESCRIPCION'],
    'Descripción de Unidad Responsable': ['UNIDAD_DESCRIPCION'],
    'Descripción de la Actividad Institucional': [
        'ACTIVIDAD_INST_DESCRIPCION',
        'Descripción de Actividad Institucional'
    ],
    'Descripción de Entidad Federativa': ['Descripción de la entidad federativa', 'ENTIDAD_FED_DESCRIPCION'],
    'Descripción de la modalidad del programa presupuestario': [
        'MODALIDAD_DESCRIPCION',
        'Descripción del Identificador del Programa Presupuestario',
        'Descripción del Identificador de Programa Presupuestario'
    ],
    'Devengado': None,
    'Ejercicio': None,
    'Ejercido': None,
    'Entidad Federativa': ['EF'],
    'Fuente de Financiamiento': ['FF', 'Fuente de Finaciamiento'],
    'Función': ['FN'],
    'Grupo Funcional': [
        'Finalidad', 'GF', 'Grupo Funcional'
    ],
    'Modalidad del Programa presupuestario': [
        'MOD',
        'Identificador de Programa Presupuestario',
        'Identificador del Programa Presupuestario'
    ],
    'Modificado': None,
    'Objeto del Gasto': ['CONCEPTO'],
    'Pagado': None,
    'Programa Presupuestario': ['PP'],
    'Ramo': ['RAMO'],
    'Reasignacion': ['RA'],
    'Subfunción': ['SF'],
    'Tipo de Gasto': ['TG'],
    'Unidad Responsable': ['UNIDAD'],
    'Capitulo': None,
    'Concepto': None,
    'Partida Genérica': None,
    'Partida Específica': None,
    'Descripción de Capitulo': None,
    'Descripción de Concepto': None,
    'Descripción de Partida Genérica': None,
    'Descripción de Partida Específica': ['Descripcion de Partida Específica'],    
}

The following hierarchical categories will have IDs prefixed with the parent categories:

In [4]:
HIERARCHIES = {
    'functional': [
        'Grupo Funcional', 
        'Función', 
        'Subfunción', 
        'Actividad Institucional'
    ],
    'administrative': [
        'Ramo', 
        'Unidad Responsable'
    ],
    'activities': [
        'Modalidad del Programa presupuestario', 
        'Programa Presupuestario'
    ],
}

The following columns are unsused and removed at the end of the pipeline:

In [5]:
REMOVE_OUTPUT_COLUMNS = [
    'Reasignacion',
    'Objeto del Gasto',
    'Descripción de Reasignacion',
    'Descripción de Objeto del Gasto'
]

In [6]:
REMOVE_INPUT_COLUMNS = {
    2016: [
        'Adefas',
        'Partida Específica',
        'Partida Genérica',
        'Descripción de Partida Genérica',
        'Descripcion de Partida Específica',
        'Ejercicio',
        'Devengado',
        'Ejercido',
    ]
}

That's it. Now just run the notebook from beginning to end.

## Imports

In [7]:
from sys import stdout
from pandas import read_csv, concat, DataFrame, ExcelWriter, ExcelFile, Series
from numpy import nan, isnan
from os.path import join, isdir
from os import mkdir
from json import dumps, loads
from pprint import pprint

## Configuration

In [8]:
BASENAME = 'mexican_federal_budget'
INPUT_FOLDER = 'pipeline.in'
OUTPUT_FOLDER = 'pipeline.out'
ITERATION_FOLDER = join(OUTPUT_FOLDER, ITERATION_LABEL)
MERGED_FILE = join(ITERATION_FOLDER, BASENAME + '.merged.csv')
CATALOGS_FOLDER = 'objeto_del_gasto.catalog'
CATALOGS_FILE = 'objeto_del_gasto.catalog.xlsx'

In [9]:
if isdir(ITERATION_FOLDER):
    raise ValueError('Please enter a unique iteration label')
    
mkdir(ITERATION_FOLDER)

## Encoding inspection

Detect the file encodings of the input files using the `cChardet` utility library. __Warning:__ it's not always accurate. This is meant only as an indication only. In the end, encodings will be taken from `INPUT_FILES`.

In [10]:
def detect_encodings():
    """Detect CSV file encoding with the cChardet library"""

    try:
        import cchardet as chardet
    except ImportError:
        cChardet = 'https://github.com/PyYoshi/cChardet'
        print('Encoding inspection skipped: install %s', cChardet)
        return

    results = {}
    results_file = join(OUTPUT_FOLDER, ITERATION_LABEL, 'encodings.detected.json')
    
    for year, file in sorted(INPUT_FILES.items()):
        datafile = join(INPUT_FOLDER, file['name'])
        
        with open(datafile, 'rb') as f:
            text = f.read()
            
        result = chardet.detect(text)
        results.update({year: result})
        print(year, 'Inspected', file['name'], result)
    
    with open(results_file, 'w+') as json:
        json.write(dumps(results, indent=4))
        print('\nSaved encoding detection report to', results_file)
        
# detect_encodings()

## Load files

In [11]:
def read_columns(file, encoding):
    """Return clean CSV file headers"""
    
    with open(file, encoding=encoding) as csv:
        header = csv.readline()
        return header.replace('\n', '').split(',')

In [12]:
def force_strings(columns):
    """Return string enforcement for each column of a CSV file"""
    
    for column in columns:
        yield column, str

In [13]:
def load_csv_files():
    """Load raw data (CSV) files"""
    
    batch = {}
    
    for year, file in sorted(INPUT_FILES.items()):
        filepath = join(INPUT_FOLDER, file['name'])
        column_names = read_columns(filepath, file['encoding'])
        column_types = dict(force_strings(column_names))
        
        batch[year] = read_csv(filepath, encoding=file['encoding'], dtype=column_types)
        print('Loaded', file['name'], 'with encoding', file['encoding'])
    
    print()
    stdout.flush()

    for year in sorted(INPUT_FILES.keys()):
        if year in REMOVE_INPUT_COLUMNS:
            for column in REMOVE_INPUT_COLUMNS[year]:
                try:
                    del batch[year][column]
                    print(year, 'deleted', column)
                except KeyError:
                    print(year, column, 'not found in', file['name'])

        stdout.flush()

    return batch

## Clean the data

In [14]:
def strip_cell_padding(batch):
    for year in sorted(batch.keys()):
        for column in batch[year].columns:
            batch[year].rename(columns={column: column.strip()}, inplace=True)
            batch[year][column] = batch[year][column].apply(lambda x: x.strip() if x is not nan else x)
        print(year, 'stripped cell paddings')
        stdout.flush()

In [15]:
def delete_empty_columns(batch):
    for year in batch.keys():
        for column in batch[year].columns:
            if 'Unnamed:' in column:
                try:
                    del batch[year][column]
                    print(year, column, 'deleted')
                    stdout.flush()
                except KeyError:
                    pass  

In [16]:
def count_missing_values(batch):
    collector = {}
    table = []

    for column in get_union_of_columns(batch):
        row = {'Column': column}
        collector.update({column: []})
        
        for year in batch.keys():
            if column in batch[year].columns:
                is_empty = batch[year][column].isnull()
                empty_lines = batch[year].where(is_empty).dropna(how='all')
                collector[column].extend(empty_lines.to_dict(orient='records'))
                nb_empty_cells = len(empty_lines)
            else:
                nb_empty_cells = nan
                
            row.update({year: nb_empty_cells})
            if nb_empty_cells not in (nan, 0):
                print(year, 'found', nb_empty_cells, 'missing values in', column)

        table.append(row)
        
    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    empty_values_overview_table = DataFrame(table).reindex_axis(ordered_columns, axis=1)
    
    return empty_values_overview_table, collector

In [17]:
def count_duplicates(batch):
    for year, df in sorted(batch.items()):
        nb_duplicate_lines = df.duplicated().apply(lambda x: 1 if x is True else 0).sum()
        print(year, 'found', nb_duplicate_lines, 'duplicate lines')

## Alias column names

In [18]:
def get_union_of_columns(batch):
    union = set()
    for year in batch.keys():
        union = union | set(batch[year].columns)
    return union

In [19]:
from yaml import load

def load_aliases(file):
    with open(file) as yaml:
        aliases = load(yaml.read())
        return aliases

In [20]:
def map_columns_to_aliases(batch, list_of_aliases):
    for year in sorted(batch.keys()):
        for column in sorted(batch[year].columns):
            if not column in list_of_aliases:
                for reference, aliases in list_of_aliases.items():
                    if aliases:
                        if column in aliases:
                            batch[year].rename(columns={column: reference}, inplace=True)
                            print(year, column, 'replaced with', reference)
                            stdout.flush()
                            break  
                else:
                    print(year, 'NO ALIAS REGISTERED FOR', column)
                    stdout.flush()

In [21]:
def build_overview(batch):
    table = []
    
    for column in get_union_of_columns(batch):
        row = {'Column': column}
        for year in batch.keys():
            row.update({year: column in batch[year].columns})
        table.append(row)
        
    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    
    overview = DataFrame(table).reindex_axis(ordered_columns, axis=1)
    print('Column mapping overview: done')
    return overview

## Check expenditure sums

There's a little cleaning to do on the amount columns (zeros represented by a dash). Assume thousands are seperated by a comma.

In [22]:
EXPENDITURE_COLUMNS = [
    'Ejercido', 
    'Devengado', 
    'Aprobado', 
    'Pagado', 
    'Modificado', 
    'Adefas', 
    'Ejercicio'
]
count = 0

def clean_expenditure_columns(batch):
    check_sums = []

    for column in EXPENDITURE_COLUMNS:
        row = {'Column': column}
        
        for year in sorted(batch.keys()):
            try:
                series = batch[year][column]
                
                # I'm assuming -' represents zero
                series = series.apply(lambda x: '0' if x == '-' else x)
                try:
                    series = series.apply(lambda x: x.replace(',', '') if x is not nan else x)    
                except AttributeError:
                    if count < 10:
                        print(year, column)
                batch[year][column] = series.astype(float)
                check_sum = batch[year][column].sum()
                
                print(year, 'cleaned and summed', column, '=', check_sum, 'pesos')
                
            except KeyError:
                check_sum = nan
                
            row.update({year: check_sum})
        
        check_sums.append(row)

    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    return DataFrame(check_sums).reindex_axis(ordered_columns, axis=1)    

## Objeto del Gasto Column split

In [23]:
from os.path import join

def generate_catalog(file):
    
    catalog_ = {}
    catalog_file = ExcelFile(file)
    INDEX_COLUMN = 0
    
    for sheet in catalog_file.sheet_names:
        if sheet != 'Concatenated':
            name = sheet.lower().replace(' ', '_')
            output = join('objeto_del_gasto.catalog', name + '.csv')

            df = catalog_file.parse(sheet).dropna()
            index = df.columns[INDEX_COLUMN]

            df[index] =  df[index].astype(str)
            df.set_index(index, inplace=True)
            df = df.groupby(df.index).first()
            df.sort_index(inplace=True)
            
            message = 'Loaded catalog {sheet} into "{name}" ({nb} lines)'
            parameters = dict(sheet=sheet, name=name, nb=len(df))

            print(message.format(**parameters))
            catalog_[name] = df['DESCRIPCION']
    
    print()
    return catalog_

__Note!__ Years are hard coded in the script below.

In [24]:
def split_objeto_del_gasto(batch):
    catalog = generate_catalog(CATALOGS_FILE)
    missing_in_catalog = []
    
    def has_digits(n, N):
        return not isinstance(n, float) and len(n) >= N 
            

    def lookup(n, table, year):
        try:
            return catalog[table].loc[n]
        except KeyError:
            missing_in_catalog.append({'year': year, 'table': table, 'ID': n})
            return nan
        except TypeError:
            # n is nan
            return nan
    
    for year in sorted(batch.keys()):
        if year == 2016:
            print('Skipping', year, 'because the raw CSV already has the required columns')
        
        else:
            objeto = batch[year]['Objeto del Gasto'].astype(str)

            batch[year]['Capitulo'] = objeto.apply(lambda x: x[0] + '000' if x not in (nan, 'nan') else nan)
            batch[year]['Concepto'] = objeto.apply(lambda x: x[:2] + '00' if x not in (nan, 'nan') else nan)
            batch[year]['Descripción de Capitulo'] = batch[year]['Capitulo'].map(lambda x: lookup(x, 'capitulo', year))  
            batch[year]['Descripción de Concepto'] = batch[year]['Concepto'].map(lambda x: lookup(x, 'concepto', year))  
            
            # Skip the LAST year of the dataset (currently 2016) it has split columns already
            batch[year]['Partida Genérica'] = objeto.apply(lambda x: x[:3] if has_digits(x, 4) else nan)
            batch[year]['Descripción de Partida Genérica'] = batch[year]['Partida Genérica'].map(lambda x: lookup(x, 'partida_generica', year))  
            
            if year not in (2008, 2009, 2010):
                batch[year]['Partida Específica'] = objeto.apply(lambda x: x if has_digits(x, 5) else nan)
                batch[year]['Descripción de Partida Específica'] = batch[year]['Partida Específica'].map(lambda x: lookup(x, 'partida_especifica', year) if has_digits(x, 5) else nan)  
            else:
                batch[year]['Partida Específica'] = nan
                batch[year]['Descripción de Partida Específica'] = nan

            print(year, 'broke down "Objeto del Gasto" column')
        
    return DataFrame(missing_in_catalog).drop_duplicates()

## Prefix IDs 
Disambiguating sub-categories may require prefixing their IDs with their parents' IDs.

In [25]:
def prefix_ids(batch):
    for year in batch.keys():       
        for hierarchy, levels in HIERARCHIES.items():
            prefix = batch[year]['Ciclo'].apply(lambda x: '')
            for n, level in enumerate(levels):
                dash = '-' if n > 0 else ''
                prefix = prefix + dash + batch[year][level]  
                batch[year][level] = prefix
                
                print(year, 'prefixed', hierarchy, 'level', n, level)
                stdout.flush()

## Remove unused columns

In [26]:
def remove_unused_columns(batch):
    for year, budget in batch.items():
        for column in REMOVE_OUTPUT_COLUMNS:
            try:
                del budget[column]
                print(year, 'deleted', column)
            except KeyError:
                pass

##  Pipeline

In [27]:
def do_pipeline():

    def echo_section(section):
        print('\n', section, '\n')

    echo_section('Loading files')
    datasets = load_csv_files()
    
    echo_section('Delete empty columns')
    delete_empty_columns(datasets)

    echo_section('Stripping padding from cells')
    strip_cell_padding(datasets)
    
    echo_section('Counting duplicate lines (NOT de-duplicating)')
    count_duplicates(datasets)
    
    echo_section('Mapping column to aliases')
    map_columns_to_aliases(datasets, COLUMN_ALIASES)

    echo_section('Counting missing values')
    missing_values_report, bad_records = count_missing_values(datasets)
    
    echo_section('Building column mapping overview')
    column_mapping_report = build_overview(datasets)
    
    echo_section('Cleaning expenditure columns')
    sums_report = clean_expenditure_columns(datasets)
    
    echo_section('Breaking down Objeto del Gasto column')
    missing_catalog_ids = split_objeto_del_gasto(datasets)
        
    echo_section('Prefixing IDs of certain category hierarchies')
    prefix_ids(datasets)

    echo_section('Removing unused columns')
    remove_unused_columns(datasets)

    echo_section('Saving pipeline configuration')

    reports_file = join(ITERATION_FOLDER, BASENAME + '.reports.xlsx')
    writer = ExcelWriter(reports_file)    
    missing_values_report.to_excel(writer, 'missing values', encoding='utf-8', index=False)
    column_mapping_report.to_excel(writer, 'column mapping', encoding='utf-8', index=False)
    sums_report.to_excel(writer, 'check sums', encoding='utf-8', index=False)
    missing_catalog_ids.to_excel(writer, 'missing_catalog_IDs', encoding='utf-8', index=False)    
    print('Saved 4 reports to', reports_file)    

    aliases_file = join(ITERATION_FOLDER, BASENAME + '.aliases.json')
    inputs_file = join(ITERATION_FOLDER, BASENAME + '.inputs.json')
    levels_file = join(ITERATION_FOLDER, BASENAME + '.levels.json')
    bad_records_file = join(ITERATION_FOLDER, BASENAME + '.missing.json')

    with open(bad_records_file, 'w+') as json:
        json.write(dumps(bad_records, indent=4))
        
    with open(aliases_file, 'w+') as json:
        json.write(dumps(COLUMN_ALIASES, indent=4))
        
    with open(levels_file, 'w+') as json:
        json.write(dumps(HIERARCHIES, indent=4))
        
    with open(inputs_file, 'w+') as json:
        json.write(dumps(INPUT_FILES, indent=4))
    
    print('Saved input configuration to', inputs_file)    
    print('Saved column aliases to', aliases_file) 
    print('Saved bad records (those with empty cells) to', bad_records_file)    
    print('Saved hierarchy levels used for prefixing to', levels_file) 
    
    echo_section('Pipeline run "%s" done' % ITERATION_LABEL)

    return datasets, missing_catalog_ids, column_mapping_report, missing_values_report, sums_report

## Run the pipeline

In [28]:
budgets, missing_ids, column_mapping, missing_values, sums = do_pipeline()


 Loading files 

Loaded Cuenta_Publica_2010.csv with encoding windows-1252
Loaded Cuenta_Publica_2011.csv with encoding windows-1252
Loaded Cuenta_Publica_2012.csv with encoding windows-1252
Loaded Cuenta_Publica_2013.csv with encoding windows-1252
Loaded Cuenta_Publica_2014.csv with encoding windows-1252
Loaded Cuenta_Publica_2015.csv with encoding windows-1252
Loaded 2016_2T_Gasto_OS.csv with encoding windows-1252

2016 deleted Adefas
2016 deleted Partida Específica
2016 deleted Partida Genérica
2016 deleted Descripción de Partida Genérica
2016 deleted Descripcion de Partida Específica
2016 deleted Ejercicio
2016 deleted Devengado
2016 deleted Ejercido

 Delete empty columns 

2011 Unnamed: 25 deleted
2011 Unnamed: 26 deleted
2011 Unnamed: 27 deleted
2011 Unnamed: 28 deleted
2011 Unnamed: 29 deleted
2011 Unnamed: 30 deleted
2011 Unnamed: 31 deleted
2011 Unnamed: 32 deleted
2011 Unnamed: 33 deleted
2011 Unnamed: 34 deleted
2011 Unnamed: 35 deleted
2011 Unnamed: 36 deleted
2011 Unname

In [29]:
from gc import collect
collect()

2524

In [30]:
for year, budget in budgets.items():
    filepath = MERGED_FILE.replace('merged', str(year))
    budget.to_csv(filepath, encoding='utf-8', index=False)
    print('Saved', filepath)
    stdout.flush()

Saved pipeline.out/iteration-9-development/mexican_federal_budget.2016.csv
Saved pipeline.out/iteration-9-development/mexican_federal_budget.2010.csv
Saved pipeline.out/iteration-9-development/mexican_federal_budget.2011.csv
Saved pipeline.out/iteration-9-development/mexican_federal_budget.2012.csv
Saved pipeline.out/iteration-9-development/mexican_federal_budget.2013.csv
Saved pipeline.out/iteration-9-development/mexican_federal_budget.2014.csv
Saved pipeline.out/iteration-9-development/mexican_federal_budget.2015.csv


In [31]:
merged = concat(list(budgets.values()))
merged.to_csv(MERGED_FILE, encoding='utf-8', index=False)
print('Saved merged dataset to', MERGED_FILE)    

Saved merged dataset to pipeline.out/iteration-9-development/mexican_federal_budget.merged.csv


## Quality control

In [32]:
sorted(list(budget.columns))

['Actividad Institucional',
 'Adefas',
 'Aprobado',
 'Capitulo',
 'Ciclo',
 'Clave de cartera',
 'Concepto',
 'Descripción de Capitulo',
 'Descripción de Concepto',
 'Descripción de Entidad Federativa',
 'Descripción de Fuente de Financiamiento',
 'Descripción de Función',
 'Descripción de Grupo Funcional',
 'Descripción de Partida Específica',
 'Descripción de Partida Genérica',
 'Descripción de Programa Presupuestario',
 'Descripción de Ramo',
 'Descripción de Subfunción',
 'Descripción de Tipo de Gasto',
 'Descripción de Unidad Responsable',
 'Descripción de la Actividad Institucional',
 'Descripción de la modalidad del programa presupuestario',
 'Devengado',
 'Ejercicio',
 'Entidad Federativa',
 'Fuente de Financiamiento',
 'Función',
 'Grupo Funcional',
 'Modalidad del Programa presupuestario',
 'Modificado',
 'Pagado',
 'Partida Específica',
 'Partida Genérica',
 'Programa Presupuestario',
 'Ramo',
 'Subfunción',
 'Tipo de Gasto',
 'Unidad Responsable']

In [33]:
budget.sample(n=10)

Unnamed: 0,Ciclo,Ramo,Descripción de Ramo,Unidad Responsable,Descripción de Unidad Responsable,Grupo Funcional,Descripción de Grupo Funcional,Función,Descripción de Función,Subfunción,...,Adefas,Ejercicio,Capitulo,Concepto,Descripción de Capitulo,Descripción de Concepto,Partida Genérica,Descripción de Partida Genérica,Partida Específica,Descripción de Partida Específica
150124,2015,15,"Desarrollo Agrario, Territorial y Urbano",15-127,Delegación Estatal en Chiapas,3,Desarrollo Económico,3-2,"Agropecuaria, Silvicultura, Pesca y Caza",3-2-1,...,0.0,53844.0,1000,1300,Servicios personales,Remuneraciones adicionales y especiales,131,Primas por años de servicios efectivos prestados,13101,
220505,2015,32,Tribunal Federal de Justicia Fiscal y Administ...,32-206,"Sala Regional del Norte Centro I, con sede en ...",1,Gobierno,1-2,Justicia,1-2-1,...,0.0,20505.85,2000,2200,Materiales y suministros,Alimentos y utensilios,221,Productos alimenticios para personas,22104,
119084,2015,11,Educación Pública,11-L6I,Comisión Nacional de Cultura Física y Deporte,2,Desarrollo Social,2-4,"Recreación, Cultura y Otras Manifestaciones So...",2-4-1,...,0.0,0.0,4000,4300,"Transferencias, asignaciones, subsidios y otra...",Subsidios y subvenciones,438,Subsidios a Entidades Federativas y Municipios,43801,
179758,2015,17,Procuraduría General de la República,17-900,Visitaduría General,1,Gobierno,1-2,Justicia,1-2-2,...,11625.96,69653.45,1000,1400,Servicios personales,Seguridad social,144,Aportaciones para seguros,14405,
12450,2015,22,Instituto Nacional Electoral,22-300,Juntas Distritales Ejecutivas,1,Gobierno,1-3,Coordinación de la Política de Gobierno,1-3-6,...,0.0,1416654.4,1000,1700,Servicios personales,Pago de estímulos a servidores públicos,171,Estímulos,17101,
43604,2015,8,"Agricultura, Ganadería, Desarrollo Rural, Pesc...",8-141,Delegación en Puebla,3,Desarrollo Económico,3-2,"Agropecuaria, Silvicultura, Pesca y Caza",3-2-1,...,0.0,27416.15,2000,2100,Materiales y suministros,"Materiales de administracion, emision de docum...",211,"Materiales, útiles y equipos menores de oficina",21101,
76369,2015,9,Comunicaciones y Transportes,9-640,Centro SCT Oaxaca,3,Desarrollo Económico,3-5,Transporte,3-5-1,...,0.0,0.0,2000,2200,Materiales y suministros,Alimentos y utensilios,221,Productos alimenticios para personas,22104,
53217,2015,9,Comunicaciones y Transportes,9-627,Centro SCT Chiapas,3,Desarrollo Económico,3-5,Transporte,3-5-1,...,0.0,0.0,2000,2100,Materiales y suministros,"Materiales de administracion, emision de docum...",211,"Materiales, útiles y equipos menores de oficina",21101,
9687,2015,3,Poder Judicial,3-100,Suprema Corte de Justicia de la Nación,1,Gobierno,1-2,Justicia,1-2-1,...,0.0,29109.9,2000,2100,Materiales y suministros,"Materiales de administracion, emision de docum...",211,"Materiales, útiles y equipos menores de oficina",21101,
50553,2015,9,Comunicaciones y Transportes,9-200,Subsecretaría de Infraestructura,3,Desarrollo Económico,3-5,Transporte,3-5-1,...,0.0,269891.43,1000,1500,Servicios personales,Otras prestaciones sociales y económicas,154,Prestaciones contractuales,15401,


In [34]:
objeto_breakdown = [
    'Ciclo', 
    'Capitulo', 'Concepto', 
    'Partida Específica', 
    'Partida Genérica'
]
budget[objeto_breakdown].sample(n=20)

Unnamed: 0,Ciclo,Capitulo,Concepto,Partida Específica,Partida Genérica
137117,2015,2000,2400,24901,249
130225,2015,2000,2200,22104,221
36610,2015,2000,2400,24601,246
154009,2015,1000,1400,14201,142
25050,2015,1000,1300,13101,131
112317,2015,1000,1100,11301,113
71970,2015,2000,2900,29101,291
145607,2015,1000,1100,11301,113
122439,2015,3000,3300,33801,338
83800,2015,3000,3500,35201,352


In [35]:
print('Total: missing', len(missing_ids), 'catalog IDs to breakdown the "Objeto del Gasto" column')
print('Tables:', dict(missing_ids.groupby('table').count()['ID']))
print('Years:', dict(missing_ids.groupby('year').count()['ID']))
missing_ids.sample(n=20)

Total: missing 2385 catalog IDs to breakdown the "Objeto del Gasto" column
Tables: {'partida_especifica': 2317, 'concepto': 7, 'partida_generica': 61}
Years: {2010: 68, 2011: 576, 2012: 457, 2013: 423, 2014: 430, 2015: 431}


Unnamed: 0,ID,table,year
92646,740,partida_generica,2010
742718,31901,partida_especifica,2014
484406,99101,partida_especifica,2012
1008472,51301,partida_especifica,2015
103862,34601,partida_especifica,2011
993472,92201,partida_especifica,2015
525245,54105,partida_especifica,2013
993497,83115,partida_especifica,2015
232972,81429,partida_especifica,2011
895344,32504,partida_especifica,2014


In [36]:
column_mapping

Unnamed: 0,Column,2010,2011,2012,2013,2014,2015,2016
0,Descripción de Concepto,False,False,False,False,False,False,True
1,Fuente de Financiamiento,True,True,True,True,True,True,True
2,Concepto,False,False,False,False,False,False,True
3,Subfunción,True,True,True,True,True,True,True
4,Modalidad del Programa presupuestario,True,True,True,True,True,True,True
5,Ejercido,True,True,True,True,False,False,False
6,Descripción de la Actividad Institucional,True,True,True,True,True,True,True
7,Capitulo,False,False,False,False,False,False,True
8,Ramo,True,True,True,True,True,True,True
9,Descripción de Fuente de Financiamiento,True,True,True,True,True,True,True


In [37]:
missing_values

Unnamed: 0,Column,2010,2011,2012,2013,2014,2015,2016
0,Descripción de Concepto,,,,,,,0.0
1,Fuente de Financiamiento,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Concepto,,,,,,,0.0
3,Subfunción,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Modalidad del Programa presupuestario,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,Ejercido,0.0,1.0,0.0,0.0,,,
6,Descripción de la Actividad Institucional,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,Capitulo,,,,,,,0.0
8,Ramo,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Descripción de Fuente de Financiamiento,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [38]:
sums

Unnamed: 0,Column,2010,2011,2012,2013,2014,2015,2016
0,Ejercido,2474100000000.0,2695930000000.0,2896331000000.0,3134797000000.0,,,
1,Devengado,,,,3135015000000.0,3426242000000.0,3761997000000.0,
2,Aprobado,2376915000000.0,2538282000000.0,2754868000000.0,2943495000000.0,3334259000000.0,3508463000000.0,5296820000000.0
3,Pagado,,,,,3386609000000.0,3728056000000.0,2707418000000.0
4,Modificado,,,,,3427172000000.0,3763467000000.0,2850453000000.0
5,Adefas,,,,,36941610000.0,31122650000.0,
6,Ejercicio,,,,,3424774000000.0,3760422000000.0,


In [39]:
merged.sample(n=20) 

Unnamed: 0,Actividad Institucional,Adefas,Aprobado,Capitulo,Ciclo,Clave de cartera,Concepto,Descripción de Capitulo,Descripción de Concepto,Descripción de Entidad Federativa,...,Modalidad del Programa presupuestario,Modificado,Pagado,Partida Específica,Partida Genérica,Programa Presupuestario,Ramo,Subfunción,Tipo de Gasto,Unidad Responsable
256806,3-8-1-24,,74804.0,1000,2016,0.0,1500,Servicios personales,Otras prestaciones sociales y económicas,Aguascalientes,...,E,26352.0,43190.0,,,E-15,51,3-8-1,1,51-GYN
57004,3-5-1-10,,7680.0,3000,2012,9096270016.0,3700,Servicios generales,Servicios de traslado y viáticos,Chiapas,...,K,,,37504.0,375.0,K-31,9,3-5-1,3,9-627
253392,2-6-9-12,,400741.0,1000,2016,0.0,1400,Servicios personales,Seguridad social,Quintana Roo,...,M,200370.0,0.0,,,M-2,51,2-6-9,1,51-GYN
112347,3-1-1-7,0.0,15600.0,1000,2015,0.0,1300,Servicios personales,Remuneraciones adicionales y especiales,Tamaulipas,...,B,16681.28,16681.28,13101.0,131.0,B-2,10,3-1-1,1,10-LAT
231509,3-8-3-5,0.0,3600.0,3000,2014,0.0,3300,Servicios generales,"Servicios profesionales, cientificos, tecnicos...",Jalisco,...,P,1358.0,1358.0,33602.0,336.0,P-1,38,3-8-3,1,38-90X
72033,2-1-4-25,,28000.0,2000,2010,,2200,Materiales y suministros,Alimentos y utensilios,,...,P,,,,220.0,P-13,12,2-1-4,1,12-V00
52448,3-5-6-8,42150.92,260000.0,3000,2015,0.0,3300,Servicios generales,"Servicios profesionales, cientificos, tecnicos...",Baja California,...,G,302150.8,259999.88,33801.0,338.0,G-1,9,3-5-6,1,9-622
201708,1-3-6-1,0.0,715.0,3000,2014,0.0,3300,Servicios generales,"Servicios profesionales, cientificos, tecnicos...",Nuevo León,...,R,3840.0,3840.0,33602.0,336.0,R-5,22,1-3-6,1,22-200
125213,2-6-8-12,0.0,6028480.0,4000,2014,0.0,4400,"Transferencias, asignaciones, subsidios y otra...",Ayudas sociales,Distrito Federal,...,S,5832935.0,5832935.0,44101.0,441.0,S-150,12,2-6-8,1,12-NHK
22286,1-5-2-5,0.0,426373.0,1000,2014,0.0,1400,Servicios personales,Seguridad social,Distrito Federal,...,P,306846.0,306846.0,14302.0,143.0,P-3,6,1-5-2,1,6-200


In [40]:
with open(join(ITERATION_FOLDER, BASENAME + '.missing.json')) as file:
    aliases = loads(file.read())
aliases['Descripción de Fuente de Financiamiento']

[{'Actividad Institucional': '3',
  'Aprobado': nan,
  'Ciclo': '2011',
  'Descripción de Fuente de Financiamiento': nan,
  'Descripción de Función': 'Ciencia y Tecnología',
  'Descripción de Grupo Funcional': 'Desarrollo Económico',
  'Descripción de Objeto del Gasto': nan,
  'Descripción de Programa Presupuestario': nan,
  'Descripción de Ramo': 'Consejo Nacional de Ciencia y Tecnología',
  'Descripción de Subfunción': 'Investigación Científica',
  'Descripción de Tipo de Gasto': nan,
  'Descripción de Unidad Responsable': 'Centro de Investigación en Materiales Avanzados, S.C.',
  'Descripción de la Actividad Institucional': nan,
  'Descripción de la modalidad del programa presupuestario': nan,
  'Ejercido': nan,
  'Fuente de Financiamiento': nan,
  'Función': '7',
  'Grupo Funcional': '3',
  'Modalidad del Programa presupuestario': nan,
  'Objeto del Gasto': nan,
  'Programa Presupuestario': nan,
  'Ramo': '38',
  'Subfunción': '1',
  'Tipo de Gasto': nan,
  'Unidad Responsable': '9

In [41]:
breakdown = [
    'Ciclo', 
    'Capitulo', 
    'Concepto', 
    'Partida Genérica',        
    'Partida Específica', 
    'Descripción de Capitulo',
    'Descripción de Concepto', 
    'Descripción de Partida Genérica',
    'Descripción de Partida Específica'
]

merged[breakdown].sample(n=200)

Unnamed: 0,Ciclo,Capitulo,Concepto,Partida Genérica,Partida Específica,Descripción de Capitulo,Descripción de Concepto,Descripción de Partida Genérica,Descripción de Partida Específica
254908,2012,2000,2400,246,24601,Materiales y suministros,Materiales y articulos de construccion y de re...,Material eléctrico y electrónico,
59485,2016,3000,3700,,,Servicios generales,Servicios de traslado y viáticos,,
104306,2015,2000,2400,246,24601,Materiales y suministros,Materiales y articulos de construccion y de re...,Material eléctrico y electrónico,
17413,2015,6000,6200,622,62201,Inversion publica,Obra publica en bienes propios,Edificación no habitacional,
103308,2011,2000,2600,261,26103,Materiales y suministros,"Combustibles, lubricantes y aditivos","Combustibles, lubricantes y aditivos",
85608,2010,3000,3400,340,,Servicios generales,"Servicios financieros, bancarios y comerciales",,
181317,2016,3000,3300,,,Servicios generales,"Servicios profesionales, cientificos, tecnicos...",,
30905,2013,4000,4300,431,43101,"Transferencias, asignaciones, subsidios y otra...",Subsidios y subvenciones,Subsidios a la producción,
140020,2014,3000,3200,327,32701,Servicios generales,Servicios de arrendamiento,Arrendamiento de activos intangibles,
116623,2011,2000,2200,221,22103,Materiales y suministros,Alimentos y utensilios,Productos alimenticios para personas,


In [42]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1520443 entries, 0 to 239082
Data columns (total 39 columns):
Actividad Institucional                                    1520443 non-null object
Adefas                                                     482560 non-null float64
Aprobado                                                   1520441 non-null float64
Capitulo                                                   1520442 non-null object
Ciclo                                                      1520443 non-null object
Clave de cartera                                           1252602 non-null object
Concepto                                                   1520442 non-null object
Descripción de Capitulo                                    1520442 non-null object
Descripción de Concepto                                    1517818 non-null object
Descripción de Entidad Federativa                          1252430 non-null object
Descripción de Fuente de Financiamiento                  

In [43]:
budget.where(budget['Descripción de Programa Presupuestario'] == 'PROSPERA Programa de Inclusión Social')['Programa Presupuestario'].dropna() 

119706    S-72
119707    S-72
119731    S-72
119732    S-72
131557    S-72
131558    S-72
131559    S-72
131560    S-72
131561    S-72
131562    S-72
131563    S-72
135670    S-72
135671    S-72
135672    S-72
135673    S-72
135674    S-72
135675    S-72
135676    S-72
135677    S-72
135678    S-72
135679    S-72
135680    S-72
135681    S-72
135682    S-72
135683    S-72
135684    S-72
135685    S-72
135686    S-72
135687    S-72
135688    S-72
          ... 
192323    S-72
192324    S-72
192325    S-72
192326    S-72
192327    S-72
192328    S-72
192329    S-72
192330    S-72
192331    S-72
192332    S-72
192333    S-72
192334    S-72
192335    S-72
192336    S-72
192337    S-72
192338    S-72
192339    S-72
192340    S-72
192341    S-72
192342    S-72
192343    S-72
192344    S-72
192345    S-72
192346    S-72
192347    S-72
192348    S-72
192349    S-72
192350    S-72
192351    S-72
192352    S-72
Name: Programa Presupuestario, dtype: object

In [44]:
budget.where(budget['Programa Presupuestario'] == '72')['Descripción de Programa Presupuestario'].dropna().unique()

array([], dtype=object)

In [45]:
budget.groupby(['Programa Presupuestario'])['Programa Presupuestario'].count()

Programa Presupuestario
A-1      1241
A-10       16
A-15      432
A-17       29
A-18       72
A-19      101
A-2      1486
A-20      130
A-21      198
A-22       40
A-23        5
A-25        3
A-3       323
A-4      1040
A-6       636
A-7       477
A-8       301
A-9       192
A-900      29
B-1        95
B-2       938
B-3        97
B-4        32
C-1        32
C-2        33
C-3        33
C-4        32
D-1         4
D-11        1
D-2         2
         ... 
U-40       36
U-5        37
U-52        1
U-57       29
U-58       33
U-59       32
U-6       218
U-67       37
U-7        45
U-74       49
U-75       32
U-76        2
U-77       33
U-79      247
U-8       110
U-80       34
U-81       34
U-82       19
U-83        8
U-84        1
U-87        1
U-88       33
U-9       240
U-90       10
U-91       32
U-93       12
U-94        1
U-95        2
Y-3         1
Y-4         1
Name: Programa Presupuestario, dtype: int64