# Mexican federal budget pre-processing pipeline

## Instructions

To you run the notebook:

1. choose a unique `ITERATION_LABEL` for each pipeline run
2. specify and describe your input files (`INPUT_FILES`)
3. make sure your column mapping (`COLUMN_ALIASES`) is correct
3. run the whole notebook by clicking on __Kernel > Restart & Run All__

## Imports

In [1]:
from sys import stdout
from pandas import read_csv, concat, DataFrame, ExcelWriter, ExcelFile, Series
from numpy import nan, isnan
from os.path import join, isdir
from os import mkdir
from json import dumps, loads
from pprint import pprint

## Settings

Choose a unique iteration label for each pipeline run.

In [2]:
ITERATION_LABEL = 'iteration-15-all-new-data-2008-2016'

Put your input files inside the `pipeline.in` folder and describe them here.

In [3]:
INPUT_FILES = {
#     2010: {'name': 'Cuenta_Publica_2010.csv', 'encoding': 'windows-1252'},
#     2011: {'name': 'Cuenta_Publica_2011.csv', 'encoding': 'windows-1252'},
#     2012: {'name': 'Cuenta_Publica_2012.csv', 'encoding': 'windows-1252'},
#     2013: {'name': 'Cuenta_Publica_2013.csv', 'encoding': 'windows-1252'},
#     2014: {'name': 'Cuenta_Publica_2014.csv', 'encoding': 'windows-1252'},
#     2015: {'name': 'Cuenta_Publica_2015.csv', 'encoding': 'windows-1252'},
#     2016: {'name': '2016_2T_Gasto_OS.csv', 'encoding': 'windows-1252'} # cp850 for the original 2016 file
    2008: {'name': 'Cuenta_Publica_2008.csv', 'encoding': 'windows-1252'},
    2009: {'name': 'Cuenta_Publica_2009.csv', 'encoding': 'windows-1252'},
    2010: {'name': 'Cuenta_Publica_2010.csv', 'encoding': 'windows-1252'},
    2011: {'name': 'Cuenta_Publica_2011.csv', 'encoding': 'windows-1252'},
    2012: {'name': 'Cuenta_Publica_2012.csv', 'encoding': 'windows-1252'},
    2013: {'name': 'CP_2013.csv', 'encoding': 'windows-1252'},
    2014: {'name': 'CP_2014.csv', 'encoding': 'windows-1252'},
    2015: {'name': 'CP_2015.csv', 'encoding': 'windows-1252'},
    2016: {'name': 'PEF_AC01_2t_2016.csv', 'encoding': 'windows-1252'} 
}


If your input files don't all have the same column names, define your mapping here. 

In [4]:
with open('columns.aliases.json') as json:
    COLUMN_ALIASES = loads(json.read())

In [5]:
from json import dumps

with open('column.aliases.json', 'w+', encoding='utf-8') as json:
    json.write(dumps(COLUMN_ALIASES, indent=4, ensure_ascii=False))

The following hierarchical categories will have IDs prefixed with the parent categories:

In [6]:
HIERARCHIES = {
    'functional': [
        'GPO_FUNCIONAL', 
        'ID_FUNCION', 
        'ID_SUBFUNCION', 
        'ID_AI'
    ],
    'administrative': [
        'ID_RAMO', 
        'ID_UR'
    ],
    'activities': [
        'DESC_MODALIDAD', 
        'PP'
    ],
}

The following columns are unsused and removed at the end of the pipeline:

In [7]:
REMOVE_OUTPUT_COLUMNS = [
    'Reasignacion',
    'Objeto del Gasto',
    'Descripción de Reasignacion',
    'Descripción de Objeto del Gasto'
]

In [8]:
REMOVE_INPUT_COLUMNS = {
#     2016: [
#         'Adefas',
#         'Partida Específica',
#         'Partida Genérica',
#         'Descripción de Partida Genérica',
#         'Descripcion de Partida Específica',
#         'Ejercicio',
#         'Devengado',
#         'Ejercido',
#     ]
}

That's it. Now just run the notebook from beginning to end.

## Configuration

In [9]:
BASENAME = 'mexican_federal_budget'
INPUT_FOLDER = 'pipeline.in.v2'
OUTPUT_FOLDER = 'pipeline.out'
ITERATION_FOLDER = join(OUTPUT_FOLDER, ITERATION_LABEL)
MERGED_FILE = join(ITERATION_FOLDER, BASENAME + '.merged.csv')
CATALOGS_FILE = 'objeto_del_gasto.catalog.xlsx'

In [10]:
if isdir(ITERATION_FOLDER):
    raise ValueError('Please enter a unique iteration label')
    
mkdir(ITERATION_FOLDER)

## Encoding inspection

Detect the file encodings of the input files using the `cChardet` utility library. __Warning:__ it's not always accurate. This is meant only as an indication only. In the end, encodings will be taken from `INPUT_FILES`.

In [11]:
def detect_encodings():
    """Detect CSV file encoding with the cChardet library"""

    try:
        import cchardet as chardet
    except ImportError:
        cChardet = 'https://github.com/PyYoshi/cChardet'
        print('Encoding inspection skipped: install %s', cChardet)
        return

    results = {}
    results_file = join(OUTPUT_FOLDER, ITERATION_LABEL, 'encodings.detected.json')
    
    for year, file in sorted(INPUT_FILES.items()):
        datafile = join(INPUT_FOLDER, file['name'])
        
        with open(datafile, 'rb') as f:
            text = f.read()
            
        result = chardet.detect(text)
        results.update({year: result})
        print(year, 'Inspected', file['name'], result)
    
    with open(results_file, 'w+') as json:
        json.write(dumps(results, indent=4))
        print('\nSaved encoding detection report to', results_file)
        
# detect_encodings()

## Load files

In [12]:
def read_columns(file, encoding):
    """Return clean CSV file headers"""
    
    with open(file, encoding=encoding) as csv:
        header = csv.readline()
        return header.replace('\n', '').split(',')

In [13]:
def force_strings(columns):
    """Return string enforcement for each column of a CSV file"""
    
    for column in columns:
        yield column, str

In [14]:
def load_csv_files():
    """Load raw data (CSV) files"""
    
    batch = {}
    
    for year, file in sorted(INPUT_FILES.items()):
        filepath = join(INPUT_FOLDER, file['name'])
        column_names = read_columns(filepath, file['encoding'])
        column_types = dict(force_strings(column_names))
        
        batch[year] = read_csv(filepath, encoding=file['encoding'], dtype=column_types)
        print('Loaded', file['name'], 'with encoding', file['encoding'])
    
    print()
    stdout.flush()

    for year in sorted(INPUT_FILES.keys()):
        if year in REMOVE_INPUT_COLUMNS:
            for column in REMOVE_INPUT_COLUMNS[year]:
                try:
                    del batch[year][column]
                    print(year, 'deleted', column)
                except KeyError:
                    print(year, column, 'not found in', file['name'])

        stdout.flush()

    return batch

## Clean the data

In [15]:
def strip_cell_padding(batch):
    for year in sorted(batch.keys()):
        for column in batch[year].columns:
            batch[year].rename(columns={column: column.strip()}, inplace=True)
            batch[year][column] = batch[year][column].apply(lambda x: x.strip() if x is not nan else x)
        print(year, 'stripped cell paddings')
        stdout.flush()

In [16]:
def delete_empty_columns(batch):
    for year in batch.keys():
        for column in batch[year].columns:
            if 'Unnamed:' in column:
                try:
                    del batch[year][column]
                    print(year, column, 'deleted')
                    stdout.flush()
                except KeyError:
                    pass  

In [17]:
def count_missing_values(batch):
    collector = {}
    table = []

    for column in get_union_of_columns(batch):
        row = {'Column': column}
        collector.update({column: []})
        
        for year in batch.keys():
            if column in batch[year].columns:
                is_empty = batch[year][column].isnull()
                empty_lines = batch[year].where(is_empty).dropna(how='all')
                collector[column].extend(empty_lines.to_dict(orient='records'))
                nb_empty_cells = len(empty_lines)
            else:
                nb_empty_cells = nan
                
            row.update({year: nb_empty_cells})
            if nb_empty_cells not in (nan, 0):
                print(year, 'found', nb_empty_cells, 'missing values in', column)

        table.append(row)
        
    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    empty_values_overview_table = DataFrame(table).reindex_axis(ordered_columns, axis=1)
    
    return empty_values_overview_table, collector

In [18]:
def count_duplicates(batch):
    for year, df in sorted(batch.items()):
        nb_duplicate_lines = df.duplicated().apply(lambda x: 1 if x is True else 0).sum()
        print(year, 'found', nb_duplicate_lines, 'duplicate lines')

## Alias column names

In [19]:
def get_union_of_columns(batch):
    union = set()
    for year in batch.keys():
        union = union | set(batch[year].columns)
    return union

In [20]:
from yaml import load

def load_aliases(file):
    with open(file) as yaml:
        aliases = load(yaml.read())
        return aliases

In [21]:
def map_columns_to_aliases(batch, list_of_aliases):
    for year in sorted(batch.keys()):
        for column in sorted(batch[year].columns):
            if not column in list_of_aliases:
                for reference, aliases in list_of_aliases.items():
                    if aliases:
                        if column in aliases:
                            batch[year].rename(columns={column: reference}, inplace=True)
                            print(year, column, 'replaced with', reference)
                            stdout.flush()
                            break  
                else:
                    print(year, 'NO ALIAS REGISTERED FOR', column)
                    stdout.flush()

In [22]:
def build_overview(batch):
    table = []
    
    for column in get_union_of_columns(batch):
        row = {'Column': column}
        for year in batch.keys():
            row.update({year: column in batch[year].columns})
        table.append(row)
        
    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    
    overview = DataFrame(table).reindex_axis(ordered_columns, axis=1)
    print('Column mapping overview: done')
    return overview

## Check expenditure sums

There's a little cleaning to do on the amount columns (zeros represented by a dash). Assume thousands are seperated by a comma.

In [23]:
EXPENDITURE_COLUMNS = [
    'MONTO_EJERCIDO', 
    'MONTO_DEVENGADO', 
    'MONTO_APROBADO', 
    'MONTO_PAGADO', 
    'MONTO_MODIFICADO', 
    'MONTO_ADEFAS', 
    'MONTO_EJERCICIO'
]
count = 0

def clean_expenditure_columns(batch):
    check_sums = []

    for column in EXPENDITURE_COLUMNS:
        row = {'Column': column}
        
        for year in sorted(batch.keys()):
            try:
                series = batch[year][column]
                
                # I'm assuming a single '-' represents zero
                series = series.apply(lambda x: '0' if x == '-' else x)
                try:
                    series = series.apply(lambda x: x.replace(',', '') if x is not nan else x)    
                except AttributeError:
                    if count < 10:
                        print(year, column)
                batch[year][column] = series.astype(float)
                check_sum = batch[year][column].sum()
                
                print(year, 'cleaned and summed', column, '=', check_sum, 'pesos')
                
            except KeyError:
                check_sum = nan
                
            row.update({year: check_sum})
        
        check_sums.append(row)

    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    return DataFrame(check_sums).reindex_axis(ordered_columns, axis=1)    

## Objeto del Gasto Column split

In [24]:
from os.path import join

def generate_catalog(file):
    
    catalog_ = {}
    catalog_file = ExcelFile(file)
    INDEX_COLUMN = 0
    
    for sheet in catalog_file.sheet_names:
        if sheet != 'Concatenated':
            name = sheet.lower().replace(' ', '_')
            output = join('objeto_del_gasto.catalog', name + '.csv')

            df = catalog_file.parse(sheet).dropna()
            index = df.columns[INDEX_COLUMN]

            df[index] =  df[index].astype(str)
            df.set_index(index, inplace=True)
            df = df.groupby(df.index).first()
            df.sort_index(inplace=True)
            
            message = 'Loaded catalog {sheet} into "{name}" ({nb} lines)'
            parameters = dict(sheet=sheet, name=name, nb=len(df))

            print(message.format(**parameters))
            catalog_[name] = df['DESCRIPCION']
    
    print()
    return catalog_

__Note!__ Years are hard coded in the script below.

In [25]:
def split_objeto_del_gasto(batch):
    catalog = generate_catalog(CATALOGS_FILE)
    missing_in_catalog = []
    
    def has_digits(n, N):
        return not isinstance(n, float) and len(n) >= N 
            
    def lookup(n, table, year):
        try:
            return catalog[table].loc[n]
        except KeyError:
            missing_in_catalog.append({'year': year, 'table': table, 'ID': n})
            return nan
        except TypeError:
            # n is nan
            return nan
    
    for year in sorted(batch.keys()):
        if year == 2016:
            print('Skipping', year, 'because the raw CSV already has the required columns')
        
        else:
            objeto = batch[year]['ID_CONCEPTO'].astype(str)

            batch[year]['ID_CAPITULO'] = objeto.apply(lambda x: x[0] + '000' if x not in (nan, 'nan') else nan)
            batch[year]['ID_CONCEPTO'] = objeto.apply(lambda x: x[:2] + '00' if x not in (nan, 'nan') else nan)
            batch[year]['DESC_CAPITULO'] = batch[year]['ID_CAPITULO'].map(lambda x: lookup(x, 'capitulo', year))  
            batch[year]['DESC_CONCEPTO'] = batch[year]['ID_CONCEPTO'].map(lambda x: lookup(x, 'concepto', year))  
            
            nb_generica_digits = 4 if year in (2008, 2009, 2010) else 3
            
            # Skip the LAST year of the dataset (currently 2016) it has split columns already
            batch[year]['PARTIDA_GENERICA'] = objeto.apply(lambda x: x[:nb_generica_digits] if has_digits(x, 4) else nan)
            batch[year]['DESC_PARTIDA_GENERICA'] = batch[year]['PARTIDA_GENERICA'].map(lambda x: lookup(x, 'partida_generica', year))  
            
            if year not in (2008, 2009, 2010):
                batch[year]['PARTIDA_ESPECIFICA'] = objeto.apply(lambda x: x if has_digits(x, 5) else nan)
                batch[year]['DESC_PARTIDA_ESPECIFICA'] = batch[year]['PARTIDA_ESPECIFICA'].map(lambda x: lookup(x, 'partida_específica', year) if has_digits(x, 5) else nan)  
            else:
                batch[year]['PARTIDA_ESPECIFICA'] = nan
                batch[year]['DESC_PARTIDA_ESPECIFICA'] = nan

            print(year, 'broke down "Objeto del Gasto" column')
        
    return DataFrame(missing_in_catalog).drop_duplicates(['ID', 'table'])

## Prefix IDs 
Disambiguating sub-categories may require prefixing their IDs with their parents' IDs.

In [26]:
def prefix_ids(batch):
    for year in batch.keys():       
        for hierarchy, levels in HIERARCHIES.items():
            prefix = batch[year]['CICLO'].apply(lambda x: '')
            for n, level in enumerate(levels):
                dash = '.' if n > 0 else ''
                prefix = prefix + dash + batch[year][level]  
                batch[year][level] = prefix
                
                print(year, 'prefixed', hierarchy, 'level', n, level)
                stdout.flush()

## Remove unused columns

In [27]:
def remove_unused_columns(batch):
    for year, budget in batch.items():
        for column in REMOVE_OUTPUT_COLUMNS:
            try:
                del budget[column]
                print(year, 'deleted', column)
            except KeyError:
                print(column, ': no such column to delete')

##  Pipeline

In [28]:
def do_pipeline():

    def echo_section(section):
        print('\n', section, '\n')

    echo_section('Loading files')
    datasets = load_csv_files()
    
    echo_section('Delete empty columns')
    delete_empty_columns(datasets)

    echo_section('Stripping padding from cells')
    strip_cell_padding(datasets)
    
    echo_section('Counting duplicate lines (NOT de-duplicating)')
    count_duplicates(datasets)
    
    echo_section('Mapping column to aliases')
    map_columns_to_aliases(datasets, COLUMN_ALIASES)

    echo_section('Counting missing values')
    missing_values_report, bad_records = count_missing_values(datasets)
    
    echo_section('Building column mapping overview')
    column_mapping_report = build_overview(datasets)
    
    echo_section('Cleaning expenditure columns')
    sums_report = clean_expenditure_columns(datasets)
    
    echo_section('Breaking down Objeto del Gasto column')
    missing_catalog_ids = split_objeto_del_gasto(datasets)
        
    echo_section('Prefixing IDs of certain category hierarchies')
    prefix_ids(datasets)

    echo_section('Removing unused columns')
    remove_unused_columns(datasets)

    echo_section('Saving pipeline configuration')

    reports_file = join(ITERATION_FOLDER, BASENAME + '.reports.xlsx')
    writer = ExcelWriter(reports_file)    
    missing_values_report.to_excel(writer, 'missing values', encoding='utf-8', index=False)
    column_mapping_report.to_excel(writer, 'column mapping', encoding='utf-8', index=False)
    sums_report.to_excel(writer, 'check sums', encoding='utf-8', index=False)
    missing_catalog_ids.to_excel(writer, 'missing_catalog_IDs', encoding='utf-8', index=False)    
    print('Saved 4 reports to', reports_file)    

    aliases_file = join(ITERATION_FOLDER, BASENAME + '.aliases.json')
    inputs_file = join(ITERATION_FOLDER, BASENAME + '.inputs.json')
    levels_file = join(ITERATION_FOLDER, BASENAME + '.levels.json')
    bad_records_file = join(ITERATION_FOLDER, BASENAME + '.missing.json')

    with open(bad_records_file, 'w+') as json:
        json.write(dumps(bad_records, indent=4))
        
    with open(aliases_file, 'w+') as json:
        json.write(dumps(COLUMN_ALIASES, indent=4))
        
    with open(levels_file, 'w+') as json:
        json.write(dumps(HIERARCHIES, indent=4))
        
    with open(inputs_file, 'w+') as json:
        json.write(dumps(INPUT_FILES, indent=4))
    
    print('Saved input configuration to', inputs_file)    
    print('Saved column aliases to', aliases_file) 
    print('Saved bad records (those with empty cells) to', bad_records_file)    
    print('Saved hierarchy levels used for prefixing to', levels_file) 
    
    echo_section('Pipeline run "%s" done' % ITERATION_LABEL)

    return datasets, missing_catalog_ids, column_mapping_report, missing_values_report, sums_report

## Run the pipeline

In [29]:
budgets, missing_ids, column_mapping, missing_values, sums = do_pipeline()


 Loading files 

Loaded Cuenta_Publica_2008.csv with encoding windows-1252
Loaded Cuenta_Publica_2009.csv with encoding windows-1252
Loaded Cuenta_Publica_2010.csv with encoding windows-1252
Loaded Cuenta_Publica_2011.csv with encoding windows-1252
Loaded Cuenta_Publica_2012.csv with encoding windows-1252
Loaded CP_2013.csv with encoding windows-1252
Loaded CP_2014.csv with encoding windows-1252
Loaded CP_2015.csv with encoding windows-1252
Loaded PEF_AC01_2t_2016.csv with encoding windows-1252


 Delete empty columns 

2009 Unnamed: 25 deleted
2009 Unnamed: 26 deleted
2009 Unnamed: 27 deleted
2009 Unnamed: 28 deleted
2009 Unnamed: 29 deleted
2009 Unnamed: 30 deleted
2009 Unnamed: 31 deleted
2009 Unnamed: 32 deleted
2009 Unnamed: 33 deleted
2009 Unnamed: 34 deleted
2009 Unnamed: 35 deleted
2009 Unnamed: 36 deleted
2009 Unnamed: 37 deleted
2009 Unnamed: 38 deleted
2009 Unnamed: 39 deleted
2011 Unnamed: 25 deleted
2011 Unnamed: 26 deleted
2011 Unnamed: 27 deleted
2011 Unnamed: 28 delete

In [30]:
from gc import collect
collect()

3029

In [31]:
for year, budget in budgets.items():
    filepath = MERGED_FILE.replace('merged', str(year))
    budget.to_csv(filepath, encoding='utf-8', index=False)
    print('Saved', filepath)
    stdout.flush()

Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2016.csv
Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2008.csv
Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2009.csv
Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2010.csv
Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2011.csv
Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2012.csv
Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2013.csv
Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2014.csv
Saved pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.2015.csv


In [32]:
merged = concat(list(budgets.values()))
merged.to_csv(MERGED_FILE, encoding='utf-8', index=False)
print('Saved merged dataset to', MERGED_FILE)    

Saved merged dataset to pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget.merged.csv


## Quality control

In [33]:
sorted(list(budget.columns))

['CICLO',
 'DESC_AI',
 'DESC_CAPITULO',
 'DESC_CONCEPTO',
 'DESC_ENT_FED',
 'DESC_FF',
 'DESC_FUNCION',
 'DESC_GPO_FUNCIONAL',
 'DESC_MODALIDAD',
 'DESC_PARTIDA_ESPECIFICA',
 'DESC_PARTIDA_GENERICA',
 'DESC_PP',
 'DESC_RAMO',
 'DESC_SUBFUNCION',
 'DESC_TIPOGASTO',
 'DESC_UR',
 'ENTIDAD',
 'GPO_FUNCIONAL',
 'ID_AI',
 'ID_CAPITULO',
 'ID_CONCEPTO',
 'ID_FF',
 'ID_FUNCION',
 'ID_MODALIDAD',
 'ID_PPI',
 'ID_RAMO',
 'ID_SUBFUNCION',
 'ID_TIPOGASTO',
 'ID_UR',
 'MONTO_ADEFAS',
 'MONTO_APROBADO',
 'MONTO_DEVENGADO',
 'MONTO_EJERCICIO',
 'MONTO_MODIFICADO',
 'MONTO_PAGADO',
 'PARTIDA_ESPECIFICA',
 'PARTIDA_GENERICA',
 'PP']

In [34]:
merged.sample(n=20)

Unnamed: 0,CICLO,DESC_AI,DESC_CAPITULO,DESC_CONCEPTO,DESC_ENT_FED,DESC_FF,DESC_FUNCION,DESC_GPO_FUNCIONAL,DESC_MODALIDAD,DESC_PARTIDA_ESPECIFICA,...,MONTO_ADEFAS,MONTO_APROBADO,MONTO_DEVENGADO,MONTO_EJERCICIO,MONTO_EJERCIDO,MONTO_MODIFICADO,MONTO_PAGADO,PARTIDA_ESPECIFICA,PARTIDA_GENERICA,PP
211387,2012,Democracia preservada y fortalecida mediante l...,Servicios personales,Otras prestaciones sociales y económicas,Durango,Recursos fiscales,Coordinación de la Política de Gobierno,Gobierno,Específicos,Prestaciones establecidas por condiciones gene...,...,,0.0,,,69769.0,,,15401.0,154.0,Específicos.3
59565,2013,Aeropuertos eficientes y competitivos,Servicios personales,Seguridad social,Morelos,Recursos fiscales,Transporte,Desarrollo Económico,Regulación y supervisión,Aportaciones al FOVISSSTE,...,,288528.0,,13392.0,,,,14201.0,142.0,Regulación y supervisión.2
11021,2009,Promoción y Coordinación de las políticas publ...,Inversion publica,Obra publica en bienes de dominio publico,,Recursos fiscales,Asistencia Social,Desarrollo Social,Proyectos de Inversión,,...,,20500000.0,,,7377123.0,,,,6102.0,Proyectos de Inversión.14
95761,2013,Diseño y aplicación de la política educativa,Servicios personales,Seguridad social,Distrito Federal,Recursos fiscales,Educación,Desarrollo Social,"Planeación, seguimiento y evaluación de políti...",Aportaciones al FOVISSSTE,...,,368648.0,,389599.68,,,,14201.0,142.0,"Planeación, seguimiento y evaluación de políti..."
9269,2013,Servicios de apoyo administrativo,Servicios personales,Otras prestaciones sociales y económicas,Distrito Federal,Recursos fiscales,Relaciones Exteriores,Gobierno,Apoyo al proceso presupuestario y para mejorar...,Compensación garantizada,...,,10464618.0,,7524178.34,,,,15402.0,154.0,Apoyo al proceso presupuestario y para mejorar...
40904,2011,"Carreteras eficientes, seguras y suficientes",Servicios generales,Servicios de traslado y viáticos,,Recursos fiscales,Comunicaciones y Transportes,Desarrollo Económico,Proyectos de Inversión,Viáticos nacionales para labores en campo y de...,...,,191250.0,,,167599.0,,,37501.0,375.0,Proyectos de Inversión.3
44573,2011,"Carreteras eficientes, seguras y suficientes",Servicios generales,"Servicios de instalacion, reparacion, mantenim...",,Recursos fiscales,Comunicaciones y Transportes,Desarrollo Económico,Regulación y supervisión,Mantenimiento y conservación de inmuebles para...,...,,32146.0,,,13955.0,,,35101.0,351.0,Regulación y supervisión.10
23005,2009,Formación recursos humanos para el sector (edu...,Servicios generales,Servicios basicos,,Recursos fiscales,Educación,Desarrollo Social,Prestación de Servicios Públicos,,...,,170900.0,,,152654.0,,,,3109.0,Prestación de Servicios Públicos.3
189126,2014,Servicios de apoyo administrativo,Servicios personales,Otras prestaciones sociales y económicas,Distrito Federal,Recursos fiscales,Protección Social,Desarrollo Social,Apoyo al proceso presupuestario y para mejorar...,Otras prestaciones,...,0.0,1053281.0,0.0,0.0,,0.0,0.0,15901.0,159.0,Apoyo al proceso presupuestario y para mejorar...
33716,2016,Democracia preservada y fortalecida mediante l...,Servicios personales,Remuneraciones al personal de carácter transit...,Estado de México,Recursos fiscales,Coordinación de la Política de Gobierno,Gobierno,Específicos,,...,,1806889.0,,,903456.0,,1531853.01,,,Específicos.5


In [35]:
objeto_breakdown = [
    'CICLO', 
    'ID_CAPITULO', 
    'ID_CONCEPTO', 
    'PARTIDA_ESPECIFICA', 
    'PARTIDA_GENERICA'
]
merged[objeto_breakdown].sample(n=20)

Unnamed: 0,CICLO,ID_CAPITULO,ID_CONCEPTO,PARTIDA_ESPECIFICA,PARTIDA_GENERICA
176059,2014,2000,2500,25301.0,253
261322,2012,1000,1500,15101.0,151
271673,2015,1000,1300,13410.0,134
109443,2013,3000,3100,31101.0,311
144684,2012,1000,1300,13202.0,132
46222,2012,3000,3300,33101.0,331
149733,2013,1000,1400,14401.0,144
234665,2012,3000,3700,37501.0,375
263909,2015,3000,3500,35901.0,359
80324,2015,2000,2100,21601.0,216


In [36]:
print('Total: missing', len(missing_ids), 'catalog IDs to breakdown the "Objeto del Gasto" column')
print('Tables:', dict(missing_ids.groupby('table').count()['ID']))
print('Years:', dict(missing_ids.groupby('year').count()['ID']))
try:
    missing_ids.sample(n=20)
except ValueError:
    pass

Total: missing 76 catalog IDs to breakdown the "Objeto del Gasto" column
Tables: {'partida_generica': 45, 'partida_específica': 24, 'concepto': 7}
Years: {2008: 48, 2009: 4, 2012: 22, 2013: 1, 2015: 1}


In [37]:
missing_ids.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76 entries, 0 to 18706
Data columns (total 3 columns):
ID       76 non-null object
table    76 non-null object
year     76 non-null int64
dtypes: int64(1), object(2)
memory usage: 2.4+ KB


In [38]:
column_mapping

Unnamed: 0,Column,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,ID_UR,True,True,True,True,True,True,True,True,True
1,DESC_ENT_FED,False,False,False,False,True,True,True,True,True
2,DESC_FUNCION,True,True,True,True,True,True,True,True,True
3,ID_MODALIDAD,True,True,True,True,True,True,True,True,True
4,DESC_RAMO,True,True,True,True,True,True,True,True,True
5,DESC_FF,True,True,True,True,True,True,True,True,True
6,ID_AI,True,True,True,True,True,True,True,True,True
7,DESC_MODALIDAD,True,True,True,True,True,True,True,True,True
8,DESC_PP,True,True,True,True,True,True,True,True,True
9,ENTIDAD,False,False,False,False,True,True,True,True,True


In [39]:
missing_values

Unnamed: 0,Column,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,ID_UR,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,DESC_ENT_FED,,,,,172.0,0.0,0.0,0.0,0.0
2,DESC_FUNCION,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ID_MODALIDAD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,DESC_RAMO,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,DESC_FF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,ID_AI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,DESC_MODALIDAD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,DESC_PP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,ENTIDAD,,,,,0.0,0.0,0.0,0.0,0.0


In [40]:
sums

Unnamed: 0,Column,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,MONTO_EJERCIDO,2576692000000.0,2296086000000.0,2474100000000.0,2717372000000.0,2896331000000.0,,,,2160939000000.0
1,MONTO_DEVENGADO,,,,,,,5076810000000.0,5508987000000.0,
2,MONTO_APROBADO,1992356000000.0,2289715000000.0,2376915000000.0,2560196000000.0,2754868000000.0,4322619000000.0,4905401000000.0,5138442000000.0,5297126000000.0
3,MONTO_PAGADO,,,,,,,4972712000000.0,5366652000000.0,2137679000000.0
4,MONTO_MODIFICADO,,,,,,,5013990000000.0,5398608000000.0,
5,MONTO_ADEFAS,,,,,,,36941610000.0,31122650000.0,
6,MONTO_EJERCICIO,,,,,,4617618000000.0,5010877000000.0,5399018000000.0,


In [41]:
merged.sample(n=20)

Unnamed: 0,CICLO,DESC_AI,DESC_CAPITULO,DESC_CONCEPTO,DESC_ENT_FED,DESC_FF,DESC_FUNCION,DESC_GPO_FUNCIONAL,DESC_MODALIDAD,DESC_PARTIDA_ESPECIFICA,...,MONTO_ADEFAS,MONTO_APROBADO,MONTO_DEVENGADO,MONTO_EJERCICIO,MONTO_EJERCIDO,MONTO_MODIFICADO,MONTO_PAGADO,PARTIDA_ESPECIFICA,PARTIDA_GENERICA,PP
219888,2015,Impartición de justicia en materia fiscal y ad...,Servicios generales,"Servicios de instalacion, reparacion, mantenim...",Querétaro,Recursos fiscales,Justicia,Gobierno,Prestación de Servicios Públicos,Mantenimiento y conservación de maquinaria y e...,...,0.0,100000.0,120397.7,120397.7,,120397.7,120397.7,35701.0,357.0,Prestación de Servicios Públicos.1
13848,2008,"Defensa de la integridad, la independencia, la...",Servicios personales,Remuneraciones adicionales y especiales,,Recursos fiscales,Soberanía,Gobierno,Otras actividades relevantes,,...,,2806987.0,,,3063596.0,,,,1309.0,Otras actividades relevantes.2
25249,2015,Gasto público transparente y orientado a resul...,Servicios personales,Seguridad social,Distrito Federal,Recursos fiscales,Asuntos Financieros y Hacendarios,Gobierno,"Planeación, seguimiento y evaluación de políti...",Depósitos para el ahorro solidario,...,0.0,17378.0,126496.15,126496.15,,126496.15,126496.15,14302.0,143.0,"Planeación, seguimiento y evaluación de políti..."
215159,2013,Democracia preservada y fortalecida mediante l...,Servicios personales,Otras prestaciones sociales y económicas,Coahuila,Recursos fiscales,Coordinación de la Política de Gobierno,Gobierno,Específicos,Asignaciones adicionales al sueldo,...,,115920.0,,123064.0,,,,15403.0,154.0,Específicos.8
23700,2012,Servicios de apoyo administrativo,Servicios generales,Servicios de arrendamiento,Distrito Federal,Recursos fiscales,Asuntos Financieros y Hacendarios,Gobierno,Apoyo al proceso presupuestario y para mejorar...,Arrendamiento de equipo y bienes informáticos,...,,0.0,,,0.0,,,32301.0,323.0,Apoyo al proceso presupuestario y para mejorar...
299276,2015,Cobertura de la atención médica preventiva,Servicios personales,Seguridad social,Durango,Ingresos Propios,Salud,Desarrollo Social,Prestación de Servicios Públicos,Cuotas para el seguro de separación individual...,...,0.0,0.0,1888.0,1888.0,,1888.0,1888.0,14404.0,144.0,Prestación de Servicios Públicos.7
164456,2014,Producción y Protección Forestal,Servicios generales,Servicios basicos,Tabasco,Recursos fiscales,"Agropecuaria, Silvicultura, Pesca y Caza",Desarrollo Económico,Prestación de Servicios Públicos,Servicio de energía eléctrica,...,0.0,279527.0,239967.43,239967.43,,239967.43,239967.43,31101.0,311.0,Prestación de Servicios Públicos.14
86950,2016,Carreteras alimentadoras y caminos rurales efi...,Servicios generales,Servicios de arrendamiento,Guerrero,Recursos fiscales,Transporte,Desarrollo Económico,Proyectos de Inversión,,...,,2981.0,,,0.0,,0.0,,,Proyectos de Inversión.31
57406,2016,Servicios de incorporación y recaudación,Servicios personales,Seguridad social,Aguascalientes,Ingresos Propios,Salud,Desarrollo Social,Prestación de Servicios Públicos,,...,,9589603.0,,,5267217.5,,5267217.5,,,Prestación de Servicios Públicos.6
230455,2014,Generación de conocimiento científico para el ...,Servicios generales,"Servicios profesionales, cientificos, tecnicos...",Guanajuato,Recursos fiscales,"Ciencia, Tecnología e Innovación",Desarrollo Económico,Prestación de Servicios Públicos,Información en medios masivos derivada de la o...,...,0.0,461736.0,346300.0,346300.0,,346300.0,346300.0,33605.0,336.0,Prestación de Servicios Públicos.1


In [42]:
breakdown = [
    'CICLO', 
    'ID_CAPITULO', 
    'ID_CONCEPTO', 
    'PARTIDA_GENERICA',        
    'PARTIDA_ESPECIFICA', 
    'DESC_CAPITULO',
    'DESC_CONCEPTO', 
    'DESC_PARTIDA_GENERICA',
    'DESC_PARTIDA_ESPECIFICA'
]

merged[breakdown].sample(n=200)

Unnamed: 0,CICLO,ID_CAPITULO,ID_CONCEPTO,PARTIDA_GENERICA,PARTIDA_ESPECIFICA,DESC_CAPITULO,DESC_CONCEPTO,DESC_PARTIDA_GENERICA,DESC_PARTIDA_ESPECIFICA
233206,2012,2000,2100,212,21201,Materiales y suministros,"Materiales de administracion, emision de docum...",Materiales y útiles de impresión y reproducción,Materiales y útiles de impresión y reproducción
111363,2014,1000,1300,132,13201,Servicios personales,Remuneraciones adicionales y especiales,"Primas de vacaciones, dominical y gratificació...",Primas de vacaciones y dominical
202559,2012,2000,2400,247,24701,Materiales y suministros,Materiales y articulos de construccion y de re...,Artículos metálicos para la construcción,Artículos metálicos para la construcción
43014,2009,5000,5300,5304,,"Bienes muebles, inmuebles e intangibles",Equipo e instrumental medico y de laboratorio,"Vehículos y equipo terrestres, aéreos, marítim...",
53113,2013,3000,3200,322,32201,Servicios generales,Servicios de arrendamiento,Arrendamiento de edificios,Arrendamiento de edificios y locales
48587,2011,3000,3700,371,37104,Servicios generales,Servicios de traslado y viáticos,Pasajes aéreos,Pasajes aéreos nacionales para servidores públ...
10589,2014,3000,3100,319,31904,Servicios generales,Servicios basicos,Servicios integrales y otros servicios,Servicios integrales de infraestructura de cóm...
25571,2011,3000,3500,351,35101,Servicios generales,"Servicios de instalacion, reparacion, mantenim...",Conservación y mantenimiento menor de inmuebles,Mantenimiento y conservación de inmuebles para...
53786,2015,2000,2100,211,21101,Materiales y suministros,"Materiales de administracion, emision de docum...","Materiales, útiles y equipos menores de oficina",Materiales y útiles de oficina
64245,2009,3000,3500,3503,,Servicios generales,"Servicios de instalacion, reparacion, mantenim...",Mantenimiento y conservación de maquinaria y e...,


In [43]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1909649 entries, 0 to 341255
Data columns (total 39 columns):
CICLO                      object
DESC_AI                    object
DESC_CAPITULO              object
DESC_CONCEPTO              object
DESC_ENT_FED               object
DESC_FF                    object
DESC_FUNCION               object
DESC_GPO_FUNCIONAL         object
DESC_MODALIDAD             object
DESC_PARTIDA_ESPECIFICA    object
DESC_PARTIDA_GENERICA      object
DESC_PP                    object
DESC_RAMO                  object
DESC_SUBFUNCION            object
DESC_TIPOGASTO             object
DESC_UR                    object
ENTIDAD                    object
GPO_FUNCIONAL              object
ID_AI                      object
ID_CAPITULO                object
ID_CONCEPTO                object
ID_FF                      object
ID_FUNCION                 object
ID_MODALIDAD               object
ID_PPI                     object
ID_RAMO                    object
ID_S

In [44]:
len(merged)

1909649

In [45]:
merged.sample(n=10000).to_csv('pipeline.out/iteration-15-all-new-data-2008-2016/mexican_federal_budget_sample.csv')