# Mexican federal budget pre-processing pipeline

## Instructions

To you run the notebook:

1. choose a unique `ITERATION_LABEL` for each pipeline run
2. specify and describe your input files (`INPUT_FILES`)
3. make sure your column mapping (`COLUMN_ALIASES`) is correct
3. run the whole notebook by clicking on __Kernel > Restart & Run All__

## Settings

Choose a unique iteration label for each pipeline run.

In [1]:
ITERATION_LABEL = 'iteration-4'

Put your input files inside the `pipeline.in` folder and describe them here.

In [2]:
INPUT_FILES = {
    2010: {'name': 'Cuenta_Publica_2010.csv', 'encoding': 'windows-1252'},
    2011: {'name': 'Cuenta_Publica_2011.csv', 'encoding': 'windows-1252'},
    2012: {'name': 'Cuenta_Publica_2012.csv', 'encoding': 'windows-1252'},
    2013: {'name': 'Cuenta_Publica_2013.csv', 'encoding': 'windows-1252'},
    2014: {'name': 'Cuenta_Publica_2014.csv', 'encoding': 'windows-1252'},
    2015: {'name': 'Cuenta_Publica_2015.csv', 'encoding': 'windows-1252'},
    2016: {'name': 'PEF2016_AC01.csv', 'encoding': 'cp850'}
}

If your input files don't all have the same column names, define your mapping here. 

In [3]:
COLUMN_ALIASES = {
    'Actividad Institucional': ['AI'],
    'Adefas': ['ADEFAS'],
    'Aprobado': [
        'PEF_2016',
        'Importe Presupuesto de Egresos de la Federación',
        'Importe Presupuesto de Egresos de la Federación (PEF)'
    ],
    'Ciclo': None,
    'Clave de cartera': ['CLAVE_CARTERA'],
    'Descripción de Fuente de Financiamiento': ['FUENTE_FINAN_DESCRIPCION'],
    'Descripción de Función': ['FUNCIONL_DESCRIPCION'],
    'Descripción de Grupo Funcional': [
        'Descripción de Finalidad',
        'GRUPO_FUN_DESCRIPCION',
        'Descripción de Grupo Funcional'
    ],
    'Descripción de Objeto del Gasto': ['CONCEPTO_DESCRIPCION'],
    'Descripción de Programa Presupuestario': ['PROGR_PRES_DESCRIPCION'],
    'Descripción de Ramo': ['RAMO_DESCRIPCION'],
    'Descripción de Reasignacion': ['REASIGNACION_DESCRIPCION'],
    'Descripción de Subfunción': ['SUBFUNCIONL_DESCRIPCION'],
    'Descripción de Tipo de Gasto': ['TIPO_GASTO_DESCRIPCION'],
    'Descripción de Unidad Responsable': ['UNIDAD_DESCRIPCION'],
    'Descripción de la Actividad Institucional': [
        'ACTIVIDAD_INST_DESCRIPCION',
        'Descripción de Actividad Institucional'
    ],
    'Descripción de la entidad federativa': ['ENTIDAD_FED_DESCRIPCION'],
    'Descripción de la modalidad del programa presupuestario': [
        'MODALIDAD_DESCRIPCION',
        'Descripción del Identificador del Programa Presupuestario',
        'Descripción del Identificador de Programa Presupuestario'
    ],
    'Devengado': None,
    'Ejercicio': None,
    'Ejercido': None,
    'Entidad Federativa': ['EF'],
    'Fuente de Financiamiento': ['FF'],
    'Función': ['FN'],
    'Grupo Funcional': [
        'Finalidad', 'GF', 'Grupo Funcional'
    ],
    'Modalidad del Programa presupuestario': [
        'MOD',
        'Identificador de Programa Presupuestario',
        'Identificador del Programa Presupuestario'
    ],
    'Modificado': None,
    'Objeto del Gasto': ['CONCEPTO'],
    'Pagado': None,
    'Programa Presupuestario': ['PP'],
    'Ramo': None,
    'Reasignacion': ['RA'],
    'Subfunción': ['SF'],
    'Tipo de Gasto': ['TG'],
    'Unidad Responsable': ['UNIDAD']
}

That's it. Now just run the notebook from beginning to end.

## Imports

In [4]:
from sys import stdout
from pandas import read_csv, concat, DataFrame, ExcelWriter
from numpy import nan
from os.path import join, isdir
from os import mkdir
from json import dumps
from pprint import pprint

## Configuration

In [5]:
BASENAME = 'mexican_federal_budget'
INPUT_FOLDER = 'pipeline.in'
OUTPUT_FOLDER = 'pipeline.out'
ITERATION_FOLDER = join(OUTPUT_FOLDER, ITERATION_LABEL)
MERGED_FILE = join(ITERATION_FOLDER, BASENAME + '.merged.csv')
CATALOGS_FOLDER = 'objeto_del_gasto.catalog'

In [6]:
if isdir(ITERATION_FOLDER):
    raise ValueError('Please enter a unique iteration label')
    
mkdir(ITERATION_FOLDER)

## Encoding inspection

Detect the file encodings of the input files using the `cChardet` utility library. __Warning:__ it's not always accurate. This is meant only as an indication only. In the end, encodings will be taken from `INPUT_FILES`.

In [7]:
def detect_encodings():
    """Detect CSV file encoding with the cChardet library"""

    try:
        import cchardet as chardet
    except ImportError:
        cChardet = 'https://github.com/PyYoshi/cChardet'
        print('Encoding inspection skipped: install %s', cChardet)
        return

    results = {}
    results_file = join(OUTPUT_FOLDER, ITERATION_LABEL, 'encodings.detected.json')
    
    for year, file in sorted(INPUT_FILES.items()):
        datafile = join(INPUT_FOLDER, file['name'])
        
        with open(datafile, 'rb') as f:
            text = f.read()
            
        result = chardet.detect(text)
        results.update({year: result})
        print(year, 'Inspected', file['name'], result)
    
    with open(results_file, 'w+') as json:
        json.write(dumps(results, indent=4))
        print('\nSaved encoding detection report to', results_file)
        
# detect_encodings()

## Load files

In [8]:
def read_columns(file, encoding):
    """Return clean CSV file headers"""
    
    with open(file, encoding=encoding) as csv:
        header = csv.readline()
        return header.replace('\n', '').split(',')

In [9]:
def force_strings(columns):
    """Return string enforcement for each column of a CSV file"""
    
    for column in columns:
        yield column, str

In [10]:
def load_csv_files():
    """Load raw data (CSV) files"""
    
    batch = {}
    
    for year, file in sorted(INPUT_FILES.items()):
        filepath = join(INPUT_FOLDER, file['name'])
        column_names = read_columns(filepath, file['encoding'])
        column_types = dict(force_strings(column_names))
        
        batch[year] = read_csv(filepath, encoding=file['encoding'], dtype=column_types)
        print('Loaded', file['name'], 'with encoding', file['encoding'])
        stdout.flush()
            
    return batch

## Clean the data

In [11]:
def strip_cell_padding(batch):
    for year in sorted(batch.keys()):
        for column in batch[year].columns:
            batch[year].rename(columns={column: column.strip()}, inplace=True)
            batch[year][column] = batch[year][column].apply(lambda x: x.strip() if x is not nan else x)
        print(year, 'stripped cell paddings')
        stdout.flush()

In [12]:
def delete_empty_columns(batch):
    for year in batch.keys():
        for column in batch[year].columns:
            if 'Unnamed:' in column:
                try:
                    del batch[year][column]
                    print(year, column, 'deleted')
                    stdout.flush()
                except KeyError:
                    pass  

In [13]:
def count_missing_values(batch):
    table = []

    for column in get_union_of_columns(batch):
        row = {'Column': column}
        
        for year in batch.keys():
            if column in batch[year].columns:
                nb_empty_cells = batch[year][column].apply(lambda x: 1 if x is nan else 0).sum()
            else:
                nb_empty_cells = nan
                
            row.update({year: nb_empty_cells})
            if nb_empty_cells not in (nan, 0):
                print(year, 'found', nb_empty_cells, 'missing values in', column)

        table.append(row)
        
    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    
    return DataFrame(table).reindex_axis(ordered_columns, axis=1)

In [14]:
def count_duplicates(batch):
    for year, df in sorted(batch.items()):
        nb_duplicate_lines = df.duplicated().apply(lambda x: 1 if x is True else 0).sum()
        print(year, 'found', nb_duplicate_lines, 'duplicate lines')

## Alias column names

In [15]:
def get_union_of_columns(batch):
    union = set()
    for year in batch.keys():
        union = union | set(batch[year].columns)
    return union

In [16]:
from yaml import load

def load_aliases(file):
    with open(file) as yaml:
        aliases = load(yaml.read())
        return aliases

In [17]:
def map_columns_to_aliases(batch, list_of_aliases):
    for year in sorted(batch.keys()):
        for column in sorted(batch[year].columns):
            if not column in list_of_aliases:
                for reference, aliases in list_of_aliases.items():
                    if aliases:
                        if column in aliases:
                            batch[year].rename(columns={column: reference}, inplace=True)
                            print(year, column, 'replaced with', reference)
                            stdout.flush()
                            break  
                else:
                    print(year, 'NO ALIAS: ', column)
                    stdout.flush()

In [18]:
def build_overview(batch):
    table = []
    
    for column in get_union_of_columns(batch):
        row = {'Column': column}
        for year in batch.keys():
            row.update({year: column in batch[year].columns})
        table.append(row)
        
    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    
    overview = DataFrame(table).reindex_axis(ordered_columns, axis=1)
    return overview

## Check expenditure sums

There's a little cleaning to do on the amount columns (zeros represented by a dash). Assume thousands are seperated by a comma.

In [19]:
EXPENDITURE_COLUMNS = [
    'Ejercido', 
    'Devengado', 
    'Aprobado', 
    'Pagado', 
    'Modificado', 
    'Adefas', 
    'Ejercicio'
]

def clean_expenditure_columns(batch):
    check_sums = []

    for column in EXPENDITURE_COLUMNS:
        row = {'Column': column}
        
        for year in sorted(batch.keys()):
            try:
                series = batch[year][column]
                
                # I'm assuming -' represents zero
                series = series.apply(lambda x: '0' if x == '-' else x)
                series = series.apply(lambda x: x.replace(',', '') if x is not nan else x)                
                batch[year][column] = series.astype(float)
                check_sum = batch[year][column].sum()
                
                print(year, 'cleaned and summed', column, '=', check_sum, 'pesos')
                
            except KeyError:
                check_sum = nan
                
            row.update({year: check_sum})
        
        check_sums.append(row)

    ordered_columns = ['Column']
    ordered_columns.extend(sorted(batch.keys()))
    return DataFrame(check_sums).reindex_axis(ordered_columns, axis=1)    

## Objeto del Gasto Column split

In [20]:
from pandas import read_csv
from os import listdir
from os.path import join

def load_catalogs(folder):
    
    fields = {}
    catalogs = {}
    files = listdir(folder)
    
    for file in files:
        name = file.split('.')[0]
        print('Loading catalog table:', name)
        filepath = join(folder, file)
        
        catalogs[name] = read_csv(filepath)
        index_column = catalogs[name].columns[0]
        catalogs[name].set_index(index_column, inplace=True)
    
    return catalogs

In [21]:
c = load_catalogs('objeto_del_gasto.catalog')

Loading catalog table: partida_generica
Loading catalog table: capitulo
Loading catalog table: concepto
Loading catalog table: partida_especifica


In [22]:
c['capitulo']

Unnamed: 0_level_0,DESCRIPCION
CAPITULO,Unnamed: 1_level_1
1000,Servicios personales
2000,Materiales y suministros
3000,Servicios generales
4000,"Transferencias, asignaciones, subsidios y otra..."
5000,"Bienes muebles, inmuebles e intangibles"
6000,Inversion publica
7000,Inversiones financieras y otras provisiones
8000,Participaciones y aportaciones
9000,Deuda publica


In [23]:
# def has_5_digits(n):
#     try:
#         return n is not nan and int(n) >= 10000
#     except ValueError:
#         print(n)
#         return False

# def split_objeto_del_gasto(batch):
#     catalog = load_catalogs(CATALOGS_FOLDER)
    
#     for year in sorted(batch.keys()):
#         print(year)
#         objeto = batch[year]['Objeto del Gasto'].apply(lambda x: int(x) if x is not nan else nan)
        
#         batch[year]['Capitulo'] = objeto.apply(lambda x: int(x/10000) if has_5_digits(x) else nan)
#         batch[year]['Concepto'] = objeto.apply(lambda x: int(x/1000) * 100 if has_5_digits(x) else nan)
#         batch[year]['Partida Genérica'] = objeto.apply(lambda x: int(x/100) if has_5_digits(x) else nan)
#         batch[year]['Partida Específica'] = objeto

# #         batch[year]['Descripción de Capitulo'] = catalog['capitulo'].loc[batch[year]['Capitulo']]

In [24]:
catalog = load_catalogs(CATALOGS_FOLDER)
missing_indices = []

def has_5_digits(n):
    return n is not nan and len(n) == 5 

def lookup(n, table):
    try:
        return catalog[table].loc[int(n)] if not isinstance(n, float) else 'Not found in the catalog'
    except KeyError:
        missing_indices.append({'table': table, 'index': n})
        return 'Not found in the catalog'
        
def split_objeto_del_gasto(batch):
    for year in sorted(batch.keys()):
        print(year, 'splitting objeto del gasto column')
        objeto = batch[year]['Objeto del Gasto'].astype(str)
        
        batch[year]['Capitulo'] = objeto.apply(lambda x: x[0] + '000' if has_5_digits(x) else nan)
        batch[year]['Concepto'] = objeto.apply(lambda x: x[:2] + '00' if has_5_digits(x) else nan)
        batch[year]['Partida Genérica'] = objeto.apply(lambda x: x[:3] if has_5_digits(x) else nan)
        batch[year]['Partida Específica'] = objeto.apply(lambda x: x if has_5_digits(x) else nan)

        batch[year]['Descripción de Capitulo'] = batch[year]['Capitulo'].map(lambda x: lookup(x, 'capitulo'))  
        batch[year]['Descripción de Concepto'] = batch[year]['Concepto'].map(lambda x: lookup(x, 'concepto'))  
        batch[year]['Descripción de Partida Genérica'] = batch[year]['Partida Genérica'].map(lambda x: lookup(x, 'partida_generica'))  
        batch[year]['Descripción de Partida Específica'] = batch[year]['Partida Específica'].map(lambda x: lookup(x, 'partida_especifica'))  
        

Loading catalog table: partida_generica
Loading catalog table: capitulo
Loading catalog table: concepto
Loading catalog table: partida_especifica


In [25]:
str(1209.0000)

'1209.0'

##  Pipeline

In [26]:
def do_pipeline():

    def echo_section(section):
        print('\n', section, '\n')

    echo_section('Loading files')
    datasets = load_csv_files()
    
    echo_section('Delete empty columns')
    delete_empty_columns(datasets)

    echo_section('Stripping padding from cells')
    strip_cell_padding(datasets)
    
#     echo_section('Counting duplicate lines (NOT de-duplicating)')
#     count_duplicates(datasets)
    
    echo_section('Mapping column to aliases')
    map_columns_to_aliases(datasets, COLUMN_ALIASES)

#     echo_section('Counting missing values')
#     missing_values_report = count_missing_values(datasets)
    
#     echo_section('Building column mapping overview')
#     column_mapping_report = build_overview(datasets)
    
#     echo_section('Cleaning expenditure columns')
#     sums_report = clean_expenditure_columns(datasets)
    
    echo_section('Splitting Objeto del Gasto')
    split_objeto_del_gasto(datasets)
        
    echo_section('Merging datasets')
    merged_dataset = concat(list(datasets.values()))
    
#     reports_file = join(ITERATION_FOLDER, BASENAME + '.report.xlsx')
#     writer = ExcelWriter(reports_file)
#     aliases_file = join(ITERATION_FOLDER, BASENAME + '.aliases.json')
#     inputs_file = join(ITERATION_FOLDER, BASENAME + '.inputs.json')

#     merged_dataset.to_csv(MERGED_FILE, encoding='utf-8', index=False)
# #     missing_values_report.to_excel(writer, 'missing values', encoding='utf-8', index=False)
# #     column_mapping_report.to_excel(writer, 'column mapping', encoding='utf-8', index=False)
# #     sums_report.to_excel(writer, 'check sums', encoding='utf-8', index=False)
    
#     with open(aliases_file, 'w+') as json:
#         json.write(dumps(COLUMN_ALIASES, indent=4))
        
#     with open(inputs_file, 'w+') as json:
#         json.write(dumps(INPUT_FILES, indent=4))
    
#     print('Saved merged datasets to', MERGED_FILE)    
#     print('Saved input configuration to', inputs_file)    
#     print('Saved reports configuration to', aliases_file)    
#     print('Saved reports to', reports_file)    

    echo_section('Pipeline run "%s" done and saved to %s' % (ITERATION_LABEL, ITERATION_FOLDER))

    return merged_dataset#, column_mapping_report, missing_values_report, sums_report, datasets

## Run the pipeline

In [27]:
merged_budget = do_pipeline()#, column_mapping, missing_values, sums, raw_data = do_pipeline()


 Loading files 

Loaded Cuenta_Publica_2010.csv with encoding windows-1252
Loaded Cuenta_Publica_2011.csv with encoding windows-1252
Loaded Cuenta_Publica_2012.csv with encoding windows-1252
Loaded Cuenta_Publica_2013.csv with encoding windows-1252
Loaded Cuenta_Publica_2014.csv with encoding windows-1252
Loaded Cuenta_Publica_2015.csv with encoding windows-1252
Loaded PEF2016_AC01.csv with encoding cp850

 Delete empty columns 

2011 Unnamed: 25 deleted
2011 Unnamed: 26 deleted
2011 Unnamed: 27 deleted
2011 Unnamed: 28 deleted
2011 Unnamed: 29 deleted
2011 Unnamed: 30 deleted
2011 Unnamed: 31 deleted
2011 Unnamed: 32 deleted
2011 Unnamed: 33 deleted
2011 Unnamed: 34 deleted
2011 Unnamed: 35 deleted
2011 Unnamed: 36 deleted
2011 Unnamed: 37 deleted
2011 Unnamed: 38 deleted
2011 Unnamed: 39 deleted
2011 Unnamed: 40 deleted
2011 Unnamed: 41 deleted

 Stripping padding from cells 

2010 stripped cell paddings
2011 stripped cell paddings
2012 stripped cell paddings
2013 stripped cell padd

## Quality control

In [28]:
list(merged_budget.columns)

['Actividad Institucional',
 'Adefas',
 'Aprobado',
 'Capitulo',
 'Ciclo',
 'Clave de cartera',
 'Concepto',
 'Descripción de Capitulo',
 'Descripción de Concepto',
 'Descripción de Fuente de Financiamiento',
 'Descripción de Función',
 'Descripción de Grupo Funcional',
 'Descripción de Objeto del Gasto',
 'Descripción de Partida Específica',
 'Descripción de Partida Genérica',
 'Descripción de Programa Presupuestario',
 'Descripción de Ramo',
 'Descripción de Reasignacion',
 'Descripción de Subfunción',
 'Descripción de Tipo de Gasto',
 'Descripción de Unidad Responsable',
 'Descripción de la Actividad Institucional',
 'Descripción de la entidad federativa',
 'Descripción de la modalidad del programa presupuestario',
 'Devengado',
 'Ejercicio',
 'Ejercido',
 'Entidad Federativa',
 'Fuente de Financiamiento',
 'Función',
 'Grupo Funcional',
 'Modalidad del Programa presupuestario',
 'Modificado',
 'Objeto del Gasto',
 'Pagado',
 'Partida Específica',
 'Partida Genérica',
 'Programa Pre

In [29]:
merged_budget.sample(n=10)

Unnamed: 0,Actividad Institucional,Adefas,Aprobado,Capitulo,Ciclo,Clave de cartera,Concepto,Descripción de Capitulo,Descripción de Concepto,Descripción de Fuente de Financiamiento,...,Objeto del Gasto,Pagado,Partida Específica,Partida Genérica,Programa Presupuestario,Ramo,Reasignacion,Subfunción,Tipo de Gasto,Unidad Responsable
36752,6,-,49500,3000,2014,0.0,3300,"DESCRIPCION Servicios generales Name: 3000,...",...,Recursos fiscales,...,33401,63750.0,33401,334,1,8,,1,1,143
138157,5,-,3346279,1000,2014,0.0,1500,DESCRIPCION Servicios personales Name: 1000...,DESCRIPCION Otras prestaciones sociales y e...,Recursos fiscales,...,15402,3342342.0,15402,154,3,14,,2,1,121
190508,3,-,644167,3000,2014,0.0,3500,"DESCRIPCION Servicios generales Name: 3000,...",...,Recursos fiscales,...,35201,208800.0,35201,352,2,21,,1,1,W3N
52318,3,,1174125.0,1000,2013,0.0,1100,DESCRIPCION Servicios personales Name: 1000...,DESCRIPCION Remuneraciones al personal de c...,Recursos fiscales,...,11301,,11301,113,10,9,,1,1,634
2908,3,,-,1000,2012,0.0,1400,DESCRIPCION Servicios personales Name: 1000...,"DESCRIPCION Seguridad social Name: 1400, dt...",Recursos fiscales,...,14302,,14302,143,1,3,,1,1,100
94396,3,,59489,2000,2011,,2900,DESCRIPCION Materiales y suministros Name: ...,"DESCRIPCION Herramientas, refacciones y acc...",Recursos fiscales,...,29101,,29101,291,10,16,,5,1,B05
270546,1,,22800,3000,2012,0.0,3100,"DESCRIPCION Servicios generales Name: 3000,...",DESCRIPCION CONCEPTO ...,Recursos fiscales,...,31801,,31801,318,2,40,,2,1,100
4740,4,-,7230,2000,2014,0.0,2500,DESCRIPCION Materiales y suministros Name: ...,...,Recursos fiscales,...,25601,1102.0,25601,256,1,3,,1,1,211
14748,4,0.00,0.00,2000,2015,0.0,2200,DESCRIPCION Materiales y suministros Name: ...,DESCRIPCION Alimentos y utensilios Name: 22...,Recursos fiscales,...,22301,2000.0,22301,223,2,2,,1,1,139
85081,5,,46777242,3000,2011,,3300,"DESCRIPCION Servicios generales Name: 3000,...",...,"Gasto financiado con recursos del BID-BIRF, as...",...,33302,,33302,333,1,14,,3,2,312


In [30]:
# sums

In [31]:
# column_mapping

In [32]:
# missing_values

In [33]:
# with open(MERGED_FILE) as file:
#     for n in range(10):
#         print(file.readline())

In [34]:
merged_budget.columns

Index(['Actividad Institucional', 'Adefas', 'Aprobado', 'Capitulo', 'Ciclo',
       'Clave de cartera', 'Concepto', 'Descripción de Capitulo',
       'Descripción de Concepto', 'Descripción de Fuente de Financiamiento',
       'Descripción de Función', 'Descripción de Grupo Funcional',
       'Descripción de Objeto del Gasto', 'Descripción de Partida Específica',
       'Descripción de Partida Genérica',
       'Descripción de Programa Presupuestario', 'Descripción de Ramo',
       'Descripción de Reasignacion', 'Descripción de Subfunción',
       'Descripción de Tipo de Gasto', 'Descripción de Unidad Responsable',
       'Descripción de la Actividad Institucional',
       'Descripción de la entidad federativa',
       'Descripción de la modalidad del programa presupuestario', 'Devengado',
       'Ejercicio', 'Ejercido', 'Entidad Federativa',
       'Fuente de Financiamiento', 'Función', 'Grupo Funcional',
       'Modalidad del Programa presupuestario', 'Modificado',
       'Objeto del

In [35]:
merged_budget[['Ciclo', 'Objeto del Gasto', 'Capitulo', 'Concepto', 'Partida Específica', 'Partida Genérica']].sample(n=50)

Unnamed: 0,Ciclo,Objeto del Gasto,Capitulo,Concepto,Partida Específica,Partida Genérica
19069,2014,26103,2000.0,2600.0,26103.0,261.0
112092,2015,14105,1000.0,1400.0,14105.0,141.0
96940,2013,14403,1000.0,1400.0,14403.0,144.0
156064,2013,22104,2000.0,2200.0,22104.0,221.0
123990,2010,3821,,,,
37135,2016,1400,,,,
62165,2013,31401,3000.0,3100.0,31401.0,314.0
25489,2011,21601,2000.0,2100.0,21601.0,216.0
235729,2014,14105,1000.0,1400.0,14105.0,141.0
9963,2016,3300,,,,


In [36]:
nan is not nan

False

In [37]:
has_5_digits(nan)

False

In [38]:
merged_budget[['Ciclo', 'Objeto del Gasto']].where(merged_budget['Objeto del Gasto'].apply(lambda x: len(x) == 4 if x is not nan else False)).sample(n=200).dropna()

Unnamed: 0,Ciclo,Objeto del Gasto
8449,2016,3900
29404,2016,6200
96123,2010,3603
113316,2010,3411
35660,2010,1509
29839,2016,3200
86585,2010,2101
32185,2010,3407
20972,2016,2100
96381,2010,1414


In [39]:
1 == 1.002

False

In [40]:
round(1.002)

1

In [50]:
set([item['index'] for item in missing_indices])

{'13203',
 '13204',
 '15302',
 '15404',
 '15405',
 '15406',
 '15903',
 '15904',
 '15905',
 '15906',
 '15907',
 '15908',
 '15909',
 '15911',
 '15912',
 '21199',
 '213',
 '214',
 '215',
 '216',
 '21801',
 '22199',
 '23199',
 '24199',
 '25199',
 '25600',
 '26199',
 '27199',
 '29199',
 '31199',
 '32199',
 '33106',
 '33199',
 '33802',
 '34102',
 '34199',
 '34502',
 '34901',
 '35103',
 '35199',
 '36199',
 '36301',
 '36400',
 '36401',
 '37199',
 '37702',
 '38199',
 '38502',
 '39102',
 '39199',
 '39802',
 '39903',
 '39911',
 '43802',
 '43803',
 '43804',
 '43805',
 '43806',
 '43807',
 '43808',
 '43809',
 '43810',
 '43811',
 '43812',
 '43813',
 '43814',
 '43815',
 '43816',
 '43817',
 '43818',
 '43819',
 '43820',
 '43821',
 '43822',
 '43823',
 '43824',
 '43825',
 '43826',
 '43827',
 '43828',
 '43829',
 '43830',
 '43831',
 '43832',
 '43833',
 '44107',
 '44108',
 '44109',
 '44110',
 '44113',
 '44199',
 '44201',
 '44501',
 '44502',
 '46301',
 '48103',
 '49199',
 '51199',
 '52199',
 '53199',
 '54199'

In [51]:
set([item['table'] for item in missing_indices])

{'partida_especifica', 'partida_generica'}

In [59]:
missing_in_catalog = DataFrame(missing_indices).drop_duplicates()

In [62]:
missing_in_catalog.to_excel('catalog.missing.xlsx')

In [56]:
len(missing_in_catalog.head().drop_duplicates())

4

In [57]:
from pickle import 

ImportError: cannot import name 'pickle'

In [76]:
merged_budget[merged_budget['Objeto del Gasto'] == '85131'][['Ciclo', 'Descripción de Objeto del Gasto']]

122909    Yucatán
Name: Descripción de Objeto del Gasto, dtype: object

In [72]:
mx2011 = read_csv('pipeline.in/Cuenta_Publica_2011.csv', encoding='windows-1252')

In [78]:
mx2011[mx2011['Objeto del Gasto'] == 85131][['Ciclo', 'Descripción de Objeto del Gasto']]

Unnamed: 0,Ciclo,Descripción de Objeto del Gasto
122909,2011,Yucatán
