# Archivo Historico de Localidades

This notebook containsan an initial exploration of the files contained in the Historical Archive of Localities (Archivo historico de localidades - AHL) and joins the records that are divided by state in a single CSV.

## Libraries and Configurations

In [2]:
# Librerias

import os
import shutil
from zipfile import ZipFile
import csv
import random

In [3]:
verbose = True  # Imprimir notas de ejecución

## Zipfile content

The base zip is named 'AHL.zip' and it has a compressed size of 78 megabytes.

In [4]:
dir_datos = 'Datos'                                      # Zip subdirectory
nom_archivo = 'AHL.zip'                                  # Zip name
ruta_zip_general = os.path.join(dir_datos, nom_archivo)  # Zip route


if verbose:
    print('Data directory content...')
    print(os.listdir(dir_datos))
    print('Tamaño:')
    print(os.path.getsize(ruta_zip_general)/1_000_000, 'Mb')

Data directory content...
['AHL.zip']
Tamaño:
78.002994 Mb


General zip contains 32 zips one for each state of the Republic of Mexico. Each state zips contains 5 csv files each:
* Habitantes.csv
* Maestro.csv
* Movimientos.csv
* Res_hist.csv
* Estructura_Archivos_AHL.pdf (Database describer, it's the same for all state zips)

In [7]:
# AHL.zip
dir_extraccion = 'temp_ahl'  # Directory where the files will be decompressed

try:
    os.mkdir(dir_extraccion)    

except FileExistsError:
    print(f'Dir {dir_extraccion} already exists...')

with ZipFile(ruta_zip_general, 'r') as zip_file:
    lista_zips = zip_file.namelist()

    # Select and extract a random zip
    zip_estatal = lista_zips[random.randint(0, len(lista_zips) - 1)]
    zip_file.extract(zip_estatal, dir_extraccion)


# State zip 
nom_zip_estatal = zip_estatal[0:5]          # Zip name without extension
dir_zip_estatal = os.path.join(dir_extraccion, nom_zip_estatal) # Dir to create
ruta_zip_estatal = os.path.join(dir_extraccion, zip_estatal) # Zip route
try:
    os.mkdir(dir_zip_estatal)     
except FileExistsError:
    print(f'Dir {dir_zip_estatal} already exists...')


with ZipFile(ruta_zip_estatal) as info_estatal:
    lista_estatal = info_estatal.namelist()
    
    # Extract all the files
    info_estatal.extractall(dir_zip_estatal)
    
if verbose:
    print(f'Number of files in the main zip: {len(lista_zips)}', end = '\n\n' )
    print('Zip content:', end = '\n\n')
    print(lista_zips, end = '\n\n')
    print('Content of the zips contained in AHL.zip:', end = '\n\n')
    print(lista_estatal)

  

Dir temp_ahl already exists...
Number of files in the main zip: 32

Zip content:

['AHL01.zip', 'AHL02.zip', 'AHL03.zip', 'AHL04.zip', 'AHL05.zip', 'AHL06.zip', 'AHL07.zip', 'AHL08.zip', 'AHL09.zip', 'AHL10.zip', 'AHL11.zip', 'AHL12.zip', 'AHL13.zip', 'AHL14.zip', 'AHL15.zip', 'AHL16.zip', 'AHL17.zip', 'AHL18.zip', 'AHL19.zip', 'AHL20.zip', 'AHL21.zip', 'AHL22.zip', 'AHL23.zip', 'AHL24.zip', 'AHL25.zip', 'AHL26.zip', 'AHL27.zip', 'AHL28.zip', 'AHL29.zip', 'AHL30.zip', 'AHL31.zip', 'AHL32.zip']

Content of the zips contained in AHL.zip:

['AHL07Habitantes.csv', 'AHL07Maestro.csv', 'AHL07Movimientos.csv', 'AHL07Res_hist.csv', 'Estructura_Archivos_AHL.pdf']


Normally INEGI follows the international information sharing guidelines that use 'utf-8' as the standard encoding, but there are times when the data it offers has the 'Windows-1252' (code 'cp1252') or 'Latin-1' (code 'latin-1'). This usually happens when data is being manipulated and exported from Excel or Access. This is the case with these files. The joined files will be exported in 'utf-8'.


In [8]:
# Sample the CSVs
muestra_csvs = {}
temas = ['Habitantes','Maestro','Movimientos','Res_hist']

for tema in temas:
    nom_csv = nom_zip_estatal + tema + '.csv'
    ruta_csv = os.path.join(dir_zip_estatal, nom_csv)
    
    try:
        with open(ruta_csv, 'r', encoding="cp1252") as row: # Tried utf-8 but there where problems
            reader = csv.reader(row)
            header = next(reader,'unix')
            data_sample = next(reader)
            muestra_csvs[tema] = {key:value for key,value in zip(header, data_sample)}
        
        
        
    except FileNotFoundError:
        print(f'File {nom_csv} does not exist in directory {nom_zip_estatal}...')

    except UnicodeDecodeError:
        print("Try a diferent coding. Mexican files normally come",
        " in utf-8 o cp1252 encodings.")

In [9]:
# Delete temporary directory
shutil.rmtree(dir_extraccion)

CVE_GEOEST column appears in all tables. Inhabitants and Movements share the CLAVE Column. The rest of the columns are unique to the tables they belong to. Null values ​​exist in some columns.

In [11]:
if verbose:
    for tema in temas[:3]:
        print(f'Sample Column/Value of CSV {tema}:')
        for item in muestra_csvs[tema].items(): print(item)
        print()
    
    print(f'Sample Column/Value of CSV {tema[3]}:')
    for key, value in muestra_csvs[temas[3]].items(): 
        reducido = value.replace('\n', '')[:50]
        print(f'(\'{key}\', \'{reducido}\'...)')

Sample Column/Value of CSV Habitantes:
('CLAVE', '45071')
('CVE_GEOEST', '070010001')
('EVE_CENSAL', '2010')
('INDICE_HAB', '15160793')
('FUENTE', 'Censo')
('TOT_HAB', '7,515')
('TOT_HOM', '3,664')
('TOT_MUJ', '3,851')

Sample Column/Value of CSV Maestro:
('CVE_GEOEST', '070050095')
('LATITUD', '')
('LONGITUD', '')
('ALTITUD', '')
('CARTA_TOPO', '')
('TIPO', 'R')
('NOMBRE_EDO', 'Chiapas')
('NOMBRE_MUN', 'Amatán')

Sample Column/Value of CSV Movimientos:
('CLAVE', '5298379')
('CVE_GEOEST', '071100004')
('INDICE', '19')
('NOM_LOC', 'Chacampon del Carmen')
('NOM_MUN', 'San Lucas')
('CAT_POLI', 'Finca')
('CAT_ADMIVA', '')
('ORI_MODIF', 'Censo de 2020.\nLocalidad deshabitada.')

Sample Column/Value of CSV i:
('CVE_GEOEST', '070010001'...)
('DATOS', 'Acacoyagua-   Aca-coyahua-c, del idioma azteca; lu'...)
('DATOS_BIBLIOGRAFIA', 'Bibliografía:(1)   Peñafiel, Antonio. Nomenclatura'...)


## Decompress CSVs bh theme

In this section the state CSVs are unzipped in a folder called CSVs inside another themed subfolder.

### Functions

In [12]:
def extraer_archivo(ruta_zip, dir_extraccion, archivo_extraer, verbose = False):
    """
    Function to extract a file inside a zip to a given path.
    default.
    
    :param zip_path: Path where zip is located.
    :param extract_dir: Directory where the file will be extracted.
    :param file_extract: Name of the file to extract.
    :param verbose: Print comments in procedure.
    :return: Returns nothing, just extract files to pc.

    """
    if verbose:
        if not os.path.exists(dir_extraccion):
            os.makedirs(dir_extraccion)
            print(f'Creating directory {dir_extraccion}')

    ruta_archivo_extraer = os.path.join(dir_extraccion, archivo_extraer)


    if verbose:
        if os.path.exists(ruta_archivo_extraer):
            print(f'Dir {ruta_archivo_extraer} already exists, overwriting...')
        
        else:
            print(f'Descompressing file {ruta_archivo_extraer}')


    with ZipFile(ruta_zip, 'r') as zip_file:
        zip_file.extract(archivo_extraer, dir_extraccion)


In [13]:


def extraer_tema(dir_extraccion, tema,  dir_archivos, zipfile_list, verbose = False):
    """
    Function to extract a file inside a zip to a given path.
    
     :param extraction_dir: Directory where the files will be extracted.
     :param theme: Theme to be extracted, can be four:
         Inhabitants, Movements, Master, and Res_hist
     :param file_dir: Directory where the file zips of
         a topic will be extracted.
     :param zipfile_list: List of zip files where the csvs are located
         by theme.
     :param verbose: Print comments in procedure.
    
     :return: Doesn't return anything, just extracts files to pc from a theme in a
         given directory.
    """
    dir_extraccion_tema = os.path.join(dir_extraccion, tema)
    print(f'Descompressing CSVs of {tema} in {dir_extraccion_tema}', end='... ')
    for zip_file in zipfile_list:
        archivo_tema = zip_file[:5] + tema +'.csv'
        ruta_zip_estatal = os.path.join(dir_archivos, zip_file)
        extraer_archivo(ruta_zip_estatal, dir_extraccion_tema, archivo_tema, verbose = False)
    print(f'Extracted.')


### Extract files by theme

In [14]:
ruta_zip_general = 'Datos\AHL.zip'     
dir_extraccion_estatales = 'temp_zips_estatales' # Temporal directory
with ZipFile(ruta_zip_general, 'r') as zip_file:
    zip_file.extractall(dir_extraccion_estatales)

In [15]:
# Themes contained in zips are: ['Habitantes', 'Maestro', 'Movimientos', 'Res_hist']
dir_extraccion = 'CSVs'    # Dir where themes will be extracted

extraer_tema(dir_extraccion, 'Habitantes', dir_extraccion_estatales, lista_zips, verbose)
extraer_tema(dir_extraccion, 'Maestro', dir_extraccion_estatales, lista_zips, verbose)
extraer_tema(dir_extraccion, 'Movimientos', dir_extraccion_estatales, lista_zips, verbose)
extraer_tema(dir_extraccion, 'Res_hist', dir_extraccion_estatales, lista_zips, verbose)

Descompressing CSVs of Habitantes in CSVs\Habitantes... Extracted.
Descompressing CSVs of Maestro in CSVs\Maestro... Extracted.
Descompressing CSVs of Movimientos in CSVs\Movimientos... Extracted.
Descompressing CSVs of Res_hist in CSVs\Res_hist... Extracted.


## Join CSVs by theme

Code to join State CSVs into one.

### Funciones

In [16]:
def unir_csvs(unido, a_unir, con_header = False, verbose = False):
    """
    Function that reads one csv file and writes it to another.
    
     :param attached: File where the csv is to be attached.
     :param to_join: File to append to the joined csv.
     :param con_header: Name of the csv to extract can have the values of
     :param verbose: True to print information about what you are doing.

     :return: Doesn't return anything, just merges csvs.

    """

    with open(a_unir, 'r', encoding="cp1252", errors='replace') as csv_a_unir:
        doc_reader = csv.reader(csv_a_unir)

        if con_header == True:
            with open(unido, 'w', encoding='utf-8', errors='replace') as unido:
                doc_writer = csv.writer(unido, lineterminator='\n')
                header = next(doc_reader)
                doc_writer.writerow(header)

                if verbose:
                    print(f'Added Header: {header}')

                doc_writer.writerows(doc_reader)

                if verbose:
                    print(f'Rows of {a_unir} added to {unido}.')

        else:
            header = next(doc_reader)
            with open(unido, 'a', encoding='utf-8') as unido:
                doc_writer = csv.writer(unido, lineterminator='\n')
                doc_writer.writerows(doc_reader)
                if verbose:
                    print(f'Rows of {a_unir} joined to {unido}.')

In [17]:
def unir_tema(ruta_csv, dir_carpeta, tema, verbose = False):
    """
    Function to extract a file inside a zip to a given path.

     :param csv_path: Path of the file where the csvs are to be joined.
     :param folder_dir: Directory where the folder is located.
     :param theme: Name with the folder with the csvs of the theme.

     :return: Doesn't return anything, just extracts files to pc 
        from a theme in a given directory.
    
    """
    if os.path.exists(ruta_csv):
        print(f'File {ruta_csv} already exists... deleting it')
        os.remove(ruta_csv)

    con_header = True
    dir_tema = os.path.join(dir_carpeta, tema)
    csvs_a_unir = os.listdir(dir_tema)

    if verbose:
        print(f'Joining files in directory {tema}', end = '... ')
        
    for csv in csvs_a_unir:
        ruta_csv_a_unir = os.path.join(dir_tema, csv)
        #print(ruta_csv, ruta_csv_a_unir, con_header, verbose) 
  
        unir_csvs(ruta_csv, ruta_csv_a_unir, con_header, verbose=False)

        con_header = False # To only copy headers of the first CSV

    if verbose:
        print(f'Unidos en {ruta_csv}')




### Implementación unir csvs por tema

In [18]:
# Join Theme CSVs
# There are coding mistakes in file 25 y 32 
# that where replaced for ? with the parameter errors = 'replace'
# in the function unir_csvs

unir_tema('CSVs\habitantes.csv', 'CSVs', 'Habitantes', verbose)
unir_tema('CSVs\maestro.csv', 'CSVs', 'Maestro', verbose)
unir_tema('CSVs\movimientos.csv', 'CSVs', 'Movimientos', verbose)
unir_tema(r'CSVs\res_hist.csv', 'CSVs', 'Res_hist', verbose) # tiene r antes del string para que
                                                             # /r se interprete bien


File CSVs\habitantes.csv already exists... deleting it
Joining files in directory Habitantes... Unidos en CSVs\habitantes.csv
File CSVs\maestro.csv already exists... deleting it
Joining files in directory Maestro... Unidos en CSVs\maestro.csv
File CSVs\movimientos.csv already exists... deleting it
Joining files in directory Movimientos... Unidos en CSVs\movimientos.csv
File CSVs\res_hist.csv already exists... deleting it
Joining files in directory Res_hist... Unidos en CSVs\res_hist.csv


In [19]:
# Resulting file information
for tema in temas:
    csv_general = tema.lower() + '.csv'
    ruta = os.path.join('CSVs', csv_general)
    print(f'Tamaño de {csv_general}:', end = ' ')
    print(os.stat(ruta).st_size / 1000000, 'megabytes')
    with open(ruta, 'r') as csv_tema:
        num_lineas = len(csv_tema.readlines()) - 1
        print(f'Numero de registros: {num_lineas}')

Tamaño de habitantes.csv: 98.951205 megabytes
Numero de registros: 2138702
Tamaño de maestro.csv: 33.13731 megabytes
Numero de registros: 409446
Tamaño de movimientos.csv: 221.355535 megabytes
Numero de registros: 3565344
Tamaño de res_hist.csv: 13.379794 megabytes
Numero de registros: 198847


In [20]:
# Delete directory with State Zips
shutil.rmtree(dir_extraccion_estatales)

# Delete directory with State CSVs
shutil.rmtree('CSVs/Habitantes/')
shutil.rmtree('CSVs/Maestro/')
shutil.rmtree('CSVs/Movimientos/')
shutil.rmtree('CSVs/Res_hist/')