# Habitantes Table

The Historical Archive of Localities (AHL) is an INEGI database with records of the different censuses and counts that have been carried out in the Mexican Republic from the year 1900 to date. The population data of the set is contained in the Habitantes table. This notebook will lood at hte data it contains.


## Libraries and Configurations

In [2]:
import pandas as pd
import numpy as np

In [3]:
verbose = True

## Overview

CVE_GEOEST is the field shared by all AHL tables. It represents the geostatistical key for a specific locality. Sometimes there are problems in importing it because they are only numeric characters. The column must be treated as text. If it is imported as a number then the '0' at the beginning of some keys is lost (eg '010010001' will be imported as 10010001) and the key format will be lost.

In [4]:
# Define problematic columns

tipo_columnas = {'CVE_GEOEST': str}

In [5]:
# Import file
habitantes = pd.read_csv('CSVs/habitantes.csv', dtype=tipo_columnas)

In [6]:
# Remove the '#' to procces the line of code.
# General info

#habitantes.memory_usage().sum() / 1_000_000  # File is 150 megas
#habitantes.info(show_counts = True)          # There are no null values
#habitantes.columns                           # Column names
#habitantes[['TOT_HAB', 'TOT_HOM', 'TOT_MUJ']].describe() 
                                              # Column statistics
                     


CVE_GEOEST is made up of 9 numeric characters. The first two represent the state code ('01'), the next three characters represent the municipality ('001'), and the last four numbers represent the locality code ('0001').

The order in wich the States are numbered is in alphabetical order. The municipalities and localities are not because when new records are created the numbering is not reset. There is only one nomenclature rule for the localities, the key '0001' will always represent the municipal seat (cabecera municipal) and is where the local government resides.

In [7]:
# Record example
habitantes.head()

Unnamed: 0,CLAVE,CVE_GEOEST,EVE_CENSAL,INDICE_HAB,FUENTE,TOT_HAB,TOT_HOM,TOT_MUJ
0,1,10010001,1940,14962331,Censo,82234,37821,44413
1,1,10010001,1930,14962332,Censo,62244,28687,33557
2,1,10010001,1900,14962333,Censo,35052,16229,18823
3,1,10010001,1970,14962334,Censo,181277,-,-
4,1,10010001,2005,14962335,Conteo,663671,319649,344022


The column CLAVE has the same number of unique values ​​as CVE_GEOEST, both columns have repeated values. CLAVE will have the same value for each CVE_GEOEST. INDICE_HAB is a unique value for each of the records in the table.

TOT_HAB (Total inhabitants), TOT_HOM (Total men), and TOT_MUJ (Total women) contain numerical values ​​that represent number of people. There are some fields that do not have numerical information, so the type of information inside the column is defined as an object and not as an integer.

The values in FUENTE represent de diferent tipes of census event.

In [8]:
#CVE_GEOEST y CLAVE
#len(habitantes)                                # CSV contains 2138702 records
#len(habitantes.CVE_GEOEST.unique())            # There 408754 unique values in CVE_GEOEST
#habitantes.CVE_GEOEST.value_counts().max()     # Max frequency of a value is 16
#habitantes.CVE_GEOEST.value_counts().min()     # Min frequency of a value is 1

#set([len(clave) for clave in habitantes.CVE_GEOEST])   # All keys are 9 characters long


#len((habitantes['CVE_GEOEST'] + habitantes['CLAVE'].apply(str)).unique())
                                                # If you concatenate CLAVE with CVE_GEOEST
                                                # the number of unique values is the same (408754). 
                                                # This means that the value of clave is the same for
                                                # each value in CVE_GEOEST
#habitantes['INDICE_HAB'].is_unique             # INDICE_HAB is a unique value for every record



In [9]:
lista_columnas = habitantes.columns.values.tolist()

print(f'Number of records in habitantes: {len(habitantes)}', end = '\n\n')

for col in lista_columnas[ :5]: 
    print(f'Number of unique values in {col}: {len(habitantes[col].unique())}')
    
print()
print('Unique values of EVE_CENSAL:', end = '\n\n')
print(habitantes['EVE_CENSAL'].unique(), end = '\n\n')

print('Unique values of FUENTE:', end = '\n\n')
print(habitantes['FUENTE'].unique(), end = '\n\n')

print('Unique values of INDICE_HAB:', end = '\n\n')
print(len(habitantes['INDICE_HAB'].unique()), end = '\n\n')



Number of records in habitantes: 2138702

Number of unique values in CLAVE: 408754
Number of unique values in CVE_GEOEST: 408754
Number of unique values in EVE_CENSAL: 19
Number of unique values in INDICE_HAB: 2138702
Number of unique values in FUENTE: 2

Unique values of EVE_CENSAL:

[1940 1930 1900 1970 2005 1995 1950 2020 1960 2010 1910 2000 1980 1990
 1921 1939 1920    2 2013]

Unique values of FUENTE:

['Censo' 'Conteo']

Unique values of INDICE_HAB:

2138702



In [10]:
# EVE_CENSAL

#sorted(habitantes.EVE_CENSAL.unique().tolist()) # Unique value list
habitantes.EVE_CENSAL.value_counts()            # Census events with less records are
                                                 # from 1939, 1920, 
                                                 # y 2013


2010    279187
2020    275867
2005    265585
2000    255718
1995    252784
1990    173771
1980    129895
1970     98154
1960     87878
1950     75029
1940     69733
1930     57411
1921     43851
1910     41045
1900     32605
1939       179
2            8
1920         1
2013         1
Name: EVE_CENSAL, dtype: int64

In [11]:
# Record analisis of records with EVE_CENSAL  values: 1939, 1920, 2, 2013

# Make dataframes of specific census events
df_localidades_2 = habitantes.loc[habitantes['EVE_CENSAL'] == 2]
df_localidades_1920 = habitantes.loc[habitantes['EVE_CENSAL'] == 1920]
df_localidades_1939 = habitantes.loc[habitantes['EVE_CENSAL'] == 1939]
df_localidades_2013 = habitantes.loc[habitantes['EVE_CENSAL'] == 2013]

In [22]:
# EVE_CENSAL 2

# All the values ​​of EVE_CENSAL represent a year in which a census or count was carried out, 
# except value 2.

#df_localidades_2['EVE_CENSAL'].value_counts()      # They are geoestatistical keys 
                                                    # that only apear in recent census.
                                                    # The oldest census is in the year 1900.
# df_localidades_2['TOT_HAB'].dtypes                # The data in this column is of type
                                                    # integer and strig.
#len(df_localidades_2)                              # There are 8 registries
#df_localidades_2                                    # Whith a value of 0


In [25]:
# EVE_CENSAL 1920 y 2013 only have one record each.

#df_localidades_1920  
#df_localidades_2013                # TOT_HAB, TOT_HOM, TOT_MUJ with a value of 0                 

In [27]:
# EVE_CENSAL 1939

#df_localidades_1939.head()
#len(df_localidades_1939)                           # 179 records in total

#df_localidades_1939[['TOT_HAB', 'TOT_HOM', 'TOT_MUJ']].value_counts()
                                                    # 196 without values in any column
                                                    # 9 with a value of 0 in all the columns
                                                    # Only one record has the values 2, 1 1 in
                                                    # TOT_HAB, TOT_HOM, TOT_MUJ

In [13]:
#habitantes['TOT_HAB'] =pd.to_numeric(habitantes['TOT_HAB'], 
#                                     errors='coerce')

#habitantes['TOT_HOM'] =pd.to_numeric(habitantes['TOT_HOM'], 
#                                     errors='coerce',
#                                     downcast='integer')

#habitantes['TOT_MUJ'] =pd.to_numeric(habitantes['TOT_MUJ'], 
#                                     errors='coerce',
#                                     downcast='integer')