# Data Prep - Codebook Processing
As we've seen before, the Census and IPVS datasets, there are many coded variables present in these datasets. Such that they can be easily interpreted, it's best that we create better names for them. For that, we will use the official codebooks and the descriptions of the variables to create such names.

In [1]:
# installing dependencies for data preparation
!pip install -r ../configs/dependencies/dataprep_requirements.txt >> ../configs/dependencies/package_installation.txt

In [2]:
# loading the magic command for the formatter
%load_ext autoreload
%load_ext lab_black
%autoreload 2

In [3]:
###### Loading the necessary libraries #########

# PySpark dependencies:s
import pyspark
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
import pyspark.sql.types as T
from pyspark.sql.window import Window

# Sedona dependencies:
from sedona.utils.adapter import Adapter
from sedona.register import SedonaRegistrator
from sedona.utils import KryoSerializer, SedonaKryoRegistrator
from sedona.core.SpatialRDD import SpatialRDD
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.core.formatMapper import GeoJsonReader

# database utilities:
from sqlalchemy import create_engine
import sqlite3 as db
import pandas as pd
import geopandas as gpd

# plotting and data visualization:
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML, Image

# other relevant libraries:
import warnings
import unidecode
import inflection
import unicodedata
from datetime import datetime, timedelta
from functools import partial
import json
import re
import os
from glob import glob
import shutil
import itertools
import chardet

# importing the atlas utilities:
from atlasutils import (
    save_to_filesystem,
    save_as_table,
    rotate_xticks,
    get_file_encoding,
    normalize_entities,
    standardize_variable_names,
    apply_category_map,
)


# setting global parameters for visualizationsss:
warnings.filterwarnings("ignore")
pd.set_option("display.precision", 4)
pd.set_option("display.float_format", lambda x: "%.2f" % x)

# 0. Configuring Spark

In [4]:
# function to encapsulate standard spark configurations:
def init_spark(app_name):

    spark = (
        SparkSession.builder.appName(app_name)
        .config("spark.files.overwrite", "true")
        .config("spark.serializer", KryoSerializer.getName)
        .config("spark.kryo.registrator", SedonaKryoRegistrator.getName)
        .config(
            "spark.jars.packages",
            "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.1-incubating,"
            "org.datasyslab:geotools-wrapper:geotools-24.1",
        )
        .config("spark.sql.repl.eagerEval.enabled", True)
        .config("spark.sql.repl.eagerEval.maxNumRows", 5)
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY")
        .config("spark.sql.parquet.compression.codec", "gzip")
        .config("sedona.global.charset", "utf8")
        .enableHiveSupport()
        .getOrCreate()
    )

    SedonaRegistrator.registerAll(spark)

    return spark

In [5]:
# init the spark session:
spark = init_spark("SP Atlas - Codebook Processing")

In [6]:
# verifying the session status:
spark

# 1. Inspecting the Data
The codebooks and other kinds of official documentation about the data are located on the `references/documentation` directory.

In [7]:
# listing the folders for the datasets available:
!ls --recursive ../references/documentation/

../references/documentation/:
ibge  idh  ipvs  layers  rais

../references/documentation/ibge:
codebook_features_selected.json  ibge_census_codebook_atlas.xlsx
feature_selection_ibge.csv	 ibge_census_codebook.json
feature_selection_ibge.json	 ibge_summary_documentation.pdf
feature_selection_ibge.xlsx

../references/documentation/idh:
idh_2010_codebook.csv  idh_2010_file_structure.csv  idh_2010_metadata.csv

../references/documentation/ipvs:
ipvs_codebook.xlsx

../references/documentation/layers:
Dicionario_Logradouro_2020_CEM.pdf

../references/documentation/rais:
rais_industry_commerce_services.xls  rais_raw_dictionary.xls
rais_industry_dictionary.xls


In [8]:
# get the documents path:
DATA_DOC_PATH = "../references/documentation/"

# loading the IBGE codebooks:
ibge_codebook = pd.read_excel(DATA_DOC_PATH + "ibge/ibge_census_codebook_atlas.xlsx")

# verifying the results:
ibge_codebook.head()

Unnamed: 0,dataset_name_pt,variable_name,is_selected,variable_description_pt,variable_description_en
0,Básico,Cod_setor,1,Código do setor,Sector code
1,Básico,Cod_Grandes Regiões,1,Código das Grandes Regiões (Regiões Geográficas),Code of Large Regions (Geographical Regions)
2,Básico,Nome_Grande_Regiao,1,Nome das Grandes Regiões (Regiões Geográficas),Name of large regions (geographical regions)
3,Básico,Cod_UF,1,Código da Unidade da Federação,Federation Unit Code
4,Básico,Nome_da_UF,1,Nome da Unidade da Federação,Name of the Federation Unit


In [9]:
# loading IPVS file:
ipvs_codebook = pd.read_excel(
    DATA_DOC_PATH + "ipvs/ipvs_codebook.xlsx", sheet_name="Variáveis"
)

# looking at the results:
ipvs_codebook.head()

Unnamed: 0,NOME DO ARQUIVO,VARIAVEL,NOME DA VARIAVEL,DESCRICAO DA VARIAVEL,FONTE,TIPO,normalized_variable
0,IPVS 2010 EST SP,ID,Identificados,Identificados,IBGE,NÚMERO,id
1,IPVS 2010 EST SP,AREA,Área em Km2,Área em Km2,IBGE,TEXTO,sector_area_square_kms
2,IPVS 2010 EST SP,CD_GEOCODI,Código do setor censitário do IBGE,Código do setor censitário do IBGE,IBGE,NÚMERO,sector_code
3,IPVS 2010 EST SP,TIPO,Tipo de setor censitário,Tipo de setor censitário,IBGE,TEXTO,sector_type
4,IPVS 2010 EST SP,CD_GEOCODM,Código do município do IBGE,Código do município do IBGE,IBGE,NÚMERO,city_code_census


For the ipvs file, I previously labeled the variable names in the `normalized_variable` column. For the census codebook, we will have to apply an automated normalization procedure to the descriptions to obtain more reasonable variable names, since it contains more than `4000` variables. We will do that by performing some text processing operations on the dataset.

# 2. Normalizing Codebook variable names

A procedure that can be done to the dataset variables is to normalize the descriptions in some way. This includes, for example, removing words that don't add meaning to the variable name (i.e stopwords) and remove textual variations that are not relevant for such use case (capitalization, accents, et cetera).

In [10]:
# normalizing the variable names using the function from atlasutils
ibge_codebook["normalized_variable"] = ibge_codebook["variable_description_en"].apply(
    lambda code: standardize_variable_names(code)
)

# removing accents from the original variables:
ibge_codebook.loc[:, "simplified_variable_name"] = ibge_codebook["variable_name"].apply(
    lambda text: unicodedata.normalize("NFD", text)
    .encode("ascii", "ignore")
    .decode("utf-8")
    .strip()
    .lower()
    .replace(" ", "_")
)

In [14]:
# saving the results:
ibge_codebook.to_json(
    "../references/documentation/ibge/codebook_features_selected.json", orient="records"
)