# Creating a Json file containing information of all variables

This notebook creates a json dictionary file 'full_var_dict.json' containing the original variable name, a descriptive variable name, as well as a brief description of each of the values inside each variable. The goal of this file is to ease the EDA and to produce graphs easier to read and understand with informative titles and descriptions of the categories of each variable.

In [2]:
import pandas as pd
import json

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


The excel file is the original data codebook, obtained from the INE website (e.g., Instituto Nacional de Estadistica), the owner of the data.

Tables 1 to 4 explain in detail the meaning of each of the values inside each of the categorical values in the data as well as how missing data is coded for numerical and categorical variables

In [41]:
# Specify the Excel file path
excel_file_path = '../data/raw/disreg_enceursalud20_a.xlsx'

# List of sheet names
sheet_names = ['Tablas1', 'Tablas2', 'Tablas3', 'Tablas4']

This code iterates trough tables 1 trough 4 and creates one single JSNON 'var_keys_dict' file with the description of each of the categorical values in the data as well as how missing data is coded

In [42]:
# Initialize variables
tables = {}

# Iterate through sheets
for sheet_name in sheet_names:
    # Read the Excel sheet and drop the specified column
    df = pd.read_excel(excel_file_path, sheet_name).drop('Unnamed: 2', axis=1, errors='ignore')

    # Initialize variables
    current_table = None
    table_data = {}

    # Iterate through rows
    for index, row in df.iterrows():
        # Check if the current row is the start of a new table
        if isinstance(row['Unnamed: 0'], str) and not row['Unnamed: 0'].isdigit() and row['Unnamed: 0'] != 'Código ':
            # Save the previous table data
            if current_table is not None:
                tables[current_table] = table_data

            # Set the current table name and initialize an empty dictionary for table data
            current_table, table_data = row['Unnamed: 0'], {}

        # Check if the row contains data (skip rows with NaN values)
        if pd.notna(row['Unnamed: 0']) and (isinstance(row['Unnamed: 0'], int) or (isinstance(row['Unnamed: 0'], str) and row['Unnamed: 0'].isdigit())):
            # Append the row data to the table_data dictionary
            table_data[row['Unnamed: 0']] = row['Unnamed: 1']

    # Save the last table data for the current sheet
    if current_table is not None:
        tables[current_table] = table_data

# Save the combined data to a JSON file
json_file_path = '../data/json_files/var_keys_dict.json'
with open(json_file_path, 'w', encoding='utf-8') as json_file:
    json.dump(tables, json_file, indent=2, ensure_ascii=False)

print(f'Successfully converted {excel_file_path} to {json_file_path}')

Successfully converted ../data/raw/disreg_enceursalud20_a.xlsx to ../data/json_files/var_keys_dict.json


This code creates a separate JSON file 'var_dict.json' containing the original variable name in the data with a brief description of the variable. The variable names in the data are short and uninformative. Hence the need of having a dictionary that explains what the variable name means

In [43]:
df = pd.read_excel(excel_file_path, 'Diseño', usecols=['Variable','Descripción','Diccionario de la variable']).set_index('Variable')

#renames the column with a shorter name
df.dropna(subset='Diccionario de la variable', inplace=True)
df.rename(columns={'Diccionario de la variable':'diccionario'}, inplace=True)

#converts the variable dictionary into json format
json_file_path = '../data/json_files/var_dict.json'
with open(json_file_path, 'w', encoding='utf-8') as json_file:
    df.to_json(json_file, orient='index', lines=False, indent=2,force_ascii=False, default_handler=str)

This code opens the two formerly created JSON files and combines them into a single file 'full_var_dict' containing: variable name, description, and description of each of the categorical values and/or missing values for each variable

In [44]:
# Load JSON strings into dictionaries
path1 = '../data/json_files/var_keys_dict.json'
path2 = '../data/json_files/var_dict.json'

with open(path1, 'r') as json_file: #loads dictionary variables
    var_keys_dict = json.load(json_file)

with open(path2, 'r') as json_file: #loads json value keys of variables
    var_dict = json.load(json_file)

#THESE TWO CONDITIONALS MODIFY THE VARIABLE KEYS TO BINARY FOR THE TRANSFORMED BINARY VARIABLES IN
# THE DATA CLEANING NOTEBOOK
if 'TSINO' in var_keys_dict:
    tsino_data = var_keys_dict['TSINO']
    
    # Create a new key '0' with the modified values
    tsino_data['0'] = f"{tsino_data['2']}/{tsino_data['8']} - {tsino_data['9']}"
    
    # Remove the unnecessary keys
    tsino_data.pop('2')
    tsino_data.pop('8')
    tsino_data.pop('9')

if 'T1SINO' in var_keys_dict:
    tsino_data = var_keys_dict['T1SINO']
    
    # Create a new key '0' with the modified values
    tsino_data['0'] = f"{tsino_data['2']}"
    # Remove the unnecessary keys
    tsino_data.pop('2')

# Iterate through the first JSON (var_dict) and replace values
for key, value in var_dict.items():
    if "diccionario" in value and value["diccionario"] in var_keys_dict:
        var_dict[key]["diccionario"] = var_keys_dict[value["diccionario"]]

# Storing the resulting JSON (as full_var_dict)
json_file_path = '../data/json_files/full_var_dict.json'
with open(json_file_path, 'w', encoding='utf-8') as json_file:
    json.dump(var_dict, json_file, indent=2, ensure_ascii=False)

# Final details for preparing JSON final file for deployment

Variable 'P87_2a' and 'P87_13a' will be replaced with 'P87_2a_nuevo' and 'P87_13a_nuevo' 

In [12]:
json_file_path = '../data/json_files/full_var_dict.json'
with open(json_file_path, 'r') as json_file_full: #loads full json file 
    full_var_dict = json.load(json_file_full)

if 'P87_2a' in full_var_dict:
    full_var_dict['P87_2a_nuevo'] = full_var_dict.pop('P87_2a')
#    full_var_dict['P87_2a_nuevo']['Descripción'] = full_var_dict.pop('Medicinas para el estómago y/o las alteraciones digestivas consumidas o recetadas')

if 'P87_13a' in full_var_dict:
    full_var_dict['P87_13a_nuevo'] = full_var_dict.pop('P87_13a')

# Step 3: Write the modified JSON data back to the file
with open('your_file.json', 'w') as file:
    json.dump(full_var_dict, file, indent=2)

In [10]:
full_var_dict['P87_13a_nuevo']['Descripción']

'Medicinas para el estómago y/o las alteraciones digestivas consumidas'