## Adoption in Brazil

In [166]:
from tqdm import tqdm
import warnings
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
divider = '-'*80

### Loading the directories

In [167]:
# directory for the data
base_dir_children = 'data/criancas_para_adocao/'
base_dir_parents = 'data/prospective_adoptive_parents/'

The display_directory_structure function uses os.walk to traverse the directory and print its structure.
This helps in visualizing the organization of files and directories.

In [168]:
# Function to display directory structure
def display_directory_structure(base_dir):
    for root, dirs, files in os.walk(base_dir):
        level = root.replace(base_dir, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print(f'{indent}{os.path.basename(root)}/')
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print(f'{subindent}{f}')
            
# Display directory structure for children and parents data
print("Children Data Directory Structure:")
display_directory_structure(base_dir_children)
print("\nProspective Adoptive Parents Data Directory Structure:")
display_directory_structure(base_dir_parents)

Children Data Directory Structure:
/
sudeste/
    by_disability.xlsx
    by_infectious_disease.xlsx
    by_ethnicity.xlsx
    by_gender.xlsx
    by_UF.xlsx
    by_siblings.xlsx
    by_age.xlsx
    by_disease.xlsx
general/
    by_disability.xlsx
    by_infectious_disease.xlsx
    by_ethnicity.xlsx
    by_gender.xlsx
    by_UF.xlsx
    by_siblings.xlsx
    by_region.xlsx
    by_age.xlsx
    by_disease.xlsx
nordeste/
    by_disability.xlsx
    by_infectious_disease.xlsx
    by_ethnicity.xlsx
    by_gender.xlsx
    by_UF.xlsx
    by_siblings.xlsx
    by_age.xlsx
    by_disease.xlsx
sul/
    by_disability.xlsx
    by_infectious_disease.xlsx
    by_ethnicity.xlsx
    by_gender.xlsx
    by_UF.xlsx
    by_siblings.xlsx
    by_age.xlsx
    by_disease.xlsx
norte/
    by_disability.xlsx
    by_infectious_disease.xlsx
    by_ethnicity.xlsx
    by_gender.xlsx
    by_UF.xlsx
    by_siblings.xlsx
    by_age.xlsx
    by_disease.xlsx
centro_oeste/
    by_disability.xlsx
    by_infectious_disease.xlsx
 

region_translation is a dictionary that maps Portuguese region names to their English equivalents.

The load_region_data function is adapted to include the translation of region names.
It uses the region_translation dictionary to translate the region names to English and add them to the DataFrame.

The code loads data for all specified regions into dictionaries (children_data and parents_data) for both children and prospective adoptive parents.

This function takes the base directory base_dir and a specific region as arguments.
It constructs the path to the regional directory (region_dir).
It initializes an empty dictionary data to store the DataFrames.
It iterates over each file in the regional directory:
If the file is an Excel file (ends with .xlsx), it reads the file into a DataFrame df.
It calls the rename_columns_based_on_filename function to rename specific columns in the DataFrame.
It adds a new column Region to the DataFrame to indicate the region the data belongs to.
It stores the DataFrame in the data dictionary with a key derived from the file name (without the extension).
Finally, it returns the data dictionary containing all the DataFrames for the specified region.

In [169]:
# Translation mappings for region names
region_translation = {
    'centro_oeste': 'Midwest',
    'nordeste': 'Northeast',
    'norte': 'North',
    'sudeste': 'Southeast',
    'sul': 'South',
    'general': 'General'
}

# Function to load data from each regional directory without renaming columns
def load_region_data(base_dir, region):
    region_dir = os.path.join(base_dir, region)
    data = {}
    for file in os.listdir(region_dir):
        if file.endswith('.xlsx'):
            key = file.split('.')[0]  # using the file name (without extension) as key
            df = pd.read_excel(os.path.join(region_dir, file))
            df['region'] = region_translation[region]  # Translate and add a column for the region
            data[key] = df
    return data

# Load data from all regions for children and parents
regions = ['centro_oeste', 'nordeste', 'norte', 'sudeste', 'sul', 'general']
children_data = {region: load_region_data(base_dir_children, region) for region in regions}
parents_data = {region: load_region_data(base_dir_parents, region) for region in regions}

In [170]:
# Function to display the loaded data in a pretty format
def display_loaded_data(data, title):
    for region, dfs in data.items():
        print(f"\nRegion: {region_translation[region]}")
        for key, df in dfs.items():
            print(f"\n{title} - {key}")
            display(df.head())

# Display the loaded data for children and parents
print(divider)
print(" Loaded Children Data:")
print(divider)
display_loaded_data(children_data, "Children Data")
print(divider)
print(" Loaded Prospective Adoptive Parents Data:")
print(divider)
display_loaded_data(parents_data, "Parents Data")

--------------------------------------------------------------------------------
 Loaded Children Data:
--------------------------------------------------------------------------------

Region: Midwest

Children Data - by_disability


Unnamed: 0,Deficiência,Disponíveis para Adoção,region
0,Sem Deficiência,331,Midwest
1,Deficiência Intelectual,35,Midwest
2,Deficiência Física e Intelectual,18,Midwest
3,Deficiência Física,6,Midwest



Children Data - by_infectious_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,387,Midwest
1,Sim,3,Midwest



Children Data - by_ethnicity


Unnamed: 0,Etnia,Disponíveis para Adoção,region
0,Parda,236,Midwest
1,Branca,75,Midwest
2,Preta,63,Midwest
3,Indigena,15,Midwest
4,Amarela,1,Midwest



Children Data - by_gender


Unnamed: 0,Gênero,Disponíveis para Adoção,region
0,Masculino,219,Midwest
1,Feminino,171,Midwest



Children Data - by_UF


Unnamed: 0,UF,Disponível - vinculada a pretendente,Disponível - não vinculada a pretendente,region
0,DF,45,40,Midwest
1,GO,45,73,Midwest
2,MS,38,87,Midwest
3,MT,20,42,Midwest



Children Data - by_siblings


Unnamed: 0,Irmãos,Disponíveis para Adoção,region
0,Sem Irmão,148,Midwest
1,Um Irmão,83,Midwest
2,Dois Irmãos,81,Midwest
3,Mais de 3 Irmãos,40,Midwest
4,Três Irmãos,38,Midwest



Children Data - by_age


Unnamed: 0,Fx. Etária,Disponíveis para Adoção,region
0,Até 2 anos,23,Midwest
1,De 2 a 4 anos,23,Midwest
2,De 4 a 6 anos,35,Midwest
3,De 6 a 8 anos,39,Midwest
4,De 8 a 10 anos,45,Midwest



Children Data - by_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,313,Midwest
1,Sim,77,Midwest



Region: Northeast

Children Data - by_disability


Unnamed: 0,Deficiência,Disponíveis para Adoção,region
0,Sem Deficiência,656,Northeast
1,Deficiência Intelectual,130,Northeast
2,Deficiência Física e Intelectual,41,Northeast
3,Deficiência Física,17,Northeast



Children Data - by_infectious_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,840,Northeast
1,Sim,4,Northeast



Children Data - by_ethnicity


Unnamed: 0,Etnia,Disponíveis para Adoção,region
0,Parda,583,Northeast
1,Preta,154,Northeast
2,Branca,106,Northeast
3,Amarela,1,Northeast



Children Data - by_gender


Unnamed: 0,Gênero,Disponíveis para Adoção,region
0,Masculino,448,Northeast
1,Feminino,396,Northeast



Children Data - by_UF


Unnamed: 0,UF,Disponível - vinculada a pretendente,Disponível - não vinculada a pretendente,region
0,AL,15,28,Northeast
1,BA,42,178,Northeast
2,CE,59,111,Northeast
3,MA,21,38,Northeast
4,PB,34,47,Northeast



Children Data - by_siblings


Unnamed: 0,Irmãos,Disponíveis para Adoção,region
0,Sem Irmão,358,Northeast
1,Um Irmão,144,Northeast
2,Dois Irmãos,143,Northeast
3,Mais de 3 Irmãos,106,Northeast
4,Três Irmãos,93,Northeast



Children Data - by_age


Unnamed: 0,Fx. Etária,Disponíveis para Adoção,region
0,Até 2 anos,52,Northeast
1,De 2 a 4 anos,51,Northeast
2,De 4 a 6 anos,54,Northeast
3,De 6 a 8 anos,63,Northeast
4,De 8 a 10 anos,82,Northeast



Children Data - by_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,678,Northeast
1,Sim,166,Northeast



Region: North

Children Data - by_disability


Unnamed: 0,Deficiência,Disponíveis para Adoção,region
0,Sem Deficiência,135,North
1,Deficiência Intelectual,51,North
2,Deficiência Física e Intelectual,19,North
3,Deficiência Física,10,North



Children Data - by_infectious_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,213,North
1,Sim,2,North



Children Data - by_ethnicity


Unnamed: 0,Etnia,Disponíveis para Adoção,region
0,Parda,175,North
1,Branca,24,North
2,Preta,12,North
3,Indigena,4,North



Children Data - by_gender


Unnamed: 0,Gênero,Disponíveis para Adoção,region
0,Masculino,115,North
1,Feminino,100,North



Children Data - by_UF


Unnamed: 0,UF,Disponível - vinculada a pretendente,Disponível - não vinculada a pretendente,region
0,AC,8,3,North
1,AM,22,48,North
2,AP,1,10,North
3,PA,28,47,North
4,RO,11,20,North



Children Data - by_siblings


Unnamed: 0,Irmãos,Disponíveis para Adoção,region
0,Sem Irmão,102,North
1,Dois Irmãos,37,North
2,Um Irmão,37,North
3,Mais de 3 Irmãos,22,North
4,Três Irmãos,17,North



Children Data - by_age


Unnamed: 0,Fx. Etária,Disponíveis para Adoção,region
0,Até 2 anos,8,North
1,De 2 a 4 anos,9,North
2,De 4 a 6 anos,21,North
3,De 6 a 8 anos,15,North
4,De 8 a 10 anos,22,North



Children Data - by_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,154,North
1,Sim,61,North



Region: Southeast

Children Data - by_disability


Unnamed: 0,Deficiência,Disponíveis para Adoção,region
0,Sem Deficiência,1800,Southeast
1,Deficiência Intelectual,288,Southeast
2,Deficiência Física e Intelectual,88,Southeast
3,Deficiência Física,36,Southeast



Children Data - by_infectious_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,2191,Southeast
1,Sim,21,Southeast



Children Data - by_ethnicity


Unnamed: 0,Etnia,Disponíveis para Adoção,region
0,Parda,1113,Southeast
1,Branca,588,Southeast
2,Preta,474,Southeast
3,Não Informada,18,Southeast
4,Amarela,16,Southeast



Children Data - by_gender


Unnamed: 0,Gênero,Disponíveis para Adoção,region
0,Masculino,1208,Southeast
1,Feminino,1004,Southeast



Children Data - by_UF


Unnamed: 0,UF,Disponível - vinculada a pretendente,Disponível - não vinculada a pretendente,region
0,ES,52,69,Southeast
1,MG,200,394,Southeast
2,RJ,142,121,Southeast
3,SP,713,521,Southeast



Children Data - by_siblings


Unnamed: 0,Irmãos,Disponíveis para Adoção,region
0,Sem Irmão,844,Southeast
1,Um Irmão,478,Southeast
2,Dois Irmãos,406,Southeast
3,Três Irmãos,251,Southeast
4,Mais de 3 Irmãos,233,Southeast



Children Data - by_age


Unnamed: 0,Fx. Etária,Disponíveis para Adoção,region
0,Até 2 anos,228,Southeast
1,De 2 a 4 anos,147,Southeast
2,De 4 a 6 anos,192,Southeast
3,De 6 a 8 anos,206,Southeast
4,De 8 a 10 anos,229,Southeast



Children Data - by_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,1808,Southeast
1,Sim,404,Southeast



Region: South

Children Data - by_disability


Unnamed: 0,Deficiência,Disponíveis para Adoção,region
0,Sem Deficiência,935,South
1,Deficiência Intelectual,181,South
2,Deficiência Física e Intelectual,49,South
3,Deficiência Física,10,South



Children Data - by_infectious_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,1159,South
1,Sim,16,South



Children Data - by_ethnicity


Unnamed: 0,Etnia,Disponíveis para Adoção,region
0,Branca,629,South
1,Parda,421,South
2,Preta,112,South
3,Indigena,10,South
4,Amarela,3,South



Children Data - by_gender


Unnamed: 0,Gênero,Disponíveis para Adoção,region
0,Masculino,589,South
1,Feminino,586,South



Children Data - by_UF


Unnamed: 0,UF,Disponível - vinculada a pretendente,Disponível - não vinculada a pretendente,region
0,PR,201,272,South
1,RS,162,317,South
2,SC,59,164,South



Children Data - by_siblings


Unnamed: 0,Irmãos,Disponíveis para Adoção,region
0,Sem Irmão,475,South
1,Um Irmão,272,South
2,Dois Irmãos,231,South
3,Três Irmãos,100,South
4,Mais de 3 Irmãos,97,South



Children Data - by_age


Unnamed: 0,Fx. Etária,Disponíveis para Adoção,region
0,Até 2 anos,82,South
1,De 2 a 4 anos,62,South
2,De 4 a 6 anos,70,South
3,De 6 a 8 anos,73,South
4,De 8 a 10 anos,110,South



Children Data - by_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,910,South
1,Sim,265,South



Region: General

Children Data - by_disability


Unnamed: 0,Deficiência,Disponíveis para Adoção,region
0,Sem Deficiência,3857,General
1,Deficiência Intelectual,685,General
2,Deficiência Física e Intelectual,215,General
3,Deficiência Física,79,General



Children Data - by_infectious_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,4790,General
1,Sim,46,General



Children Data - by_ethnicity


Unnamed: 0,Etnia,Disponíveis para Adoção,region
0,Parda,2528,General
1,Branca,1422,General
2,Preta,815,General
3,Indigena,32,General
4,Amarela,21,General



Children Data - by_gender


Unnamed: 0,Gênero,Disponíveis para Adoção,region
0,Masculino,2579,General
1,Feminino,2257,General



Children Data - by_UF


Unnamed: 0,UF,Disponível - vinculada a pretendente,Disponível - não vinculada a pretendente,region
0,AC,8,3,General
1,AL,15,28,General
2,AM,22,48,General
3,AP,1,10,General
4,BA,42,178,General



Children Data - by_siblings


Unnamed: 0,Irmãos,Disponíveis para Adoção,region
0,Sem Irmão,1927,General
1,Um Irmão,1014,General
2,Dois Irmãos,898,General
3,Três Irmãos,499,General
4,Mais de 3 Irmãos,498,General



Children Data - by_region


Unnamed: 0,Região,Disponíveis para Adoção,region
0,Centro-Oeste,390,General
1,Nordeste,844,General
2,Norte,215,General
3,Sudeste,2212,General
4,Sul,1175,General



Children Data - by_age


Unnamed: 0,Fx. Etária,Disponíveis para Adoção,region
0,Até 2 anos,393,General
1,De 2 a 4 anos,292,General
2,De 4 a 6 anos,372,General
3,De 6 a 8 anos,396,General
4,De 8 a 10 anos,488,General



Children Data - by_disease


Unnamed: 0,Doença,Disponíveis para Adoção,region
0,Não,3863,General
1,Sim,973,General


--------------------------------------------------------------------------------
 Loaded Prospective Adoptive Parents Data:
--------------------------------------------------------------------------------

Region: Midwest

Parents Data - by_couple


Unnamed: 0,Casal?,Pretendentes Disponíveis,region
0,Sim,2091,Midwest
1,Não,331,Midwest



Parents Data - by_quantity


Unnamed: 0,Qtd. aceita adotar,Pretendentes Disponíveis,region
0,1,1395,Midwest
1,2,924,Midwest
2,Acima,103,Midwest



Parents Data - by_disability


Unnamed: 0,Deficiência,Pretendentes Disponíveis,region
0,Sem Deficiência,2288,Midwest
1,Deficiência Física,85,Midwest
2,Deficiência Física e Intelectual,43,Midwest
3,Deficiência Intelectual,6,Midwest



Parents Data - by_infectious_illness


Unnamed: 0,Doença infectocontagiosa,Pretendentes Disponíveis,region
0,Não,2255,Midwest
1,Sim,167,Midwest



Parents Data - by_ethnicity


Unnamed: 0,Etnia,Pretendentes Disponíveis,region
0,Qualquer,1610,Midwest
1,Parda,639,Midwest
2,Branca,633,Midwest
3,Amarela,239,Midwest
4,Preta,166,Midwest



Parents Data - by_gender


Unnamed: 0,Gênero,Pretendentes Disponíveis,region
0,Qualquer,1682,Midwest
1,Feminino,578,Midwest
2,Masculino,162,Midwest



Parents Data - by_UF


Unnamed: 0,UF,Pretendentes Disponíveis,region
0,DF,454,Midwest
1,GO,1111,Midwest
2,MS,241,Midwest
3,MT,616,Midwest



Parents Data - by_type


Unnamed: 0,Tipo,Pretendentes Disponíveis,region
0,Nacional,1673,Midwest
1,Estadual,621,Midwest
2,Municipal,127,Midwest
3,Internacional,1,Midwest



Parents Data - by_age


Unnamed: 0,Idade,Pretendentes Disponíveis,region
0,Até 2 anos,458,Midwest
1,De 2 a 4 anos,755,Midwest
2,De 4 a 6 anos,730,Midwest
3,De 6 a 8 anos,322,Midwest
4,De 8 a 10 anos,95,Midwest



Parents Data - by_disease


Unnamed: 0,Doença,Pretendentes Disponíveis,region
0,Não,1767,Midwest
1,Sim,655,Midwest



Region: Northeast

Parents Data - by_couple


Unnamed: 0,Casal?,Pretendentes Disponíveis,region
0,Sim,4619,Northeast
1,Não,870,Northeast



Parents Data - by_quantity


Unnamed: 0,Qtd. aceita adotar,Pretendentes Disponíveis,region
0,1,3682,Northeast
1,2,1657,Northeast
2,Acima,150,Northeast



Parents Data - by_disability


Unnamed: 0,Deficiência,Pretendentes Disponíveis,region
0,Sem Deficiência,5244,Northeast
1,Deficiência Física,123,Northeast
2,Deficiência Física e Intelectual,111,Northeast
3,Deficiência Intelectual,11,Northeast



Parents Data - by_infectious_disease


Unnamed: 0,Doença infectocontagiosa,Pretendentes Disponíveis,region
0,Não,5159,Northeast
1,Sim,330,Northeast



Parents Data - by_ethnicity


Unnamed: 0,Etnia,Pretendentes Disponíveis,region
0,Qualquer,3547,Northeast
1,Parda,1606,Northeast
2,Branca,1302,Northeast
3,Amarela,396,Northeast
4,Preta,354,Northeast



Parents Data - by_gender


Unnamed: 0,Gênero,Pretendentes Disponíveis,region
0,Qualquer,3240,Northeast
1,Feminino,1765,Northeast
2,Masculino,484,Northeast



Parents Data - by_UF


Unnamed: 0,UF,Pretendentes Disponíveis,region
0,AL,331,Northeast
1,BA,1398,Northeast
2,CE,1115,Northeast
3,MA,238,Northeast
4,PB,514,Northeast



Parents Data - by_type


Unnamed: 0,Tipo,Pretendentes Disponíveis,region
0,Nacional,2373,Northeast
1,Estadual,2262,Northeast
2,Municipal,851,Northeast
3,Internacional,3,Northeast



Parents Data - by_age


Unnamed: 0,Idade,Pretendentes Disponíveis,region
0,Até 2 anos,1423,Northeast
1,De 2 a 4 anos,1843,Northeast
2,De 4 a 6 anos,1374,Northeast
3,De 6 a 8 anos,496,Northeast
4,De 8 a 10 anos,187,Northeast



Parents Data - by_disease


Unnamed: 0,Doença,Pretendentes Disponíveis,region
0,Não,4107,Northeast
1,Sim,1382,Northeast



Region: North

Parents Data - by_couple


Unnamed: 0,Casal?,Pretendentes Disponíveis,region
0,Sim,1003,North
1,Não,201,North



Parents Data - by_quantity


Unnamed: 0,Qtd. aceita adotar,Pretendentes Disponíveis,region
0,1,719,North
1,2,438,North
2,Acima,47,North



Parents Data - by_disability


Unnamed: 0,Deficiência,Pretendentes Disponíveis,region
0,Sem Deficiência,1123,North
1,Deficiência Física,39,North
2,Deficiência Física e Intelectual,38,North
3,Deficiência Intelectual,4,North



Parents Data - by_infectious_disease


Unnamed: 0,Doença infectocontagiosa,Pretendentes Disponíveis,region
0,Não,1121,North
1,Sim,83,North



Parents Data - by_ethnicity


Unnamed: 0,Etnia,Pretendentes Disponíveis,region
0,Qualquer,823,North
1,Parda,325,North
2,Branca,296,North
3,Amarela,131,North
4,Preta,110,North



Parents Data - by_gender


Unnamed: 0,Gênero,Pretendentes Disponíveis,region
0,Qualquer,704,North
1,Feminino,347,North
2,Masculino,153,North



Parents Data - by_UF


Unnamed: 0,UF,Pretendentes Disponíveis,region
0,AC,72,North
1,AM,145,North
2,AP,62,North
3,PA,448,North
4,RO,279,North



Parents Data - by_type


Unnamed: 0,Tipo,Pretendentes Disponíveis,region
0,Nacional,505,North
1,Estadual,456,North
2,Municipal,239,North
3,Internacional,4,North



Parents Data - by_age


Unnamed: 0,Idade,Pretendentes Disponíveis,region
0,Até 2 anos,282,North
1,De 2 a 4 anos,380,North
2,De 4 a 6 anos,323,North
3,De 6 a 8 anos,120,North
4,De 8 a 10 anos,45,North



Parents Data - by_disease


Unnamed: 0,Doença,Pretendentes Disponíveis,region
0,Não,977,North
1,Sim,227,North



Region: Southeast

Parents Data - by_couple


Unnamed: 0,Casal?,Pretendentes Disponíveis,region
0,Sim,15721,Southeast
1,Não,2023,Southeast



Parents Data - by_accepted_disease


Unnamed: 0,Doença,Pretendentes Disponíveis,region
0,Não,11023,Southeast
1,Sim,6721,Southeast



Parents Data - by_accepted_disability


Unnamed: 0,Deficiência,Pretendentes Disponíveis,region
0,Sem Deficiência,16806,Southeast
1,Deficiência Física,752,Southeast
2,Deficiência Física e Intelectual,124,Southeast
3,Deficiência Intelectual,62,Southeast



Parents Data - by_quantity_accepted


Unnamed: 0,Qtd. aceita adotar,Pretendentes Disponíveis,region
0,1,11208,Southeast
1,2,6182,Southeast
2,Acima,354,Southeast



Parents Data - by_ethnicity


Unnamed: 0,Etnia,Pretendentes Disponíveis,region
0,Qualquer,11100,Southeast
1,Branca,5456,Southeast
2,Parda,4999,Southeast
3,Amarela,1275,Southeast
4,Preta,1153,Southeast



Parents Data - by_gender


Unnamed: 0,Gênero,Pretendentes Disponíveis,region
0,Qualquer,12279,Southeast
1,Feminino,4186,Southeast
2,Masculino,1279,Southeast



Parents Data - by_UF


Unnamed: 0,UF,Pretendentes Disponíveis,region
0,ES,703,Southeast
1,MG,4886,Southeast
2,RJ,3172,Southeast
3,SP,8983,Southeast



Parents Data - by_type


Unnamed: 0,Tipo,Pretendentes Disponíveis,region
0,Nacional,8602,Southeast
1,Estadual,7242,Southeast
2,Municipal,1898,Southeast
3,Internacional,12,Southeast



Parents Data - by_age


Unnamed: 0,Idade,Pretendentes Disponíveis,region
0,Até 2 anos,2506,Southeast
1,De 2 a 4 anos,5649,Southeast
2,De 4 a 6 anos,5736,Southeast
3,De 6 a 8 anos,2743,Southeast
4,De 8 a 10 anos,773,Southeast



Parents Data - by_infectious_disease_accepted


Unnamed: 0,Doença infectocontagiosa,Pretendentes Disponíveis,region
0,Não,16093,Southeast
1,Sim,1651,Southeast



Region: South

Parents Data - by_couple


Unnamed: 0,Casal?,Pretendentes Disponíveis,region
0,Sim,8394,South
1,Não,930,South



Parents Data - by_quantity


Unnamed: 0,Qtd. aceita adotar,Pretendentes Disponíveis,region
0,1,5247,South
1,2,3868,South
2,Acima,205,South



Parents Data - by_disability


Unnamed: 0,Deficiência,Pretendentes Disponíveis,region
0,Sem Deficiência,8880,South
1,Deficiência Física,357,South
2,Deficiência Física e Intelectual,60,South
3,Deficiência Intelectual,27,South



Parents Data - by_ethnicity


Unnamed: 0,Etnia,Pretendentes Disponíveis,region
0,Qualquer,5232,South
1,Branca,3861,South
2,Parda,2537,South
3,Amarela,995,South
4,Preta,480,South



Parents Data - by_gender


Unnamed: 0,Gênero,Pretendentes Disponíveis,region
0,Qualquer,7000,South
1,Feminino,1795,South
2,Masculino,525,South



Parents Data - by_UF


Unnamed: 0,UF,Pretendentes Disponíveis,region
0,PR,2639,South
1,RS,3673,South
2,SC,3008,South



Parents Data - by_type


Unnamed: 0,Tipo,Pretendentes Disponíveis,region
0,Nacional,5701,South
1,Estadual,3068,South
2,Municipal,554,South
3,Internacional,1,South



Parents Data - by_age


Unnamed: 0,Idade,Pretendentes Disponíveis,region
0,Até 2 anos,1545,South
1,De 2 a 4 anos,2957,South
2,De 4 a 6 anos,3068,South
3,De 6 a 8 anos,1291,South
4,De 8 a 10 anos,313,South



Parents Data - by_infectious_disease_accepted


Unnamed: 0,Doença infectocontagiosa,Pretendentes Disponíveis,region
0,Não,8513,South
1,Sim,811,South



Parents Data - by_disease


Unnamed: 0,Doença,Pretendentes Disponíveis,region
0,Não,5336,South
1,Sim,3988,South



Region: General

Parents Data - by_couple


Unnamed: 0,Casal?,Pretendentes Disponíveis,region
0,Sim,31899,General
1,Não,4349,General



Parents Data - by_accepted_age


Unnamed: 0,Idade,Pretendentes Disponíveis,region
0,Até 2 anos,6228,General
1,De 2 a 4 anos,11606,General
2,De 4 a 6 anos,11240,General
3,De 6 a 8 anos,4989,General
4,De 8 a 10 anos,1412,General



Parents Data - by_accepted_disability


Unnamed: 0,Deficiência,Pretendentes Disponíveis,region
0,Sem Deficiência,34402,General
1,Deficiência Física,1359,General
2,Deficiência Física e Intelectual,377,General
3,Deficiência Intelectual,110,General



Parents Data - by_quantity_accepted


Unnamed: 0,Qtd. aceita adotar,Pretendentes Disponíveis,region
0,1,22301,General
1,2,13087,General
2,Acima,860,General



Parents Data - by_UF


Unnamed: 0,UF,Pretendentes Disponíveis,region
0,AC,72,General
1,AL,331,General
2,AM,144,General
3,AP,62,General
4,BA,1396,General



Parents Data - by_region


Unnamed: 0,Tipo,Pretendentes Disponíveis,region
0,Centro-Oeste,2426,General
1,Nordeste,5496,General
2,Norte,1205,General
3,Sudeste,17789,General
4,Sul,9332,General



Parents Data - by_accepted_ethnicity


Unnamed: 0,Etnia,Pretendentes Disponíveis,region
0,Qualquer,22352,General
1,Branca,11571,General
2,Parda,10126,General
3,Amarela,3036,General
4,Preta,2265,General



Parents Data - by_type


Unnamed: 0,Tipo,Pretendentes Disponíveis,region
0,Nacional,18854,General
1,Estadual,13649,General
2,Municipal,3669,General
3,Internacional,21,General



Parents Data - by_infectious_disease_accepted


Unnamed: 0,Doença infectocontagiosa,Pretendentes Disponíveis,region
0,Não,33202,General
1,Sim,3046,General



Parents Data - by_accepted_gender


Unnamed: 0,Gênero,Pretendentes Disponíveis,region
0,Qualquer,24949,General
1,Feminino,8689,General
2,Masculino,2610,General


### Translation

In [171]:
# Translation mappings for column names and content
translation_dict_columns_children = {
    'Disponíveis para Adoção': 'available_for_adoption',
    'Fx. Etária': 'age_group',
    'Idade': 'age_preference',
    'Gênero': 'gender',
    'Etnia': 'ethnicity',
    'Deficiência': 'disability',
    'Doença': 'disease',
    'Irmãos': 'sibling',
    'UF': 'UF',
    'Região': 'region',
    'Disponível - vinculada a pretendente': 'available_for_adoption_linked',
    'Disponível - não vinculada a pretendente': 'available_for_adoption_unlinked'   
}

translation_dict_columns_parents = {
    'Disponíveis para Adoção': 'available_for_adoption',
    'Fx. Etária': 'age_group',
    'Idade': 'age_preference',
    'Gênero': 'gender_preference',
    'Etnia': 'ethnicity_preference',
    'Deficiência': 'disability_preference',
    'Doença': 'disease_preference',  # This will be conditionally changed based on file name
    'Irmãos': 'sibling_preference',
    'UF': 'UF',
    'Região': 'region',
    'Qtd. aceita adotar': 'accepted_quantity',
    'Casal?': 'couple',
    'Pretendentes Disponíveis': 'available_candidates',
    'Doença infectocontagiosa': 'infectious_disease_preference',
    'Tipo': 'adoption_type'

}

translation_dict_content = {
    'age_group': {
        'Até 2 anos': '0-2 years', 'De 2 a 4 anos': '2-4 years', 'De 4 a 6 anos': '4-6 years',
        'De 6 a 8 anos': '6-8 years', 'De 8 a 10 anos': '8-10 years', 'De 10 a 12 anos': '10-12 years',
        'De 12 a 14 anos': '12-14 years', 'De 14 a 16 anos': '14-16 years', 'Maior de 16 anos': '16+ years'
    },
    'age_preference': {
        'Até 2 anos': '0-2 years', 'De 2 a 4 anos': '2-4 years', 'De 4 a 6 anos': '4-6 years',
        'De 6 a 8 anos': '6-8 years', 'De 8 a 10 anos': '8-10 years', 'De 10 a 12 anos': '10-12 years',
        'De 12 a 14 anos': '12-14 years', 'De 14 a 16 anos': '14-16 years', 'Maior de 16 anos': '16+ years'
    },
    'gender': {
        'Masculino': 'Male', 'Feminino': 'Female', 'Qualquer': 'Any'
    },
    'disease': {
        'Não': 'No', 'Sim': 'Yes'
    },
    'infectious_disease': {
        'Não': 'No', 'Sim': 'Yes'
    },
    'infectious_disease_preference': {
        'Não': 'No', 'Sim': 'Yes'
    },
    'accepted_quantity': {
        'Acima': 'Above'
    },
    'couple': {
        'Não': 'No', 'Sim': 'Yes'
    },
    'sibling': {
        'Sem Irmão': 'No Siblings', 'Um Irmão': 'One Sibling', 'Dois Irmãos': 'Two Siblings',
        'Três Irmãos': 'Three Siblings', 'Mais de 3 Irmãos': 'More than Three Siblings'
    },
    'disability': {
        'Sem Deficiência': 'No Disability', 'Deficiência Intelectual': 'Intellectual Disability',
        'Deficiência Física e Intelectual': 'Physical and Intellectual Disability', 'Deficiência Física': 'Physical Disability'
    },
    'disability_preference': {
        'Sem Deficiência': 'No Disability', 'Deficiência Intelectual': 'Intellectual Disability',
        'Deficiência Física e Intelectual': 'Physical and Intellectual Disability', 'Deficiência Física': 'Physical Disability'
    },
    'ethnicity': {
        'Parda': 'Mixed', 'Branca': 'White', 'Preta': 'Black', 'Indígena': 'Indigenous', 'Amarela': 'Asian', 'Não Informada': 'Not Informed', 'Qualquer': 'Any'
    },
    'region': {
        'Centro Oeste': 'Midwest', 'Nordeste': 'Northeast', 'Norte': 'North', 'Sudeste': 'Southeast', 'Sul': 'South'
    },
    'available_for_adoption': {
        'Disponível - vinculada a pretendente': 'Available - Linked to Applicant',
        'Disponível - não vinculada a pretendente': 'Available - Not Linked to Applicant'
    },
    'adoption_type': {
        'Nacional': 'National', 'Estadual': 'State', 'Municipal': 'Municipal', 'Internacional': 'International'
    }
}

In [172]:
# Function to conditionally rename columns based on file name
def conditionally_rename_columns(df, filename, col_translations):
    col_translations = col_translations.copy()  # Copy to avoid modifying the original dictionary
    if 'by_infectious_disease' in filename:
        col_translations['Doença'] = 'infectious_disease'
    elif 'by_disease' in filename:
        col_translations['Doença'] = 'disease'
    return df.rename(columns=col_translations)

# Preprocess children data by summing 'available_for_adoption_linked' and 'available_for_adoption_unlinked'
def preprocess_data(data):
    for region, dfs in data.items():
        for key, df in dfs.items():
            if 'available_for_adoption_linked' in df.columns and 'available_for_adoption_unlinked' in df.columns:
                df['available_for_adoption'] = df['available_for_adoption_linked'] + df['available_for_adoption_unlinked']
                df.drop(['available_for_adoption_linked', 'available_for_adoption_unlinked'], axis=1, inplace=True)
    return data


# Preprocess children data
children_data = preprocess_data(children_data)


# Function to translate column names and content
def translate_dataframe(df, filename, col_translations, content_translations):
    if not isinstance(df, pd.DataFrame):
        return df
    
    # Conditionally rename columns based on file name
    df = conditionally_rename_columns(df, filename, col_translations)
    
    # Translate content for each column
    for col in df.columns:
        if col in content_translations:
            if pd.api.types.is_object_dtype(df[col]):
                df[col] = df[col].map(content_translations[col]).fillna(df[col])
    
    return df

def apply_translations(data, col_translations, content_translations):
    for region, dfs in data.items():
        for key, df in dfs.items():
            try:
                data[region][key] = translate_dataframe(df, key, col_translations, content_translations)
            except Exception as e:
                pass  # Silently handle exceptions
            
            # Special handling for 'by_region' DataFrame
            if key == 'by_region':
                if 'Região' in df.columns:
                    df['region'] = df['Região'].map(content_translations['region']).fillna(df['Região'])
                if 'Disponíveis para Adoção' in df.columns:
                    df = df.rename(columns={'Disponíveis para Adoção': 'available_for_adoption'})
                data[region][key] = df
    
    return data

# Apply translations to children and parents data
children_data = apply_translations(children_data, translation_dict_columns_children, translation_dict_content)
parents_data = apply_translations(parents_data, translation_dict_columns_parents, translation_dict_content)

In [173]:
# Display the loaded data for children and parents
print(divider)
print(" Translated Children Data:")
print(divider)
display_loaded_data(children_data, "Children Data")
print(divider)
print(" Translated Prospective Adoptive Parents Data:")
print(divider)
display_loaded_data(parents_data, "Parents Data")

--------------------------------------------------------------------------------
 Translated Children Data:
--------------------------------------------------------------------------------

Region: Midwest

Children Data - by_disability


Unnamed: 0,disability,available_for_adoption,region
0,No Disability,331,Midwest
1,Intellectual Disability,35,Midwest
2,Physical and Intellectual Disability,18,Midwest
3,Physical Disability,6,Midwest



Children Data - by_infectious_disease


Unnamed: 0,infectious_disease,available_for_adoption,region
0,No,387,Midwest
1,Yes,3,Midwest



Children Data - by_ethnicity


Unnamed: 0,ethnicity,available_for_adoption,region
0,Mixed,236,Midwest
1,White,75,Midwest
2,Black,63,Midwest
3,Indigena,15,Midwest
4,Asian,1,Midwest



Children Data - by_gender


Unnamed: 0,gender,available_for_adoption,region
0,Male,219,Midwest
1,Female,171,Midwest



Children Data - by_UF


Unnamed: 0,UF,available_for_adoption_linked,available_for_adoption_unlinked,region
0,DF,45,40,Midwest
1,GO,45,73,Midwest
2,MS,38,87,Midwest
3,MT,20,42,Midwest



Children Data - by_siblings


Unnamed: 0,sibling,available_for_adoption,region
0,No Siblings,148,Midwest
1,One Sibling,83,Midwest
2,Two Siblings,81,Midwest
3,More than Three Siblings,40,Midwest
4,Three Siblings,38,Midwest



Children Data - by_age


Unnamed: 0,age_group,available_for_adoption,region
0,0-2 years,23,Midwest
1,2-4 years,23,Midwest
2,4-6 years,35,Midwest
3,6-8 years,39,Midwest
4,8-10 years,45,Midwest



Children Data - by_disease


Unnamed: 0,disease,available_for_adoption,region
0,No,313,Midwest
1,Yes,77,Midwest



Region: Northeast

Children Data - by_disability


Unnamed: 0,disability,available_for_adoption,region
0,No Disability,656,Northeast
1,Intellectual Disability,130,Northeast
2,Physical and Intellectual Disability,41,Northeast
3,Physical Disability,17,Northeast



Children Data - by_infectious_disease


Unnamed: 0,infectious_disease,available_for_adoption,region
0,No,840,Northeast
1,Yes,4,Northeast



Children Data - by_ethnicity


Unnamed: 0,ethnicity,available_for_adoption,region
0,Mixed,583,Northeast
1,Black,154,Northeast
2,White,106,Northeast
3,Asian,1,Northeast



Children Data - by_gender


Unnamed: 0,gender,available_for_adoption,region
0,Male,448,Northeast
1,Female,396,Northeast



Children Data - by_UF


Unnamed: 0,UF,available_for_adoption_linked,available_for_adoption_unlinked,region
0,AL,15,28,Northeast
1,BA,42,178,Northeast
2,CE,59,111,Northeast
3,MA,21,38,Northeast
4,PB,34,47,Northeast



Children Data - by_siblings


Unnamed: 0,sibling,available_for_adoption,region
0,No Siblings,358,Northeast
1,One Sibling,144,Northeast
2,Two Siblings,143,Northeast
3,More than Three Siblings,106,Northeast
4,Three Siblings,93,Northeast



Children Data - by_age


Unnamed: 0,age_group,available_for_adoption,region
0,0-2 years,52,Northeast
1,2-4 years,51,Northeast
2,4-6 years,54,Northeast
3,6-8 years,63,Northeast
4,8-10 years,82,Northeast



Children Data - by_disease


Unnamed: 0,disease,available_for_adoption,region
0,No,678,Northeast
1,Yes,166,Northeast



Region: North

Children Data - by_disability


Unnamed: 0,disability,available_for_adoption,region
0,No Disability,135,North
1,Intellectual Disability,51,North
2,Physical and Intellectual Disability,19,North
3,Physical Disability,10,North



Children Data - by_infectious_disease


Unnamed: 0,infectious_disease,available_for_adoption,region
0,No,213,North
1,Yes,2,North



Children Data - by_ethnicity


Unnamed: 0,ethnicity,available_for_adoption,region
0,Mixed,175,North
1,White,24,North
2,Black,12,North
3,Indigena,4,North



Children Data - by_gender


Unnamed: 0,gender,available_for_adoption,region
0,Male,115,North
1,Female,100,North



Children Data - by_UF


Unnamed: 0,UF,available_for_adoption_linked,available_for_adoption_unlinked,region
0,AC,8,3,North
1,AM,22,48,North
2,AP,1,10,North
3,PA,28,47,North
4,RO,11,20,North



Children Data - by_siblings


Unnamed: 0,sibling,available_for_adoption,region
0,No Siblings,102,North
1,Two Siblings,37,North
2,One Sibling,37,North
3,More than Three Siblings,22,North
4,Three Siblings,17,North



Children Data - by_age


Unnamed: 0,age_group,available_for_adoption,region
0,0-2 years,8,North
1,2-4 years,9,North
2,4-6 years,21,North
3,6-8 years,15,North
4,8-10 years,22,North



Children Data - by_disease


Unnamed: 0,disease,available_for_adoption,region
0,No,154,North
1,Yes,61,North



Region: Southeast

Children Data - by_disability


Unnamed: 0,disability,available_for_adoption,region
0,No Disability,1800,Southeast
1,Intellectual Disability,288,Southeast
2,Physical and Intellectual Disability,88,Southeast
3,Physical Disability,36,Southeast



Children Data - by_infectious_disease


Unnamed: 0,infectious_disease,available_for_adoption,region
0,No,2191,Southeast
1,Yes,21,Southeast



Children Data - by_ethnicity


Unnamed: 0,ethnicity,available_for_adoption,region
0,Mixed,1113,Southeast
1,White,588,Southeast
2,Black,474,Southeast
3,Not Informed,18,Southeast
4,Asian,16,Southeast



Children Data - by_gender


Unnamed: 0,gender,available_for_adoption,region
0,Male,1208,Southeast
1,Female,1004,Southeast



Children Data - by_UF


Unnamed: 0,UF,available_for_adoption_linked,available_for_adoption_unlinked,region
0,ES,52,69,Southeast
1,MG,200,394,Southeast
2,RJ,142,121,Southeast
3,SP,713,521,Southeast



Children Data - by_siblings


Unnamed: 0,sibling,available_for_adoption,region
0,No Siblings,844,Southeast
1,One Sibling,478,Southeast
2,Two Siblings,406,Southeast
3,Three Siblings,251,Southeast
4,More than Three Siblings,233,Southeast



Children Data - by_age


Unnamed: 0,age_group,available_for_adoption,region
0,0-2 years,228,Southeast
1,2-4 years,147,Southeast
2,4-6 years,192,Southeast
3,6-8 years,206,Southeast
4,8-10 years,229,Southeast



Children Data - by_disease


Unnamed: 0,disease,available_for_adoption,region
0,No,1808,Southeast
1,Yes,404,Southeast



Region: South

Children Data - by_disability


Unnamed: 0,disability,available_for_adoption,region
0,No Disability,935,South
1,Intellectual Disability,181,South
2,Physical and Intellectual Disability,49,South
3,Physical Disability,10,South



Children Data - by_infectious_disease


Unnamed: 0,infectious_disease,available_for_adoption,region
0,No,1159,South
1,Yes,16,South



Children Data - by_ethnicity


Unnamed: 0,ethnicity,available_for_adoption,region
0,White,629,South
1,Mixed,421,South
2,Black,112,South
3,Indigena,10,South
4,Asian,3,South



Children Data - by_gender


Unnamed: 0,gender,available_for_adoption,region
0,Male,589,South
1,Female,586,South



Children Data - by_UF


Unnamed: 0,UF,available_for_adoption_linked,available_for_adoption_unlinked,region
0,PR,201,272,South
1,RS,162,317,South
2,SC,59,164,South



Children Data - by_siblings


Unnamed: 0,sibling,available_for_adoption,region
0,No Siblings,475,South
1,One Sibling,272,South
2,Two Siblings,231,South
3,Three Siblings,100,South
4,More than Three Siblings,97,South



Children Data - by_age


Unnamed: 0,age_group,available_for_adoption,region
0,0-2 years,82,South
1,2-4 years,62,South
2,4-6 years,70,South
3,6-8 years,73,South
4,8-10 years,110,South



Children Data - by_disease


Unnamed: 0,disease,available_for_adoption,region
0,No,910,South
1,Yes,265,South



Region: General

Children Data - by_disability


Unnamed: 0,disability,available_for_adoption,region
0,No Disability,3857,General
1,Intellectual Disability,685,General
2,Physical and Intellectual Disability,215,General
3,Physical Disability,79,General



Children Data - by_infectious_disease


Unnamed: 0,infectious_disease,available_for_adoption,region
0,No,4790,General
1,Yes,46,General



Children Data - by_ethnicity


Unnamed: 0,ethnicity,available_for_adoption,region
0,Mixed,2528,General
1,White,1422,General
2,Black,815,General
3,Indigena,32,General
4,Asian,21,General



Children Data - by_gender


Unnamed: 0,gender,available_for_adoption,region
0,Male,2579,General
1,Female,2257,General



Children Data - by_UF


Unnamed: 0,UF,available_for_adoption_linked,available_for_adoption_unlinked,region
0,AC,8,3,General
1,AL,15,28,General
2,AM,22,48,General
3,AP,1,10,General
4,BA,42,178,General



Children Data - by_siblings


Unnamed: 0,sibling,available_for_adoption,region
0,No Siblings,1927,General
1,One Sibling,1014,General
2,Two Siblings,898,General
3,Three Siblings,499,General
4,More than Three Siblings,498,General



Children Data - by_region


Unnamed: 0,Região,available_for_adoption,region
0,Centro-Oeste,390,Centro-Oeste
1,Nordeste,844,Northeast
2,Norte,215,North
3,Sudeste,2212,Southeast
4,Sul,1175,South



Children Data - by_age


Unnamed: 0,age_group,available_for_adoption,region
0,0-2 years,393,General
1,2-4 years,292,General
2,4-6 years,372,General
3,6-8 years,396,General
4,8-10 years,488,General



Children Data - by_disease


Unnamed: 0,disease,available_for_adoption,region
0,No,3863,General
1,Yes,973,General


--------------------------------------------------------------------------------
 Translated Prospective Adoptive Parents Data:
--------------------------------------------------------------------------------

Region: Midwest

Parents Data - by_couple


Unnamed: 0,couple,available_candidates,region
0,Yes,2091,Midwest
1,No,331,Midwest



Parents Data - by_quantity


Unnamed: 0,accepted_quantity,available_candidates,region
0,1,1395,Midwest
1,2,924,Midwest
2,Above,103,Midwest



Parents Data - by_disability


Unnamed: 0,disability_preference,available_candidates,region
0,No Disability,2288,Midwest
1,Physical Disability,85,Midwest
2,Physical and Intellectual Disability,43,Midwest
3,Intellectual Disability,6,Midwest



Parents Data - by_infectious_illness


Unnamed: 0,infectious_disease_preference,available_candidates,region
0,No,2255,Midwest
1,Yes,167,Midwest



Parents Data - by_ethnicity


Unnamed: 0,ethnicity_preference,available_candidates,region
0,Qualquer,1610,Midwest
1,Parda,639,Midwest
2,Branca,633,Midwest
3,Amarela,239,Midwest
4,Preta,166,Midwest



Parents Data - by_gender


Unnamed: 0,gender_preference,available_candidates,region
0,Qualquer,1682,Midwest
1,Feminino,578,Midwest
2,Masculino,162,Midwest



Parents Data - by_UF


Unnamed: 0,UF,available_candidates,region
0,DF,454,Midwest
1,GO,1111,Midwest
2,MS,241,Midwest
3,MT,616,Midwest



Parents Data - by_type


Unnamed: 0,adoption_type,available_candidates,region
0,National,1673,Midwest
1,State,621,Midwest
2,Municipal,127,Midwest
3,International,1,Midwest



Parents Data - by_age


Unnamed: 0,age_preference,available_candidates,region
0,0-2 years,458,Midwest
1,2-4 years,755,Midwest
2,4-6 years,730,Midwest
3,6-8 years,322,Midwest
4,8-10 years,95,Midwest



Parents Data - by_disease


Unnamed: 0,disease,available_candidates,region
0,No,1767,Midwest
1,Yes,655,Midwest



Region: Northeast

Parents Data - by_couple


Unnamed: 0,couple,available_candidates,region
0,Yes,4619,Northeast
1,No,870,Northeast



Parents Data - by_quantity


Unnamed: 0,accepted_quantity,available_candidates,region
0,1,3682,Northeast
1,2,1657,Northeast
2,Above,150,Northeast



Parents Data - by_disability


Unnamed: 0,disability_preference,available_candidates,region
0,No Disability,5244,Northeast
1,Physical Disability,123,Northeast
2,Physical and Intellectual Disability,111,Northeast
3,Intellectual Disability,11,Northeast



Parents Data - by_infectious_disease


Unnamed: 0,infectious_disease_preference,available_candidates,region
0,No,5159,Northeast
1,Yes,330,Northeast



Parents Data - by_ethnicity


Unnamed: 0,ethnicity_preference,available_candidates,region
0,Qualquer,3547,Northeast
1,Parda,1606,Northeast
2,Branca,1302,Northeast
3,Amarela,396,Northeast
4,Preta,354,Northeast



Parents Data - by_gender


Unnamed: 0,gender_preference,available_candidates,region
0,Qualquer,3240,Northeast
1,Feminino,1765,Northeast
2,Masculino,484,Northeast



Parents Data - by_UF


Unnamed: 0,UF,available_candidates,region
0,AL,331,Northeast
1,BA,1398,Northeast
2,CE,1115,Northeast
3,MA,238,Northeast
4,PB,514,Northeast



Parents Data - by_type


Unnamed: 0,adoption_type,available_candidates,region
0,National,2373,Northeast
1,State,2262,Northeast
2,Municipal,851,Northeast
3,International,3,Northeast



Parents Data - by_age


Unnamed: 0,age_preference,available_candidates,region
0,0-2 years,1423,Northeast
1,2-4 years,1843,Northeast
2,4-6 years,1374,Northeast
3,6-8 years,496,Northeast
4,8-10 years,187,Northeast



Parents Data - by_disease


Unnamed: 0,disease,available_candidates,region
0,No,4107,Northeast
1,Yes,1382,Northeast



Region: North

Parents Data - by_couple


Unnamed: 0,couple,available_candidates,region
0,Yes,1003,North
1,No,201,North



Parents Data - by_quantity


Unnamed: 0,accepted_quantity,available_candidates,region
0,1,719,North
1,2,438,North
2,Above,47,North



Parents Data - by_disability


Unnamed: 0,disability_preference,available_candidates,region
0,No Disability,1123,North
1,Physical Disability,39,North
2,Physical and Intellectual Disability,38,North
3,Intellectual Disability,4,North



Parents Data - by_infectious_disease


Unnamed: 0,infectious_disease_preference,available_candidates,region
0,No,1121,North
1,Yes,83,North



Parents Data - by_ethnicity


Unnamed: 0,ethnicity_preference,available_candidates,region
0,Qualquer,823,North
1,Parda,325,North
2,Branca,296,North
3,Amarela,131,North
4,Preta,110,North



Parents Data - by_gender


Unnamed: 0,gender_preference,available_candidates,region
0,Qualquer,704,North
1,Feminino,347,North
2,Masculino,153,North



Parents Data - by_UF


Unnamed: 0,UF,available_candidates,region
0,AC,72,North
1,AM,145,North
2,AP,62,North
3,PA,448,North
4,RO,279,North



Parents Data - by_type


Unnamed: 0,adoption_type,available_candidates,region
0,National,505,North
1,State,456,North
2,Municipal,239,North
3,International,4,North



Parents Data - by_age


Unnamed: 0,age_preference,available_candidates,region
0,0-2 years,282,North
1,2-4 years,380,North
2,4-6 years,323,North
3,6-8 years,120,North
4,8-10 years,45,North



Parents Data - by_disease


Unnamed: 0,disease,available_candidates,region
0,No,977,North
1,Yes,227,North



Region: Southeast

Parents Data - by_couple


Unnamed: 0,couple,available_candidates,region
0,Yes,15721,Southeast
1,No,2023,Southeast



Parents Data - by_accepted_disease


Unnamed: 0,disease_preference,available_candidates,region
0,Não,11023,Southeast
1,Sim,6721,Southeast



Parents Data - by_accepted_disability


Unnamed: 0,disability_preference,available_candidates,region
0,No Disability,16806,Southeast
1,Physical Disability,752,Southeast
2,Physical and Intellectual Disability,124,Southeast
3,Intellectual Disability,62,Southeast



Parents Data - by_quantity_accepted


Unnamed: 0,accepted_quantity,available_candidates,region
0,1,11208,Southeast
1,2,6182,Southeast
2,Above,354,Southeast



Parents Data - by_ethnicity


Unnamed: 0,ethnicity_preference,available_candidates,region
0,Qualquer,11100,Southeast
1,Branca,5456,Southeast
2,Parda,4999,Southeast
3,Amarela,1275,Southeast
4,Preta,1153,Southeast



Parents Data - by_gender


Unnamed: 0,gender_preference,available_candidates,region
0,Qualquer,12279,Southeast
1,Feminino,4186,Southeast
2,Masculino,1279,Southeast



Parents Data - by_UF


Unnamed: 0,UF,available_candidates,region
0,ES,703,Southeast
1,MG,4886,Southeast
2,RJ,3172,Southeast
3,SP,8983,Southeast



Parents Data - by_type


Unnamed: 0,adoption_type,available_candidates,region
0,National,8602,Southeast
1,State,7242,Southeast
2,Municipal,1898,Southeast
3,International,12,Southeast



Parents Data - by_age


Unnamed: 0,age_preference,available_candidates,region
0,0-2 years,2506,Southeast
1,2-4 years,5649,Southeast
2,4-6 years,5736,Southeast
3,6-8 years,2743,Southeast
4,8-10 years,773,Southeast



Parents Data - by_infectious_disease_accepted


Unnamed: 0,infectious_disease_preference,available_candidates,region
0,No,16093,Southeast
1,Yes,1651,Southeast



Region: South

Parents Data - by_couple


Unnamed: 0,couple,available_candidates,region
0,Yes,8394,South
1,No,930,South



Parents Data - by_quantity


Unnamed: 0,accepted_quantity,available_candidates,region
0,1,5247,South
1,2,3868,South
2,Above,205,South



Parents Data - by_disability


Unnamed: 0,disability_preference,available_candidates,region
0,No Disability,8880,South
1,Physical Disability,357,South
2,Physical and Intellectual Disability,60,South
3,Intellectual Disability,27,South



Parents Data - by_ethnicity


Unnamed: 0,ethnicity_preference,available_candidates,region
0,Qualquer,5232,South
1,Branca,3861,South
2,Parda,2537,South
3,Amarela,995,South
4,Preta,480,South



Parents Data - by_gender


Unnamed: 0,gender_preference,available_candidates,region
0,Qualquer,7000,South
1,Feminino,1795,South
2,Masculino,525,South



Parents Data - by_UF


Unnamed: 0,UF,available_candidates,region
0,PR,2639,South
1,RS,3673,South
2,SC,3008,South



Parents Data - by_type


Unnamed: 0,adoption_type,available_candidates,region
0,National,5701,South
1,State,3068,South
2,Municipal,554,South
3,International,1,South



Parents Data - by_age


Unnamed: 0,age_preference,available_candidates,region
0,0-2 years,1545,South
1,2-4 years,2957,South
2,4-6 years,3068,South
3,6-8 years,1291,South
4,8-10 years,313,South



Parents Data - by_infectious_disease_accepted


Unnamed: 0,infectious_disease_preference,available_candidates,region
0,No,8513,South
1,Yes,811,South



Parents Data - by_disease


Unnamed: 0,disease,available_candidates,region
0,No,5336,South
1,Yes,3988,South



Region: General

Parents Data - by_couple


Unnamed: 0,couple,available_candidates,region
0,Yes,31899,General
1,No,4349,General



Parents Data - by_accepted_age


Unnamed: 0,age_preference,available_candidates,region
0,0-2 years,6228,General
1,2-4 years,11606,General
2,4-6 years,11240,General
3,6-8 years,4989,General
4,8-10 years,1412,General



Parents Data - by_accepted_disability


Unnamed: 0,disability_preference,available_candidates,region
0,No Disability,34402,General
1,Physical Disability,1359,General
2,Physical and Intellectual Disability,377,General
3,Intellectual Disability,110,General



Parents Data - by_quantity_accepted


Unnamed: 0,accepted_quantity,available_candidates,region
0,1,22301,General
1,2,13087,General
2,Above,860,General



Parents Data - by_UF


Unnamed: 0,UF,available_candidates,region
0,AC,72,General
1,AL,331,General
2,AM,144,General
3,AP,62,General
4,BA,1396,General



Parents Data - by_region


Unnamed: 0,Tipo,Pretendentes Disponíveis,region
0,Centro-Oeste,2426,General
1,Nordeste,5496,General
2,Norte,1205,General
3,Sudeste,17789,General
4,Sul,9332,General



Parents Data - by_accepted_ethnicity


Unnamed: 0,ethnicity_preference,available_candidates,region
0,Qualquer,22352,General
1,Branca,11571,General
2,Parda,10126,General
3,Amarela,3036,General
4,Preta,2265,General



Parents Data - by_type


Unnamed: 0,adoption_type,available_candidates,region
0,National,18854,General
1,State,13649,General
2,Municipal,3669,General
3,International,21,General



Parents Data - by_infectious_disease_accepted


Unnamed: 0,infectious_disease_preference,available_candidates,region
0,No,33202,General
1,Yes,3046,General



Parents Data - by_accepted_gender


Unnamed: 0,gender_preference,available_candidates,region
0,Qualquer,24949,General
1,Feminino,8689,General
2,Masculino,2610,General


### Children Data Merging
We will merge the individual DataFrames for children data into a single consolidated DataFrame.

In [214]:
def load_and_preprocess_data(base_dir, is_children_data=True):
    data = {}
    for region in os.listdir(base_dir):
        if region in region_translation:
            region_dir = os.path.join(base_dir, region)
            data[region_translation[region]] = {}
            for file in os.listdir(region_dir):
                if file.endswith('.xlsx'):
                    key = file.split('.')[0]
                    df = pd.read_excel(os.path.join(region_dir, file))
                    df['region'] = region_translation[region]
                    
                    # Apply translations
                    col_translations = translation_dict_columns_children if is_children_data else translation_dict_columns_parents
                    df = translate_dataframe(df, key, col_translations, translation_dict_content)
                    
                    # Preprocess children data
                    if is_children_data and 'available_for_adoption_linked' in df.columns and 'available_for_adoption_unlinked' in df.columns:
                        df['available_for_adoption'] = df['available_for_adoption_linked'] + df['available_for_adoption_unlinked']
                        df.drop(['available_for_adoption_linked', 'available_for_adoption_unlinked'], axis=1, inplace=True)
                    
                    data[region_translation[region]][key] = df
    return data

# Load and preprocess data
children_data = load_and_preprocess_data(base_dir_children, is_children_data=True)
parents_data = load_and_preprocess_data(base_dir_parents, is_children_data=False)

In [215]:
# Functions to generate IDs, distribute counts, and ensure columns exist
def generate_ids(prefix, count):
    return [f"{prefix}{str(i).zfill(3)}" for i in range(1, count + 1)]

In [216]:
def distribute_counts(df, count_column, id_prefix):
    distributed_rows = []
    for _, row in df.iterrows():
        count = int(row[count_column])
        for i in range(count):
            new_row = row.copy()
            new_row[f'{id_prefix.capitalize()}_ID'] = f"{new_row['region']}_{id_prefix}_{len(distributed_rows)+1:03d}"
            distributed_rows.append(new_row)
    return pd.DataFrame(distributed_rows)

In [233]:
def ensure_columns_exist(df, columns):
    missing_columns = []
    for column in columns:
        if column not in df.columns:
            df[column] = None
            missing_columns.append(column)
    
    if missing_columns:
        warnings.warn(f"Missing columns added with default value None: {missing_columns}")
    
    return df

In [234]:
def create_consolidated_children_data_for_region(region, dfs):
    column_mapping = {
        'siblings': 'sibling',
        'infectious_disease': 'infectious_disease',
        'disability': 'disability',
        'ethnicity': 'ethnicity',
        'gender': 'gender',
        'UF': 'UF',
        'disease': 'disease',
        'age_group': 'age_group'
    }
    
    if 'by_age' in dfs and 'available_for_adoption' in dfs['by_age'].columns:
        base_df = dfs['by_age'].copy()
        total_children = base_df['available_for_adoption'].sum()
        
        new_rows = []
        for _, row in base_df.iterrows():
            for i in range(int(row['available_for_adoption'])):
                new_row = row.copy()
                new_row['Child_ID'] = f"{region}_{len(new_rows)+1:03d}"
                new_rows.append(new_row)
        
        base_df = pd.DataFrame(new_rows).reset_index(drop=True)
        base_df = base_df.drop(columns=['available_for_adoption'])
        
        for filter_name, df in dfs.items():
            if filter_name != 'by_age' and 'available_for_adoption' in df.columns:
                column_name = column_mapping.get(filter_name.replace('by_', ''), filter_name.replace('by_', ''))
                total_count = df['available_for_adoption'].sum()
                if total_count == len(base_df):
                    new_column = []
                    for _, row in df.iterrows():
                        if column_name in row:
                            new_column.extend([row[column_name]] * int(row['available_for_adoption']))
                        else:
                            print(f"Warning: Column '{column_name}' not found in filter '{filter_name}' for region {region}")
                            new_column.extend([None] * int(row['available_for_adoption']))
                    base_df[column_name] = new_column
                else:
                    print(f"Warning: Total count mismatch for filter '{filter_name}' in region {region}. Expected {len(base_df)}, got {total_count}. Skipping this filter.")
        
        return base_df
    else:
        print(f"Warning: 'by_age' filter not found or 'available_for_adoption' column missing for region {region}")
        return pd.DataFrame()

In [235]:
def create_consolidated_parents_data_for_region(region, dfs):
    column_mapping = {
        'couple': 'couple',
        'accepted_quantity': 'accepted_quantity',
        'disability_preference': 'disability_preference',
        'infectious_disease_preference': 'infectious_disease_preference',
        'ethnicity_preference': 'ethnicity_preference',
        'gender_preference': 'gender_preference',
        'UF': 'UF',
        'adoption_type': 'adoption_type',
        'age_preference': 'age_preference',
        'disease_preference': 'disease_preference',
        'accepted_disease': 'disease_preference',
        'accepted_disability': 'disability_preference',
        'quantity_accepted': 'accepted_quantity',
        'accepted_ethnicity': 'ethnicity_preference',
        'accepted_gender': 'gender_preference',
        'accepted_age': 'age_preference'
    }
    
    if 'by_couple' in dfs and 'available_candidates' in dfs['by_couple'].columns:
        base_df = dfs['by_couple'].copy()
        total_parents = base_df['available_candidates'].sum()
        
        new_rows = []
        for _, row in base_df.iterrows():
            for i in range(int(row['available_candidates'])):
                new_row = row.copy()
                new_row['Parent_ID'] = f"{region}_{len(new_rows)+1:03d}"
                new_rows.append(new_row)
        
        base_df = pd.DataFrame(new_rows).reset_index(drop=True)
        base_df = base_df.drop(columns=['available_candidates'])
        
        for filter_name, df in dfs.items():
            if filter_name != 'by_couple' and 'available_candidates' in df.columns:
                column_name = column_mapping.get(filter_name.replace('by_', ''), filter_name.replace('by_', ''))
                total_count = df['available_candidates'].sum()
                if total_count == len(base_df):
                    new_column = []
                    for _, row in df.iterrows():
                        if column_name in row:
                            new_column.extend([row[column_name]] * int(row['available_candidates']))
                        else:
                            print(f"Warning: Column '{column_name}' not found in filter '{filter_name}' for region {region}")
                            new_column.extend([None] * int(row['available_candidates']))
                    base_df[column_name] = new_column
                else:
                    print(f"Warning: Total count mismatch for filter '{filter_name}' in region {region}. Expected {len(base_df)}, got {total_count}. Skipping this filter.")
        
        return base_df
    else:
        print(f"Warning: 'by_couple' filter not found or 'available_candidates' column missing for region {region}")
        return pd.DataFrame()

In [236]:
def consolidate_data_for_all_regions(children_data, parents_data):
    consolidated_data = {}
    
    for region in tqdm(children_data.keys(), desc="Processing children data for all regions"):
        if region != 'general':  # Skip 'general' region for children
            consolidated_data[f"children_{region_translation[region]}"] = create_consolidated_children_data_for_region(region, children_data[region])
    
    for region in tqdm(parents_data.keys(), desc="Processing parents data for all regions"):
        if region != 'general':  # Skip 'general' region for parents
            consolidated_data[f"parents_{region_translation[region]}"] = create_consolidated_parents_data_for_region(region, parents_data[region])
    
    # Handle 'general' region separately if needed
    if 'general' in children_data:
        consolidated_data["children_general"] = create_consolidated_children_data_for_region('general', children_data['general'])
    if 'general' in parents_data:
        consolidated_data["parents_general"] = create_consolidated_parents_data_for_region('general', parents_data['general'])
    
    return consolidated_data

In [237]:
def process_general_data(data, count_column):
    processed_data = []
    for key, df in data.items():
        if count_column in df.columns:
            for _, row in df.iterrows():
                count = int(row[count_column])
                for _ in range(count):
                    new_row = row.drop(count_column).to_dict()
                    processed_data.append(new_row)
    return pd.DataFrame(processed_data)

In [238]:
# Consolidate data for all regions
consolidated_data = {}

for region, dfs in children_data.items():
    if region != 'General':
        print(f"Processing children data for region: {region}")
        consolidated_data[f"children_{region}"] = create_consolidated_children_data_for_region(region, dfs)

for region, dfs in parents_data.items():
    if region != 'General':
        print(f"Processing parents data for region: {region}")
        consolidated_data[f"parents_{region}"] = create_consolidated_parents_data_for_region(region, dfs)

# Display basic info about the consolidated dataframes for each region
for key, df in consolidated_data.items():
    print(f"\n{key} Data:")
    print(df.info())
    display(df.head())

Processing children data for region: Southeast
Processing children data for region: Northeast
Processing children data for region: South
Processing children data for region: North
Processing children data for region: Midwest
Processing parents data for region: Southeast
Processing parents data for region: Northeast
Processing parents data for region: South
Processing parents data for region: North
Processing parents data for region: Midwest



children_Southeast Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2207 entries, 0 to 2206
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age_group  2207 non-null   object
 1   region     2207 non-null   object
 2   Child_ID   2207 non-null   object
dtypes: object(3)
memory usage: 51.9+ KB
None


Unnamed: 0,age_group,region,Child_ID
0,0-2 years,Southeast,Southeast_001
1,0-2 years,Southeast,Southeast_002
2,0-2 years,Southeast,Southeast_003
3,0-2 years,Southeast,Southeast_004
4,0-2 years,Southeast,Southeast_005



children_Northeast Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 844 entries, 0 to 843
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   age_group           844 non-null    object
 1   region              844 non-null    object
 2   Child_ID            844 non-null    object
 3   disability          844 non-null    object
 4   infectious_disease  844 non-null    object
 5   ethnicity           844 non-null    object
 6   gender              844 non-null    object
 7   UF                  844 non-null    object
 8   sibling             844 non-null    object
 9   disease             844 non-null    object
dtypes: object(10)
memory usage: 66.1+ KB
None


Unnamed: 0,age_group,region,Child_ID,disability,infectious_disease,ethnicity,gender,UF,sibling,disease
0,0-2 years,Northeast,Northeast_001,No Disability,No,Mixed,Male,AL,No Siblings,No
1,0-2 years,Northeast,Northeast_002,No Disability,No,Mixed,Male,AL,No Siblings,No
2,0-2 years,Northeast,Northeast_003,No Disability,No,Mixed,Male,AL,No Siblings,No
3,0-2 years,Northeast,Northeast_004,No Disability,No,Mixed,Male,AL,No Siblings,No
4,0-2 years,Northeast,Northeast_005,No Disability,No,Mixed,Male,AL,No Siblings,No



children_South Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1174 entries, 0 to 1173
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age_group  1174 non-null   object
 1   region     1174 non-null   object
 2   Child_ID   1174 non-null   object
dtypes: object(3)
memory usage: 27.6+ KB
None


Unnamed: 0,age_group,region,Child_ID
0,0-2 years,South,South_001
1,0-2 years,South,South_002
2,0-2 years,South,South_003
3,0-2 years,South,South_004
4,0-2 years,South,South_005



children_North Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   age_group           215 non-null    object
 1   region              215 non-null    object
 2   Child_ID            215 non-null    object
 3   disability          215 non-null    object
 4   infectious_disease  215 non-null    object
 5   ethnicity           215 non-null    object
 6   gender              215 non-null    object
 7   UF                  215 non-null    object
 8   sibling             215 non-null    object
 9   disease             215 non-null    object
dtypes: object(10)
memory usage: 16.9+ KB
None


Unnamed: 0,age_group,region,Child_ID,disability,infectious_disease,ethnicity,gender,UF,sibling,disease
0,0-2 years,North,North_001,No Disability,No,Mixed,Male,AC,No Siblings,No
1,0-2 years,North,North_002,No Disability,No,Mixed,Male,AC,No Siblings,No
2,0-2 years,North,North_003,No Disability,No,Mixed,Male,AC,No Siblings,No
3,0-2 years,North,North_004,No Disability,No,Mixed,Male,AC,No Siblings,No
4,0-2 years,North,North_005,No Disability,No,Mixed,Male,AC,No Siblings,No



children_Midwest Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   age_group           390 non-null    object
 1   region              390 non-null    object
 2   Child_ID            390 non-null    object
 3   disability          390 non-null    object
 4   infectious_disease  390 non-null    object
 5   ethnicity           390 non-null    object
 6   gender              390 non-null    object
 7   UF                  390 non-null    object
 8   sibling             390 non-null    object
 9   disease             390 non-null    object
dtypes: object(10)
memory usage: 30.6+ KB
None


Unnamed: 0,age_group,region,Child_ID,disability,infectious_disease,ethnicity,gender,UF,sibling,disease
0,0-2 years,Midwest,Midwest_001,No Disability,No,Mixed,Male,DF,No Siblings,No
1,0-2 years,Midwest,Midwest_002,No Disability,No,Mixed,Male,DF,No Siblings,No
2,0-2 years,Midwest,Midwest_003,No Disability,No,Mixed,Male,DF,No Siblings,No
3,0-2 years,Midwest,Midwest_004,No Disability,No,Mixed,Male,DF,No Siblings,No
4,0-2 years,Midwest,Midwest_005,No Disability,No,Mixed,Male,DF,No Siblings,No



parents_Southeast Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17744 entries, 0 to 17743
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   couple                       17744 non-null  object
 1   region                       17744 non-null  object
 2   Parent_ID                    17744 non-null  object
 3   disease_preference           17744 non-null  object
 4   disability_preference        17744 non-null  object
 5   accepted_quantity            17744 non-null  object
 6   gender                       0 non-null      object
 7   UF                           17744 non-null  object
 8   age                          0 non-null      object
 9   infectious_disease_accepted  0 non-null      object
dtypes: object(10)
memory usage: 1.4+ MB
None


Unnamed: 0,couple,region,Parent_ID,disease_preference,disability_preference,accepted_quantity,gender,UF,age,infectious_disease_accepted
0,Yes,Southeast,Southeast_001,Não,No Disability,1,,ES,,
1,Yes,Southeast,Southeast_002,Não,No Disability,1,,ES,,
2,Yes,Southeast,Southeast_003,Não,No Disability,1,,ES,,
3,Yes,Southeast,Southeast_004,Não,No Disability,1,,ES,,
4,Yes,Southeast,Southeast_005,Não,No Disability,1,,ES,,



parents_Northeast Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5489 entries, 0 to 5488
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   couple              5489 non-null   object
 1   region              5489 non-null   object
 2   Parent_ID           5489 non-null   object
 3   quantity            0 non-null      object
 4   disability          0 non-null      object
 5   infectious_disease  0 non-null      object
 6   gender              0 non-null      object
 7   UF                  5489 non-null   object
 8   type                0 non-null      object
 9   age                 0 non-null      object
 10  disease             5489 non-null   object
dtypes: object(11)
memory usage: 471.8+ KB
None


Unnamed: 0,couple,region,Parent_ID,quantity,disability,infectious_disease,gender,UF,type,age,disease
0,Yes,Northeast,Northeast_001,,,,,AL,,,No
1,Yes,Northeast,Northeast_002,,,,,AL,,,No
2,Yes,Northeast,Northeast_003,,,,,AL,,,No
3,Yes,Northeast,Northeast_004,,,,,AL,,,No
4,Yes,Northeast,Northeast_005,,,,,AL,,,No



parents_South Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9324 entries, 0 to 9323
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   couple                       9324 non-null   object
 1   region                       9324 non-null   object
 2   Parent_ID                    9324 non-null   object
 3   disability                   0 non-null      object
 4   type                         0 non-null      object
 5   infectious_disease_accepted  0 non-null      object
 6   disease                      9324 non-null   object
dtypes: object(7)
memory usage: 510.0+ KB
None


Unnamed: 0,couple,region,Parent_ID,disability,type,infectious_disease_accepted,disease
0,Yes,South,South_001,,,,No
1,Yes,South,South_002,,,,No
2,Yes,South,South_003,,,,No
3,Yes,South,South_004,,,,No
4,Yes,South,South_005,,,,No



parents_North Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1204 entries, 0 to 1203
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   couple              1204 non-null   object
 1   region              1204 non-null   object
 2   Parent_ID           1204 non-null   object
 3   quantity            0 non-null      object
 4   disability          0 non-null      object
 5   infectious_disease  0 non-null      object
 6   gender              0 non-null      object
 7   UF                  1204 non-null   object
 8   type                0 non-null      object
 9   age                 0 non-null      object
 10  disease             1204 non-null   object
dtypes: object(11)
memory usage: 103.6+ KB
None


Unnamed: 0,couple,region,Parent_ID,quantity,disability,infectious_disease,gender,UF,type,age,disease
0,Yes,North,North_001,,,,,AC,,,No
1,Yes,North,North_002,,,,,AC,,,No
2,Yes,North,North_003,,,,,AC,,,No
3,Yes,North,North_004,,,,,AC,,,No
4,Yes,North,North_005,,,,,AC,,,No



parents_Midwest Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2422 entries, 0 to 2421
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   couple              2422 non-null   object
 1   region              2422 non-null   object
 2   Parent_ID           2422 non-null   object
 3   quantity            0 non-null      object
 4   disability          0 non-null      object
 5   infectious_illness  0 non-null      object
 6   gender              0 non-null      object
 7   UF                  2422 non-null   object
 8   type                0 non-null      object
 9   age                 0 non-null      object
 10  disease             2422 non-null   object
dtypes: object(11)
memory usage: 208.3+ KB
None


Unnamed: 0,couple,region,Parent_ID,quantity,disability,infectious_illness,gender,UF,type,age,disease
0,Yes,Midwest,Midwest_001,,,,,DF,,,No
1,Yes,Midwest,Midwest_002,,,,,DF,,,No
2,Yes,Midwest,Midwest_003,,,,,DF,,,No
3,Yes,Midwest,Midwest_004,,,,,DF,,,No
4,Yes,Midwest,Midwest_005,,,,,DF,,,No


In [94]:
# Functions to create consolidated data
def create_consolidated_children_data(children_data):
    required_columns = ['Child_ID', 'region', 'UF', 'gender', 'age_group', 'disability', 'infectious_disease', 'ethnicity', 'sibling', 'disease']
    consolidated_children = pd.DataFrame(columns=required_columns)
    for region, dfs in children_data.items():
        for key, df in dfs.items():
            if 'available_for_adoption' in df.columns:
                df = distribute_counts(df, 'available_for_adoption', f"{region}_child", 'Child_ID')
                df = ensure_columns_exist(df, required_columns)
                df = df[required_columns]
                consolidated_children = pd.concat([consolidated_children, df], ignore_index=True)
    return consolidated_children

In [95]:
def create_consolidated_parents_data(parents_data):
    required_columns = ['Parent_ID', 'region', 'UF', 'couple', 'gender_preference', 'age_preference', 'ethnicity_preference', 'accepted_quantity', 'disability_preference', 'disease_preference', 'infectious_disease_preference', 'adoption_type']
    consolidated_parents = pd.DataFrame(columns=required_columns)
    for region, dfs in parents_data.items():
        for key, df in dfs.items():
            if 'available_candidates' in df.columns:
                df = distribute_counts(df, 'available_candidates', f"{region}_parent", 'Parent_ID')
                df = ensure_columns_exist(df, required_columns)
                df = df[required_columns]
                consolidated_parents = pd.concat([consolidated_parents, df], ignore_index=True)
    return consolidated_parents

In [96]:
# Preprocess children data
children_data = preprocess_data(children_data)

In [97]:
# Consolidate data for all regions
consolidated_data = {}

for region, dfs in children_data.items():
    if region != 'General':
        print(f"Processing children data for region: {region}")
        consolidated_data[f"children_{region}"] = create_consolidated_children_data_for_region(region, dfs)

for region, dfs in parents_data.items():
    if region != 'General':
        print(f"Processing parents data for region: {region}")
        consolidated_data[f"parents_{region}"] = create_consolidated_parents_data_for_region(region, dfs)

# Display basic info about the consolidated dataframes for each region
for key, df in consolidated_data.items():
    print(f"\n{key} Data:")
    print(df.info())
    display(df.head())

Children Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82200 entries, 0 to 82199
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Child_ID            82200 non-null  object
 1   region              82200 non-null  object
 2   UF                  9672 non-null   object
 3   gender              9672 non-null   object
 4   age_group           9660 non-null   object
 5   disability          9672 non-null   object
 6   infectious_disease  9672 non-null   object
 7   ethnicity           9672 non-null   object
 8   sibling             9672 non-null   object
 9   disease             9672 non-null   object
dtypes: object(10)
memory usage: 6.3+ MB
None


Unnamed: 0,Child_ID,region,UF,gender,age_group,disability,infectious_disease,ethnicity,sibling,disease
0,centro_oeste_child_1,Midwest,,,,No Disability,,,,
1,centro_oeste_child_2,Midwest,,,,No Disability,,,,
2,centro_oeste_child_3,Midwest,,,,No Disability,,,,
3,centro_oeste_child_4,Midwest,,,,No Disability,,,,
4,centro_oeste_child_5,Midwest,,,,No Disability,,,,



Parents Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717676 entries, 0 to 717675
Data columns (total 12 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Parent_ID                      717676 non-null  object
 1   region                         717676 non-null  object
 2   UF                             72427 non-null   object
 3   couple                         72431 non-null   object
 4   gender_preference              72427 non-null   object
 5   age_preference                 72427 non-null   object
 6   ethnicity_preference           102106 non-null  object
 7   accepted_quantity              72427 non-null   object
 8   disability_preference          72431 non-null   object
 9   disease_preference             17744 non-null   object
 10  infectious_disease_preference  72431 non-null   object
 11  adoption_type                  72386 non-null   object
dtypes: object(12)
memory usage: 6

Unnamed: 0,Parent_ID,region,UF,couple,gender_preference,age_preference,ethnicity_preference,accepted_quantity,disability_preference,disease_preference,infectious_disease_preference,adoption_type
0,centro_oeste_parent_1,Midwest,,Yes,,,,,,,,
1,centro_oeste_parent_2,Midwest,,Yes,,,,,,,,
2,centro_oeste_parent_3,Midwest,,Yes,,,,,,,,
3,centro_oeste_parent_4,Midwest,,Yes,,,,,,,,
4,centro_oeste_parent_5,Midwest,,Yes,,,,,,,,


In [86]:
display(consolidated_children)

Unnamed: 0,Child_ID,region,UF,gender,age_group,disability,infectious_disease,ethnicity,sibling,disease
0,centro_oeste_child_001,Midwest,,,,No Disability,,,,
1,centro_oeste_child_002,Midwest,,,,No Disability,,,,
2,centro_oeste_child_003,Midwest,,,,No Disability,,,,
3,centro_oeste_child_004,Midwest,,,,No Disability,,,,
4,centro_oeste_child_005,Midwest,,,,No Disability,,,,
...,...,...,...,...,...,...,...,...,...,...
72523,general_child_969,General,,,,,,,,Yes
72524,general_child_970,General,,,,,,,,Yes
72525,general_child_971,General,,,,,,,,Yes
72526,general_child_972,General,,,,,,,,Yes


In [None]:
def consolidate_data(data_dict, data_type):
    consolidated = []
    all_columns = set()

    # First pass to collect all unique columns
    for region, dfs in data_dict.items():
        for df in dfs.values():
            all_columns.update(df.columns)

    # Second pass to ensure all dataframes have the same columns
    for region, dfs in data_dict.items():
        for key, df in dfs.items():
            # Reset the index to ensure uniqueness
            df = df.reset_index(drop=True)
            # Add columns to identify the source of each row
            df['region'] = region_translation[region]  # Use the translated region names
            df['data_type'] = data_type
            df['data_subtype'] = key

            # Add missing columns as NaN
            for col in all_columns:
                if col not in df.columns:
                    df[col] = None

            # Ensure all columns are in the same order
            df = df[sorted(all_columns)]

            # Remove any rows with NaN values
            df = df.dropna()
            consolidated.append(df)
    
    # Concatenate all DataFrames
    result = pd.concat(consolidated, ignore_index=True)
    
    # Ensure all column names are strings
    result.columns = result.columns.astype(str)
    
    return result

# Consolidate children and parents data
children_consolidated = consolidate_data(children_data, "children")
parents_consolidated = consolidate_data(parents_data, "parents")

# Display basic info about the consolidated dataframes
print("Children Data:")
print(children_consolidated.info())
print("\nParents Data:")
print(parents_consolidated.info())

In [28]:
def reshape_children_data(df):
    columns = {
        'region': 'region',
        'UF': 'UF',
        'gender': 'gender',
        'age_group': 'age_group',
        'disability': 'disability',
        'infectious_disease': 'infectious_disease',
        'ethnicity': 'ethnicity',
        'sibling': 'sibling',
        'disease': 'disease'
    }
    # Select relevant columns and rename them
    df = df[list(columns.keys())].rename(columns=columns)
    # Ensure all columns are present
    for col in columns.values():
        if col not in df.columns:
            df[col] = None
    return df

# Reshape the children data
children_final = reshape_children_data(children_consolidated)

print("Final Children Data:")
print(children_final.info())
display(children_final.head())

NameError: name 'children_consolidated' is not defined

In [78]:
# Function to display top 10 rows of a dataframe
def display_top_10(df, title):
    print(f"\nTop 10 rows of {title}:")
    display(df.head(10))
    
# Display top 10 rows of each consolidated dataframe
display_top_10(children_consolidated, "Children Data")
display_top_10(parents_consolidated, "Parents Data")



Top 10 rows of Children Data:


Unnamed: 0,disability,available_for_adoption,Region,Child_ID,infectious_disease,ethnicity,gender,UF,available_for_adoption_linked,available_for_adoption_unlinked,sibling,age_group,disease,region
0,No Disability,331.0,Midwest,Child0001,,,,,,,,,,
1,Intellectual Disability,35.0,Midwest,Child0002,,,,,,,,,,
2,Physical and Intellectual Disability,18.0,Midwest,Child0003,,,,,,,,,,
3,Physical Disability,6.0,Midwest,Child0004,,,,,,,,,,
4,,387.0,Midwest,Child0005,No,,,,,,,,,
5,,3.0,Midwest,Child0006,Yes,,,,,,,,,
6,,236.0,Midwest,Child0007,,Mixed,,,,,,,,
7,,75.0,Midwest,Child0008,,White,,,,,,,,
8,,63.0,Midwest,Child0009,,Black,,,,,,,,
9,,15.0,Midwest,Child0010,,Indigena,,,,,,,,



Top 10 rows of Parents Data:


Unnamed: 0,couple,available_candidates,Region,Parent_ID,accepted_quantity,disability_preference,infectious_disease_preference,ethnicity,gender,UF,adoption_type,age_preference,disease
0,Yes,2091,Midwest,Parent0001,,,,,,,,,
1,No,331,Midwest,Parent0002,,,,,,,,,
2,,1395,Midwest,Parent0003,1,,,,,,,,
3,,924,Midwest,Parent0004,2,,,,,,,,
4,,103,Midwest,Parent0005,Above,,,,,,,,
5,,2288,Midwest,Parent0006,,No Disability,,,,,,,
6,,85,Midwest,Parent0007,,Physical Disability,,,,,,,
7,,43,Midwest,Parent0008,,Physical and Intellectual Disability,,,,,,,
8,,6,Midwest,Parent0009,,Intellectual Disability,,,,,,,
9,,2255,Midwest,Parent0010,,,No,,,,,,


In [12]:
# directory for the data
base_dir_children = 'data/criancas_para_adocao/'
base_dir_parents = 'data/prospective_adoptive_parents/'







# Function to dynamically rename columns based on the file name
def rename_columns_based_on_filename(df, filename):
    if 'by_infectious_disease' in filename:
        df = df.rename(columns={'Doença': 'infectious_disease'})
    elif 'by_disease' in filename:
        df = df.rename(columns={'Doença': 'disease_status'})
    elif 'by_disability' in filename:
        df = df.rename(columns={'Deficiência': 'disability_status'})
    elif 'by_siblings' in filename:
        df = df.rename(columns={'Irmãos': 'sibling_status'})
    return df

# Function to load data from each regional directory
def load_region_data(base_dir, region):
    region_dir = os.path.join(base_dir, region)
    data = {}
    for file in os.listdir(region_dir):
        if file.endswith('.xlsx'):
            key = file.split('.')[0]
            df = pd.read_excel(os.path.join(region_dir, file))
            df = rename_columns_based_on_filename(df, file)
            df['Region'] = region
            data[key] = df
    return data

# Load data from all regions for children and parents
regions = ['centro_oeste', 'nordeste', 'norte', 'sudeste', 'sul', 'general']
children_data = {region: load_region_data(base_dir_children, region) for region in regions}
parents_data = {region: load_region_data(base_dir_parents, region) for region in regions}

In [9]:
# Translation mappings
translation_dict = {
    'age_group': {
        'Até 2 anos': '0-2 years', 'De 2 a 4 anos': '2-4 years', 'De 4 a 6 anos': '4-6 years',
        'De 6 a 8 anos': '6-8 years', 'De 8 a 10 anos': '8-10 years', 'De 10 a 12 anos': '10-12 years',
        'De 12 a 14 anos': '12-14 years', 'De 14 a 16 anos': '14-16 years', 'Maior de 16 anos': '16+ years'
    },
    'gender': {
        'Masculino': 'Male', 'Feminino': 'Female'
    },
    'disease_status': {
        'Não': 'No', 'Sim': 'Yes'
    },
    'sibling_status': {
        'Sem Irmão': 'No Siblings', 'Um Irmão': 'One Sibling', 'Dois Irmãos': 'Two Siblings',
        'Três Irmãos': 'Three Siblings', 'Mais de 3 Irmãos': 'More than Three Siblings'
    },
    'disability_status': {
        'Sem Deficiência': 'No Disability', 'Deficiência Intelectual': 'Intellectual Disability',
        'Deficiência Física e Intelectual': 'Physical and Intellectual Disability', 'Deficiência Física': 'Physical Disability'
    },
    'ethnicity': {
        'Parda': 'Mixed', 'Branca': 'White', 'Preta': 'Black', 'Indígena': 'Indigenous', 'Amarela': 'Asian', 'Não Informada': 'Not Informed'
    },
    'region': {
        'Centro Oeste': 'Midwest', 'Nordeste': 'Northeast', 'Norte': 'North', 'Sudeste': 'Southeast', 'Sul': 'South'
    },
    'available_for_adoption': {
        'Disponível - vinculada a pretendente': 'Available - Linked to Applicant',
        'Disponível - não vinculada a pretendente': 'Available - Not Linked to Applicant'
    }
    # Add more translations as needed
}

# Function to translate column values
def translate_columns(df, col_name, translation_dict):
    if col_name in df.columns and col_name in translation_dict:
        df[col_name] = df[col_name].map(translation_dict[col_name])
    return df

# Function to standardize and translate columns
def standardize_and_translate_columns(df, col_mapping, translation_dict):
    df = df.rename(columns=col_mapping)
    for col in col_mapping.values():
        df = translate_columns(df, col, translation_dict)
    return df

# Example column mappings for children data
col_mappings_children = {
    'Disponíveis para Adoção': 'available_for_adoption',
    'Fx. Etária': 'age_group',
    'Gênero': 'gender',
    'Etnia': 'ethnicity',
    'Deficiência': 'disability_status',
    'Doença': 'disease_status',
    'Irmãos': 'sibling_status',
    'UF': 'UF',
    'Região': 'region',
    'Disponível - vinculada a pretendente': 'available_for_adoption',
    'Disponível - não vinculada a pretendente': 'available_for_adoption',
    
}

# Example column mappings for parents data
col_mappings_parents = {
    'Disponíveis para Adoção': 'available_for_adoption',
    'Fx. Etária': 'age_group',
    'Gênero': 'gender',
    'Etnia': 'ethnicity',
    'Deficiência': 'disability_preference',
    'Doença': 'disease_preference',
    'Irmãos': 'sibling_preference',
    'UF': 'UF',
    'Região': 'region'
}

# Standardize and translate columns for each dataframe in all regions
for region, data in children_data.items():
    for key, df in data.items():
        children_data[region][key] = standardize_and_translate_columns(df, col_mappings_children, translation_dict)

for region, data in parents_data.items():
    for key, df in data.items():
        parents_data[region][key] = standardize_and_translate_columns(df, col_mappings_parents, translation_dict)

In [15]:
# Function to preview data from a specific region and key
def preview_data(data, region, key, num_rows=5):
    print(f"Preview of {key} data for {region} region:")
    display(data[region][key].head(num_rows))

# Preview some data for children
preview_data(children_data, 'general', 'by_age')
preview_data(children_data, 'general', 'by_gender')

# Preview some data for prospective adoptive parents
preview_data(parents_data, 'general', 'by_accepted_age')
preview_data(parents_data, 'general', 'by_accepted_gender')

Preview of by_age data for general region:


Unnamed: 0,Fx. Etária,Disponíveis para Adoção,Region
0,Até 2 anos,393,general
1,De 2 a 4 anos,292,general
2,De 4 a 6 anos,372,general
3,De 6 a 8 anos,396,general
4,De 8 a 10 anos,488,general


Preview of by_gender data for general region:


Unnamed: 0,Gênero,Disponíveis para Adoção,Region
0,Masculino,2579,general
1,Feminino,2257,general


Preview of by_accepted_age data for general region:


Unnamed: 0,Idade,Pretendentes Disponíveis,Region
0,Até 2 anos,6228,general
1,De 2 a 4 anos,11606,general
2,De 4 a 6 anos,11240,general
3,De 6 a 8 anos,4989,general
4,De 8 a 10 anos,1412,general


Preview of by_accepted_gender data for general region:


Unnamed: 0,Gênero,Pretendentes Disponíveis,Region
0,Qualquer,24949,general
1,Feminino,8689,general
2,Masculino,2610,general


In [None]:
# Function to standardize column names and ensure consistency
def standardize_columns(df, col_mapping):
    df = df.rename(columns=col_mapping)
    return df

# Example column mappings (add more as needed)
col_mappings = {
    'Disponíveis para Adoção': 'Available_for_Adoption',
    'Fx. Etária': 'Age_Group',
    'Gênero': 'Gender',
    'Etnia': 'Ethnicity',
    'Deficiência': 'Disability_Status',
    'Doença': 'Disease_Status',
    'Irmãos': 'Sibling_Status',
    'UF': 'UF',
    'Região': 'Region'
    'Tipo': 'Adoption_Preference',
    'Casal?': 'Marital_Status',
    'Doença infectocontagiosa': 'Infectious_Disease_Preference'
    
    
    
}

# Standardize columns for each dataframe in all regions
for region, data in children_data.items():
    for key, df in data.items():
        children_data[region][key] = standardize_columns(df, col_mappings)

for region, data in parents_data.items():
    for key, df in data.items():
        parents_data[region][key] = standardize_columns(df, col_mappings)