This document presents some descriptions regarding the columns from datasets, considering:
- There are multiple inconsistencies in naming of columns between files.
- There is duplicated data between datasets.
- No relational diagram was provided.

This documents also explains some of the classifications of data columns in local package module `core_ds4a_project.columns`

Setting up notebook:

In [1]:
import glob
import json
import os

import pandas as pd
import numpy as np

from dotenv import load_dotenv

from core_ds4a_project import columns as project_columns
from core_ds4a_project.cleaning import normalize_columns_name, unique_columns_from_dataframes

%load_ext autoreload
%autoreload 1
%aimport core_ds4a_project, core_ds4a_project.cleaning, core_ds4a_project.columns


pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 320)


# Environment variables
load_dotenv('envvars')
ROOT_DATA_PATH = os.environ.get('ROOT_DATA_PATH')
RAW_DATA_PATH = os.environ.get('RAW_DATA_PATH') or f'{ROOT_DATA_PATH}/raw'
CLEAN_DATA_PATH = os.environ.get('CLEAN_DATA_PATH') or f'{ROOT_DATA_PATH}/clean'

Reading data:

> Note: ideally you would normalize columns just after reading each dataframe. This is not the case here considering raw columns are required to compose columns renaming dictionary.

In [2]:
def read_csv(file):
    df = pd.read_csv(file, sep=';', encoding="ISO-8859-1", nrows=1)
    return df


dataset_dfs_dict = {}
dataset_files_dict = {}  # used in later sections

file = glob.glob(f'{RAW_DATA_PATH}/*COLOCACION*.xlsx')[0]  # single file dataset
df = pd.read_excel(file, nrows=1, usecols='A:BB')
dataset_dfs_dict['COLOCACION_NEW'] = [df]

# csv_datasets = ['CARTERA', 'CONTACTO', 'NEGOCIO']
csv_datasets = ['CARTERA', 'COLOCACION', 'CONTACTO', 'NEGOCIO']
for dataset in csv_datasets:
    files = glob.glob(f'{RAW_DATA_PATH}/*{dataset}*.csv')
    dfs = [read_csv(f) for f in files]

    dataset_dfs_dict[dataset] = dfs
    dataset_files_dict[dataset] = files

## Columns renaming dictionary

In [3]:
dataframes = [pd.concat(dfs) for dfs in dataset_dfs_dict.values()]
raw_columns = unique_columns_from_dataframes(dataframes)
normalized_columns = normalize_columns_name(raw_columns)

renaming_dict = {raw: norm for (raw, norm) in zip(raw_columns, normalized_columns)}
renaming_file = f'{ROOT_DATA_PATH}/dict-renaming-raw-columns.json'

with open(renaming_file, 'w', encoding='utf-8') as file:
    json.dump(renaming_dict, file, indent=4)

## Columns dataframe

In [4]:
columns_series_list = []

for (dataset, dfs) in dataset_dfs_dict.items():
    for df in dfs:
        df.columns = normalize_columns_name(df.columns)

    unique_columns = unique_columns_from_dataframes(dfs)
    columns_series = pd.Series({c: True for c in unique_columns}, name=dataset)
    columns_series_list.append(columns_series)

columns_df = pd.concat(columns_series_list, axis=1).fillna(False).sort_index()
columns_df = columns_df[columns_df.columns.sort_values()]
columns_df.shape


(300, 5)

Complete columns dataframe is presented at last section.

Following is a cut version of columns dataframe useful for finding relations and duplicated columns between datasets:

In [5]:
columns_cut_df = (columns_df
    .copy()
    .drop(index=project_columns.NEGOCIO_PIVOTED_COLUMNS)
    .drop(columns=['COLOCACION'])  # old dataset
)

# 
columns_cut_df.loc[project_columns.CARTERA_USELESS_COLUMNS, 'CARTERA'] = False

counts_series = columns_cut_df.sum(axis=1).rename('COUNTS').astype(int)
columns_cut_df = pd.concat([columns_cut_df, counts_series], axis=1).query('COUNTS > 1')
columns_cut_df.shape

(28, 5)

In [6]:
columns_cut_df

Unnamed: 0,CARTERA,COLOCACION_NEW,CONTACTO,NEGOCIO,COUNTS
ASIGNADO_A,False,False,True,True,2
CLIENTE,True,False,True,True,3
COD_LINEA,True,True,False,False,2
COD_MODALIDAD,True,True,False,False,2
DIRECCION,False,True,False,True,2
ESTADO,False,True,False,True,2
ESTADO_CIVIL,False,True,True,False,2
FECHA_APROBA,True,True,False,False,2
FECHA_DESEMBOLSO,True,True,False,False,2
FECHA_NACIMIENTO,False,True,True,False,2


The following are the two main column identifier keys across datasets.
- OBLIGACION: identifier that links CARTERA, CASTIGO and COLOCACION datasets.
- CLIENTE (former "Homologacion Documento de Identidad"): identifier that links CARTERA, CONTACTO, and NEGOCIO datasets.

## Columns in CARTERA dataset

CARTERA dataset is splitted into 56 files which some of them contain variable number of columns. Following are all available columns in CARTERA dataset along with the count of files in which each corresponding column is present:

In [7]:
cartera_columns_df = unique_columns_from_dataframes(dataframes=dataset_dfs_dict['CARTERA'],
                                                    return_column_df=True,
                                                    index=[os.path.basename(f) for f in dataset_files_dict['CARTERA']])

cartera_cols_counts = cartera_columns_df.sum(axis=0).rename('COUNTS')
cartera_cols_counts

BUS_REGION                  1
CALIFICACION_CIERRE        56
CAPITAL_VEN                56
CEDULA                      3
CELULAR                     2
CLIENTE                    56
COD_LINEA                  56
COD_MODALIDAD              56
COMISION                   56
CORREO                      2
CUOTAS_PACTADAS            56
CUOTAS_PENDIENTES          56
DIAS_VENCIDO               56
DIRECCION                   2
EDAD                        4
EJECUTIVO_ACTUAL            1
ESTADO_CIVIL                2
ESTRATO                     4
FACTORRH                    4
FECHA_APROBA               56
FECHA_DESEMBOLSO           56
FECHA_NACIMIENTO            4
FECHA_PROXIMO_PAGO         56
FECHA_SOLICITUD            56
FECHA_ULT_PAGO             56
FECHA_UTL_ACTUALIZACION     2
FECHA_VENCIMIENTO_FINAL    56
GARANTIA_REAL              56
GENERO                      4
INTERES_VEN                56
LINEA                      56
MODALIDAD                  56
MONTO                      56
MORA      

Columns in `core_ds4a_project.columns.CARTERA_USELESS_COLUMNS` correspond to columns from CARTERA dataset that are considered useless as they are present only in less than 10 out of 56 CARTERA files:
> Notes:
> - Columns EDAD and ESTRATO are present in COLOCATION dataset
> - Columns ESTADO_CIVIL, FECHA_NACIMIENTO, NIVEL_ESTUDIOS, and PROFESION, are present in both COLOCACION and CONTACTO datasets.

In [8]:
cartera_cols_counts_less_than = cartera_cols_counts[cartera_cols_counts < 10].index
pd.Index(project_columns.CARTERA_USELESS_COLUMNS).equals(cartera_cols_counts_less_than)

True

In [9]:
project_columns.CARTERA_USELESS_COLUMNS

['BUS_REGION',
 'CEDULA',
 'CELULAR',
 'CORREO',
 'DIRECCION',
 'EDAD',
 'EJECUTIVO_ACTUAL',
 'ESTADO_CIVIL',
 'ESTRATO',
 'FACTORRH',
 'FECHA_NACIMIENTO',
 'FECHA_UTL_ACTUALIZACION',
 'GENERO',
 'MUJER_CABEZA',
 'MUNICIPIO',
 'NIVEL_ESTUDIOS',
 'NOMBRE',
 'PROFESION',
 'RANGO_PAGO',
 'REGION_1',
 'REGION_REAL',
 'SUCURSAL',
 'SUCURSALES',
 'SUCURSAL_1',
 'TEL_FIJO']

## Columns in NEGOCIO dataset

Columns in `core_ds4a_project.columns.NEGOCIO_PIVOTED_COLUMNS` correspond to columns that have been pivoted. Those pivoted columns are only present in NEGOCIO dataset.

In [10]:
# Match any NEGOCIO pivoted column
regex = '|'.join([f'({col})' for col in project_columns.NEGOCIO_PIVOTED_COLUMNS])
index = columns_df.index.str.match(regex)

only_in_negocio = (columns_df.loc[index,'NEGOCIO'] & columns_df[index].sum(axis=1) == 1).all()
only_in_negocio

True

In [11]:
project_columns.NEGOCIO_PIVOTED_COLUMNS

['CANTIDAD_DE_INVENTARIO_1',
 'CANTIDAD_DE_INVENTARIO_2',
 'CANTIDAD_DE_INVENTARIO_3',
 'CANTIDAD_DE_INVENTARIO_4',
 'CANTIDAD_DE_INVENTARIO_5',
 'CANTIDAD_DE_INVENTARIO_6',
 'DESCRIPCION_DEL_ACREEDOR_1',
 'DESCRIPCION_DEL_ACREEDOR_2',
 'DESCRIPCION_DEL_ACREEDOR_3',
 'DESCRIPCION_DEL_ACREEDOR_4',
 'DESCRIPCION_DEL_ACTIVO_FIJO_1',
 'DESCRIPCION_DEL_ACTIVO_FIJO_2',
 'DESCRIPCION_DEL_ACTIVO_FIJO_3',
 'DESCRIPCION_DEL_PRODUCTO_1',
 'DESCRIPCION_DEL_PRODUCTO_2',
 'DESCRIPCION_DEL_PRODUCTO_3',
 'DESCRIPCION_DEL_PRODUCTO_4',
 'DESCRIPCION_DEL_PRODUCTO_5',
 'DESCRIPCION_DEL_PRODUCTO_6',
 'DESCRIPCION_DEL_PRODUCTO_QUE_COMERCIALIZA_1',
 'DESCRIPCION_DEL_PRODUCTO_QUE_COMERCIALIZA_2',
 'DESCRIPCION_DEL_PRODUCTO_QUE_COMERCIALIZA_3',
 'DESCRIPCION_DEL_PRODUCTO_QUE_COMERCIALIZA_4',
 'DESCRIPCION_DEL_PRODUCTO_QUE_COMERCIALIZA_5',
 'DESTINO_DEL_PASIVO_1',
 'DESTINO_DEL_PASIVO_2',
 'DESTINO_DEL_PASIVO_3',
 'DESTINO_DEL_PASIVO_4',
 'ENTIDAD_CON_LA_CUAL_TIENE_LA_HIPOTECA_O_LA_PRENDA_1',
 'ENTIDAD_CON_LA_C

## Comparing CONTACTO and NEGOCIO

> Note: `core_ds4a_project.columns.NEGOCIO_PIVOTED_COLUMNS` have been compacted for visualization purposes by replacing last number or word for a "PIVOTED", but such compacted columns are not actually part of NEGOCIO dataset.

In [12]:
contacto_negocio_cols_df = (columns_df
    [['CONTACTO', 'NEGOCIO']]
    .drop(index=project_columns.NEGOCIO_PIVOTED_COLUMNS)
    .query('CONTACTO | NEGOCIO')
)

# Compacting pivoted columns
pattern = r'^(?P<col>.*)_\w*$'
negocio_pivoted_cols = (pd.Series(project_columns.NEGOCIO_PIVOTED_COLUMNS)
                        .str.replace(pattern, lambda m: m.group('col') + '_PIVOTED', regex=True)
                        .drop_duplicates()
                        )
negocio_pivoted_cols_df = pd.DataFrame([True for c in negocio_pivoted_cols],
                  columns=['NEGOCIO'],
                  index=negocio_pivoted_cols)

pd.concat([contacto_negocio_cols_df, negocio_pivoted_cols_df]).fillna(False).sort_index()

Unnamed: 0,CONTACTO,NEGOCIO
ACTIVIDAD,False,True
ACTIVIDAD_CIIU_PRIMARIA,True,False
ACTIVIDAD_ECONOMICA,True,False
ADMINISTRA_RECURSOS_PUBLICOS,True,False
ARRENDADOR_DEL_NEGOCIO,False,True
ASIGNADO_A,True,True
AUTORIZA_EL_ENVIO_DE_INFORMACION_POR_MEDIO_DE_CORREO_ELCTRONICO,True,False
AUTORIZA_LA_CONSULTA_EN_LAS_CENTRALES_DE_RIESGO,True,False
BANCO_MONEDA_EXTRANJERA,True,False
BARRIO,False,True


## Complete columns dataframe

In [13]:
counts_series = columns_df.sum(axis=1).rename('COUNTS').astype(int)
pd.concat([columns_df, counts_series], axis=1)

Unnamed: 0,CARTERA,COLOCACION,COLOCACION_NEW,CONTACTO,NEGOCIO,COUNTS
ACTIVIDAD,False,False,False,False,True,1
ACTIVIDAD_CIIU_PRIMARIA,False,False,False,True,False,1
ACTIVIDAD_ECONOMICA,False,False,False,True,False,1
ADMINISTRA_RECURSOS_PUBLICOS,False,False,False,True,False,1
ANO_CONTABILIZA,False,False,True,False,False,1
ARRENDADOR_DEL_NEGOCIO,False,False,False,False,True,1
ASIGNADO_A,False,False,False,True,True,2
AUTORIZA_EL_ENVIO_DE_INFORMACION_POR_MEDIO_DE_CORREO_ELCTRONICO,False,False,False,True,False,1
AUTORIZA_LA_CONSULTA_EN_LAS_CENTRALES_DE_RIESGO,False,False,False,True,False,1
BANCO_MONEDA_EXTRANJERA,False,False,False,True,False,1
