# Anonymization Techniques: Generalization

## 1. Goal:
Implement the generalization technique on a dataset.

## 2. Specification:
Load the “covid.csv” and “regioes.csv” datasets that will be provided to you. The first dataset will already be pseudo-anonymized, while the second will have to be used to generalize an attribute. You must implement a program to perform the generalization of the COVID dataset: </br>
– Define a generalization tree with 2 levels for the attribute “Municipio” considering the mesoregions of the States of Ceará contained in the dataset “regioes.csv”, and another tree for date of birth with 3 levels: original data (yyyy/mm/ dd), date without day (yyyy/mm), date without day and month (yyyy). Consider level 0, the original data level.</br>
– From the “covid.csv” dataset and based on the generalization trees sent, the program must perform: </br>
     1. The full domain generalization of the “Municipio” attribute values to level 1 and recording an anonymized “covid_anon01.csv” file; </br>
     2. The full domain generalization of the “Nascimento” attribute values to level 1 and write an anonymized “covid_anon02.csv” file. </br>
     3. Subtree generalization of the “Nascimento” attribute values to level 2 and write an anonymized “covid_anon03.csv” file. To this end, apply anonymization only to records whose individual was born in the 1950s (1950s to 1959). </br>
     4. The generalization of the full domain of the values of the attributes “Municipio” (level 1) and “Nascimento” (level 2) and recording an anonymized “covid_anon04.csv” file. </br>
The program should show the two generalization hierarchies defined from the dataset on the screen. </br>
Program an interface for selecting and viewing the datasets generated for each output, where the user can view the anonymized dataset selected.

In [1]:
import pandas as pd
import numpy as np
np.warnings.filterwarnings('ignore')

Dataset "covid.csv"

In [2]:
df_covid = pd.read_csv('covid.csv', sep=';')

In [3]:
df_covid.head()

Unnamed: 0.1,Unnamed: 0,Identificador,CodigoMunicipio,Municipio,Estado,Genero,Nascimento,ResultadoExame
0,0,c3ba634113e4b5eb0e3eaae93b09759b,231290.0,SOBRAL,CE,MASCULINO,2003-08-14,Negativo
1,1,ac84809bfc89b992a0a0221e50b135c0,230960.0,PACAJUS,CE,MASCULINO,1983-11-07,Negativo
2,2,28ccfaa0c53b792cd1ffa0b7e535f617,230523.0,HORIZONTE,CE,FEMININO,1982-01-14,Negativo
3,3,9683fc5fd2c0f7b72fa92ffd259d738a,230440.0,FORTALEZA,CE,MASCULINO,1992-03-12,Negativo
4,4,e257ccdc48289f02e047cbf046251319,230370.0,CAUCAIA,CE,MASCULINO,1970-03-06,Negativo


In [4]:
df_covid['CodigoMunicipio'] = df_covid['CodigoMunicipio'].astype('int')

In [5]:
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 879543 entries, 0 to 879542
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Unnamed: 0       879543 non-null  int64 
 1   Identificador    879543 non-null  object
 2   CodigoMunicipio  879543 non-null  int32 
 3   Municipio        879543 non-null  object
 4   Estado           879543 non-null  object
 5   Genero           879543 non-null  object
 6   Nascimento       879543 non-null  object
 7   ResultadoExame   879543 non-null  object
dtypes: int32(1), int64(1), object(6)
memory usage: 50.3+ MB


Dataset "regioes.csv"

In [6]:
df_regiao = pd.read_csv('regioes.csv')

In [7]:
df_regiao.head()

Unnamed: 0.1,Unnamed: 0,Código IBGE do Município,Nome da Mesoregião,Sigla da Unidade da Federação,Nome das Grandes Regiões
0,0,230010,Sul Cearense,CE,Nordeste
1,1,230015,Norte Cearense,CE,Nordeste
2,2,230020,Noroeste Cearense,CE,Nordeste
3,3,230030,Sertões Cearenses,CE,Nordeste
4,4,230040,Sertões Cearenses,CE,Nordeste


In [8]:
df_regiao.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184 entries, 0 to 183
Data columns (total 5 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   Unnamed: 0                     184 non-null    int64 
 1   Código IBGE do Município       184 non-null    int64 
 2   Nome da Mesoregião             184 non-null    object
 3   Sigla da Unidade da Federação  184 non-null    object
 4   Nome das Grandes Regiões       184 non-null    object
dtypes: int64(2), object(3)
memory usage: 7.3+ KB


In [9]:
# merging datasets on the "Municipio" code. 
df_covid_regiao = pd.merge(df_covid,df_regiao,left_on='CodigoMunicipio',right_on = 'Código IBGE do Município', how = 'inner')

In [10]:
df_covid_regiao

Unnamed: 0,Unnamed: 0_x,Identificador,CodigoMunicipio,Municipio,Estado,Genero,Nascimento,ResultadoExame,Unnamed: 0_y,Código IBGE do Município,Nome da Mesoregião,Sigla da Unidade da Federação,Nome das Grandes Regiões
0,0,c3ba634113e4b5eb0e3eaae93b09759b,231290,SOBRAL,CE,MASCULINO,2003-08-14,Negativo,166,231290,Noroeste Cearense,CE,Nordeste
1,9,8db897c909c1c246b56ee491e0df994f,231290,SOBRAL,CE,MASCULINO,1989-12-28,Negativo,166,231290,Noroeste Cearense,CE,Nordeste
2,65,17997c9bd21d9931ea130f1ca65236e9,231290,SOBRAL,CE,FEMININO,2000-07-08,Negativo,166,231290,Noroeste Cearense,CE,Nordeste
3,95,25e5850e62482886d70efb3f8438aaa7,231290,SOBRAL,CE,FEMININO,1954-03-20,Negativo,166,231290,Noroeste Cearense,CE,Nordeste
4,96,de7400d331c8521d71baa53409d8b4f8,231290,SOBRAL,CE,MASCULINO,2002-12-06,Negativo,166,231290,Noroeste Cearense,CE,Nordeste
...,...,...,...,...,...,...,...,...,...,...,...,...,...
879538,876429,8ca78bc532162483be4b9ec4c4bc5e64,231123,POTIRETAMA,CE,FEMININO,1974-09-29,Negativo,146,231123,Jaguaribe,CE,Nordeste
879539,876938,112a5ad7b915503baca248adcd32c6f5,231123,POTIRETAMA,CE,MASCULINO,1941-08-20,Negativo,146,231123,Jaguaribe,CE,Nordeste
879540,877011,4b3a7183752a62c90e2f59cac1d8a984,231123,POTIRETAMA,CE,FEMININO,1984-02-29,Positivo,146,231123,Jaguaribe,CE,Nordeste
879541,877515,3072ba3050b62e29449eca68420a3983,231123,POTIRETAMA,CE,FEMININO,1991-07-04,Positivo,146,231123,Jaguaribe,CE,Nordeste


In [11]:
# Dropping repeated and/or unimportant columns:
df_covid_regiao.drop(columns = ['Unnamed: 0_x', 'Unnamed: 0_y', 'CodigoMunicipio', 'Estado'], inplace = True)

In [12]:
df_covid_regiao.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 879543 entries, 0 to 879542
Data columns (total 9 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Identificador                  879543 non-null  object
 1   Municipio                      879543 non-null  object
 2   Genero                         879543 non-null  object
 3   Nascimento                     879543 non-null  object
 4   ResultadoExame                 879543 non-null  object
 5   Código IBGE do Município       879543 non-null  int64 
 6   Nome da Mesoregião             879543 non-null  object
 7   Sigla da Unidade da Federação  879543 non-null  object
 8   Nome das Grandes Regiões       879543 non-null  object
dtypes: int64(1), object(8)
memory usage: 67.1+ MB


### Define a generalization tree with 2 levels for the attribute “Municipio” considering the mesoregions of the States of Ceará contained in the dataset “regioes.csv”, and another tree for date of birth with 3 levels: original data (yyyy/mm/dd ), date without day (yyyy/mm), date without day and month (yyyy). Consider level 0, the original data level.

**Anonymization for Municipality in 2 levels:**

                                Anonymization technique used for Municipality ("Municípios"): Generalization
                                        
[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO...]            [ARACATI, SAO JOAO DO JAGUARIBE...]                                [CRATEUS, PIQUET CARNEIRO, TAUA...]
                          
                          Level 1: Generalization for all municipalities according to Mesoregion
                       |                                          |                                       |
                       |                                          |                                       | 
                       |                                          |                                       | 
                    [SOBRAL]                               [POTIRETAMA]                               [CRATEUS] 
                                                       Level 0: Municipality ("Município")

**Anonymization for Data in 3 levels:**

                                Anonymization technique used for Data: Generalization 
                                
                [2003]                                     [1989]                                  [2000] 
                                            Level 2: Generalization for year
                   |                                          |                                       |
                   |                                          |                                       | 
                   |                                          |                                       | 
               [2003-08]                                  [1989-12]                              [2000-07] 
                                      
                                          Level 1: Generalization for year and month
                                      
                   |                                          |                                       |
                   |                                          |                                       | 
                   |                                          |                                       | 
             [2003-08-14]                               [1989-12-28]                            [2000-07-08] 
                                                  Level 0: Original date

                                        

**1. The full domain generalization of the “Municipio” attribute values to level 1 and write an anonymized “covid_anon01.csv” file.**

In [13]:
df_municipio = df_covid_regiao.copy()

In [14]:
# Number of mesoregions
df_municipio['Nome da Mesoregião'].unique()

array(['Noroeste Cearense', 'Metropolitana de Fortaleza',
       'Norte Cearense', 'Sul Cearense', 'Jaguaribe', 'Sertões Cearenses',
       'Centro-Sul Cearense'], dtype=object)

In [15]:
# List of municipalities according to mesoregion:
list_municipio = df_municipio.groupby('Nome da Mesoregião')['Municipio'].unique()
list_municipio

Nome da Mesoregião
Centro-Sul Cearense           [IGUATU, OROS, CEDRO, BAIXIO, CARIUS, ICO, LAV...
Jaguaribe                     [ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...
Metropolitana de Fortaleza    [PACAJUS, HORIZONTE, FORTALEZA, CAUCAIA, MARAN...
Noroeste Cearense             [SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...
Norte Cearense                [CASCAVEL, SAO GONCALO DO AMARANTE, PARACURU, ...
Sertões Cearenses             [CRATEUS, PIQUET CARNEIRO, TAUA, ARNEIROZ, QUI...
Sul Cearense                  [MILAGRES, MISSAO VELHA, CRATO, BREJO SANTO, J...
Name: Municipio, dtype: object

In [16]:
dic_municipio = dict(list_municipio)
dic_municipio

{'Centro-Sul Cearense': array(['IGUATU', 'OROS', 'CEDRO', 'BAIXIO', 'CARIUS', 'ICO',
        'LAVRAS DA MANGABEIRA', 'VARZEA ALEGRE', 'ANTONINA DO NORTE',
        'JUCAS', 'QUIXELO', 'IPAUMIRIM', 'UMARI', 'TARRAFAS'], dtype=object),
 'Jaguaribe': array(['ARACATI', 'SAO JOAO DO JAGUARIBE', 'LIMOEIRO DO NORTE', 'RUSSAS',
        'MORADA NOVA', 'QUIXERE', 'JAGUARIBARA', 'JAGUARIBE', 'IRACEMA',
        'TABULEIRO DO NORTE', 'PEREIRO', 'ITAICABA', 'ERERE', 'ICAPUI',
        'JAGUARETAMA', 'IBICUITINGA', 'FORTIM', 'JAGUARUANA', 'ALTO SANTO',
        'PALHANO', 'POTIRETAMA'], dtype=object),
 'Metropolitana de Fortaleza': array(['PACAJUS', 'HORIZONTE', 'FORTALEZA', 'CAUCAIA', 'MARANGUAPE',
        'PACATUBA', 'MARACANAU', 'AQUIRAZ', 'EUSEBIO', 'ITAITINGA',
        'GUAIÚBA'], dtype=object),
 'Noroeste Cearense': array(['SOBRAL', 'ACARAU', 'TIANGUA', 'SAO BENEDITO', 'CAMOCIM', 'CRUZ',
        'HIDROLANDIA', 'SANTA QUITERIA', 'PIRES FERREIRA', 'GRANJA',
        'ITAREMA', 'CATUNDA', 'IPU', 'RERI

In [17]:
# Performing anonymization generalizing the municipality ("município") in question to all municipalities in their respective mesoregion:
df_municipio['Municipio'] = df_municipio['Nome da Mesoregião'].map(dic_municipio)

In [18]:
df_municipio.sample(5)

Unnamed: 0,Identificador,Municipio,Genero,Nascimento,ResultadoExame,Código IBGE do Município,Nome da Mesoregião,Sigla da Unidade da Federação,Nome das Grandes Regiões
858733,b41d39d5428c26165b65e10464b8b695,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",FEMININO,1983-03-09,Negativo,230050,Noroeste Cearense,CE,Nordeste
112169,17e6e486e111128824e4210e203f8632,"[PACAJUS, HORIZONTE, FORTALEZA, CAUCAIA, MARAN...",FEMININO,1977-08-24,Negativo,230440,Metropolitana de Fortaleza,CE,Nordeste
17789,aeb790983a784aae616d6bf63d9d5849,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",MASCULINO,1982-04-09,Negativo,231290,Noroeste Cearense,CE,Nordeste
582417,82b4d59cc57dedff45c584a583c97aee,"[IGUATU, OROS, CEDRO, BAIXIO, CARIUS, ICO, LAV...",MASCULINO,1954-09-11,Negativo,230380,Centro-Sul Cearense,CE,Nordeste
128163,9db5524e368268b3cc3dc297b41ec4de,"[PACAJUS, HORIZONTE, FORTALEZA, CAUCAIA, MARAN...",FEMININO,1958-12-22,Negativo,230440,Metropolitana de Fortaleza,CE,Nordeste


Considering the total anonymization of the dataset, despite not being expressly requested in the statement, the suppression technique will be applied to the "Código IBGE do Município" field so that the original Municipality is not discovered when being searched for its code.

In [19]:
# Suppression Technique: IBGE code values will be replaced by "******"
df_municipio.loc[df_municipio['Código IBGE do Município'] > 1, 'Código IBGE do Município'] = '******'

In [20]:
df_municipio.sample(5)

Unnamed: 0,Identificador,Municipio,Genero,Nascimento,ResultadoExame,Código IBGE do Município,Nome da Mesoregião,Sigla da Unidade da Federação,Nome das Grandes Regiões
433810,76cd168b8687b9721d87efc44aee8bd9,"[PACAJUS, HORIZONTE, FORTALEZA, CAUCAIA, MARAN...",FEMININO,1980-12-16,Provável,******,Metropolitana de Fortaleza,CE,Nordeste
11329,6e8385beccdb0f9fe5157ed948e2f983,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",FEMININO,1989-01-09,Negativo,******,Noroeste Cearense,CE,Nordeste
257539,20afb4980455ec2acc0eef89b7b2ee92,"[PACAJUS, HORIZONTE, FORTALEZA, CAUCAIA, MARAN...",FEMININO,1971-07-18,Negativo,******,Metropolitana de Fortaleza,CE,Nordeste
489641,63c263f404516799ca2671c9a919ae90,"[CASCAVEL, SAO GONCALO DO AMARANTE, PARACURU, ...",MASCULINO,1985-01-11,Negativo,******,Norte Cearense,CE,Nordeste
659780,7050ac1a85fbeb4026d81c18ea8fb39a,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",MASCULINO,1981-03-30,Negativo,******,Noroeste Cearense,CE,Nordeste


In [21]:
# Converting the dataframe to an anonymized file:
df_municipio.to_csv('covid_anon01.csv', index = False)

**2. The full domain generalization of the “Birth” attribute values to level 1 and writing an anonymized “covid_anon02.csv” file.**

In [22]:
df_nascimento = df_covid.copy()

In [23]:
df_nascimento.drop(columns = ['Unnamed: 0'], inplace = True)

When inspecting the dataset, it is possible to observe that there are dates below 1900. Therefore, the first step will be to change the values of these dates, so that any date between 1600-1899 becomes a value within the 20th century.

In [24]:
# Dates of birth with values between 1600-1899.
df_nascimento[df_nascimento['Nascimento'].str.contains('^16|^17|^18')]

Unnamed: 0,Identificador,CodigoMunicipio,Municipio,Estado,Genero,Nascimento,ResultadoExame
31516,cef7b4f3742dbd0590ac4f803f82b9f7,230300,CARIDADE,CE,MASCULINO,1885-10-25,Provável
34517,81477c7bf949f9108ea4e7baa986460e,230340,CARNAUBAL,CE,FEMININO,1899-12-30,Positivo
37156,538ff42c112efcf5db644fbd12155426,231140,QUIXERAMOBIM,CE,FEMININO,1899-12-30,Negativo
40178,4a3f2df72236698b9db233294c9aa55d,230470,GRANJA,CE,FEMININO,1899-12-30,Negativo
40179,badbbdfd53b5de472963375134af27d8,230426,DEPUTADO IRAPUAN PINHEIRO,CE,MASCULINO,1899-12-30,Negativo
40181,98ef3813ede57b158ed15027e61749b1,230533,IBICUITINGA,CE,FEMININO,1899-12-30,Negativo
40191,e3ff9650bc38b225d2b8813c642ea251,230550,IGUATU,CE,FEMININO,1899-12-30,Positivo
40193,31f5c8867ab60523a6d3f80b1ca398bc,230580,IPU,CE,FEMININO,1899-12-30,Negativo
40196,6dbdcf86a1374024010d7c98d05a1d1d,230550,IGUATU,CE,FEMININO,1899-12-30,Negativo
40200,f6a1d946e6c12774aff3977beb4885ce,230630,ITAPAJE,CE,FEMININO,1899-12-30,Negativo


In [25]:
# Replacing the values 16, 17 and 18 with 19, so that the base only has dates of birth from 1900 onwards
index = df_nascimento[df_nascimento['Nascimento'].str.contains('^16|^17|^18')].index
df_nascimento.loc[index, 'Nascimento'] = df_nascimento.loc[index,'Nascimento'].str.replace('^16|^17|^18','19')

In [26]:
# Example of a value that previously had a date: 1894-03-16
df_nascimento.loc[df_nascimento['Identificador']=='a4e89b7dfa149cac27b5b524eae59ca1']

Unnamed: 0,Identificador,CodigoMunicipio,Municipio,Estado,Genero,Nascimento,ResultadoExame
64694,a4e89b7dfa149cac27b5b524eae59ca1,231400,VARZEA ALEGRE,CE,MASCULINO,1994-03-16,Positivo


After the transformation of the initial data, the dataset will be copied so that levels 1 and 2 can work on for anonymization.

In [27]:
df_nascimento_1 = df_nascimento.copy()

In [28]:
df_nascimento_1['Nascimento'] = pd.to_datetime(df_nascimento['Nascimento'])

In [29]:
# Creating Level 1: Year and Month
df_nascimento_1['Nascimento'] = df_nascimento_1['Nascimento'].map(lambda x: x.strftime("%Y-%m"))

In [30]:
# Random samples from the anonymized dataset for the first level of the "Nascimento" attribute:
df_nascimento_1.sample(5)

Unnamed: 0,Identificador,CodigoMunicipio,Municipio,Estado,Genero,Nascimento,ResultadoExame
452014,33f4dc8eff891f390ff7f34016151b37,230250,BREJO SANTO,CE,FEMININO,1992-08,Negativo
502908,24136f58e355feadf5ee328d682bf78e,230880,MORAUJO,CE,MASCULINO,1987-02,Negativo
867505,005136abc424901dcefc1e17a3d7de7d,231025,PARAIPABA,CE,FEMININO,1999-10,Provável
134174,592d29e87617e6935b181d1d90aee2a1,230540,ICO,CE,FEMININO,2012-01,Negativo
801425,ebc963748dcdd38a55f84d8dcd7bff7c,230730,JUAZEIRO DO NORTE,CE,MASCULINO,1990-04,Positivo


In [31]:
df_nascimento_1.to_csv('covid_anon02.csv', index = False)

**3. Subtree generalization of the Nascimento attribute values to level 2 and write an anonymized “covid_anon03.csv” file. Apply anonymization only to records whose individual was born in the 1950s (1950s to 1959).**

As the statement informs, the anonymization should be done only in individuals born in the 1950s. The others will not be anonymized. Therefore, for the birth ("Nascimento") attribute, the data will be at level 0 (outside the range) and level 2 (range).

In [32]:
df_nascimento_2 = df_nascimento.copy()

In [33]:
# Creating Level 2: Year
df_nascimento_2['Ano'] = pd.DatetimeIndex(df_nascimento_2['Nascimento']).year

In [34]:
df_nascimento_2['Ano'] = df_nascimento_2['Ano'].astype('int')

In [35]:
df_nascimento_2

Unnamed: 0,Identificador,CodigoMunicipio,Municipio,Estado,Genero,Nascimento,ResultadoExame,Ano
0,c3ba634113e4b5eb0e3eaae93b09759b,231290,SOBRAL,CE,MASCULINO,2003-08-14,Negativo,2003
1,ac84809bfc89b992a0a0221e50b135c0,230960,PACAJUS,CE,MASCULINO,1983-11-07,Negativo,1983
2,28ccfaa0c53b792cd1ffa0b7e535f617,230523,HORIZONTE,CE,FEMININO,1982-01-14,Negativo,1982
3,9683fc5fd2c0f7b72fa92ffd259d738a,230440,FORTALEZA,CE,MASCULINO,1992-03-12,Negativo,1992
4,e257ccdc48289f02e047cbf046251319,230370,CAUCAIA,CE,MASCULINO,1970-03-06,Negativo,1970
...,...,...,...,...,...,...,...,...
879538,034184a3bba5a9d9020c6f78287cfb30,230440,FORTALEZA,CE,FEMININO,1982-12-08,Positivo,1982
879539,7f4c28d219dd03de2457b16de64563a0,230440,FORTALEZA,CE,FEMININO,1980-12-02,Negativo,1980
879540,4f8257be1fefe7e9a3b17f08247792c6,230440,FORTALEZA,CE,FEMININO,1988-05-17,Provável,1988
879541,2b610df8ede251af2e17a6ac07726678,230410,CRATEUS,CE,FEMININO,1970-09-28,Negativo,1970


In [36]:
intervalo = df_nascimento_2[['Nascimento','Ano']].loc[(df_nascimento_2['Ano']>=1950) & (df_nascimento_2['Ano']<=1959)] 
intervalo

Unnamed: 0,Nascimento,Ano
26,1951-04-12,1951
36,1950-02-03,1950
39,1953-08-03,1953
64,1952-01-17,1952
88,1958-12-23,1958
...,...,...
879519,1951-01-04,1951
879527,1956-09-09,1956
879528,1954-12-18,1954
879535,1956-10-21,1956


In [37]:
residual = df_nascimento_2[['Nascimento','Ano']].loc[(df_nascimento_2['Ano']<1950) | (df_nascimento_2['Ano']>1959)]
residual['Ano']=residual['Nascimento']
residual

Unnamed: 0,Nascimento,Ano
0,2003-08-14,2003-08-14
1,1983-11-07,1983-11-07
2,1982-01-14,1982-01-14
3,1992-03-12,1992-03-12
4,1970-03-06,1970-03-06
...,...,...
879538,1982-12-08,1982-12-08
879539,1980-12-02,1980-12-02
879540,1988-05-17,1988-05-17
879541,1970-09-28,1970-09-28


In [38]:
# Merge to form a dictionary covering all lines totaling 879543 lines like the original:
df_dic = pd.concat([intervalo,residual])
df_dic

Unnamed: 0,Nascimento,Ano
26,1951-04-12,1951
36,1950-02-03,1950
39,1953-08-03,1953
64,1952-01-17,1952
88,1958-12-23,1958
...,...,...
879538,1982-12-08,1982-12-08
879539,1980-12-02,1980-12-02
879540,1988-05-17,1988-05-17
879541,1970-09-28,1970-09-28


In [39]:
dic_ano = dict(zip(df_dic['Nascimento'],(df_dic['Ano'])))
dic_ano

{'1951-04-12': 1951,
 '1950-02-03': 1950,
 '1953-08-03': 1953,
 '1952-01-17': 1952,
 '1958-12-23': 1958,
 '1954-03-20': 1954,
 '1953-03-25': 1953,
 '1958-03-08': 1958,
 '1956-06-27': 1956,
 '1959-05-03': 1959,
 '1953-05-10': 1953,
 '1950-01-15': 1950,
 '1957-03-05': 1957,
 '1950-01-17': 1950,
 '1950-07-03': 1950,
 '1953-04-21': 1953,
 '1957-02-08': 1957,
 '1952-05-29': 1952,
 '1958-06-16': 1958,
 '1958-11-30': 1958,
 '1951-08-28': 1951,
 '1956-02-02': 1956,
 '1954-05-15': 1954,
 '1956-02-21': 1956,
 '1953-07-12': 1953,
 '1953-11-01': 1953,
 '1957-03-16': 1957,
 '1953-03-20': 1953,
 '1959-12-15': 1959,
 '1956-05-09': 1956,
 '1950-10-09': 1950,
 '1959-07-07': 1959,
 '1951-04-10': 1951,
 '1959-04-21': 1959,
 '1953-10-25': 1953,
 '1956-02-23': 1956,
 '1957-02-02': 1957,
 '1950-12-13': 1950,
 '1958-05-22': 1958,
 '1952-01-01': 1952,
 '1952-07-28': 1952,
 '1953-09-02': 1953,
 '1952-05-27': 1952,
 '1959-10-17': 1959,
 '1950-11-05': 1950,
 '1951-11-07': 1951,
 '1955-12-26': 1955,
 '1957-08-07'

In [40]:
# Replacing only the values that belong to the range: 1950-1959
df_nascimento_2['Nascimento'] = df_nascimento_2['Nascimento'].map(dic_ano)

In [41]:
df_nascimento_2.drop(columns = ['Ano'], inplace = True)

In [42]:
df_nascimento_2.sample(10)

Unnamed: 0,Identificador,CodigoMunicipio,Municipio,Estado,Genero,Nascimento,ResultadoExame
430612,9c9bf7cfbed16ed820f34daa7fb840e7,230440,FORTALEZA,CE,MASCULINO,1997-10-19,Negativo
700623,d20204b2d01db43b1ef6e66145f726c4,231150,QUIXERE,CE,FEMININO,1987-04-01,Negativo
524130,60527a799878b0fec4bb9a7f8ca75944,231290,SOBRAL,CE,FEMININO,1991-12-25,Positivo
186667,62a90e916893bdb218ce2fed7a7b8f90,230765,MARACANAU,CE,MASCULINO,1953,Provável
170210,9715b4562db9395974adf0dba70795c1,230190,BARBALHA,CE,FEMININO,1985-11-16,Negativo
102274,eea32868e19c59623edbca9b832303c3,231340,TIANGUA,CE,MASCULINO,1980-10-22,Negativo
677291,d26d5d0c426fba4ad13cf8b9eed19cd5,231140,QUIXERAMOBIM,CE,FEMININO,1979-06-11,Positivo
203160,a687e8756a9c7775db0eb394b6903c2d,230490,GROAIRAS,CE,MASCULINO,1995-08-25,Negativo
533774,60a30c5a1df698198476db4d119a2b7f,231180,RUSSAS,CE,MASCULINO,1941-01-05,Negativo
635281,215f31142d8facc65a42e3d3401fd9fa,231320,TAMBORIL,CE,MASCULINO,1966-09-12,Positivo


In [43]:
# Converting the dataframe to an anonymized file:
df_nascimento_2.to_csv('covid_anon03.csv', index = False)

**4. Full domain generalization of the values of the “Municipio” (level 1) and "Nascimento" (level 2) attributes and recording an anonymized “covid_anon04.csv” file.**

The dataframe formed by the dataset of the region and covid will be used after the anonymization of the Municipality: df_municipio

In [44]:
df_municipio_nascimento = df_municipio.copy()
df_municipio_nascimento

Unnamed: 0,Identificador,Municipio,Genero,Nascimento,ResultadoExame,Código IBGE do Município,Nome da Mesoregião,Sigla da Unidade da Federação,Nome das Grandes Regiões
0,c3ba634113e4b5eb0e3eaae93b09759b,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",MASCULINO,2003-08-14,Negativo,******,Noroeste Cearense,CE,Nordeste
1,8db897c909c1c246b56ee491e0df994f,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",MASCULINO,1989-12-28,Negativo,******,Noroeste Cearense,CE,Nordeste
2,17997c9bd21d9931ea130f1ca65236e9,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",FEMININO,2000-07-08,Negativo,******,Noroeste Cearense,CE,Nordeste
3,25e5850e62482886d70efb3f8438aaa7,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",FEMININO,1954-03-20,Negativo,******,Noroeste Cearense,CE,Nordeste
4,de7400d331c8521d71baa53409d8b4f8,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",MASCULINO,2002-12-06,Negativo,******,Noroeste Cearense,CE,Nordeste
...,...,...,...,...,...,...,...,...,...
879538,8ca78bc532162483be4b9ec4c4bc5e64,"[ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...",FEMININO,1974-09-29,Negativo,******,Jaguaribe,CE,Nordeste
879539,112a5ad7b915503baca248adcd32c6f5,"[ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...",MASCULINO,1941-08-20,Negativo,******,Jaguaribe,CE,Nordeste
879540,4b3a7183752a62c90e2f59cac1d8a984,"[ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...",FEMININO,1984-02-29,Positivo,******,Jaguaribe,CE,Nordeste
879541,3072ba3050b62e29449eca68420a3983,"[ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...",FEMININO,1991-07-04,Positivo,******,Jaguaribe,CE,Nordeste


In [45]:
df_municipio_nascimento.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 879543 entries, 0 to 879542
Data columns (total 9 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Identificador                  879543 non-null  object
 1   Municipio                      879543 non-null  object
 2   Genero                         879543 non-null  object
 3   Nascimento                     879543 non-null  object
 4   ResultadoExame                 879543 non-null  object
 5   Código IBGE do Município       879543 non-null  object
 6   Nome da Mesoregião             879543 non-null  object
 7   Sigla da Unidade da Federação  879543 non-null  object
 8   Nome das Grandes Regiões       879543 non-null  object
dtypes: object(9)
memory usage: 67.1+ MB


In [46]:
# Replacing only the values that belong to the range: 1950-1959
df_municipio_nascimento['Nascimento'] = df_municipio_nascimento['Nascimento'].map(dic_ano)

In [47]:
df_municipio_nascimento

Unnamed: 0,Identificador,Municipio,Genero,Nascimento,ResultadoExame,Código IBGE do Município,Nome da Mesoregião,Sigla da Unidade da Federação,Nome das Grandes Regiões
0,c3ba634113e4b5eb0e3eaae93b09759b,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",MASCULINO,2003-08-14,Negativo,******,Noroeste Cearense,CE,Nordeste
1,8db897c909c1c246b56ee491e0df994f,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",MASCULINO,1989-12-28,Negativo,******,Noroeste Cearense,CE,Nordeste
2,17997c9bd21d9931ea130f1ca65236e9,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",FEMININO,2000-07-08,Negativo,******,Noroeste Cearense,CE,Nordeste
3,25e5850e62482886d70efb3f8438aaa7,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",FEMININO,1954,Negativo,******,Noroeste Cearense,CE,Nordeste
4,de7400d331c8521d71baa53409d8b4f8,"[SOBRAL, ACARAU, TIANGUA, SAO BENEDITO, CAMOCI...",MASCULINO,2002-12-06,Negativo,******,Noroeste Cearense,CE,Nordeste
...,...,...,...,...,...,...,...,...,...
879538,8ca78bc532162483be4b9ec4c4bc5e64,"[ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...",FEMININO,1974-09-29,Negativo,******,Jaguaribe,CE,Nordeste
879539,112a5ad7b915503baca248adcd32c6f5,"[ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...",MASCULINO,1941-08-20,Negativo,******,Jaguaribe,CE,Nordeste
879540,4b3a7183752a62c90e2f59cac1d8a984,"[ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...",FEMININO,1984-02-29,Positivo,******,Jaguaribe,CE,Nordeste
879541,3072ba3050b62e29449eca68420a3983,"[ARACATI, SAO JOAO DO JAGUARIBE, LIMOEIRO DO N...",FEMININO,1991-07-04,Positivo,******,Jaguaribe,CE,Nordeste


In [48]:
df_municipio_nascimento.to_csv('covid_anon04.csv', index = False)

# Interface

The interface contains four buttons referring to different types of anonymization that have been worked on throughout this notebook. This way, the user will be able to choose the type of anonymization and the dataset he wants to see.

In [51]:
import pandas as pd
from tkinter import *
from pandastable import Table, TableModel

class MyApp():
    def __init__(self, master):
        self.master = master
        master.title("Anonymized Datasets")
        master.configure(bg='#4169e1')

        
        self.button_1 = Button(master,text="Anonymization 1: Municipality ('Município') - level 1",font= ('Helvetica 16'), fg=("black"), 
                               command=self.display_df_in_new_window_1) 
        self.button_1.grid(column=0, row=0, padx=20, pady=20)
        self.button_1.configure(bg='white')
        
        self.button_2 = Button(master,text="Anonymization 2: Birth ('Nascimento') - level 1",font= ('Helvetica 16'), fg=("black"),
                               command=self.display_df_in_new_window_2) 
        self.button_2.grid(column=0, row=1, padx=20, pady=20)
        self.button_2.configure(bg='white')
        
        self.button_3 = Button(master,text="Anonymization 3: Birth ('Nascimento') - level 2", font= ('Helvetica 16'), fg=("black"),
                               command=self.display_df_in_new_window_3) 
        self.button_3.grid(column=0, row=2, padx=20, pady=20)
        self.button_3.configure(bg='white')
        
        self.button_4 = Button(master,text="Anonymization 4: Municipality ('Município') (1) and Birth ('Nascimento') (2)",font= ('Helvetica 16'), fg=("black"),
                               command=self.display_df_in_new_window_4) 
        self.button_4.grid(column=0, row=3, padx=20, pady=20)
        self.button_4.configure(bg='white')

    def display_df_in_new_window_1(self):
       frame = Toplevel(self.master) #this is the new window
       self.table = Table(frame, dataframe=pd.read_csv('covid_anon01.csv'), showtoolbar=True, showstatusbar=True)
       self.table.show()
    def display_df_in_new_window_2(self):
       frame = Toplevel(self.master) #this is the new window
       self.table = Table(frame, dataframe=pd.read_csv('covid_anon02.csv'), showtoolbar=True, showstatusbar=True)
       self.table.show()
    def display_df_in_new_window_3(self):
       frame = Toplevel(self.master) #this is the new window
       self.table = Table(frame, dataframe=pd.read_csv('covid_anon03.csv'), showtoolbar=True, showstatusbar=True)
       self.table.show()
    def display_df_in_new_window_4(self):
       frame = Toplevel(self.master) #this is the new window
       self.table = Table(frame, dataframe=pd.read_csv('covid_anon04.csv'), showtoolbar=True, showstatusbar=True)
       self.table.show()
    

root = Tk()
my_gui = MyApp(root)
root.mainloop()