
**[PT]** Para executar este ficheiro veja o ficheiro README.md nesta directoria

**[EN]** To run this file: follow instructions in the README.md file in this directory.



# **[PT]** Remissivas
# **[EN]** Cross references

The digital version of the FA carries over a cross-reference mechanism from the card catalog. As was usual at the time, extra cards were inserted into the catalog to guide users searching for name variants of the main entry. These cards have a “base” name, followed the word “vide” and a “name expression”. 

This notebook analyses the cross-reference information in the FA.

Data from this notebook was used in:


## **[PT]** Inicialização
## **[EN]** Setup

In [500]:
from timelinknb import current_time,current_machine, get_mhk_db
import ucalumni.config as alumniconf

db_name = alumniconf.mhk_db_name
db = get_mhk_db(db_name, connect_args={'connect_timeout': 3600})
print(current_machine,current_time,f'db={db_name}')


mini-m1.local 2022-05-07 11:43:37.538887 db=ucalumni


Prepare a dataframe to collect the results of cross reference analysis


In [501]:
import pandas as pd

columns = ['data','sequential','random']
vars = ['vide','vide_plus',
        'see','see_matched','see_matched_ok','nodate_novide',
        'aka','nodate','nodate_novide','aka_matched','aka_matched_ok',
        'records_matched','records_matched_ok','records_error',
        'matched_pairs','matched_pairs_ok','records_see_aka','records_aka_see','records_aka_aka', 'records_see_see',
        'records_transitive','records_asymmetric']

match_info = pd.DataFrame(index=vars,columns=columns)
match_records = dict([(k,dict.fromkeys(columns)) for k in vars])
match_info.sort_index(inplace=True)
match_info

Unnamed: 0,data,sequential,random
aka,,,
aka_matched,,,
aka_matched_ok,,,
matched_pairs,,,
matched_pairs_ok,,,
nodate,,,
nodate_novide,,,
nodate_novide,,,
records_aka_aka,,,
records_aka_see,,,


## **[PT]** Conjunto de registos com "vide"
## **[EN]** Set of "see also" records ("vide")

### **[PT]** Conjunto de fichas que incluem uma nota remissiva (vide)

Algumas fichas geram mais que uma linha na table seguinte, quando incluem mais que uma faculdade ou nome geográfico.

Por isso o número de linhas da tabela é superior ao número de fichas. 

**Para obter o número real de fichas é necessário contar o número de identificadores (números de seis dígitos) de registo diferentes:**

    nvide = len(vide.index.unique())

### **[EN]** Set of records which contain a "see" note (vide)


Note that records with more than one faculty and/or more than one geographic name 
generate more than one line. So the number of lines in the data frame is greater
than the number of records.

**To obtain the real number of records in a data frame it is necessary to count the number of unique record identifiers (six-digit numbers) in the data frame index.**

    nvide = len(vide.index.unique())

In [502]:
from timelinknb.pandas import attribute_to_df


# Get list of people with attribute nome-vide and add nome-geografico, nome-pai, entry date and faculdade
vide = attribute_to_df(
                    the_type='nome-vide',
                    person_info=True,
                    more_cols=['nome-geografico','faculdade','nome-pai','uc-entrada','uc-saida'],
                    sql_echo=False)
# drop columns that are not usefull
vide.drop(['nome-vide.date','nome-vide.obs','nome-geografico.date','nome-geografico.obs','nome-pai.date','nome-pai.obs','uc-entrada.date','uc-entrada.obs'],axis=1, inplace=True)
nvide = len(vide.index.unique())
print(current_machine,current_time,f'db={db_name}')
print("Number of records with 'vide' cross reference:'",nvide)
match_info.loc['vide','data'] = nvide
match_records['vide']['data'] = vide.index.unique()
print()
print(vide.info())



mini-m1.local 2022-05-07 11:43:37.538887 db=ucalumni
Number of records with 'vide' cross reference:' 8625

<class 'pandas.core.frame.DataFrame'>
Index: 9286 entries, 127765 to 358077
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9286 non-null   object
 1   sex              9286 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8784 non-null   object
 4   faculdade        4793 non-null   object
 5   faculdade.date   4793 non-null   object
 6   faculdade.obs    4775 non-null   object
 7   nome-pai         3484 non-null   object
 8   uc-entrada       9286 non-null   object
 9   uc-saida         9286 non-null   object
 10  uc-saida.date    9286 non-null   object
 11  uc-saida.obs     0 non-null      object
dtypes: object(12)
memory usage: 1.2+ MB
None


In [503]:

print()
print("Check a few:")
vide.head(5)


Check a few:


Unnamed: 0_level_0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,uc-saida.date,uc-saida.obs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
127765,André Vaz Cabaço,m,Vaz,Coimbra,,,,,0000-00-00,0000-00-00,0000-00-00,
127798,António Joaquim do Cabo,m,e Faria,Belém,,,,,0000-00-00,0000-00-00,0000-00-00,
127819,Álvaro de Madureira Cabral,m,Madureira,Lamego,,,,,0000-00-00,0000-00-00,0000-00-00,
128000,António Cabral,m,Castelo Branco,Celorico,Leis,1603-10-07,Leis,João Gil de Abreu,1603-10-07,1616-05-16,1616-05-16,
128013,António Cabral,m,Camelo,Ranhados,Cânones,1642-10-29,Cânones,Lourenço Cabral,1642-10-29,1651-05-21,1651-05-21,


#### **[EN]*+ Problems in processing 'vide' notes with multiple "vide"

There are a few cases in the form  "vide _...name..._ e vide _...name..._"

1. https://pesquisa.auc.uc.pt/details?id=141274
2. https://pesquisa.auc.uc.pt/details?id=147377
3. https://pesquisa.auc.uc.pt/details?id=147659
4. https://pesquisa.auc.uc.pt/details?id=150350
5. https://pesquisa.auc.uc.pt/details?id=150562
6. https://pesquisa.auc.uc.pt/details?id=152472
7. https://pesquisa.auc.uc.pt/details?id=189389
8. https://pesquisa.auc.uc.pt/details?id=190076
9. https://pesquisa.auc.uc.pt/details?id=191599
10. https://pesquisa.auc.uc.pt/details?id=192039
11. https://pesquisa.auc.uc.pt/details?id=196728
12. https://pesquisa.auc.uc.pt/details?id=197167
13. https://pesquisa.auc.uc.pt/details?id=207991
14. https://pesquisa.auc.uc.pt/details?id=209208
15. https://pesquisa.auc.uc.pt/details?id=216619
16. https://pesquisa.auc.uc.pt/details?id=244099
17. https://pesquisa.auc.uc.pt/details?id=248624
18. https://pesquisa.auc.uc.pt/details?id=266150

19. 130281	Nuno da Câmara	is tricky, because it combines note and vide, and the vide part has two names
      Nuno da Câmara (D.), vide Nuno Casimiro da Câmara e Nuno José da Câmara it links with  130516 and 130517
      https://pesquisa.auc.uc.pt/details?id=130281
  
Handling these requires changing the grammar rules, scheduled for next version.

### **[PT]** Determinar o tipo de remissiva

### **[EN]** Determine the type of cross reference

__Forward cross references (“see”)__
* Almost empty records with a name with “vide”
* A few with more than one (…vide... e vide…)
* No dates (empty “UnitDateInitial” field)
* Other than the name:
    * 93% place of birth
    * 27% father’s name
    * 23% faculty 
  
__Back cross references (“also knows as/aka”)__
* Normal records with “vide” in the name.
* Dates (valid “UnitDateInitial” field)
* Contain all types of information:
    * 97% place of birth
    * 53% father’s name
    * 99% faculty
    * degrees, enrolment, and so on.
* Can be matched with “see” records.
* These records are the non preferred form of the name and should link to a preferred form.



#### **[PT]** Remissivas tipo "ver": registos com 'vide' e sem datas

Estes registos são a forma não preferida do nome e remetem para a forma principal.

#### **[EN]** "See"  or forward cross-references: "vide" and no dates

These records are the non preferred form of the name and should link to a preferred form.

In [504]:

zdate_filter = vide['uc-entrada'] == '0000-00-00'
vide.loc[zdate_filter,'rec_type'] = 'see'

see_vide = vide[zdate_filter]
nsee_vide = len(see_vide.index.unique())
match_info.loc['see','data'] = nsee_vide
match_records['see']['data'] = list(see_vide.index.unique())
print("Number of vide records with zero dates (forward cross references):",nsee_vide)

nsee_vide_geo = len(see_vide[see_vide['nome-geografico'].notnull()].index.unique())
match_info.loc['see_geo','data'] = nsee_vide_geo
print(f"    of which {nsee_vide_geo} with place of birth {nsee_vide_geo/nsee_vide:.2%}")

nsee_vide_pai = len(see_vide[see_vide['nome-pai'].notnull()].index.unique())
match_info.loc['see_pai','data'] = nsee_vide_pai
print(f"    of which {nsee_vide_pai} with father's name  {nsee_vide_pai/nsee_vide:.2%}")

nsee_vide_fac = len(see_vide[see_vide['faculdade'].notnull()].index.unique())
match_info.loc['see_fac','data'] = nsee_vide_fac
print(f"    of which {nsee_vide_fac} with faculty        {nsee_vide_fac/nsee_vide:.2%}")

print()
base_vide_cols=['nome-vide','nome-geografico','faculdade','nome-pai']


Number of vide records with zero dates (forward cross references): 5563
    of which 5153 with place of birth 92.63%
    of which 1512 with father's name  27.18%
    of which 1305 with faculty        23.46%



In [505]:
match_info.sort_index(inplace=True)
match_info.fillna(" ")

Unnamed: 0,data,sequential,random
aka,,,
aka_matched,,,
aka_matched_ok,,,
matched_pairs,,,
matched_pairs_ok,,,
nodate,,,
nodate_novide,,,
nodate_novide,,,
records_aka_aka,,,
records_aka_see,,,


In [506]:
# Show some
see_vide.head()

Unnamed: 0_level_0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,uc-saida.date,uc-saida.obs,rec_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
127765,André Vaz Cabaço,m,Vaz,Coimbra,,,,,0000-00-00,0000-00-00,0000-00-00,,see
127798,António Joaquim do Cabo,m,e Faria,Belém,,,,,0000-00-00,0000-00-00,0000-00-00,,see
127819,Álvaro de Madureira Cabral,m,Madureira,Lamego,,,,,0000-00-00,0000-00-00,0000-00-00,,see
128053,António da Fonseca Cabral,m,Fonseca,Samodães,Cânones,0000-00-00,Cânones,Sebastião da Fonseca Cabral,0000-00-00,0000-00-00,0000-00-00,,see
128061,António de Matos Cabral,m,Matos,Alhos Vedros,Cânones,0000-00-00,Cânones,Tomé de Matos Cabral,0000-00-00,0000-00-00,0000-00-00,,see


#### **[PT]** Remissivas inversas ("também conhecido por"): registos completos com "vide"
#### **[EN]** "Aka"  or back references: records with "vide" other types of information

These are the records that should be linked back to zero date vide records.

There are too few of them!

In [507]:
# count vide record with a proper (non-zero) date
aka_filter = vide['uc-entrada'] != '0000-00-00'
vide.loc[aka_filter,'rec_type'] = 'aka'
aka_vide = vide[aka_filter]

naka_vide = len(set(aka_vide.index.values))
match_info.loc['aka','data'] = naka_vide
print("Number of records with vide and proper date (aka):",naka_vide)
match_records['aka']['data'] = list(aka_vide.index.unique())

naka_vide_geo = len(aka_vide[aka_vide['nome-geografico'].notnull()].index.unique())
match_info.loc['aka_geo','data'] = naka_vide_geo
print(f"    of which {naka_vide_geo} with place of birth {naka_vide_geo/naka_vide:.2%}")

naka_vide_pai = len(aka_vide[aka_vide['nome-pai'].notnull()].index.unique())
match_info.loc['aka_pai','data'] = naka_vide_pai
print(f"    of which {naka_vide_pai} with father's name  {naka_vide_pai/naka_vide:.2%}")

naka_vide_fac = len(aka_vide[aka_vide['faculdade'].notnull()].index.unique())
match_info.loc['aka_fac','data'] = naka_vide_fac
print(f"    of which {naka_vide_fac} with faculty        {naka_vide_fac/naka_vide:.2%}")

print("Number of records with vide and zero date (see):",nsee_vide)
# we subtract 
print("Number of zero date records in excess of dated vide records         :", nsee_vide-naka_vide)
match_info.sort_index(inplace=True)

Number of records with vide and proper date (aka): 3062
    of which 2973 with place of birth 97.09%
    of which 1619 with father's name  52.87%
    of which 3035 with faculty        99.12%
Number of records with vide and zero date (see): 5563
Number of zero date records in excess of dated vide records         : 2501


In [508]:
match_info.fillna(" ")

Unnamed: 0,data,sequential,random
aka,3062.0,,
aka_fac,3035.0,,
aka_geo,2973.0,,
aka_matched,,,
aka_matched_ok,,,
aka_pai,1619.0,,
matched_pairs,,,
matched_pairs_ok,,,
nodate,,,
nodate_novide,,,


In [509]:
# Show some
aka_vide.head()

Unnamed: 0_level_0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,uc-saida.date,uc-saida.obs,rec_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
128000,António Cabral,m,Castelo Branco,Celorico,Leis,1603-10-07,Leis,João Gil de Abreu,1603-10-07,1616-05-16,1616-05-16,,aka
128013,António Cabral,m,Camelo,Ranhados,Cânones,1642-10-29,Cânones,Lourenço Cabral,1642-10-29,1651-05-21,1651-05-21,,aka
128142,Diogo de Morais Cabral,m,Morais,Mêda,Cânones,1681-10-28,Cânones,,1681-10-28,1689-10-01,1689-10-01,,aka
128155,Fernão Cabral,m,Albuquerque,Celorico da Beira,Cânones,1663-10-15,Cânones,,1663-10-15,1674-07-24,1674-07-24,,aka
128333,Inácio de Figueiredo Cabral,m,Albuquerque,Penalva,Cânones,1655-10-22,Cânones,,1655-10-22,1662-07-21,1662-07-21,,aka


### **[EN]** Look at other records with no dates, even if they have no "vide" expression

To test if all zero date records are part of the cross reference scheme.
Maybe the "vide" expression was missed during input in the database.

Frist collect all records th zero date.

In [510]:
from timelinknb.pandas import attribute_to_df
from timelinknb import Session

with Session() as session:
    session.begin()

# Get list of people with no start-date and add nome-geografico, nome-pai, nome-vide and faculdade
zero_date = attribute_to_df(
                    the_type='uc-entrada',
                    the_value='0000-00-00',
                    person_info=True,
                    more_cols=['nome-vide','nome-geografico','nome-pai','faculdade','uc-saida'],
                    sql_echo=False)
zero_date.drop(['nome-vide.date','nome-vide.obs','nome-geografico.date','nome-geografico.obs','nome-pai.date','nome-pai.obs','uc-entrada.date','uc-entrada.obs'],axis=1, inplace=True)                    
nzero_date = len(set(zero_date.index.unique()))
print()
print(current_machine,current_time,f'db={db_name}')
print("Total number of rows with zero date:", len(zero_date))
print("Total number of records with zero date:", nzero_date)
match_info.loc['nodate','data'] = nzero_date

base_vide_cols=['nome-vide','nome-geografico','faculdade','nome-pai']
zero_date[base_vide_cols].count(axis=0)



mini-m1.local 2022-05-07 11:43:37.538887 db=ucalumni
Total number of rows with zero date: 6061
Total number of records with zero date: 5763


nome-vide          5843
nome-geografico    5605
faculdade          1521
nome-pai           1673
dtype: int64

#### **[EN]** List of records with no date and no "vide": are they part of the cross references?

These are zero date records with no vide information, 
which means that there are no name transformations 
to be used in searching of matching records.
But since they have no dates they might be part of 
the cross-reference set.

In late April 2022 there were around 200 records.


In [511]:

# From the zero date set filter those with no "vide" 
zd_no_vide = zero_date[zero_date['nome-vide'].isnull()]
nzd_no_vide = len(set(zd_no_vide.index.values))
print()
print("Number of records with zero date and no 'vide':",nzd_no_vide)




Number of records with zero date and no 'vide': 200


#### **[EN]** Check if the unit dates were left blank by mistake

If a record with no unit dates contains neverthless dated information, 
then it would be possible to register the unit dates from that information,
and the blank unit dates are an error.

First collect all the attributes available for those "zero date no vide" records.

In [512]:
from timelinknb.pandas import group_attributes

zdnv_group = group_attributes(set(zd_no_vide.index.values))

Next search for attributes with valid dates in that set.

In [513]:
zdnv_with_dates = (zdnv_group['date']>'0000-00-00') & (zdnv_group['date'] < '1917-12-31')
false_zd = zdnv_group[zdnv_with_dates]
nfalse_zd = len(set(false_zd.index.values))
print("Number of records with dates in attributes but not unit dates:",nfalse_zd)
print("These are not cross-reference records, just records with unfilled unit dates")
print("Sample:")
false_zd.head(10)[['name','date','type','value','attr_obs']]

Number of records with dates in attributes but not unit dates: 59
These are not cross-reference records, just records with unfilled unit dates
Sample:


Unnamed: 0_level_0,name,date,type,value,attr_obs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
128851,Luís de Miranda Cabral,1695-02-23,grau,Bacharel em Artes,Incorporação de Bacharel em Artes: 23.02.1695
128851,Luís de Miranda Cabral,1695-02-23,grau.ano,Bacharel em Artes.1695,Incorporação de Bacharel em Artes: 23.02.1695
128851,Luís de Miranda Cabral,1695-05-18,grau,Licenciado em Artes,18.05.1695
128851,Luís de Miranda Cabral,1695-05-18,grau.ano,Licenciado em Artes.1695,18.05.1695
131475,Diogo Fialho,1656-03-30,grau,Licenciado em Artes,Licenciado em Artes 30.03.1656: Atos e Graus L...
131475,Diogo Fialho,1656-03-30,grau.ano,Licenciado em Artes.1656,Licenciado em Artes 30.03.1656: Atos e Graus L...
137651,Manuel Pais de Figueiredo,1685-07-30,grau,Formatura em Cânones,Formatura 30.07.1685
137651,Manuel Pais de Figueiredo,1685-07-30,grau.ano,Formatura em Cânones.1685,Formatura 30.07.1685
139433,Alexandre da Fonseca,1614-05-17,grau,Licenciado em Artes,Licenciado em Artes 17.05.1614
139433,Alexandre da Fonseca,1614-05-17,grau.ano,Licenciado em Artes.1614,Licenciado em Artes 17.05.1614


We remove those records from the possible cross reference aditions.

In late April 2022 there were 60 of such record. They are not cross references.

Removing those from the zero dated, no "vide" records, around 140 remain.


In [514]:
zd_no_vide_clean = zd_no_vide.drop(false_zd.index)
zd_no_vide_clean['rec_type'] = 'see'
nzd_no_vide_clean = len(zd_no_vide_clean.index.unique())
print()
print("Number of records with zero date and no 'vide' (cleaned):",nzd_no_vide_clean)
match_info.loc['nodate_novide','data'] = nzd_no_vide_clean
match_records['nodate_novide','data'] = zd_no_vide_clean.index.unique()
print("Information contained in these records:")
base_vide_cols=['nome-vide','nome-geografico','faculdade','nome-pai','uc-entrada']
zd_no_vide_clean[base_vide_cols].count(axis=0)


Number of records with zero date and no 'vide' (cleaned): 141
Information contained in these records:


nome-vide            0
nome-geografico    132
faculdade           79
nome-pai            63
uc-entrada         152
dtype: int64

Lets see what they look like

In [515]:
zd_no_vide_clean[['name','nome-vide','nome-pai','nome-geografico','faculdade','faculdade.obs','uc-entrada','rec_type']].head().sort_values('name').fillna(" ")

Unnamed: 0_level_0,name,nome-vide,nome-pai,nome-geografico,faculdade,faculdade.obs,uc-entrada,rec_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
128114,Belchior de Sá Cabral,,,Alfândega,Cânones,Cânones,0000-00-00,see
129384,Damião Dias Caldeira,,,Estremoz,Leis,Leis,0000-00-00,see
128371,João Cabral,,António Teixeira,Torres Vedras,,,0000-00-00,see
130534,Manuel Domingues Ferreira,,,Ferreiros,,,0000-00-00,see
130359,Manuel Ferreira,,,Penalva,Artes,Faculdade inferida,0000-00-00,see


#### **[EN]** Add zero date records with no 'vide' to records to be matched

We join the zero date no 'vide' records to the vide records,.

We assume that zero date records are also "see also" records which were not flagged as 'vide' due to input variations.

But we know this is not always the case, some of the zero date records are normal records where the unit dates were not recorded for some reason.

In [516]:
import pandas as pd

vide_plus = pd.concat([vide,zd_no_vide_clean])
nvide_plus = len(vide_plus.index.unique())
match_info.loc['vide_plus','data'] = nvide_plus
match_records['vide_plus']['data'] = vide_plus.index.unique()
print(f"Number of unique records involved in the cross references: {nvide_plus}")
vide_plus.info()

Number of unique records involved in the cross references: 8766
<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 127765 to 316291
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8916 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
dtypes: object(13)
memory usage: 1.3+ MB


#### **[EN]** Update information on "see" type referecences

Taking into account the new zero date, no vide records (around 140)

In [517]:
see_vide = vide_plus[vide_plus['uc-entrada'] == '0000-00-00']

nsee_vide = len(see_vide.index.unique())
match_info.loc['see','data'] = nsee_vide
match_records['see']['data']=list(see_vide.index.unique())
print("Number of vide records with zero dates (forward cross references) updated:",nsee_vide)

nsee_vide_geo = len(see_vide[see_vide['nome-geografico'].notnull()].index.unique())
match_info.loc['see_geo','data'] = nsee_vide_geo
print(f"    of which {nsee_vide_geo} with place of birth {nsee_vide_geo/nsee_vide:.2%}")

nsee_vide_pai = len(see_vide[see_vide['nome-pai'].notnull()].index.unique())
match_info.loc['see_pai','data'] = nsee_vide_pai
print(f"    of which {nsee_vide_pai} with father's name  {nsee_vide_pai/nsee_vide:.2%}")

nsee_vide_fac = len(see_vide[see_vide['faculdade'].notnull()].index.unique())
match_info.loc['see_fac','data'] = nsee_vide_fac
print(f"    of which {nsee_vide_fac} with faculty        {nsee_vide_fac/nsee_vide:.2%}")
print()

base_vide_cols=['nome-vide','nome-geografico','faculdade','nome-pai']


Number of vide records with zero dates (forward cross references) updated: 5704
    of which 5274 with place of birth 92.46%
    of which 1569 with father's name  27.51%
    of which 1378 with faculty        24.16%



#### **[EN]** Closer look at "see" references


##### **[EN]** Presence of place of birth in zero date records

Most of them have place of birth information.
Note that 

In [518]:
see_vide_with_geo = see_vide[see_vide['nome-geografico'].notnull()]
nsee_vide_with_geo = len(set(see_vide_with_geo.index.values))
print("See references with geo info (unique records):",
       nsee_vide_with_geo,
       "out of",nsee_vide,
       f'= {nsee_vide_with_geo/nzero_date:.2%}')
print("Other information (note that some records have more than one geographic name)")
see_vide_with_geo[base_vide_cols].count(axis=0)


See references with geo info (unique records): 5274 out of 5704 = 91.51%
Other information (note that some records have more than one geographic name)


nome-vide          5433
nome-geografico    5565
faculdade          1392
nome-pai           1639
dtype: int64

##### **[EN]** See references with no birth place

Check which information is available when place of birth is missing.

The values are similar to normal "see" records.


In [519]:
see_vide_nogeo = see_vide[see_vide['nome-geografico'].isnull()]
nsee_vide_nogeo = len(set(see_vide_nogeo.index.values))
print("Zero date records without geo info:",
       nsee_vide_nogeo,
       "out of",nzero_date,
       f'= {nsee_vide_nogeo/nzero_date:.2%}')
print()
print("Other information:")
see_vide_nogeo[base_vide_cols].count(axis=0)

Zero date records without geo info: 430 out of 5763 = 7.46%

Other information:


nome-vide          410
nome-geografico      0
faculdade           64
nome-pai            20
dtype: int64

### **[EN]** Final ist of records involved in cross references



In [520]:
import pandas as pd

pd.set_option('display.max_rows',50)
vide_plus.info()
vide_plus.head(10)

<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 127765 to 316291
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8916 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
dtypes: object(13)
memory usage: 1.3+ MB


Unnamed: 0_level_0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,uc-saida.date,uc-saida.obs,rec_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
127765,André Vaz Cabaço,m,Vaz,Coimbra,,,,,0000-00-00,0000-00-00,0000-00-00,,see
127798,António Joaquim do Cabo,m,e Faria,Belém,,,,,0000-00-00,0000-00-00,0000-00-00,,see
127819,Álvaro de Madureira Cabral,m,Madureira,Lamego,,,,,0000-00-00,0000-00-00,0000-00-00,,see
128000,António Cabral,m,Castelo Branco,Celorico,Leis,1603-10-07,Leis,João Gil de Abreu,1603-10-07,1616-05-16,1616-05-16,,aka
128013,António Cabral,m,Camelo,Ranhados,Cânones,1642-10-29,Cânones,Lourenço Cabral,1642-10-29,1651-05-21,1651-05-21,,aka
128053,António da Fonseca Cabral,m,Fonseca,Samodães,Cânones,0000-00-00,Cânones,Sebastião da Fonseca Cabral,0000-00-00,0000-00-00,0000-00-00,,see
128061,António de Matos Cabral,m,Matos,Alhos Vedros,Cânones,0000-00-00,Cânones,Tomé de Matos Cabral,0000-00-00,0000-00-00,0000-00-00,,see
128066,António de Mendonça Cabral,m,Mendonça,Pernambuco,,,,,0000-00-00,0000-00-00,0000-00-00,,see
128093,António Pinto Cabral,m,Pinto,Grajal,,,,,0000-00-00,0000-00-00,0000-00-00,,see
128103,António Veloso Cabral,m,Veloso,Sanfins,,,,,0000-00-00,0000-00-00,0000-00-00,,see


In [521]:
match_info.sort_index(inplace=True)
match_info.fillna('')

Unnamed: 0,data,sequential,random
aka,3062.0,,
aka_fac,3035.0,,
aka_geo,2973.0,,
aka_matched,,,
aka_matched_ok,,,
aka_pai,1619.0,,
matched_pairs,,,
matched_pairs_ok,,,
nodate,5763.0,,
nodate_novide,141.0,,


## **[EN]** Analyse geographic names for variations

Alternative approach is to make an index of variations in geographic names and 
find name matches within the set of names from similar spelled places.

For similarity we use the so called "gestalt pattern matching" from the Python library see: https://docs.python.org/3/library/difflib.html

The threshold ration was determined empirically. It is not a problem that some false varitations are detected since 
a further check is done with matching the names.

The following code compares all the the geographic names and prints out those considered to be variants of the same place, with
the similarity ration. Note that this ratio is sensitive to length, and fails the threshold in short forms like "Algozo/Algoso" or
when the two forms are of different lengths like "Poiares/Vila Nova de Poiares".

Stil it detects many usefull variants.

In [522]:
import difflib

# we have many variations in geografic names
have_geoname_filter = vide_plus['nome-geografico'].notnull()
geonames_index = sorted(vide_plus[have_geoname_filter]['nome-geografico'].unique())
print("Number of different geo names:",len(geonames_index))
geo_similars = {}
diff_threshold = .85

for geo in [g for g in geonames_index if g is not None]:

    for similar in [s for s in geonames_index if s is not None and s > geo] :
        diff = difflib.SequenceMatcher(None, geo, similar).ratio()
        if  diff >= diff_threshold and geo is not None and similar is not None:
            pass
            print(f"{geo} / {similar} diff:{diff:.3}")
            geo_similars[geo] = geo_similars.get(geo,[]) + [similar]
            geo_similars[similar] = geo_similars.get(similar,[]) + [geo]


Number of different geo names: 1429
Aguas Santas / Águas Santas diff:0.917
Alcacer / Alcácer diff:0.857
Alcaide / Alcaíde diff:0.857
Alenquer / Alquer diff:0.857
Alhos Vedras / Alhos Vedros diff:0.917
Almalaguer / Almalaguez diff:0.9
Ameixilhoeira da Carregação / Mexilhoeira da Carregação diff:0.923
Anadaluzia / Andaluzia diff:0.947
Angra do Heroismo / Angra do Heroísmo diff:0.941
Arco / Arcos diff:0.889
Arcos Valdevez / Arcos de Valdevez diff:0.903
Arcos Valdevez / Arcos de Valedevez diff:0.875
Arcos de Valdevez / Arcos de Valedevez diff:0.971
Arcozelo / Arcozelos diff:0.941
Arrifana de Sousa / Arrifana do Sousa diff:0.941
Atei / Athei diff:0.889
Bairros / Barrosa diff:0.857
Barcelos / Barcos diff:0.857
Barcos / Buarcos diff:0.923
Barreira / Barreiria diff:0.941
Barreira / Barreiro diff:0.875
Barrosa / Barrosas diff:0.933
Bemviver / Benviver diff:0.875
Cabacos / Cabaços diff:0.857
Cabananas / Cabanas diff:0.875
Cabeceira de Basto / Cabeceira de Bastos diff:0.973
Cabeceira de Basto / C

## **[EN]** Matching "vide" references

Try to match "forward" and "backward" references by generating the target names from "vide"
expressions 

### **[PT]** Geração de nomes alternativos a partir das expressões de "vide"


Existem quatro tipos de expressões "vide".

1. Remoção de Apelido: António Veloso Cabral, vide Veloso. Interpretação: António Veloso
2. Adição de Apelido : André de Campos, vide Cordeiro. Interpretação: André de Campos Cordeiro
3. Substituição do nome: Adriano Sisnando Brotero de Avelar Quintino, vide Adriano Sisnando Brotero Quintino de Avelars. Interpretação: Adriano Sisnando Brotero Quintino de Avelar
4. Substitução de Apelido: Francisco António Campos, vide de Novais Campos. Interpretação: Francisco António de Novais Campos	
  

### **[EN]** Generation of target names from "vide" expressions

 


1. “Cut”: António Veloso Cabral, vide Veloso, result: António Veloso. The “vide” expression is a family name before the last; the target name is computed as the base name up to and including the “vide” expression; the resulting name is a shorter version of the base name, with the last family name(s) removed
2. “Add”: André de Campos, vide Cordeiro, result: André de Campos Cordeiro. The “vide” expression is not present in the base name; the target name is the base name with the “vide” expression added at the end; in some cases, the real target name will have and extra particle before the vide expression, like “de”, “e”, ... etc...). 
3.	“Replace”: Adriano Sisnando Brotero de Avelar Quintino, vide Adriano Sisnando Brotero Quintino de Avelar, result: Adriano Sisnando Brotero Quintino de Avelar. The “vide” expression is a full name, sharing the first name with the base name. This happens when the transformation of family names cannot be expressed by “cut” and “add”, so the author of the card wrote the full target name after “vide” for clarity.
4.	“Partial replace”: Francisco António Campos, vide de Novais Campos, result: Francisco António de Novais Campos. The “vide” expression replaces part of the base name; the “base name” and the “vide” expression overlap at the end; the matched part in the “base name” is replaced by the “vide” expression.



In [523]:
vide_plus.loc['217701']

name               José de Santo António
sex                                    m
nome-vide                      Lencastre
nome-geografico                      NaN
faculdade                            NaN
faculdade.date                       NaN
faculdade.obs                        NaN
nome-pai                             NaN
uc-entrada                    0000-00-00
uc-saida                      0000-00-00
uc-saida.date                 0000-00-00
uc-saida.obs                        None
rec_type                             see
Name: 217701, dtype: object

#### **[EN]** Collect first names from database, filter rare ones

In [524]:
# collect possible first names from current database
from timelinknb.pandas import attribute_values
from timelinknb import Session


# collect list of first names, ignore the less frequent ones
#
threshold = 5
pnomes = []
pnomes_table = attribute_values('nome-primeiro')
for id,linha in pnomes_table.iterrows():
    pnome = id
    count = linha['count']
    if count>threshold:
        pnomes.append(pnome)

print(f"Number of first names with more than {threshold} occurrences {len(pnomes)}")
print("Use this to copy to other places:")
print()
print("[")
for i in range(len(pnomes)):
    print(f"'{pnomes[i]}',", end='')
    if int((i+1)%5) == 0:
        print()
print("]")

Number of first names with more than 5 occurrences 279
Use this to copy to other places:

[
'Abel','Abílio','Acácio','Acúrcio','Adão',
'Adelino','Adolfo','Adriano','Adrião','Afonso',
'Agnelo','Agostinho','Aires','Albano','Alberto',
'Albino','Aleixo','Alexandre','Alfredo','Alípio',
'Álvaro','Amadeu','Amador','Amancio','Amândio',
'Amaro','Ambrósio','Américo','Anacleto','Anastácio',
'André','Ângelo','Aníbal','Aniceto','Anselmo',
'Antão','Antero','António','Antrónio','Apolinário',
'Arcanjo','Aristides','Armando','Arnaldo','Arsenio',
'Artur','Ascenso','Atanásio','Augusto','Aureliano',
'Aurélio','Avelino','Baltasar','Barnabé','Bartolomeu',
'Basílio','Batista','Belchior','Benjamim','Bento',
'Berardo','Bernardino','Bernardo','Boaventura','Bonifácio',
'Brás','Bruno','Caetano','Calisto','Camilo',
'Cândido','Carlos','Casimiro','Celestino','César',
'Cipriano','Cláudio','Clemente','Constantino','Cosme',
'Crisogono','Crispim','Cristiano','Cristóvão','Custódio',
'Dâmaso','Damião','Daniel','David','De

#### **[EN]** Apply vide expression transformation for records.

Echo replacements that involve changing the first name, because they are error prone.




In [525]:
from os.path import commonprefix
import re


vide_plus['loookup']=''
vide_plus['vide_type']=''

for id,linha in vide_plus.iterrows():
    nome =  linha['name']
    if not pd.isnull(linha['nome-vide']) :
        nome_vide = linha['nome-vide']
        nomes = nome.split(" ")
        nomes_vide = nome_vide.split(" ")
        # find a common suffix (invert names, use commonprefix, invert result)
        terminacao_comum = commonprefix([nome[::-1],nome_vide[::-1]])[::-1]
        # check it is a separate name and not just common letters at the end
        # a proper family name should share a starting space 
        if len(terminacao_comum) > 0:
            if terminacao_comum[0] != ' ':
                terminacao_comum = ''    # not a separate name, abandom
            else:
                terminacao_comum = terminacao_comum.strip()
        # currently using common suffix lowers the matches why?
        # terminacao_comum = ''

        # Type CUT: vide is a inner part of the original name
        # e.g. André Vaz Cabaço, vide Vaz
        # but also Manuel de Almeida Cabral, vide de Almeida 
        pos = nome.find(nome_vide)
        if pos > -1:
            lookup_name = nome[0:pos] + nome_vide
            vtype="cut"
        # Type REP: vide name looks like a full name 
        # e.g. António de Abreu Bacelar de Azevedo, vide António Abreu Bacelar
        # relaxing the same first name rule, lots  of leaks
        #  This leaks a lot : elif len(nomes_vide)>1 and nomes_vide[0] in pnomes :
        elif nomes[0] == nomes_vide[0]:
            lookup_name = nome_vide 
            vtype='rep'
        # Type REPAP: vide overlaps end of name
        # e.g. Joaquim Carvalho, vide Ramos de Carvalho
        # but vide must not contain first name
        # in that case probably a REP
        # otherwise generates leaks and lowers mumbers of matches
        elif terminacao_comum > '':
            if not nomes_vide[0] in pnomes :
                lookup_name = re.sub(f'{terminacao_comum}$',nome_vide,nome)
                vtype='repap'
            else:  # if common termination and first name better replace
                lookup_name = nome_vide
                vtype='rep'              
        else:
            # TYPE ADD vide name is not part of original nor a full name
            # so it must be an aditional surname
            # e.g. Fernão Cabral, vide Albuquerque = Fernão Cabral Albuquerque
            lookup_name = nome+" "+nome_vide
            vtype='add'
    else:
        lookup_name = nome
        vtype='novid'
  
        
    # we try to recover cases where there was replacement of first name
    # they are missed by the REP amd REPAP rules above and end up 
    # producing lookup which are the sobreposition of two names
    # this was added by examining bad "ADD" and "REPAP" results
    # if the result is a long name (>5 names), both name and vide start
    # with first names and vide also long (>4) then probable a replace
    # that changes the first name.
    nomes_lookup = lookup_name.split()
    if vtype != 'rep' \
         and nomes[0] in pnomes and nomes_vide[0] in pnomes\
         and nomes[0] != nomes_vide[0]\
         and len(nomes_vide) > 3\
         and len(nomes_lookup) > 5:
        old_lookup = lookup_name
        lookup_name = nome_vide
        vtype = 'rep+'
        print(id,nome,"vide", nome_vide,"--->",lookup_name,"\n  instead of", old_lookup)

    # print(f'{type} :{id:7}{nome:40}{nome_vide:40} = {lookup_name}')
    vide_plus.loc[id,'lookup'] = lookup_name
    vide_plus.loc[id,'vide_type'] = vtype


174974 Dionísio Dinis de Oliveira vide Dinis de Oliveira da Fonseca ---> Dinis de Oliveira da Fonseca 
  instead of Dionísio Dinis de Oliveira Dinis de Oliveira da Fonseca
175757 João Batista de Oliveira vide José Batista de Oliveira Baeina ---> José Batista de Oliveira Baeina 
  instead of João Batista de Oliveira José Batista de Oliveira Baeina
179121 João Xavier Mousinho da Silveira Gomide vide José Xavier Mousinho Gomide da Silveira ---> José Xavier Mousinho Gomide da Silveira 
  instead of João Xavier Mousinho da Silveira Gomide José Xavier Mousinho Gomide da Silveira
180061 João António Osório Pereira Gouveia vide Adriano Osório Pereira Guerra ---> Adriano Osório Pereira Guerra 
  instead of João António Osório Pereira Gouveia Adriano Osório Pereira Guerra
180742 Adriano Osório Pereira Guerra vide João António Pereira Cerenato ---> João António Pereira Cerenato 
  instead of Adriano Osório Pereira Guerra João António Pereira Cerenato
195729 Joaquim da Conceição vide Jacinto de Sã

In [526]:
vide_plus.loc['217701']

name                         José de Santo António
sex                                              m
nome-vide                                Lencastre
nome-geografico                                NaN
faculdade                                      NaN
faculdade.date                                 NaN
faculdade.obs                                  NaN
nome-pai                                       NaN
uc-entrada                              0000-00-00
uc-saida                                0000-00-00
uc-saida.date                           0000-00-00
uc-saida.obs                                  None
rec_type                                       see
loookup                                           
vide_type                                      add
lookup             José de Santo António Lencastre
Name: 217701, dtype: object

#### **[EN]** Collect stats on type of vide transformation applied

In [527]:
vide_types = vide_plus.groupby('vide_type').count()[['name']]
vide_types['perc'] = vide_types['name']/ vide_types['name'].sum()
vide_types

Unnamed: 0_level_0,name,perc
vide_type,Unnamed: 1_level_1,Unnamed: 2_level_1
add,4057,0.429858
cut,4126,0.437169
novid,152,0.016105
rep,1057,0.111994
rep+,20,0.002119
repap,26,0.002755


#### **[EN]** Double check partial replace transformation

* Partial replace:
    * Francisco António Campos, vide de Novais Campos, result: Francisco António de Novais Campos. 
        * the “vide” expression replaces part of the base name; 
        * the “base name” and the “vide” expression overlap at the end; 
        * the matched part in the “base name” is replaced by the “vide” expression.

They are sensistive to misspelling in first names.

In [528]:
# Check if we got many cases of vide with overlap, and if they are handled right
repap = vide_plus[['name','nome-vide','vide_type', 'lookup','nome-geografico','faculdade','uc-entrada']].sort_values(['name','nome-vide','vide_type', 'lookup'])
repap[repap.vide_type == 'repap']

Unnamed: 0_level_0,name,nome-vide,vide_type,lookup,nome-geografico,faculdade,uc-entrada
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
151153,António Barreiros,Rodrigues Barreiros,repap,António Rodrigues Barreiros,Lisboa,Leis,1655-10-16
171604,Belchior de Matos,Rodrigues de Matos,repap,Belchior Rodrigues de Matos,Vila Viçosa,Cânones,1621-10-11
171604,Belchior de Matos,Rodrigues de Matos,repap,Belchior Rodrigues de Matos,Vila Viçosa,Leis,1621-10-11
192525,Diogo Chamorro,Homem Chamorro,repap,Diogo Homem Chamorro,Porto,,0000-00-00
130230,Francisco António Campos,de Novais Campos,repap,Francisco António de Novais Campos,Azeitão,,0000-00-00
135280,Francisco Tavares de Figueiredo,Farncisco Xavier Tavares de Figueiredo,repap,Francisco Farncisco Xavier Tavares de Figueiredo,Meãs,Cânones,1762-10-01
135280,Francisco Tavares de Figueiredo,Farncisco Xavier Tavares de Figueiredo,repap,Francisco Farncisco Xavier Tavares de Figueiredo,vila,Cânones,1762-10-01
209659,Gaspar da Cunha,Macedo da Cunha,repap,Gaspar Macedo da Cunha,Amarante,,0000-00-00
165045,Isidoro da Cunha de Eça,dos Santos de Eça,repap,Isidoro da Cunha dos Santos de Eça,Alvorninha,,0000-00-00
165046,Isidoro dos Santos de Eça,da Cunha de Eça,repap,Isidoro dos Santos da Cunha de Eça,Alvorninha,Cânones,1735-10-01


Current fails:

* 135280	Francisco Tavares de Figueiredo	__vide__ Farncisco Xavier Tavares de Figueiredo	repap	Francisco Farncisco Xavier Tavares de Figueiredo __first name misspelled__	
* 245474    Jerónimo de Magalhães Mexia	jerónimo __vide__ josé de Macêdo Magalhães Mexia	repap	Jerónimo de jerónimo josé de Macêdo Magalhães ...	__first name misspelled__
* 277264	José Luís Alves Feijó __vide__ Angelo do Santíssimo Sacramento Alves Feijó	repap	José Luís Angelo do Santíssimo Sacramento Alve...	__first name misspelled__
* 228003	José da Fonseca Marques da Silva __vide__ da Fonseca da Silva	repap	José da Fonseca Marques da Fonseca da Silva: __bad expression should be a replace__
* 204835    Luís de Figueiredo	__vide__ Figueiredo Lobo ou Lobo de Figueiredo	repap	Luís Figueiredo Lobo ou Lobo de Figueiredo __bad expression not understandable__


#### **[EN]** Remove particles from names

To increase the chance of matches we make a copy of names and target names derived from "vide"
without the particles that are used in Portuguese names (not very uniformely)


In [529]:

def remove_particles(name):
    particles = ("de","da","e","das","dos","do")
    return " ".join([n for n in name.split() if n not in particles])

vide_plus['name_sp'] = vide_plus['name'].apply(lambda name: remove_particles(name))
vide_plus['lookup_sp'] = vide_plus['lookup'].apply(lambda name: remove_particles(name))
vide_plus[vide_plus['name']!=vide_plus['name_sp']][['name','name_sp', 'lookup','lookup_sp']].head(10)

Unnamed: 0_level_0,name,name_sp,lookup,lookup_sp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
127798,António Joaquim do Cabo,António Joaquim Cabo,António Joaquim do Cabo e Faria,António Joaquim Cabo Faria
127819,Álvaro de Madureira Cabral,Álvaro Madureira Cabral,Álvaro de Madureira,Álvaro Madureira
128053,António da Fonseca Cabral,António Fonseca Cabral,António da Fonseca,António Fonseca
128061,António de Matos Cabral,António Matos Cabral,António de Matos,António Matos
128066,António de Mendonça Cabral,António Mendonça Cabral,António de Mendonça,António Mendonça
128125,Britaldo de Gouveia Cabral,Britaldo Gouveia Cabral,Britaldo de Gouveia,Britaldo Gouveia
128142,Diogo de Morais Cabral,Diogo Morais Cabral,Diogo de Morais,Diogo Morais
128159,Filipe de Gouveia Cabral,Filipe Gouveia Cabral,Filipe de Gouveia,Filipe Gouveia
128244,Francisco Xavier da Veiga Cabral,Francisco Xavier Veiga Cabral,Francisco Xavier da Veiga,Francisco Xavier Veiga
128295,João da Costa Cabreira,João Costa Cabreira,João da Costa,João Costa


#### **[EN]** Save name transformations for reference

Output table with the generation of target names from base names and vide expressions.

File available at https://github.com/joaquimrcarvalho/ucalumni/blob/master/inferences/remissivas/vide_transform.csv

This table allows comparing between sucessive versions for fine tuning.

In [530]:
vide_plus.loc['217701']

name                         José de Santo António
sex                                              m
nome-vide                                Lencastre
nome-geografico                                NaN
faculdade                                      NaN
faculdade.date                                 NaN
faculdade.obs                                  NaN
nome-pai                                       NaN
uc-entrada                              0000-00-00
uc-saida                                0000-00-00
uc-saida.date                           0000-00-00
uc-saida.obs                                  None
rec_type                                       see
loookup                                           
vide_type                                      add
lookup             José de Santo António Lencastre
name_sp                         José Santo António
lookup_sp             José Santo António Lencastre
Name: 217701, dtype: object

In [531]:
# save for change tracking
vide_plus[['name','name_sp','nome-vide','vide_type','lookup', 'lookup_sp']].sort_values('name_sp').to_csv('../inferences/remissivas/vide_transform.csv',sep=',')

### **[EN]** Match

#### **[EN]** Sort by geographic name, name and lookup

First attemp is to order the records so that matching cross references end up in consecutive rows.
We sort by place of birth and inside place of birth by name and target vide name.

This is a type of similarity filter, that puts many matches in sucessive rows.


In [532]:

# sort by naturalidade, 'lookup', 'name' (with the  name and lookup ordered alfabetically)
# this should put the vide pair in sucessive rows, but misses some cases due to ordering issues
cols = ['lookup_sp','name_sp']
vide_plus['sort_key'] = vide_plus[cols].apply(lambda row: '-'.join(sorted(row.values.astype(str))), axis=1)
vide_plus.sort_values(['nome-geografico','sort_key'], inplace=True)
vide_plus[['nome-geografico','sort_key', 'name_sp','lookup_sp','nome-vide','vide_type','uc-entrada','nome-pai','faculdade']].head()

Unnamed: 0_level_0,nome-geografico,sort_key,name_sp,lookup_sp,nome-vide,vide_type,uc-entrada,nome-pai,faculdade
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
198423,Constância,António Homem Magalhães-António Homem Magalhãe...,António Homem Magalhães Corte Real,António Homem Magalhães,Magalhães,cut,0000-00-00,,Leis
144388,Constância,Fernão Álvares-Fernão Álvares Temudo,Fernão Álvares,Fernão Álvares Temudo,Temudo,add,1573-11-13,Pantaleão Rosado,Cânones
202622,Constância,Fernão Álvares-Fernão Álvares Temudo,Fernão Álvares Temudo,Fernão Álvares,Álvares,cut,0000-00-00,,
129617,Constância,João Claúdio Ferreira Calado-João Claúdio Ferr...,João Claúdio Ferreira Calado,João Claúdio Ferreira Calado Oliveiro,Oliveiro,add,0000-00-00,,
171438,Constância,João Veiga-João Veiga Mendes Nogueira,João Veiga Mendes Nogueira,João Veiga,João da Veiga,cut,0000-00-00,,Leis


In [533]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                                NaN
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

In [534]:
# we set records with no nome geográfico temporarly to a string
# so that we can have them together for consideration
vide_plus.loc[vide_plus['nome-geografico'].isnull(),'nome-geografico'] = '***NA***'


In [535]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                           ***NA***
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

#### **[EN]** Sequential matching

In late April around 3600 records found a match by this process.

Bit some matches are inconsistent (assymetric or ambiguous).


In [536]:
def compare_names(name1, name2):
    return remove_particles(name1) == remove_particles(name2)

previous_lookup = ''
previous_nome = ''
previous_id = ''
previous_date = ''
sequential_matches = []

for id,linha in vide_plus.iterrows():
    nome = linha['name_sp']
    lookup_name = linha['lookup_sp']
    uc_date = linha['uc-entrada']
    rec_type = linha['rec_type']

    if compare_names(previous_lookup,nome)\
         and compare_names(previous_nome,lookup_name)\
         and id != previous_id:
        # we store the direction of the match see-aka or aka-see
        from_type = rec_type
        to_type = previous_rec_type
        to_tuple = (id,previous_id,f'{from_type}-{to_type}')
        from_tuple = (previous_id,id,f'{to_type}-{from_type}')
        if to_tuple in sequential_matches:
            # print("Skipping duplicate match",to_tuple)
            pass
        else:
            sequential_matches.append((id,previous_id,f'{from_type}-{to_type}'))
        if from_tuple in sequential_matches:
            # print("Skipping duplicate match",to_tuple)
            pass
        else:    
            sequential_matches.append((previous_id,id,f'{to_type}-{from_type}'))

    previous_id = id
    previous_nome = nome
    previous_lookup = lookup_name  
    previous_date = uc_date
    previous_rec_type = rec_type

vide_plus.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 198423 to 230315
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  9438 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
 13  loookup          9438 non-null   object
 14  vide_type        9438 non-null   object
 15  lookup           9438 non-null   object
 16  name_sp          9438 non-null   object
 17  lookup_sp        9438 non-null 

##### **[EN]** Analyse sequential match results

In [537]:
method = 'sequential'
match_records['matched_pairs'][method] = list(set(sequential_matches))
match_info.loc['matched_pairs'][method] = len(match_records['matched_pairs'][method])

# pairs 
pairs_see_aka = [(o,d,mtype) for (o,d,mtype) in sequential_matches if mtype == 'see-aka']
pairs_aka_see = [(o,d,mtype) for (o,d,mtype) in sequential_matches if mtype == 'aka-see']
pairs_aka_aka = [(o,d,mtype) for (o,d,mtype) in sequential_matches if mtype == 'aka-aka']
pairs_see_see = [(o,d,mtype) for (o,d,mtype) in sequential_matches if mtype == 'see-see']

# records
rec_matched = set([id for (id,d,t) in sequential_matches]+
                  [id for (o,id,t) in sequential_matches])
rec_see_aka = set([id for (id,d,t) in pairs_see_aka])
rec_aka_see = set([id for (o,id,t) in pairs_aka_see])
rec_see_see = set([id for (id,d,t) in pairs_see_see] +
                  [id for (o,id,t) in pairs_see_see])
rec_aka_aka = set([id for (id,d,t) in pairs_aka_aka] +
                  [id for (o,id,t) in pairs_aka_aka])

match_records['records_matched'][method] = list(rec_matched)
match_records['records_see_aka'][method] = list(rec_see_aka)
match_records['records_aka_see'][method] = list(rec_aka_see)
match_records['records_aka_aka'][method] = list(rec_aka_aka)
match_records['records_see_see'][method] = list(rec_see_see)

match_info.loc['records_matched'][method] = len(rec_matched)
match_info.loc['records_see_aka'][method] = len(rec_see_aka)
match_info.loc['records_aka_see'][method] = len(rec_aka_see)
match_info.loc['records_aka_aka'][method] = len(rec_aka_aka)
match_info.loc['records_see_see'][method] = len(rec_see_see)

# new
match_info.loc['aka_matched'][method] = len(rec_aka_see.union(rec_aka_aka))
match_records['aka_matched'][method] = list(rec_aka_see.union(rec_aka_aka))
match_info.loc['see_matched'][method] = len(rec_see_aka.union(rec_see_see))
match_records['see_matched'][method] = list(rec_see_aka.union(rec_see_see))

##### **[EN]** Check for ambigous matches
Look for records matched with more than one or involved in transitive matching (A->B->C)

Note that sequential matches only matches symmetric records and so no asymmetric is generated in this method

In [538]:
import networkx as nx

method = 'sequential'

matched_pairs = match_records['matched_pairs'][method]
records = match_records['records_matched'][method]

matched_multiple = []
matched_single = []

origins = [o for (o,d,t) in matched_pairs]
destinations = [d for (o,d,t) in matched_pairs]
rec_in_matches = origins + destinations
for i in rec_in_matches:
    c = rec_in_matches.count(i)
    if c >2:
        matched_multiple.append(i)
    elif c == 1:
        matched_single.append(i)
matched_multiple = list(set(matched_multiple))
matched_single = list(set(matched_single))


# alternative method, perhaps more informative:
pairs_to_check = match_records['matched_pairs'][method]

asymmetric_pairs = []
for (o,d,t) in pairs_to_check:
    if t == 'see-aka':
        rt = 'aka-see'
    elif t == 'aka-see':
        rt = 'see-aka'
    else:
        rt = t
    if (d,o,rt) not in pairs_to_check:
        asymmetric_pairs.append((o,d,t))
        print("asymmetic match:",(o,d,t))

print("Records with more than one match      :", len(matched_multiple))
print("Records with just one match           :", len(matched_single))

G = nx.Graph()
simple_pairs = [(o,d) for (o,d,t) in matched_pairs]
G.add_edges_from(simple_pairs)
transitive  = [c for c in nx.connected_components(G) if len(c) > 2]
# number of records in ambiguous matches
amb_records = [item for amb in transitive for item in amb]
namb_records = len(set(amb_records))
print("Records in ambiguous matches          :", namb_records)
for amb in transitive:
    print(amb)
print("Are multiple in ambiguous?            :",set(matched_multiple).issubset(set(amb_records)))

rec_errors = set(amb_records).union(matched_multiple).union(matched_single)
rec_ok = set(records).difference(rec_errors)

match_records['records_error'][method] = list(rec_errors)
match_info.loc['records_error'][method] = len(rec_errors)
match_records['records_matched_ok'][method] = list(rec_ok)
match_info.loc['records_matched_ok'][method] = len(rec_ok)
match_records['records_asymmetric'][method] = list(matched_single)
match_info.loc['records_asymmetric'][method] = len(matched_single)
match_records['records_transitive'][method] = list(amb_records)
match_info.loc['records_transitive'][method] = namb_records

# new
aka = match_records['aka']['data']
aka_ok = set(aka).intersection(rec_ok)
see = match_records['see']['data']
see_ok = set(see).intersection(rec_ok)
match_info.loc['aka_matched_ok'][method] = len(aka_ok)
match_records['aka_matched_ok'][method] = list(aka_ok)
match_info.loc['see_matched_ok'][method] = len(see_ok)
match_records['see_matched_ok'][method] = list(see_ok)

pairs_ok = set([(o,d,t) for (o,d,t) in match_records['matched_pairs'][method]
                                                        if o not in rec_errors and d not in rec_errors])
match_records['matched_pairs_ok'][method] = list(pairs_ok)
match_info.loc['matched_pairs_ok'][method] = len(pairs_ok)

vide_plus.loc[matched_single,'match_error'] = False
vide_plus.loc[matched_single,'match_obs'] = "W01-Single match (asymmetric) "+method
vide_plus.loc[matched_multiple,'match_error'] = True
vide_plus.loc[matched_multiple,'match_obs'] = "E02-Multiple match "+method
vide_plus.loc[amb_records,'match_error'] = True
vide_plus.loc[amb_records,'match_obs'] = "E01-Ambiguity in match "+method

match_info.fillna('')

Records with more than one match      : 4
Records with just one match           : 0
Records in ambiguous matches          : 12
{'278765', '181236', '178312'}
{'162809', '136283', '169757'}
{'183307', '183306', '235009'}
{'203494', '255769', '206151'}
Are multiple in ambiguous?            : True


Unnamed: 0,data,sequential,random
aka,3062.0,,
aka_fac,3035.0,,
aka_geo,2973.0,,
aka_matched,,1913.0,
aka_matched_ok,,1907.0,
aka_pai,1619.0,,
matched_pairs,,3644.0,
matched_pairs_ok,,3628.0,
nodate,5763.0,,
nodate_novide,141.0,,


In [539]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                           ***NA***
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

##### **[EN]** Show some of the ambiguous records

In [540]:
from timelinknb.pandas import display_group_attributes

pd.set_option('display.max_rows',250)

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

show_only = 8
for ambiguous_records in transitive[:show_only]:
    display_group_attributes(ambiguous_records,
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,181236,António Gomes de Macedo,Gomes,Coimbra,0000-00-00,,
1,178312,António Gomes,Macedo,Coimbra,1641-10-02,Teologia,Manuel Rodrigues
2,278765,António Gomes,Macedo,Coimbra,1749-04-26,Medicina,José Gomes


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,181236,naturalidade,Coimbra,
1,0000-00-00,181236,nome,António Gomes,"António Gomes de Macedo, vide Gomes"
2,0000-00-00,181236,nome,António Gomes de Macedo,
3,0000-00-00,181236,nome-vide,Gomes,
4,0000-00-00,181236,uc-entrada,0000-00-00,
5,0000-00-00,181236,uc-saida,0000-00-00,
6,1641-10-02,178312,faculdade,Teologia,Teologia
7,1641-10-02,178312,matricula-faculdade,Teologia,02.10.1641
8,1641-10-02,178312,naturalidade,Coimbra,
9,1641-10-02,178312,nome,António Gomes,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,136283,Manuel Francisco de Ribolhos,Manuel Francisco,Ribolhos,0000-00-00,Cânones,
1,169757,Manuel Francisco de Ribolhos,Francisco,Ribolhos,0000-00-00,,
2,162809,Manuel Francisco,Manuel Francisco de Ribolhos,Ribolhos,1756-10-01,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,136283,faculdade,Cânones,Cânones
1,0000-00-00,136283,naturalidade,Ribolhos,
2,0000-00-00,169757,naturalidade,Ribolhos,
3,0000-00-00,136283,nome,Manuel Francisco,"Manuel Francisco de Ribolhos, vide Manuel Francisco"
4,0000-00-00,169757,nome,Manuel Francisco,"Manuel Francisco de Ribolhos, vide Francisco"
5,0000-00-00,136283,nome,Manuel Francisco de Ribolhos,
6,0000-00-00,169757,nome,Manuel Francisco de Ribolhos,
7,0000-00-00,169757,nome-vide,Francisco,
8,0000-00-00,136283,nome-vide,Manuel Francisco,
9,0000-00-00,136283,uc-entrada,0000-00-00,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,183306,António de Sousa,António de Sousa Macedo,Lisboa,0000-00-00,,Gonçalo de Sousa
1,235009,António de Sousa,Macedo,Lisboa,0000-00-00,,João
2,183307,António de Sousa de Macedo,Sousa,Lisboa,1629-11-15,Leis,Gonçalo de Sousa


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,183306,naturalidade,Lisboa,
1,0000-00-00,235009,naturalidade,Lisboa,
2,0000-00-00,183306,nome,António de Sousa,
3,0000-00-00,235009,nome,António de Sousa,
4,0000-00-00,183306,nome,António de Sousa Macedo,"António de Sousa, vide António de Sousa Macedo"
5,0000-00-00,235009,nome,António de Sousa Macedo,"António de Sousa, vide Macedo"
6,0000-00-00,183306,nome-pai,Gonçalo de Sousa,
7,0000-00-00,235009,nome-pai,João,
8,0000-00-00,183306,nome-vide,António de Sousa Macedo,
9,0000-00-00,235009,nome-vide,Macedo,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,206151,Francisco Rodrigues de Valadares,Rodrigues,Vila Viçosa,0000-00-00,Cânones,Rodrigo Rodrigues
1,203494,Francisco Rodrigues,Valadares,Vila Viçosa,1605-10-20,Cânones,Rodrigo Rodrigues
2,255769,Francisco Rodrigues,Valadares,Vila Viçosa,1605-10-20,Cânones,Rodrigo Rodrigues


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,206151,faculdade,Cânones,Cânones
1,0000-00-00,206151,naturalidade,Vila Viçosa,
2,0000-00-00,206151,nome,Francisco Rodrigues,"Francisco Rodrigues de Valadares, vide Rodrigues"
3,0000-00-00,206151,nome,Francisco Rodrigues de Valadares,
4,0000-00-00,206151,nome-pai,Rodrigo Rodrigues,
5,0000-00-00,206151,nome-vide,Rodrigues,
6,0000-00-00,206151,uc-entrada,0000-00-00,
7,0000-00-00,206151,uc-saida,0000-00-00,
8,1605-10-20,203494,colegio,Colégio de S.Paulo,colegial de São Paulo
9,1605-10-20,255769,colegio,Colégio de S.Paulo,colegial de São Paulo


##### **[EN]** aka-aka matches in sequential mode

In sequential mode we do not filter by record type so some aka-aka and see-see matches occur.
matches 

In [541]:
from timelinknb.pandas import display_group_attributes

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

pairs = match_records['matched_pairs']['sequential']
show_pairs = [(o,d,t) for o,d,t in pairs if t == 'aka-aka' and o<d]
show_only = 10
print(f"aka-aka matches in sequential mode (show only {show_only}) of {len(show_pairs)}:")
for o,d,t in show_pairs[:show_only]:
    display_group_attributes([o,d],
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

aka-aka matches in sequential mode (show only 10) of 100:


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,243154,João Manuel de Pina,Manuel,Óbidos,1601-10-15,Cânones,Francisco Gorjão
1,184012,João Manuel,Pina,Óbidos,1605-10-07,Cursos jurídicos (Cânones ou Leis),Francisco Gorjão


Unnamed: 0,date,id,type,value,attr_obs
0,1601-10-15,243154,faculdade,Cânones,Cânones
1,1601-10-15,243154,matricula-faculdade,Cânones,15.10.1601
2,1601-10-15,243154,naturalidade,Óbidos,
3,1601-10-15,243154,nome,João Manuel,"João Manuel de Pina, vide Manuel"
4,1601-10-15,243154,nome,João Manuel de Pina,
5,1601-10-15,243154,nome-pai,Francisco Gorjão,
6,1601-10-15,243154,nome-vide,Manuel,
7,1601-10-15,243154,uc-entrada,1601-10-15,
8,1601-10-15,243154,uc-entrada.ano,1601,
9,1605-10-07,184012,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,150364,Diogo Barbosa,Carvalho,Lisboa,1624-10-22,Cânones,Tristão Barbosa
1,150364,Diogo Barbosa,Carvalho,Lisboa,1624-10-22,Leis,Tristão Barbosa
2,265272,Diogo Barbosa de Carvalho,Barbosa,Lisboa,1624-10-22,Cânones,Tristão Barbosa
3,265272,Diogo Barbosa de Carvalho,Barbosa,Lisboa,1624-10-22,Leis,Tristão Barbosa


Unnamed: 0,date,id,type,value,attr_obs
0,1624-10-22,150364,faculdade,Cânones,Faculdade corrigida
1,1624-10-22,265272,faculdade,Cânones,Faculdade corrigida
2,1624-10-22,150364,faculdade,Leis,Faculdade corrigida
3,1624-10-22,265272,faculdade,Leis,Faculdade corrigida
4,1624-10-22,150364,faculdade-original,Leis,
5,1624-10-22,265272,faculdade-original,Leis,
6,1624-10-22,150364,instituta,1624-10-22,"""1624/10/22 1624-10-22"""
7,1624-10-22,150364,naturalidade,Lisboa,
8,1624-10-22,265272,naturalidade,Lisboa,
9,1624-10-22,150364,nome,Diogo Barbosa,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,182217,Simão Machado,Miranda,Ilha da Madeira,1624-10-02,Cânones,João Machado
1,182217,Simão Machado,Miranda,Ilha da Madeira,1624-10-02,Leis,João Machado
2,249651,Simão Machado de Miranda,Machado,Ilha da Madeira,1624-00-00,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1624-00-00,249651,faculdade,Cânones,Cânones
1,1624-00-00,249651,instituta,1624-00-00,??.10.1624
2,1624-00-00,249651,naturalidade,Ilha da Madeira,
3,1624-00-00,249651,nome,Simão Machado,"Simão Machado de Miranda, vide Machado"
4,1624-00-00,249651,nome,Simão Machado de Miranda,
5,1624-00-00,249651,nome-vide,Machado,
6,1624-00-00,249651,uc-entrada,1624-00-00,
7,1624-00-00,249651,uc-entrada.ano,1624,
8,1624-10-02,182217,faculdade,Cânones,Faculdade corrigida
9,1624-10-02,182217,faculdade,Leis,Faculdade corrigida


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,250513,Simão Lourenço,Coelho,Tancos,1648-10-31,Leis,Teodósio Lourenço
1,194939,Simão Lourenço Coelho,Lourenço,Tancos,1650-10-12,Leis,Teodósio Lourenço


Unnamed: 0,date,id,type,value,attr_obs
0,1648-10-31,250513,faculdade,Leis,Leis
1,1648-10-31,250513,instituta,1648-10-31,1648.10.31 1648-10-31
2,1648-10-31,250513,naturalidade,Tancos,
3,1648-10-31,250513,nome,Simão Lourenço,
4,1648-10-31,250513,nome,Simão Lourenço Coelho,"Simão Lourenço, vide Coelho"
5,1648-10-31,250513,nome-pai,Teodósio Lourenço,
6,1648-10-31,250513,nome-vide,Coelho,
7,1648-10-31,250513,uc-entrada,1648-10-31,
8,1648-10-31,250513,uc-entrada.ano,1648,
9,1649-10-02,250513,matricula-faculdade,Leis,1649.10.02


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,194913,Semião Coelho,Amaral,Viseu,1659-03-20,Medicina,
1,145682,Semião Coelho do Amaral,Coelho,Viseu,1659-10-02,Medicina,João Cardoso de Loureiro


Unnamed: 0,date,id,type,value,attr_obs
0,1659-03-20,194913,faculdade,Medicina,Medicina
1,1659-03-20,194913,grau,Bacharel em Artes,Bacharel em Artes 20.03.1659
2,1659-03-20,194913,naturalidade,Viseu,
3,1659-03-20,194913,nome,Semião Coelho,
4,1659-03-20,194913,nome,Semião Coelho Amaral,"Semião Coelho, vide Amaral"
5,1659-03-20,194913,nome-vide,Amaral,
6,1659-03-20,194913,uc-entrada,1659-03-20,
7,1659-03-20,194913,uc-entrada.ano,1659,
8,1659-03-20,194913,uc-saida,1659-03-20,
9,1659-03-20,194913,uc-saida.ano,1659,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,202938,Manuel da Costa,Lemos e Tunes,Arganil,1725-10-01,Cânones,Pedro Nunes da Costa
1,205883,Manuel da Costa Lemos e Tunes,Costa,Arganil,1729-07-15,,Pedro Nunes da Costa


Unnamed: 0,date,id,type,value,attr_obs
0,1725-10-01,202938,faculdade,Cânones,Cânones
1,1725-10-01,202938,matricula-faculdade,Cânones,01.10.1725
2,1725-10-01,202938,naturalidade,Arganil,
3,1725-10-01,202938,nome,Manuel da Costa,
4,1725-10-01,202938,nome,Manuel da Costa Lemos e Tunes,"Manuel da Costa, vide Lemos e Tunes"
5,1725-10-01,202938,nome-nota,padre,
6,1725-10-01,202938,nome-pai,Pedro Nunes da Costa,
7,1725-10-01,202938,nome-vide,Lemos e Tunes,
8,1725-10-01,202938,padre,sim,padre
9,1725-10-01,202938,uc-entrada,1725-10-01,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,252428,Jorge Fernandes,Aires,Lisboa,1572-03-15,Cânones,Aires Vaz
1,141854,Jorge Fernandes Aires,Fernandes,Lisboa,1583-10-01,Cânones,Aires Vaz


Unnamed: 0,date,id,type,value,attr_obs
0,1572-03-15,252428,faculdade,Cânones,Cânones
1,1572-03-15,252428,grau,Bacharel em Artes,
2,1572-03-15,252428,naturalidade,Lisboa,
3,1572-03-15,252428,nome,Jorge Fernandes,
4,1572-03-15,252428,nome,Jorge Fernandes Aires,"Jorge Fernandes, vide Aires"
5,1572-03-15,252428,nome-pai,Aires Vaz,
6,1572-03-15,252428,nome-vide,Aires,
7,1572-03-15,252428,uc-entrada,1572-03-15,
8,1572-03-15,252428,uc-entrada.ano,1572,
9,1573-05-08,252428,grau,Licenciado em Cânones,Licenciado 08.05.1573


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,169085,Mateus Neto,Miguéis,Buarcos,1645-10-16,Cânones,Pedro Brás
1,245978,Mateus Neto Miguéis,Neto,"Redondos, Buarcos",1677-03-16,Cânones,Pedro Vaz ou Brás


Unnamed: 0,date,id,type,value,attr_obs
0,1645-10-16,169085,faculdade,Cânones,Cânones
1,1645-10-16,169085,instituta,1645-10-16,16.10.1645 1645-10-16
2,1645-10-16,169085,naturalidade,Buarcos,
3,1645-10-16,169085,nome,Mateus Neto,
4,1645-10-16,169085,nome,Mateus Neto Miguéis,"Mateus Neto, vide Miguéis"
5,1645-10-16,169085,nome-nota,padre,
6,1645-10-16,169085,nome-pai,Pedro Brás,
7,1645-10-16,169085,nome-vide,Miguéis,
8,1645-10-16,169085,padre,sim,padre
9,1645-10-16,169085,uc-entrada,1645-10-16,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,129711,Hilário da Rocha Calheiros,Rocha,Caldas,1660-10-01,Cursos jurídicos (Cânones ou Leis),
1,173982,Hilário da Rocha,Calheiros,Caldas,1661-10-15,Cânones,António da Rocha


Unnamed: 0,date,id,type,value,attr_obs
0,1660-10-01,129711,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
1,1660-10-01,129711,instituta,1660-10-01,01.10.1660 1660-10-01
2,1660-10-01,173982,instituta,1660-10-01,1660.10.01 1660-10-01
3,1660-10-01,129711,naturalidade,Caldas,
4,1660-10-01,129711,nome,Hilário da Rocha,"Hilário da Rocha Calheiros, vide Rocha"
5,1660-10-01,129711,nome,Hilário da Rocha Calheiros,
6,1660-10-01,129711,nome-vide,Rocha,
7,1660-10-01,129711,uc-entrada,1660-10-01,
8,1660-10-01,129711,uc-entrada.ano,1660,
9,1660-10-01,129711,uc-saida,1660-10-01,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,178636,Francisco Gomes,Miranda,Lisboa,1597-10-01,Cânones,Basílio Gomes
1,248706,Francisco Gomes de Miranda,Gomes,Lisboa,1607-01-18,Cânones,Baílio Gomes


Unnamed: 0,date,id,type,value,attr_obs
0,1597-10-01,178636,faculdade,Cânones,Cânones
1,1597-10-01,178636,instituta,1597-10-01,01.10.1597 1597-10-01
2,1597-10-01,178636,naturalidade,Lisboa,
3,1597-10-01,178636,nome,Francisco Gomes,
4,1597-10-01,178636,nome,Francisco Gomes Miranda,"Francisco Gomes, vide Miranda"
5,1597-10-01,178636,nome-pai,Basílio Gomes,
6,1597-10-01,178636,nome-vide,Miranda,
7,1597-10-01,178636,uc-entrada,1597-10-01,
8,1597-10-01,178636,uc-entrada.ano,1597,
9,1598-10-21,178636,matricula-faculdade,Cânones,21.10.1598


In [542]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                           ***NA***
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

#### **[EN]** Non sequential matching

In [543]:
import pandas as pd
import numpy as np

vide_plus['match_error']=False

previous_geo_name = ''

# list of record to need to be debugged
# use a break point in "pass" statement of the "if" 
#  at the start of the loop
# problematic = ['169888','169890','214417']  # add to list what you what to debug
#
# 211703 See matches 168662 which in turn matches 211704 e 211706
problematic = ['168662','211703','211704','211706'] 


# Loop through the vide records
random_matches = []
for id,linha in vide_plus.iterrows():

    if id in problematic:
        pass  # do breakpoint here to debug problematic records

    nome = linha['name_sp']
    lookup_name = linha['lookup_sp']
    nome_geo = linha['nome-geografico']
    rec_type = linha['rec_type']

    # we now check for similar geo names
    # and load names from variants
    if nome_geo != previous_geo_name:
        simile = geo_similars.get(nome_geo,[])
        if len(simile) > 0 :
            # we have similar geo names
            local_records = vide_plus[vide_plus['nome-geografico'].isin([nome_geo] + simile)]
            pass
        else:   # if no variants just load names with this place of birth
            local_records = vide_plus[vide_plus['nome-geografico'] == nome_geo]

        previous_geo_name = nome_geo

    # search for records with name matching the vide expression of this record
    candidates = []

    found_lookup_name = local_records['name_sp']==lookup_name

    for cid,same_name in local_records[found_lookup_name].iterrows():
        if same_name['lookup_sp'] == nome and cid != id:
            candidates.append(cid)

    # if nothing found search for records with vide expression equal to this one
    if len(candidates) == 0:
        found_lookup_name = local_records['lookup_sp']==lookup_name

        for cid,same_name in local_records[found_lookup_name].iterrows():
            if same_name['lookup_sp'] == nome and cid != id:
                candidates.append(cid)
    
    if len(candidates) > 0:  # we found some candidates
        for cand in candidates:
            mrec_type = vide_plus.loc[[cand]].iloc[0]['rec_type']
            mtype = f'{rec_type}-{mrec_type}'
            match = (id,cand,mtype)
            if match not in random_matches:
                random_matches.append(match)
    



##### **[EN]** Analyse results (random)

In [544]:
method = 'random'
match_records['matched_pairs'][method] = list(set(random_matches))
match_info.loc['matched_pairs'][method] = len(match_records['matched_pairs'][method])

# pairs 
pairs_see_aka = [(o,d,mtype) for (o,d,mtype) in random_matches if mtype == 'see-aka']
pairs_aka_see = [(o,d,mtype) for (o,d,mtype) in random_matches if mtype == 'aka-see']
pairs_aka_aka = [(o,d,mtype) for (o,d,mtype) in random_matches if mtype == 'aka-aka']
pairs_see_see = [(o,d,mtype) for (o,d,mtype) in random_matches if mtype == 'see-see']


# records
rec_matched = set([id for (id,d,t) in random_matches]+
                  [id for (o,id,t) in random_matches])
rec_see_aka = set([id for (id,d,t) in pairs_see_aka])
rec_aka_see = set([id for (o,id,t) in pairs_aka_see])
rec_see_see = set([id for (id,d,t) in pairs_see_see] +
                  [id for (o,id,t) in pairs_see_see])
rec_aka_aka = set([id for (id,d,t) in pairs_aka_aka] +
                  [id for (o,id,t) in pairs_aka_aka])

match_records['records_matched'][method] = list(rec_matched)
match_records['records_see_aka'][method] = list(rec_see_aka)
match_records['records_aka_see'][method] = list(rec_aka_see)
match_records['records_aka_aka'][method] = list(rec_aka_aka)
match_records['records_see_see'][method] = list(rec_see_see)

match_info.loc['records_matched'][method] = len(rec_matched)
match_info.loc['records_see_aka'][method] = len(rec_see_aka)
match_info.loc['records_aka_see'][method] = len(rec_aka_see)
match_info.loc['records_aka_aka'][method] = len(rec_aka_aka)
match_info.loc['records_see_see'][method] = len(rec_see_see)

# new
match_info.loc['aka_matched'][method] = len(rec_aka_see.union(rec_aka_aka))
match_records['aka_matched'][method] = list(rec_aka_see.union(rec_aka_aka))
match_info.loc['see_matched'][method] = len(rec_see_aka.union(rec_see_see))
match_records['see_matched'][method] = list(rec_see_aka.union(rec_see_see))

matched_rand_df = pd.DataFrame(random_matches, columns=['from','to','type'])
matched_rand_df.groupby('type').count()

Unnamed: 0_level_0,from,to
type,Unnamed: 1_level_1,Unnamed: 2_level_1
aka-aka,218,218
aka-see,1769,1769
see-aka,1796,1796
see-see,21,21


##### **[EN]** Check the matches for errors (random)

In [545]:
import networkx as nx

method = 'random'

matched_pairs = match_records['matched_pairs'][method]
records = match_records['records_matched'][method]

matched_multiple = []
matched_single = []

origins = [o for (o,d,t) in matched_pairs]
destinations = [d for (o,d,t) in matched_pairs]
rec_in_matches = origins + destinations
for i in rec_in_matches:
    c = rec_in_matches.count(i)
    if c >2:
        matched_multiple.append(i)
    elif c == 1:
        matched_single.append(i)
matched_multiple = list(set(matched_multiple))
matched_single = list(set(matched_single))

print("Records with more than two matches    :", len(matched_multiple))
print("Records asymmetric (one match only)   :", len(matched_single))

G = nx.Graph()
simple_pairs = [(o,d) for (o,d,t) in matched_pairs]
G.add_edges_from(simple_pairs)
transitive  = [c for c in nx.connected_components(G) if len(c) > 2]
# number of records in ambiguous matches
amb_records = [item for amb in transitive for item in amb]
namb_records = len(set(amb_records))
print("Records in ambiguous matches          :", namb_records)
for amb in transitive:
    print(amb)
print("Are multiple in ambiguous             :",set(matched_multiple).issubset(set(amb_records)))

rec_errors = set(amb_records).union(matched_multiple).union(matched_single)
rec_ok = set(records).difference(rec_errors)

match_records['records_error'][method] = list(rec_errors)
match_info.loc['records_error'][method] = len(rec_errors)
match_records['records_matched_ok'][method] = list(rec_ok)
match_info.loc['records_matched_ok'][method] = len(rec_ok)
match_records['records_asymmetric'][method] = list(matched_single)
match_info.loc['records_asymmetric'][method] = len(matched_single)
match_records['records_transitive'][method] = list(amb_records)
match_info.loc['records_transitive'][method] = namb_records

# new
aka = match_records['aka']['data']
aka_ok = set(aka).intersection(rec_ok)
see = match_records['see']['data']
see_ok = set(see).intersection(rec_ok)
match_info.loc['aka_matched_ok'][method] = len(aka_ok)
match_records['aka_matched_ok'][method] = list(aka_ok)
match_info.loc['see_matched_ok'][method] = len(see_ok)
match_records['see_matched_ok'][method] = list(see_ok)

pairs_ok = set([(o,d,t) for (o,d,t) in match_records['matched_pairs'][method]
                                                        if o not in rec_errors and d not in rec_errors])
match_records['matched_pairs_ok'][method] = list(pairs_ok)
match_info.loc['matched_pairs_ok'][method] = len(pairs_ok)

vide_plus.loc[matched_single,'match_error'] = False # we dont consider a single match an error
vide_plus.loc[matched_single,'match_obs'] = "W01-Single match (asymmetric) "+method
vide_plus.loc[matched_multiple,'match_error'] = True
vide_plus.loc[matched_multiple,'match_obs'] = "E02-Multiple match "+method
vide_plus.loc[amb_records,'match_error'] = True
vide_plus.loc[amb_records,'match_obs'] = "E03-Ambiguity in match "+method
match_info.fillna('')

Records with more than two matches    : 38
Records asymmetric (one match only)   : 76
Records in ambiguous matches          : 116
{'186310', '186417', '132440', '186309', '238066'}
{'194771', '316381', '201515'}
{'251536', '251534', '216361'}
{'201728', '254410', '201733'}
{'233103', '221393', '233094'}
{'248881', '171093', '171095'}
{'189900', '189891', '222156'}
{'212857', '253465', '252907'}
{'142386', '171377', '142388'}
{'232128', '232079', '210090'}
{'266114', '146547', '266130'}
{'243711', '152894', '152890'}
{'278765', '181236', '178312'}
{'233838', '206536', '206540'}
{'172681', '172677', '190285'}
{'153316', '139166', '139146'}
{'245530', '245531', '146206'}
{'208877', '208873', '243481'}
{'191659', '134102', '191654'}
{'160196', '160158', '152482'}
{'175730', '143426', '175731'}
{'147410', '240879', '240882'}
{'238845', '238842', '233035'}
{'211704', '211706', '211703', '168662'}
{'203494', '206151', '203487', '161932', '255769'}
{'226697', '226704', '195422'}
{'162809', '13

Unnamed: 0,data,sequential,random
aka,3062.0,,
aka_fac,3035.0,,
aka_geo,2973.0,,
aka_matched,,1913.0,1970.0
aka_matched_ok,,1907.0,1897.0
aka_pai,1619.0,,
matched_pairs,,3644.0,3804.0
matched_pairs_ok,,3628.0,3614.0
nodate,5763.0,,
nodate_novide,141.0,,


In [546]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                           ***NA***
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

##### **[EN]** Show some of the ambiguous records


In [547]:
pd.set_option('display.max_rows',250)

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

show_only = 6
for ambiguous_records in transitive[:show_only]:
    display_group_attributes(ambiguous_records,
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,238066,Manuel Gomes Pereira,Gomes,Coimbra,0000-00-00,,
1,186309,Manuel Gomes,Pereira,Coimbra,1553-00-00,Cânones,
2,132440,Manuel Gomes Quaresma,Gomes,Coimbra,1714-04-24,Artes,
3,186310,Manuel Gomes,Quaresma,Coimbra,1714-10-01,Medicina,


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,186417,faculdade,Cânones,Cânones
1,0000-00-00,186417,naturalidade,Coimbra,
2,0000-00-00,238066,naturalidade,Coimbra,
3,0000-00-00,186417,nome,Manuel Gomes,
4,0000-00-00,238066,nome,Manuel Gomes,"Manuel Gomes Pereira, vide Gomes"
5,0000-00-00,238066,nome,Manuel Gomes Pereira,
6,0000-00-00,186417,nome-pai,Simão Gomes Rebelo,
7,0000-00-00,238066,nome-vide,Gomes,
8,0000-00-00,186417,uc-entrada,0000-00-00,
9,0000-00-00,238066,uc-entrada,0000-00-00,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,201515,Manuel José de Magalhães Teixeira,Manuel José,Braga,0000-00-00,,
1,194771,Manuel José,Magalhães Teixeira,Braga,1735-01-30,Cânones,
2,316381,Manuel José de Magalhães Teixeira,Manuel José,Braga,1741-06-27,Cânones,Domingos Teixeira de Magalhães


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,201515,naturalidade,Braga,
1,0000-00-00,201515,nome,Manuel José,"Manuel José de Magalhães Teixeira, vide Manuel José"
2,0000-00-00,201515,nome,Manuel José de Magalhães Teixeira,
3,0000-00-00,201515,nome-vide,Manuel José,
4,0000-00-00,201515,uc-entrada,0000-00-00,
5,0000-00-00,201515,uc-saida,0000-00-00,
6,1735-01-30,194771,faculdade,Cânones,Cânones
7,1735-01-30,194771,instituta,1735-01-30,30.01.1735 1735-01-30
8,1735-01-30,194771,naturalidade,Braga,
9,1735-01-30,194771,nome,Manuel José,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,216361,Manuel Vidigal,Morais,Évora,0000-00-00,Cânones,
1,251534,Manuel Vidigal de Morais,Vidigal,Évora,0000-00-00,,Crispim Luís
2,251536,Manuel Vidigal de Morais,Vidigal,Évora,1667-10-21,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,216361,faculdade,Cânones,Cânones
1,0000-00-00,216361,naturalidade,Évora,
2,0000-00-00,251534,naturalidade,Évora,
3,0000-00-00,216361,nome,Manuel Vidigal,
4,0000-00-00,251534,nome,Manuel Vidigal,"Manuel Vidigal de Morais, vide Vidigal"
5,0000-00-00,216361,nome,Manuel Vidigal Morais,"Manuel Vidigal, vide Morais"
6,0000-00-00,251534,nome,Manuel Vidigal de Morais,
7,0000-00-00,251534,nome-pai,Crispim Luís,
8,0000-00-00,216361,nome-vide,Morais,
9,0000-00-00,251534,nome-vide,Vidigal,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,201728,Miguel Luís Teixeira,Miguel Luís,Lamego,0000-00-00,,António Luís
1,201733,Miguel Luís Teixeira,Luís,Lamego,0000-00-00,Cânones,
2,254410,Miguel Luís,Teixeira,Lamego,1667-11-15,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,201733,faculdade,Cânones,Cânones
1,0000-00-00,201728,naturalidade,Lamego,
2,0000-00-00,201733,naturalidade,Lamego,
3,0000-00-00,201728,nome,Miguel Luís,"Miguel Luís Teixeira, vide Miguel Luís"
4,0000-00-00,201733,nome,Miguel Luís,"Miguel Luís Teixeira, vide Luís"
5,0000-00-00,201728,nome,Miguel Luís Teixeira,
6,0000-00-00,201733,nome,Miguel Luís Teixeira,
7,0000-00-00,201733,nome-nota,padre,
8,0000-00-00,201728,nome-pai,António Luís,
9,0000-00-00,201733,nome-vide,Luís,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,233094,Sebastião Soares,Pais,Lisboa,0000-00-00,,
1,233103,Sebastião Soares,Pais,Lisboa,1604-10-08,Leis,Diogo Dias
2,221393,Sebastião Soares Pais,Soares,Lisboa,1607-10-05,Leis,Diogo Dias


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,233094,naturalidade,Lisboa,
1,0000-00-00,233094,nome,Sebastião Soares,
2,0000-00-00,233094,nome,Sebastião Soares Pais,"Sebastião Soares, vide Pais"
3,0000-00-00,233094,nome-vide,Pais,
4,0000-00-00,233094,uc-entrada,0000-00-00,
5,0000-00-00,233094,uc-saida,0000-00-00,
6,1604-10-08,233103,faculdade,Leis,Leis
7,1604-10-08,233103,instituta,1604-10-08,08.10.1604 1604-10-08
8,1604-10-08,233103,naturalidade,Lisboa,
9,1604-10-08,233103,nome,Sebastião Soares,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,248881,António Fernandes,Nogueira,Góis,0000-00-00,,
1,171093,António Fernandes Nogueira,Fernandes,Góis,1602-10-09,Medicina,Baltasar
2,171095,António Fernandes Nogueira,Fernandes,Góis,1602-10-09,Medicina,Baltasar Gomes


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,248881,naturalidade,Góis,
1,0000-00-00,248881,nome,António Fernandes,
2,0000-00-00,248881,nome,António Fernandes Nogueira,"António Fernandes, vide Nogueira"
3,0000-00-00,248881,nome-vide,Nogueira,
4,0000-00-00,248881,uc-entrada,0000-00-00,
5,0000-00-00,248881,uc-saida,0000-00-00,
6,1602-00-00,171093,grau,Bacharel em Artes,cursar tudo quanto se requer para Bacharel em Artes 1602
7,1602-03-16,171093,grau,Bacharel em Medicina,Bacharel 16.03.1602
8,1602-10-09,171093,faculdade,Medicina,Medicina
9,1602-10-09,171095,faculdade,Medicina,Medicina


#### **[EN]** Show some of the aka-aka records

In [548]:
from timelinknb.pandas import display_group_attributes

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

pairs = match_records['matched_pairs']['random']
show_pairs = [(o,d,t) for o,d,t in pairs if t == 'aka-aka' and o<d]
show_only = 4
print(f"aka-aka matches in sequential mode (show only {show_only}) of {len(show_pairs)}:")
for o,d,t in show_pairs[:show_only]:
    display_group_attributes([o,d],
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

aka-aka matches in sequential mode (show only 4) of 109:


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,243154,João Manuel de Pina,Manuel,Óbidos,1601-10-15,Cânones,Francisco Gorjão
1,184012,João Manuel,Pina,Óbidos,1605-10-07,Cursos jurídicos (Cânones ou Leis),Francisco Gorjão


Unnamed: 0,date,id,type,value,attr_obs
0,1601-10-15,243154,faculdade,Cânones,Cânones
1,1601-10-15,243154,matricula-faculdade,Cânones,15.10.1601
2,1601-10-15,243154,naturalidade,Óbidos,
3,1601-10-15,243154,nome,João Manuel,"João Manuel de Pina, vide Manuel"
4,1601-10-15,243154,nome,João Manuel de Pina,
5,1601-10-15,243154,nome-pai,Francisco Gorjão,
6,1601-10-15,243154,nome-vide,Manuel,
7,1601-10-15,243154,uc-entrada,1601-10-15,
8,1601-10-15,243154,uc-entrada.ano,1601,
9,1605-10-07,184012,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,150364,Diogo Barbosa,Carvalho,Lisboa,1624-10-22,Cânones,Tristão Barbosa
1,150364,Diogo Barbosa,Carvalho,Lisboa,1624-10-22,Leis,Tristão Barbosa
2,265272,Diogo Barbosa de Carvalho,Barbosa,Lisboa,1624-10-22,Cânones,Tristão Barbosa
3,265272,Diogo Barbosa de Carvalho,Barbosa,Lisboa,1624-10-22,Leis,Tristão Barbosa


Unnamed: 0,date,id,type,value,attr_obs
0,1624-10-22,150364,faculdade,Cânones,Faculdade corrigida
1,1624-10-22,265272,faculdade,Cânones,Faculdade corrigida
2,1624-10-22,150364,faculdade,Leis,Faculdade corrigida
3,1624-10-22,265272,faculdade,Leis,Faculdade corrigida
4,1624-10-22,150364,faculdade-original,Leis,
5,1624-10-22,265272,faculdade-original,Leis,
6,1624-10-22,150364,instituta,1624-10-22,"""1624/10/22 1624-10-22"""
7,1624-10-22,150364,naturalidade,Lisboa,
8,1624-10-22,265272,naturalidade,Lisboa,
9,1624-10-22,150364,nome,Diogo Barbosa,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,182217,Simão Machado,Miranda,Ilha da Madeira,1624-10-02,Cânones,João Machado
1,182217,Simão Machado,Miranda,Ilha da Madeira,1624-10-02,Leis,João Machado
2,249651,Simão Machado de Miranda,Machado,Ilha da Madeira,1624-00-00,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1624-00-00,249651,faculdade,Cânones,Cânones
1,1624-00-00,249651,instituta,1624-00-00,??.10.1624
2,1624-00-00,249651,naturalidade,Ilha da Madeira,
3,1624-00-00,249651,nome,Simão Machado,"Simão Machado de Miranda, vide Machado"
4,1624-00-00,249651,nome,Simão Machado de Miranda,
5,1624-00-00,249651,nome-vide,Machado,
6,1624-00-00,249651,uc-entrada,1624-00-00,
7,1624-00-00,249651,uc-entrada.ano,1624,
8,1624-10-02,182217,faculdade,Cânones,Faculdade corrigida
9,1624-10-02,182217,faculdade,Leis,Faculdade corrigida


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,250513,Simão Lourenço,Coelho,Tancos,1648-10-31,Leis,Teodósio Lourenço
1,194939,Simão Lourenço Coelho,Lourenço,Tancos,1650-10-12,Leis,Teodósio Lourenço


Unnamed: 0,date,id,type,value,attr_obs
0,1648-10-31,250513,faculdade,Leis,Leis
1,1648-10-31,250513,instituta,1648-10-31,1648.10.31 1648-10-31
2,1648-10-31,250513,naturalidade,Tancos,
3,1648-10-31,250513,nome,Simão Lourenço,
4,1648-10-31,250513,nome,Simão Lourenço Coelho,"Simão Lourenço, vide Coelho"
5,1648-10-31,250513,nome-pai,Teodósio Lourenço,
6,1648-10-31,250513,nome-vide,Coelho,
7,1648-10-31,250513,uc-entrada,1648-10-31,
8,1648-10-31,250513,uc-entrada.ano,1648,
9,1649-10-02,250513,matricula-faculdade,Leis,1649.10.02


In [549]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                           ***NA***
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

In [550]:

# set back the missing nome_geografico to null
no_geo_filter = vide_plus['nome-geografico'] == '***NA***'
vide_plus.loc[no_geo_filter,'nome-geografico'] = np.nan
print(len(vide_plus[vide_plus['nome-geografico'] == '***NA***']))
vide_plus.info()

0
<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 198423 to 230315
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8916 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
 13  loookup          9438 non-null   object
 14  vide_type        9438 non-null   object
 15  lookup           9438 non-null   object
 16  name_sp          9438 non-null   object
 17  lookup_sp        9438 non-nul

In [551]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                                NaN
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     


Why:
* 230791	Abrantes	Manuel Fernandes da Silveira	None	Manuel Fernandes da Silveira	Fernandes da Silveira	0000-00-00	NaN
* 230756	Abrantes	Manuel Fernandes da Silveira	None	Manuel da Silveira	Manuel Fernandes da Silveira	1598-10-12	Leis

### **[EN]** Merge the result of the two methods, check for errors, again

Since we are mixing matches from two different methods it can happen that, together, they introduce new errrors,  specially transitive matches.

In [552]:
import networkx as nx
method = 'data'  # short for merged methods

matched_rand_pairs = match_records['matched_pairs']['random']
matched_seq_pairs = match_records['matched_pairs']['sequential']
matched_pairs = list(set(matched_rand_pairs + matched_seq_pairs))
print("Number of matched pairs (union of both methods)  :",len(matched_pairs))
match_records['matched_pairs'][method] = matched_pairs
match_info.loc['matched_pairs',method] = len(matched_pairs)

rec_errors_seq = match_records['records_error']['sequential']
rec_errors_rand = match_records['records_error']['random']


# And now filter, this is necessary because error detected are different in each method
matched_multiple = []
matched_single = []

origins = [o for (o,d,t) in matched_pairs]
destinations = [d for (o,d,t) in matched_pairs]
rec_in_matches = origins + destinations
for i in rec_in_matches:
    c = rec_in_matches.count(i)
    if i == '172681':
        pass
    if c >2:
        matched_multiple.append(i)
    elif c == 1:
        matched_single.append(i)
  
        
matched_multiple = list(set(matched_multiple))
matched_single = list(set(matched_single))
print("Number of matches random              :",len(matched_rand_pairs))
print("Number of matches sequential          :",len(matched_seq_pairs))
print("Number of matches both                :",len(matched_pairs))
print("Records with more than two matches    :", len(matched_multiple))
print("Records with just one match           :", len(matched_single))

# alternative method, perhaps more informative:
pairs_to_check = matched_pairs
print()
print("The following pairs have to reverse match:")
asymmetric_pairs = []
for (o,d,t) in matched_pairs:
    if t == 'see-aka':
        rt = 'aka-see'
    elif t == 'aka-see':
        rt = 'see-aka'
    else:
        rt = t
    if (d,o,rt) not in matched_pairs:
        asymmetric_pairs.append((o,d,t))
        print((o,d,t))

# now test for transitivity
G = nx.Graph()
simple_pairs = [(o,d) for (o,d,t) in matched_pairs]
G.add_edges_from(simple_pairs)
transitive  = [c for c in nx.connected_components(G) if len(c) > 2]

# number of records in ambiguous matchesamb_records = [item for amb in transitive for item in amb]
namb_records = len(set(amb_records))
print("Records in ambiguous matches          :", namb_records)

match_records['records_transitive'][method] = list(amb_records)
match_info.loc['records_transitive'][method] = namb_records

for amb in transitive:
    print(amb)
print("Are multiple in ambiguous             :",set(matched_multiple).issubset(set(amb_records)))

vide_plus.loc[matched_single,'match_error'] = False
vide_plus.loc[matched_single,'match_obs'] = "W01-Single match (asymmetric) "+method

vide_plus.loc[matched_multiple,'match_error'] = True
vide_plus.loc[matched_multiple,'match_obs'] = "E03-Multiple match"+method

vide_plus.loc[amb_records,'match_error'] = True
vide_plus.loc[amb_records,'match_obs'] = "E04-Ambiguity in match "+method

# do a new list of records in error
rec_errors = set(amb_records).union(matched_multiple)

records = set(rec_in_matches)
rec_ok = records.difference(rec_errors)

print("Records involved in matches           :", len(records))
print("Records matched without errors        :", len(rec_ok))
print("Records matched with errors           :", len(rec_errors))

match_records['records_error'][method] = list(rec_errors)
match_info.loc['records_error'][method] = len(rec_errors)
match_records['records_matched_ok'][method] = list(rec_ok)
match_info.loc['records_matched_ok'][method] = len(rec_ok)
match_records['records_matched'][method] = list(records)
match_info.loc['records_matched'][method] = len(records)

aka = match_records['aka']['data']
aka_ok = set(aka).intersection(rec_ok)
see = match_records['see']['data']
see_ok = set(see).intersection(rec_ok)
match_info.loc['aka_matched_ok'][method] = len(aka_ok)
match_records['aka_matched_ok'][method] = list(aka_ok)
match_info.loc['see_matched_ok'][method] = len(see_ok)
match_records['see_matched_ok'][method] = list(see_ok)

pairs_ok = set([(o,d,t) for (o,d,t) in match_records['matched_pairs'][method]
                                                        if o not in rec_errors and d not in rec_errors])
match_records['matched_pairs_ok'][method] = list(pairs_ok)
match_info.loc['matched_pairs_ok'][method] = len(pairs_ok)


Number of matched pairs (union of both methods)  : 3818
Number of matches random              : 3804
Number of matches sequential          : 3644
Number of matches both                : 3818
Records with more than two matches    : 38
Records with just one match           : 76

The following pairs have to reverse match:
('223100', '252345', 'see-aka')
('194373', '128907', 'see-aka')
('225717', '173224', 'aka-see')
('195505', '160216', 'see-aka')
('181367', '214929', 'see-aka')
('186417', '132440', 'see-aka')
('204089', '196842', 'see-aka')
('130534', '163482', 'see-aka')
('230791', '230756', 'see-aka')
('235056', '142554', 'see-aka')
('140367', '208712', 'see-aka')
('203487', '206151', 'see-see')
('256874', '141866', 'see-aka')
('129384', '139423', 'see-aka')
('241346', '197630', 'see-aka')
('206505', '158689', 'see-aka')
('185191', '181415', 'see-aka')
('205772', '209662', 'see-aka')
('233397', '230482', 'see-aka')
('241026', '241012', 'see-aka')
('128114', '211581', 'see-aka')
('18427

Detect asymmetries in matches with no errors (asymmetries are not considered errors)

In [553]:
method = 'data'  # short for merged methods

pairs_ok = match_records['matched_pairs_ok'][method]
# pairs 
pairs_see_aka = [(o,d,mtype) for (o,d,mtype) in pairs_ok if mtype == 'see-aka']
pairs_aka_see = [(o,d,mtype) for (o,d,mtype) in pairs_ok if mtype == 'aka-see']
pairs_aka_aka = [(o,d,mtype) for (o,d,mtype) in pairs_ok if mtype == 'aka-aka']
pairs_see_see = [(o,d,mtype) for (o,d,mtype) in pairs_ok if mtype == 'see-see']

aka_in_see_aka = [d for (o,d,mtype) in pairs_ok if mtype == 'see-aka']
aka_in_aka_see = [o for (o,d,mtype) in pairs_ok if mtype == 'aka-see']
asymmetry_aka = sorted(list(set(aka_in_see_aka) ^ set(aka_in_aka_see)))

print("Asymmetry for aka:", asymmetry_aka )
print("N Asymmetry for aka:", len(asymmetry_aka) )

see_in_see_aka = [o for (o,d,mtype) in pairs_ok if mtype == 'see-aka']
see_in_aka_see = [d for (o,d,mtype) in pairs_ok if mtype == 'aka-see']
asymmetry_see = sorted(list(set(see_in_see_aka) ^ set(see_in_aka_see)))

print("Asymmetry for see:", asymmetry_see )
print("N Asymmetry for see:", len(asymmetry_see))

# alternative method, perhaps more informative:
print()
pairs_to_check = pairs_ok
print("Asymmetric matches: for each match bellow the reverse one was not found")
asymmetric_pairs = []
for (o,d,t) in pairs_to_check:
    if t == 'see-aka':
        rt = 'aka-see'
    elif t == 'aka-see':
        rt = 'see-aka'
    else:
        rt = t
    if (d,o,rt) not in pairs_to_check:
        asymmetric_pairs.append((o,d,t))
        print((o,d,t))

asymmetric_records = set([o for (o,d,t) in asymmetric_pairs] + [d for (o,d,t) in asymmetric_pairs])
print()

match_info.loc['records_asymmetric',method] = len(asymmetric_records)
match_records['records_asymmetric'][method] = list(asymmetric_records)
# records
rec_matched = set([id for (id,d,t) in pairs_ok]+
                  [id for (o,id,t) in pairs_ok])
rec_see_aka = set([id for (id,d,t) in pairs_see_aka])
rec_aka_see = set([id for (o,id,t) in pairs_aka_see])
rec_see_see = set([id for (id,d,t) in pairs_see_see] +
                  [id for (o,id,t) in pairs_see_see])
rec_aka_aka = set([id for (id,d,t) in pairs_aka_aka] +
                  [id for (o,id,t) in pairs_aka_aka])

match_records['records_see_aka'][method] = list(rec_see_aka)
match_records['records_aka_see'][method] = list(rec_aka_see)
match_records['records_aka_aka'][method] = list(rec_aka_aka)
match_records['records_see_see'][method] = list(rec_see_see)

match_info.loc['records_see_aka'][method] = len(rec_see_aka)
match_info.loc['records_aka_see'][method] = len(rec_aka_see)
match_info.loc['records_aka_aka'][method] = len(rec_aka_aka)
match_info.loc['records_see_see'][method] = len(rec_see_see)

# new
match_info.loc['aka_matched'][method] = len(rec_aka_see.union(rec_aka_aka))
match_records['aka_matched'][method] = list(rec_aka_see.union(rec_aka_aka))
match_info.loc['see_matched'][method] = len(rec_see_aka.union(rec_see_see))
match_records['see_matched'][method] = list(rec_see_aka.union(rec_see_see))

matched_rand_df = pd.DataFrame(pairs_ok, columns=['from','to','type'])
matched_rand_df.groupby('type').count()

Asymmetry for aka: ['128907', '131947', '139423', '141866', '142554', '151170', '151354', '158689', '160216', '163482', '166987', '177796', '179399', '181415', '182602', '182659', '196842', '197630', '198768', '200713', '203369', '207054', '208712', '209662', '211581', '214929', '225717', '230482', '230756', '239847', '241012', '242752', '245318', '247449', '248237', '252345']
N Asymmetry for aka: 36
Asymmetry for see: ['128114', '129384', '130534', '134006', '140367', '148963', '149251', '163021', '164227', '173224', '181367', '184271', '184419', '185191', '188508', '194373', '195505', '204089', '205772', '206278', '206505', '209320', '212796', '221053', '223100', '226997', '230791', '233397', '233550', '234767', '235056', '241026', '241346', '251029', '252478', '256874']
N Asymmetry for see: 36

Asymmetric matches: for each match bellow the reverse one was not found
('223100', '252345', 'see-aka')
('194373', '128907', 'see-aka')
('225717', '173224', 'aka-see')
('195505', '160216', 's

Unnamed: 0_level_0,from,to
type,Unnamed: 1_level_1,Unnamed: 2_level_1
aka-aka,188,188
aka-see,1722,1722
see-aka,1746,1746
see-see,9,9


In [554]:
match_info.fillna("")

Unnamed: 0,data,sequential,random
aka,3062,,
aka_fac,3035,,
aka_geo,2973,,
aka_matched,1910,1913.0,1970.0
aka_matched_ok,1940,1907.0,1897.0
aka_pai,1619,,
matched_pairs,3818,3644.0,3804.0
matched_pairs_ok,3665,3628.0,3614.0
nodate,5763,,
nodate_novide,141,,


##### **[EN** Check role of no date novide records in assymetries matches

Since these records have no "vide" expression they do not generate the symmetric name forookup.

In [555]:
see_with_no_vide = set(match_records['vide_plus']['data']) - set(match_records['vide']['data'])
print("Number of records with see and no vide: ",len(see_with_no_vide))
asymmetric_see_no_vide = list(set(asymmetric_records).intersection(see_with_no_vide))
print("See no vide part in asymmetric matches: ",len(asymmetric_see_no_vide),set(asymmetric_records).intersection(see_with_no_vide))

Number of records with see and no vide:  141
See no vide part in asymmetric matches:  22 {'204089', '181367', '256874', '223100', '128114', '149251', '140367', '251029', '209320', '195505', '252478', '234767', '194373', '148963', '185191', '164227', '184271', '129384', '233397', '134006', '130534', '136704'}


##### **[EN** Check asymmetric pairs

In [556]:

match_list = asymmetric_pairs

pd.set_option('display.max_rows',250)

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

show_only = 6
for (o,d,t) in match_list[:show_only]:
    print(o,d,t)
    display_group_attributes([o,d],
                             header_cols=['uc-entrada','name','nome-vide','naturalidade','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

223100 252345 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,223100,0000-00-00,João Fernandes de Passos,,Penaguião,Leis,Gonçalo Fernandes
1,252345,1597-11-12,João Fernandes,Passos,Penaguião,Cânones,Gonçalo Fernandes
2,252345,1597-11-12,João Fernandes,Passos,Penaguião,Leis,Gonçalo Fernandes


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,223100,faculdade,Leis,Leis
1,0000-00-00,223100,naturalidade,Penaguião,
2,0000-00-00,223100,nome,João Fernandes de Passos,
3,0000-00-00,223100,nome-pai,Gonçalo Fernandes,
4,0000-00-00,223100,uc-entrada,0000-00-00,
5,0000-00-00,223100,uc-saida,0000-00-00,
6,1597-11-12,252345,faculdade,Cânones,Faculdade corrigida
7,1597-11-12,252345,faculdade,Leis,Faculdade corrigida
8,1597-11-12,252345,faculdade-original,Cânones,
9,1597-11-12,252345,matricula-faculdade,Leis,


194373 128907 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,194373,0000-00-00,Manuel Coelho,,Guimarães,Cânones,Roque Coelho
1,128907,1640-11-05,Manuel Coelho Cabral,Coelho,Guimarães,Cânones,Roque Coelho


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,194373,faculdade,Cânones,Cânones
1,0000-00-00,194373,naturalidade,Guimarães,
2,0000-00-00,194373,nome,Manuel Coelho,
3,0000-00-00,194373,nome-pai,Roque Coelho,
4,0000-00-00,194373,uc-entrada,0000-00-00,
5,0000-00-00,194373,uc-saida,0000-00-00,
6,1640-11-05,128907,faculdade,Cânones,Cânones
7,1640-11-05,128907,matricula-faculdade,Cânones,05.11.1640
8,1640-11-05,128907,naturalidade,Guimarães,
9,1640-11-05,128907,nome,Manuel Coelho,"Manuel Coelho Cabral, vide Coelho"


225717 173224 aka-see


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,173224,0000-00-00,João Nunes,Rogado,Terena,Leis,André Rogado
1,225717,1587-10-03,João Nunes Rogado,Rogado,Terena,Leis,André Rogado


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,173224,faculdade,Leis,Leis
1,0000-00-00,173224,naturalidade,Terena,
2,0000-00-00,173224,nome,João Nunes,
3,0000-00-00,173224,nome,João Nunes Rogado,"João Nunes, vide Rogado"
4,0000-00-00,173224,nome-pai,André Rogado,
5,0000-00-00,173224,nome-vide,Rogado,
6,0000-00-00,173224,uc-entrada,0000-00-00,
7,0000-00-00,173224,uc-saida,0000-00-00,
8,1587-10-03,225717,faculdade,Leis,Leis
9,1587-10-03,225717,matricula-faculdade,Leis,1587.10.03


195505 160216 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,195505,0000-00-00,Constantino Ribeiro do Lago,,Braga,,
1,160216,1634-10-01,Constantino Ribeiro,Lago,Braga,Leis,Manuel Ribeiro


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,195505,naturalidade,Braga,
1,0000-00-00,195505,nome,Constantino Ribeiro do Lago,
2,0000-00-00,195505,uc-entrada,0000-00-00,
3,0000-00-00,195505,uc-saida,0000-00-00,
4,1634-10-01,160216,faculdade,Leis,Leis
5,1634-10-01,160216,naturalidade,Braga,
6,1634-10-01,160216,nome,Constantino Ribeiro,
7,1634-10-01,160216,nome,Constantino Ribeiro Lago,"Constantino Ribeiro, vide Lago"
8,1634-10-01,160216,nome-pai,Manuel Ribeiro,
9,1634-10-01,160216,nome-vide,Lago,


181367 214929 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,181367,0000-00-00,Gaspar de Macedo,,"Cepões, Lamego",Cânones,Gaspar Luís
1,214929,1660-11-23,Gaspar de Macedo Sampaio,Macedo,"Cepões, Lamego",Cânones,Gaspar Luís


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,181367,faculdade,Cânones,Cânones
1,0000-00-00,181367,naturalidade,"Cepões, Lamego",Cepões (Lamego)
2,0000-00-00,181367,nome,Gaspar de Macedo,
3,0000-00-00,181367,nome-pai,Gaspar Luís,
4,0000-00-00,181367,uc-entrada,0000-00-00,
5,0000-00-00,181367,uc-saida,0000-00-00,
6,1660-11-23,214929,faculdade,Cânones,Cânones
7,1660-11-23,214929,instituta,1660-11-23,23.11.1660 1660-11-23
8,1660-11-23,214929,naturalidade,"Cepões, Lamego",Cepões (Lamego)
9,1660-11-23,214929,nome,Gaspar de Macedo,"Gaspar de Macedo Sampaio, vide Macedo"


204089 196842 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,204089,0000-00-00,Domingos Correia de Torres,,Lisboa,Cânones,Miguel Fernandes
1,196842,1622-01-12,Domingos Correia,Torres,Lisboa,Cânones,Miguel Fernandes


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,204089,faculdade,Cânones,Cânones
1,0000-00-00,204089,naturalidade,Lisboa,
2,0000-00-00,204089,nome,Domingos Correia de Torres,
3,0000-00-00,204089,nome-pai,Miguel Fernandes,
4,0000-00-00,204089,uc-entrada,0000-00-00,
5,0000-00-00,204089,uc-saida,0000-00-00,
6,1622-01-12,196842,faculdade,Cânones,Cânones
7,1622-01-12,196842,matricula-faculdade,Cânones,12.01.1622
8,1622-01-12,196842,naturalidade,Lisboa,
9,1622-01-12,196842,nome,Domingos Correia,


Add match information to the records

In [557]:
import pandas as pd

pairs = match_records['matched_pairs_ok']['data']
def get_match(id,pairs):
    match_list = [(d,mtype) for (o,d,mtype) in pairs if o == id]
    if len(match_list) == 0:
        return (None,None)
    else:
        return match_list[0]

ids = vide_plus.index.values
matches = [get_match(id,pairs) for id in ids]
cols = pd.DataFrame(matches,columns=['match','match_type'], index=ids)
vide_plus = pd.concat([vide_plus,cols],axis=1)


## **[EN]** Match results

### **[EN]** Match general statistics

In [558]:
nvide_plus = match_info.loc['vide_plus','data']
match_info.fillna("")
vars_perc_vide = ['aka','aka_fac','aka_geo','aka_pai',
                'nodate','nodate_novide',
                'records_matched','records_matched_ok',
                'records_see_aka','records_see_see',
                'records_aka_see','records_aka_aka',
                'records_transitive',
                'vide','vide_plus']

match_info.loc[vars_perc_vide,'perc_vide_plus'] = match_info.loc[vars_perc_vide,'data']/nvide_plus

nrec_matched = match_info.loc['records_matched_ok','data']
vars_perc_matches = ['records_matched_ok',
                     'records_see_aka','records_see_see',
                     'records_aka_see','records_aka_aka',
                     'records_transitive']
match_info.loc[vars_perc_matches,'perc_matched_ok'] = match_info.loc[vars_perc_matches,'data']/nrec_matched
                     
nsee = match_info.loc['see','data']
vars_perc_see = ['see_matched','see_matched_ok','records_see_aka','records_see_see', 'see','see_fac','see_geo','see_pai']
match_info.loc[vars_perc_see,'perc_type'] = match_info.loc[vars_perc_see,'data']/nsee
match_info.loc[vars_perc_see,'type'] = 'see'


naka = match_info.loc['aka','data']
vars_perc_aka = ['aka','aka_matched','aka_matched_ok','aka_fac','aka_geo','aka_pai','records_aka_see','records_aka_aka']
match_info.loc[vars_perc_aka,'perc_type'] = match_info.loc[vars_perc_aka,'data']/naka
match_info.loc[vars_perc_aka,'type'] = 'aka'


nmatched_pairs = match_info.loc['matched_pairs','data']
vars_perc_matched = ['matched_pairs','matched_pairs_ok']
match_info.loc[vars_perc_matched,'perc_type'] = match_info.loc[vars_perc_matched,'data']/nmatched_pairs
match_info.loc[vars_perc_matched,'type'] = 'matched_pairs'

nrec_matched = match_info.loc['records_matched','data']
vars_perc_rec_matched = ['records_matched','records_matched_ok','records_transitive']
match_info.loc[vars_perc_rec_matched,'perc_type'] = match_info.loc[vars_perc_rec_matched,'data']/nrec_matched
match_info.loc[vars_perc_rec_matched,'type'] = 'records_mached'

In [559]:
match_info.fillna(" ")

Unnamed: 0,data,sequential,random,perc_vide_plus,perc_matched_ok,perc_type,type
aka,3062,,,0.349304,,1.0,aka
aka_fac,3035,,,0.346224,,0.991182,aka
aka_geo,2973,,,0.339151,,0.970934,aka
aka_matched,1910,1913.0,1970.0,,,0.623775,aka
aka_matched_ok,1940,1907.0,1897.0,,,0.633573,aka
aka_pai,1619,,,0.184691,,0.528739,aka
matched_pairs,3818,3644.0,3804.0,,,1.0,matched_pairs
matched_pairs_ok,3665,3628.0,3614.0,,,0.959927,matched_pairs
nodate,5763,,,0.657426,,,
nodate_novide,141,,,0.016085,,,


### **[EN]** Generate matching file and dataframe

In [568]:

matching_view_cols = ['match','nome-geografico','uc-entrada','uc-saida','name','nome-vide','lookup','nome-pai','vide_type','faculdade','match_type','match_error','match_obs']

matched_error = vide_plus[vide_plus['match_error']==True]
matched_error_index = matched_error.index.unique()

matched_index = match_records['records_matched']['data']
matched_ok_index = list(set(matched_index)-set(matched_error_index))

matched = vide_plus.loc[matched_index].sort_values(['sort_key','nome-geografico','uc-entrada'])[matching_view_cols]
nmatched = len(matched_index)
print("Number of matched records:",nmatched)
matched.to_csv('../inferences/remissivas/vide_matched.csv',sep=',',)
matched.head(40)


Number of matched records: 3818


Unnamed: 0,match,nome-geografico,uc-entrada,uc-saida,name,nome-vide,lookup,nome-pai,vide_type,faculdade,match_type,match_error,match_obs
179898,151044.0,Pinheiro de Ázere,0000-00-00,0000-00-00,Adrião da Barca de Gouveia,Barca,Adrião da Barca,,cut,,see-aka,False,
151044,179898.0,Pinheiro de Ázere,1596-10-19,1620-07-11,Adrião da Barca,Gouveia,Adrião da Barca Gouveia,Baltasar Cardoso,add,Cânones,aka-see,False,
151589,131748.0,Viana,0000-00-00,0000-00-00,Afonso de Barros,Caminha,Afonso de Barros Caminha,,add,,see-aka,False,
131748,151589.0,Viana,1684-10-01,1687-10-01,Afonso de Barros Caminha,Barros,Afonso de Barros,,cut,Cânones,aka-see,False,
250325,151588.0,Estremoz,0000-00-00,0000-00-00,Afonso de Barros Preto,Barros,Afonso de Barros,Francisco Dias Zagalo,cut,,see-aka,False,
151588,250325.0,Estremoz,1563-11-16,1577-10-11,Afonso de Barros,Preto,Afonso de Barros Preto,,add,Leis,aka-see,False,
181618,186611.0,Viseu,0000-00-00,0000-00-00,Afonso Botelho Machado,Botelho,Afonso Botelho,,cut,Leis,see-aka,False,
186611,181618.0,Viseu,1609-10-01,1624-10-30,Afonso Botelho,Machado,Afonso Botelho Machado,António Machado,add,Cânones,aka-see,False,
186611,181618.0,Viseu,1609-10-01,1624-10-30,Afonso Botelho,Machado,Afonso Botelho Machado,António Machado,add,Leis,aka-see,False,
221796,164067.0,Elvas,0000-00-00,0000-00-00,Afonso Frausto Segurado,Frausto,Afonso Frausto,,cut,,see-aka,False,


### **[EN]** Matches, excluding errors


**ERROR** records_matched_ok listsrecords in ERROR

In [569]:
matched_ok_index = match_records['records_matched_ok']['data']
matched.loc[matched_ok_index].sort_values(['name','nome-geografico','uc-entrada']).head(40).fillna('')

Unnamed: 0,match,nome-geografico,uc-entrada,uc-saida,name,nome-vide,lookup,nome-pai,vide_type,faculdade,match_type,match_error,match_obs
151044,179898,Pinheiro de Ázere,1596-10-19,1620-07-11,Adrião da Barca,Gouveia,Adrião da Barca Gouveia,Baltasar Cardoso,add,Cânones,aka-see,False,
179898,151044,Pinheiro de Ázere,0000-00-00,0000-00-00,Adrião da Barca de Gouveia,Barca,Adrião da Barca,,cut,,see-aka,False,
186611,181618,Viseu,1609-10-01,1624-10-30,Afonso Botelho,Machado,Afonso Botelho Machado,António Machado,add,Cânones,aka-see,False,
186611,181618,Viseu,1609-10-01,1624-10-30,Afonso Botelho,Machado,Afonso Botelho Machado,António Machado,add,Leis,aka-see,False,
181618,186611,Viseu,0000-00-00,0000-00-00,Afonso Botelho Machado,Botelho,Afonso Botelho,,cut,Leis,see-aka,False,
164067,221796,Elvas,1621-00-00,1627-02-12,Afonso Frausto,Segurado,Afonso Frausto Segurado,,add,Cânones,aka-see,False,
221796,164067,Elvas,0000-00-00,0000-00-00,Afonso Frausto Segurado,Frausto,Afonso Frausto,,cut,,see-aka,False,
177343,203124,Fronteira,1577-10-03,1585-11-23,Afonso Garcia,Tinoco,Afonso Garcia Tinoco,Pedro Garcia Tinoco,add,Leis,aka-see,False,
203124,177343,Fronteira,0000-00-00,0000-00-00,Afonso Garcia Tinoco,Garcia,Afonso Garcia,Pedro Garcia Tinoco,cut,Leis,see-aka,False,
187008,190253,Arruda,0000-00-00,0000-00-00,Afonso Henriques,Homem,Afonso Henriques Homem,,add,Cânones,see-aka,False,


### **[EN]** Show diferences in matching results

In [570]:
match_info

Unnamed: 0,data,sequential,random,perc_vide_plus,perc_matched_ok,perc_type,type
aka,3062,,,0.349304,,1.0,aka
aka_fac,3035,,,0.346224,,0.991182,aka
aka_geo,2973,,,0.339151,,0.970934,aka
aka_matched,1910,1913.0,1970.0,,,0.623775,aka
aka_matched_ok,1940,1907.0,1897.0,,,0.633573,aka
aka_pai,1619,,,0.184691,,0.528739,aka
matched_pairs,3818,3644.0,3804.0,,,1.0,matched_pairs
matched_pairs_ok,3665,3628.0,3614.0,,,0.959927,matched_pairs
nodate,5763,,,0.657426,,,
nodate_novide,141,,,0.016085,,,


#### **[EN]** Only matched in random mode

The extra success of the random mode comes from a better tolerance to variations of geographic names.

This is because the random mode uses a similarity factor to find students of with the same birth place while the sequential method uses sorting on geographic name and names to get the matches adjacent.

Each methods manages to succeed in cases where the other fails, but random is more efficient.

In [572]:
matched_rand_index = match_records['records_matched_ok']['random']
matched_seq_index = match_records['records_matched_ok']['sequential']
matched_error_index = match_records['records_error']['data']

matched_rand_only = list(set(matched_rand_index)-set(matched_seq_index)-set(matched_error_index))
nmatched_rand_only = len(matched_rand_only)
print(f"Number of records matched only in random access mode (errors excluded): {nmatched_rand_only}")
print()
print("Sample:")
matched.loc[matched_rand_only].sort_values(['name','nome-geografico','uc-entrada',])[matching_view_cols].head(40)

Number of records matched only in random access mode (errors excluded): 68

Sample:


Unnamed: 0,match,nome-geografico,uc-entrada,uc-saida,name,nome-vide,lookup,nome-pai,vide_type,faculdade,match_type,match_error,match_obs
183928,222335,Nogoselo,1756-10-01,1756-10-01,Agostinho Manuel,Agostinho Manuel de Sequeira,Agostinho Manuel de Sequeira,,rep,Cursos jurídicos (Cânones ou Leis),aka-see,False,
222335,183928,Nagoselo,0000-00-00,0000-00-00,Agostinho Manuel de Sequeira,Agostinho Manuel,Agostinho Manuel,,cut,,see-aka,False,
148502,213090,Santiago do Cacém,1650-10-12,1658-03-30,André Ascenso,Salema,André Ascenso Salema,Manuel Raposo Pessanha,add,Cânones,aka-see,False,
148502,213090,Santiago do Cacém,1650-10-12,1658-03-30,André Ascenso,Salema,André Ascenso Salema,Manuel Raposo Pessanha,add,Leis,aka-see,False,
213090,148502,Santiago de Cacém,0000-00-00,0000-00-00,André Ascenso Salema,Ascenso,André Ascenso,,cut,Leis,see-aka,False,
178267,251037,Ilha Terceira,1567-10-01,1575-05-17,André Gomes,Monteiro,André Gomes Monteiro,António Vaz,add,Cânones,aka-see,False,
251037,178267,Ilha da Terceira,0000-00-00,0000-00-00,André Gomes Monteiro,Gomes,André Gomes,,cut,,see-aka,False,
222372,140377,Várzae de Meruge,1728-10-01,1731-10-01,André de Sequeira,Abranches,André de Sequeira Abranches,,add,Cânones,aka-see,False,
140377,222372,Várzea de Meruge,0000-00-00,0000-00-00,André de Sequeira Abranches,Sequeira,André de Sequeira,,cut,,see-aka,False,
238571,202253,Setã,1632-01-10,1659-11-08,António Lopes,Leitão,António Lopes Leitão,António André,add,Cânones,aka-see,False,


#### **[EN]** Only matched in sequential mode

A few cases sequential is more successful.

In [573]:
pd.set_option('display.max_rows',100)


matched_seq_only = list(set(matched_seq_index)-set(matched_rand_index)-set(matched_error_index))
nmatched_not_rand = len(matched_seq_only)
print(f"Number of records matched only in sequential mode (errors excluded): {nmatched_not_rand}")
print()
matched.loc[matched_seq_only].sort_values(['name','nome-geografico','uc-entrada',]).head(20)[matching_view_cols]


Number of records matched only in sequential mode (errors excluded): 14



Unnamed: 0,match,nome-geografico,uc-entrada,uc-saida,name,nome-vide,lookup,nome-pai,vide_type,faculdade,match_type,match_error,match_obs
241634,250486,Algoso,0000-00-00,0000-00-00,António Pimentel,Morais,António Pimentel Morais,,add,Cânones,see-aka,False,
250486,241634,Algozo,1656-10-15,1665-03-24,António Pimentel Morais,Pimentel,António Pimentel,,cut,Leis,aka-see,False,
212564,151964,Mouta,1613-03-23,1615-10-05,Jorge Vaz,Barros,Jorge Vaz Barros,Manuel Vaz,add,Teologia,aka-see,False,
151964,212564,Mouta ou Moita,0000-00-00,0000-00-00,Jorge Vaz de Barros,Vaz,Jorge Vaz,Manuel Vaz,cut,Teologia,see-aka,False,
188868,131782,Azoia de Baixo,1748-10-01,1752-10-01,José Henriques,Figueira,José Henriques Figueira,,add,Cânones,aka-see,False,
131782,188868,Azoia,0000-00-00,0000-00-00,José Henriques Figueira,José Henriques,José Henriques,,cut,Cânones,see-aka,False,
239722,242576,São José de Godim,0000-00-00,0000-00-00,José Manuel Borges de Sousa,José Manuel Borges de Sousa Pinto,José Manuel Borges de Sousa Pinto,,rep,,see-aka,False,
242576,239722,Gondim,1762-10-01,1766-10-01,José Manuel Borges de Sousa Pinto,José Manuel Borges de Sousa,José Manuel Borges de Sousa,,cut,Leis,aka-see,False,
242576,239722,São José,1762-10-01,1766-10-01,José Manuel Borges de Sousa Pinto,José Manuel Borges de Sousa,José Manuel Borges de Sousa,,cut,Leis,aka-see,False,
136003,143938,Lobelhe do Mato,1642-10-20,1648-02-29,Manuel Cardoso,de Almeida,Manuel Cardoso de Almeida,Agostinho Cardoso,add,Medicina,aka-see,False,


#### **[EN]** Analyse Aka to Aka (see also) links.

These are true duplicates. Some of them could be prevented with a check on dates, but this serves to assess the extend of duplicate records in the data.

Analysis:
* 150364 265272: strange because the two records are exactly the same except for the name and vide **and the date on instituta** One of the vide should be a "see" record.
* 129553 191232: also strange: same father name, both records contain the same enrollment in "instituta" in 1601-10-14. the record 191232 
  has the faculdade "Medicina" and an enrollment date of 1613.01.12, while keeping the instituta date. 
* 207361 251998 This looks like a late addition to the "vide" scheme, a note on record 251998 states "Mudou o nome no ano de 1573 aos 03.06 - Atos e Graus 10, fl. 143, caderno 3º". To be in conformity 251988 should be a "see" records with no dates.
* 190606 248991 the record 248991 should be a "See" it retains a single enrolment date  in 1588.10.01 which also exists in the paired record. With no dates and the redundant enrollment removed 248991 this would have been a normal match
* 5 188413 193737: this is a true duplicate but the shorter record 188413 seems to contain redundant information except that the faculdade is 
  recorded as "Leis" while in 193737 is recorded as "Cânones". Note that except for the name of the faculdade 193737 always refers "Leis" in the various fields, including the degree.

In [574]:
vide_plus.columns

Index(['name', 'sex', 'nome-vide', 'nome-geografico', 'faculdade',
       'faculdade.date', 'faculdade.obs', 'nome-pai', 'uc-entrada', 'uc-saida',
       'uc-saida.date', 'uc-saida.obs', 'rec_type', 'loookup', 'vide_type',
       'lookup', 'name_sp', 'lookup_sp', 'sort_key', 'match_error',
       'match_obs', 'match', 'match_type'],
      dtype='object')

In [611]:
from timelinknb.pandas import display_group_attributes

date_threshold = 15  # difference in years for flagging false duplicate.
show_only = 20

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

pairs = match_records['matched_pairs_ok']['data']
show_pairs = [(o,d,t) for o,d,t in pairs if t == 'aka-aka' and o<d]

aka_aka_same_date = []
aka_aka_far_apart = []
aka_aka_possible_see = []
for o,d,t in show_pairs:
    if o == '141854':
        pass
    # get the dates of entry to filter those that cannot be then same
    date_o = matched.loc[[o]]['uc-entrada'][0]
    date_d = matched.loc[[d]]['uc-entrada'][0]
    date_s_o = matched.loc[[o]]['uc-saida'][0]
    date_s_d = matched.loc[[d]]['uc-saida'][0]

    if date_o == date_s_o:
        aka_aka_possible_see.append(o)
        
    if date_d == date_s_d:
        aka_aka_possible_see.append(d)

    if date_o == date_d and date_s_o == date_s_d:
        # print("aka-aka pair with same date:",date_o,(o,d,t))
        aka_aka_same_date.append((o,d,t))
    else:
        year_o = int(date_o[:4])    
        year_d = int(date_d[:4])
        if max(year_o,year_d) - min(year_o,year_d) > date_threshold:
            # print(f"False aka-aka: records more than {date_threshold} years appart",(o,d,t),date_s_o,date_d)
            aka_aka_far_apart.append((o,d,t))

print(f"Number of aka-aka pairs with the same date:",len(aka_aka_same_date))
print(f"Number of aka-aka pairs more {date_threshold} years apart:",len(aka_aka_far_apart))
print(f"Number of possible false aka records (records with a single date, probably a see record)",len(aka_aka_possible_see))


print(f"aka-aka matches (show only {show_only}) of {len(show_pairs)}:")
i = 0
for o,d,t in show_pairs[:show_only]:
    i += 1
    print(i,o,d)
    if (o,d,t) in aka_aka_same_date:
        print("SAME DATES: Possible double registration of the same card")
    elif (o,d,t) in aka_aka_far_apart:
        print(f"FAR APART >{date_threshold} years: possible false match, records chronologically affar")
    if o in aka_aka_possible_see:
        print(f"{o} is a possible 'see' record")
    if d in aka_aka_possible_see:
        print(f"{d} is a possible 'see' record")
    
    display_group_attributes([o,d],
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','uc-saida','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')


Number of aka-aka pairs with the same date: 5
Number of aka-aka pairs more 15 years apart: 9
Number of possible false aka records (records with a single date, probably a see record) 38
aka-aka matches (show only 20) of 94:
1 184012 243154
184012 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,243154,João Manuel de Pina,Manuel,Óbidos,1601-10-15,1614-06-26,Cânones,Francisco Gorjão
1,184012,João Manuel,Pina,Óbidos,1605-10-07,1605-10-07,Cursos jurídicos (Cânones ou Leis),Francisco Gorjão


Unnamed: 0,date,id,type,value,attr_obs
0,1601-10-15,243154,faculdade,Cânones,Cânones
1,1601-10-15,243154,matricula-faculdade,Cânones,15.10.1601
2,1601-10-15,243154,naturalidade,Óbidos,
3,1601-10-15,243154,nome,João Manuel,"João Manuel de Pina, vide Manuel"
4,1601-10-15,243154,nome,João Manuel de Pina,
5,1601-10-15,243154,nome-pai,Francisco Gorjão,
6,1601-10-15,243154,nome-vide,Manuel,
7,1601-10-15,243154,uc-entrada,1601-10-15,
8,1601-10-15,243154,uc-entrada.ano,1601,
9,1605-10-07,184012,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida


2 150364 265272
SAME DATES: Possible double registration of the same card


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,150364,Diogo Barbosa,Carvalho,Lisboa,1624-10-22,1633-05-21,Cânones,Tristão Barbosa
1,150364,Diogo Barbosa,Carvalho,Lisboa,1624-10-22,1633-05-21,Leis,Tristão Barbosa
2,265272,Diogo Barbosa de Carvalho,Barbosa,Lisboa,1624-10-22,1633-05-21,Cânones,Tristão Barbosa
3,265272,Diogo Barbosa de Carvalho,Barbosa,Lisboa,1624-10-22,1633-05-21,Leis,Tristão Barbosa


Unnamed: 0,date,id,type,value,attr_obs
0,1624-10-22,150364,faculdade,Cânones,Faculdade corrigida
1,1624-10-22,265272,faculdade,Cânones,Faculdade corrigida
2,1624-10-22,150364,faculdade,Leis,Faculdade corrigida
3,1624-10-22,265272,faculdade,Leis,Faculdade corrigida
4,1624-10-22,150364,faculdade-original,Leis,
5,1624-10-22,265272,faculdade-original,Leis,
6,1624-10-22,150364,instituta,1624-10-22,"""1624/10/22 1624-10-22"""
7,1624-10-22,150364,naturalidade,Lisboa,
8,1624-10-22,265272,naturalidade,Lisboa,
9,1624-10-22,150364,nome,Diogo Barbosa,


3 182217 249651


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,182217,Simão Machado,Miranda,Ilha da Madeira,1624-10-02,1629-07-08,Cânones,João Machado
1,182217,Simão Machado,Miranda,Ilha da Madeira,1624-10-02,1629-07-08,Leis,João Machado
2,249651,Simão Machado de Miranda,Machado,Ilha da Madeira,1624-00-00,1626-10-02,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1624-00-00,249651,faculdade,Cânones,Cânones
1,1624-00-00,249651,instituta,1624-00-00,??.10.1624
2,1624-00-00,249651,naturalidade,Ilha da Madeira,
3,1624-00-00,249651,nome,Simão Machado,"Simão Machado de Miranda, vide Machado"
4,1624-00-00,249651,nome,Simão Machado de Miranda,
5,1624-00-00,249651,nome-vide,Machado,
6,1624-00-00,249651,uc-entrada,1624-00-00,
7,1624-00-00,249651,uc-entrada.ano,1624,
8,1624-10-02,182217,faculdade,Cânones,Faculdade corrigida
9,1624-10-02,182217,faculdade,Leis,Faculdade corrigida


4 194939 250513


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,250513,Simão Lourenço,Coelho,Tancos,1648-10-31,1655-04-30,Leis,Teodósio Lourenço
1,194939,Simão Lourenço Coelho,Lourenço,Tancos,1650-10-12,1655-04-30,Leis,Teodósio Lourenço


Unnamed: 0,date,id,type,value,attr_obs
0,1648-10-31,250513,faculdade,Leis,Leis
1,1648-10-31,250513,instituta,1648-10-31,1648.10.31 1648-10-31
2,1648-10-31,250513,naturalidade,Tancos,
3,1648-10-31,250513,nome,Simão Lourenço,
4,1648-10-31,250513,nome,Simão Lourenço Coelho,"Simão Lourenço, vide Coelho"
5,1648-10-31,250513,nome-pai,Teodósio Lourenço,
6,1648-10-31,250513,nome-vide,Coelho,
7,1648-10-31,250513,uc-entrada,1648-10-31,
8,1648-10-31,250513,uc-entrada.ano,1648,
9,1649-10-02,250513,matricula-faculdade,Leis,1649.10.02


5 145682 194913
194913 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,194913,Semião Coelho,Amaral,Viseu,1659-03-20,1659-03-20,Medicina,
1,145682,Semião Coelho do Amaral,Coelho,Viseu,1659-10-02,1665-05-20,Medicina,João Cardoso de Loureiro


Unnamed: 0,date,id,type,value,attr_obs
0,1659-03-20,194913,faculdade,Medicina,Medicina
1,1659-03-20,194913,grau,Bacharel em Artes,Bacharel em Artes 20.03.1659
2,1659-03-20,194913,naturalidade,Viseu,
3,1659-03-20,194913,nome,Semião Coelho,
4,1659-03-20,194913,nome,Semião Coelho Amaral,"Semião Coelho, vide Amaral"
5,1659-03-20,194913,nome-vide,Amaral,
6,1659-03-20,194913,uc-entrada,1659-03-20,
7,1659-03-20,194913,uc-entrada.ano,1659,
8,1659-03-20,194913,uc-saida,1659-03-20,
9,1659-03-20,194913,uc-saida.ano,1659,


6 202938 205883


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,202938,Manuel da Costa,Lemos e Tunes,Arganil,1725-10-01,1731-07-17,Cânones,Pedro Nunes da Costa
1,205883,Manuel da Costa Lemos e Tunes,Costa,Arganil,1729-07-15,1731-07-17,,Pedro Nunes da Costa


Unnamed: 0,date,id,type,value,attr_obs
0,1725-10-01,202938,faculdade,Cânones,Cânones
1,1725-10-01,202938,matricula-faculdade,Cânones,01.10.1725
2,1725-10-01,202938,naturalidade,Arganil,
3,1725-10-01,202938,nome,Manuel da Costa,
4,1725-10-01,202938,nome,Manuel da Costa Lemos e Tunes,"Manuel da Costa, vide Lemos e Tunes"
5,1725-10-01,202938,nome-nota,padre,
6,1725-10-01,202938,nome-pai,Pedro Nunes da Costa,
7,1725-10-01,202938,nome-vide,Lemos e Tunes,
8,1725-10-01,202938,padre,sim,padre
9,1725-10-01,202938,uc-entrada,1725-10-01,


7 141854 252428
141854 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,252428,Jorge Fernandes,Aires,Lisboa,1572-03-15,1585-05-07,Cânones,Aires Vaz
1,141854,Jorge Fernandes Aires,Fernandes,Lisboa,1583-10-01,1583-10-01,Cânones,Aires Vaz


Unnamed: 0,date,id,type,value,attr_obs
0,1572-03-15,252428,faculdade,Cânones,Cânones
1,1572-03-15,252428,grau,Bacharel em Artes,
2,1572-03-15,252428,naturalidade,Lisboa,
3,1572-03-15,252428,nome,Jorge Fernandes,
4,1572-03-15,252428,nome,Jorge Fernandes Aires,"Jorge Fernandes, vide Aires"
5,1572-03-15,252428,nome-pai,Aires Vaz,
6,1572-03-15,252428,nome-vide,Aires,
7,1572-03-15,252428,uc-entrada,1572-03-15,
8,1572-03-15,252428,uc-entrada.ano,1572,
9,1573-05-08,252428,grau,Licenciado em Cânones,Licenciado 08.05.1573


8 169085 245978
FAR APART >15 years: possible false match, records chronologically affar
245978 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,169085,Mateus Neto,Miguéis,Buarcos,1645-10-16,1664-10-01,Cânones,Pedro Brás
1,245978,Mateus Neto Miguéis,Neto,"Redondos, Buarcos",1677-03-16,1677-03-16,Cânones,Pedro Vaz ou Brás


Unnamed: 0,date,id,type,value,attr_obs
0,1645-10-16,169085,faculdade,Cânones,Cânones
1,1645-10-16,169085,instituta,1645-10-16,16.10.1645 1645-10-16
2,1645-10-16,169085,naturalidade,Buarcos,
3,1645-10-16,169085,nome,Mateus Neto,
4,1645-10-16,169085,nome,Mateus Neto Miguéis,"Mateus Neto, vide Miguéis"
5,1645-10-16,169085,nome-nota,padre,
6,1645-10-16,169085,nome-pai,Pedro Brás,
7,1645-10-16,169085,nome-vide,Miguéis,
8,1645-10-16,169085,padre,sim,padre
9,1645-10-16,169085,uc-entrada,1645-10-16,


9 129711 173982
129711 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,129711,Hilário da Rocha Calheiros,Rocha,Caldas,1660-10-01,1660-10-01,Cursos jurídicos (Cânones ou Leis),
1,173982,Hilário da Rocha,Calheiros,Caldas,1661-10-15,1668-01-30,Cânones,António da Rocha


Unnamed: 0,date,id,type,value,attr_obs
0,1660-10-01,129711,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
1,1660-10-01,129711,instituta,1660-10-01,01.10.1660 1660-10-01
2,1660-10-01,173982,instituta,1660-10-01,1660.10.01 1660-10-01
3,1660-10-01,129711,naturalidade,Caldas,
4,1660-10-01,129711,nome,Hilário da Rocha,"Hilário da Rocha Calheiros, vide Rocha"
5,1660-10-01,129711,nome,Hilário da Rocha Calheiros,
6,1660-10-01,129711,nome-vide,Rocha,
7,1660-10-01,129711,uc-entrada,1660-10-01,
8,1660-10-01,129711,uc-entrada.ano,1660,
9,1660-10-01,129711,uc-saida,1660-10-01,


10 178636 248706
248706 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,178636,Francisco Gomes,Miranda,Lisboa,1597-10-01,1608-10-11,Cânones,Basílio Gomes
1,248706,Francisco Gomes de Miranda,Gomes,Lisboa,1607-01-18,1607-01-18,Cânones,Baílio Gomes


Unnamed: 0,date,id,type,value,attr_obs
0,1597-10-01,178636,faculdade,Cânones,Cânones
1,1597-10-01,178636,instituta,1597-10-01,01.10.1597 1597-10-01
2,1597-10-01,178636,naturalidade,Lisboa,
3,1597-10-01,178636,nome,Francisco Gomes,
4,1597-10-01,178636,nome,Francisco Gomes Miranda,"Francisco Gomes, vide Miranda"
5,1597-10-01,178636,nome-pai,Basílio Gomes,
6,1597-10-01,178636,nome-vide,Miranda,
7,1597-10-01,178636,uc-entrada,1597-10-01,
8,1597-10-01,178636,uc-entrada.ano,1597,
9,1598-10-21,178636,matricula-faculdade,Cânones,21.10.1598


11 190606 248991
248991 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,190606,Fernão Pires de Castro,Pires,Elvas,1586-10-01,1591-05-13,Teologia,Fernão Pires
1,248991,Fernão Pires,Castro,Elvas,1588-10-01,1588-10-01,Teologia,Fernão Pires


Unnamed: 0,date,id,type,value,attr_obs
0,1586-10-01,190606,faculdade,Teologia,Teologia
1,1586-10-01,190606,matricula-faculdade,Teologia,01.10.1586
2,1586-10-01,190606,naturalidade,Elvas,
3,1586-10-01,190606,nome,Fernão Pires,"Fernão Pires de Castro, vide Pires"
4,1586-10-01,190606,nome,Fernão Pires de Castro,
5,1586-10-01,190606,nome-pai,Fernão Pires,
6,1586-10-01,190606,nome-vide,Pires,
7,1586-10-01,190606,uc-entrada,1586-10-01,
8,1586-10-01,190606,uc-entrada.ano,1586,
9,1587-10-01,190606,matricula-faculdade,Teologia,01.10.1587


12 211073 240873


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,211073,Simão Pestana da Cunha,Pestana,Ferreirim,1709-10-01,1719-10-01,Teologia,
1,240873,Simão Pestana,Cunha,Ferreirim,1709-10-01,1720-05-18,Teologia,Manuel da Fonseca


Unnamed: 0,date,id,type,value,attr_obs
0,1709-10-01,211073,faculdade,Teologia,Teologia
1,1709-10-01,240873,faculdade,Teologia,Teologia
2,1709-10-01,211073,matricula-faculdade,Teologia,01.10.1709
3,1709-10-01,240873,matricula-faculdade,Teologia,01.10.1709
4,1709-10-01,211073,naturalidade,Ferreirim,
5,1709-10-01,240873,naturalidade,Ferreirim,
6,1709-10-01,211073,nome,Simão Pestana,"Simão Pestana da Cunha, vide Pestana"
7,1709-10-01,240873,nome,Simão Pestana,
8,1709-10-01,240873,nome,Simão Pestana Cunha,"Simão Pestana, vide Cunha"
9,1709-10-01,211073,nome,Simão Pestana da Cunha,


13 160999 214635
FAR APART >15 years: possible false match, records chronologically affar


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,214635,Pedro Velho,Fragoso,Vila Nova de Portimão,1608-10-01,1617-07-23,Cânones,Francisco Velho
1,160999,Pedro Velho Fragoso,Velho,Vila Nova de Portimão,1626-05-12,1636-11-05,,Francisco Velho


Unnamed: 0,date,id,type,value,attr_obs
0,1608-10-01,214635,faculdade,Cânones,Cânones
1,1608-10-01,214635,instituta,1608-10-01,01.10.1608 1608-10-01
2,1608-10-01,214635,naturalidade,Vila Nova de Portimão,
3,1608-10-01,214635,nome,Pedro Velho,
4,1608-10-01,214635,nome,Pedro Velho Fragoso,"Pedro Velho, vide Fragoso"
5,1608-10-01,214635,nome-pai,Francisco Velho,
6,1608-10-01,214635,nome-vide,Fragoso,
7,1608-10-01,214635,uc-entrada,1608-10-01,
8,1608-10-01,214635,uc-entrada.ano,1608,
9,1609-10-06,214635,matricula-faculdade,Cânones,06.10.1609


14 206932 207569
206932 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,207569,Jerónimo Mendes,Vale,Lisboa,1569-10-01,1578-06-27,Cânones,
1,206932,Jerónimo Mendes do Vale,Mendes,Lisboa,1573-11-06,1573-11-06,Cânones,Duarte Mendes


Unnamed: 0,date,id,type,value,attr_obs
0,1569-10-01,207569,faculdade,Cânones,Faculdade inferida
1,1569-10-01,207569,naturalidade,Lisboa,
2,1569-10-01,207569,nome,Jerónimo Mendes,
3,1569-10-01,207569,nome,Jerónimo Mendes Vale,"Jerónimo Mendes, vide Vale"
4,1569-10-01,207569,nome-vide,Vale,
5,1569-10-01,207569,uc-entrada,1569-10-01,
6,1569-10-01,207569,uc-entrada.ano,1569,
7,1573-11-06,206932,faculdade,Cânones,Cânones
8,1573-11-06,206932,matricula-faculdade,Cânones,06.11.1573
9,1573-11-06,206932,naturalidade,Lisboa,


15 190442 209990
209990 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,190442,Bernardo de Castro,Leite,Braga,1625-10-17,1632-07-07,Cânones,Domingos Dias
1,209990,Bernardo de Castro Leite,Castro,Braga,1628-10-29,1628-10-29,Cânones,Domingos Dias Leite


Unnamed: 0,date,id,type,value,attr_obs
0,1625-10-17,190442,faculdade,Cânones,Cânones
1,1625-10-17,190442,instituta,1625-10-17,17.10.1625 1625-10-17
2,1625-10-17,190442,naturalidade,Braga,
3,1625-10-17,190442,nome,Bernardo de Castro,
4,1625-10-17,190442,nome,Bernardo de Castro Leite,"Bernardo de Castro, vide Leite"
5,1625-10-17,190442,nome-pai,Domingos Dias,
6,1625-10-17,190442,nome-vide,Leite,
7,1625-10-17,190442,uc-entrada,1625-10-17,
8,1625-10-17,190442,uc-entrada.ano,1625,
9,1626-10-14,190442,matricula-faculdade,Cânones,14.10.1626


16 197653 226783
197653 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,197653,Guilherme Tavares,Silva,Lisboa,1719-10-01,1719-10-01,Cursos jurídicos (Cânones ou Leis),
1,226783,Guilherme Tavares da Silva,Tavares,Lisboa,1720-10-01,1726-06-28,Cânones,Manuel Tavares


Unnamed: 0,date,id,type,value,attr_obs
0,1719-10-01,197653,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
1,1719-10-01,197653,instituta,1719-10-01,01.10.1719 1719-10-01
2,1719-10-01,197653,naturalidade,Lisboa,
3,1719-10-01,197653,nome,Guilherme Tavares,
4,1719-10-01,197653,nome,Guilherme Tavares Silva,"Guilherme Tavares, vide Silva"
5,1719-10-01,197653,nome-vide,Silva,
6,1719-10-01,197653,uc-entrada,1719-10-01,
7,1719-10-01,197653,uc-entrada.ano,1719,
8,1719-10-01,197653,uc-saida,1719-10-01,
9,1719-10-01,197653,uc-saida.ano,1719,


17 133131 180160


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,180160,Manuel de Gouveia,Quintela,Lisboa,1656-10-01,1657-10-01,Cânones,
1,133131,Manuel de Gouveia Quintela,Gouveia,Lisboa,1656-10-07,1664-02-22,Cânones,João de Gouveia


Unnamed: 0,date,id,type,value,attr_obs
0,1656-10-01,180160,faculdade,Cânones,Cânones
1,1656-10-01,180160,naturalidade,Lisboa,
2,1656-10-01,180160,nome,Manuel de Gouveia,
3,1656-10-01,180160,nome,Manuel de Gouveia Quintela,"Manuel de Gouveia, vide Quintela"
4,1656-10-01,180160,nome-vide,Quintela,
5,1656-10-01,180160,uc-entrada,1656-10-01,
6,1656-10-01,180160,uc-entrada.ano,1656,
7,1656-10-07,133131,faculdade,Cânones,Cânones
8,1656-10-07,133131,instituta,1656-10-07,07.10.1656 1656-10-07
9,1656-10-07,180160,instituta,1656-10-07,07.10.1656 1656-10-07


18 153630 246901
FAR APART >15 years: possible false match, records chronologically affar


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,246901,Gaspar Pinto,Fonseca,Amarante,1591-01-04,1594-10-01,Cânones,Gonçalo Vaz
1,153630,Gaspar Pinto da Fonseca,Pinto,Amarante,1616-12-15,1623-07-04,Leis,Pedro de Seixas


Unnamed: 0,date,id,type,value,attr_obs
0,1591-01-04,246901,faculdade,Cânones,Cânones
1,1591-01-04,246901,instituta,1591-01-04,04.01.1591 1591-01-04
2,1591-01-04,246901,naturalidade,Amarante,
3,1591-01-04,246901,nome,Gaspar Pinto,
4,1591-01-04,246901,nome,Gaspar Pinto Fonseca,"Gaspar Pinto, vide Fonseca"
5,1591-01-04,246901,nome-pai,Gonçalo Vaz,
6,1591-01-04,246901,nome-vide,Fonseca,
7,1591-01-04,246901,uc-entrada,1591-01-04,
8,1591-01-04,246901,uc-entrada.ano,1591,
9,1592-10-01,246901,matricula-faculdade,Cânones,01.10.1592


19 142500 265903
SAME DATES: Possible double registration of the same card


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,265903,Matias de Almeida,Pereira Preto de Almada,Figueiró dos Vinhos,1734-10-01,1741-10-01,Cânones,
1,142500,Matias de Almeida Pereira Preto de Almada,Almeida,Figueiró dos Vinhos,1734-10-01,1741-10-01,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1734-10-01,142500,faculdade,Cânones,Cânones
1,1734-10-01,265903,faculdade,Cânones,Cânones
2,1734-10-01,265903,instituta,1734-10-01,"""1734/10/01 1734-10-01"""
3,1734-10-01,142500,naturalidade,Figueiró dos Vinhos,
4,1734-10-01,265903,naturalidade,Figueiró dos Vinhos,
5,1734-10-01,142500,nome,Matias de Almeida,"Matias de Almeida Pereira Preto de Almada, vide Almeida"
6,1734-10-01,265903,nome,Matias de Almeida,
7,1734-10-01,142500,nome,Matias de Almeida Pereira Preto de Almada,
8,1734-10-01,265903,nome,Matias de Almeida Pereira Preto de Almada,"Matias de Almeida, vide Pereira Preto de Almada"
9,1734-10-01,142500,nome-vide,Almeida,


20 187658 187661


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,187661,Gaspar da Costa Brandão,Gaspar Afonso da Costa Brandão,"Vila Cova de Sub-Avô, hoje Vila Cova de Alva",1720-10-01,1726-05-25,Leis,
1,187658,Gaspar Afonso da Costa Brandão,Gaspar da Costa Brandão,"Vila Cova de Sub-Avô, hoje Vila Cova de Alva",1721-10-01,1726-07-28,Leis,Bento de Figueiredo Brandão


Unnamed: 0,date,id,type,value,attr_obs
0,1720-10-01,187661,faculdade,Leis,Leis
1,1720-10-01,187661,instituta,1720-10-01,01.10.1720 1720-10-01
2,1720-10-01,187661,naturalidade,"Vila Cova de Sub-Avô, hoje Vila Cova de Alva",
3,1720-10-01,187661,nome,Gaspar Afonso da Costa Brandão,"Gaspar da Costa Brandão, vide Gaspar Afonso da Costa Brandão"
4,1720-10-01,187661,nome,Gaspar da Costa Brandão,
5,1720-10-01,187661,nome-vide,Gaspar Afonso da Costa Brandão,
6,1720-10-01,187661,uc-entrada,1720-10-01,
7,1720-10-01,187661,uc-entrada.ano,1720,
8,1721-10-01,187658,faculdade,Leis,Leis
9,1721-10-01,187658,naturalidade,"Vila Cova de Sub-Avô, hoje Vila Cova de Alva",


#### **[EN]** Types of transformations in matched records

In [576]:
vide_types_matches = matched.groupby('vide_type').count()[['name']]
vide_types_matches['perc'] = vide_types_matches['name']/ vide_types_matches['name'].sum()
vide_types_matches

Unnamed: 0_level_0,name,perc
vide_type,Unnamed: 1_level_1,Unnamed: 2_level_1
add,1897,0.460213
cut,1899,0.460699
novid,26,0.006308
rep,294,0.071325
repap,6,0.001456


In [577]:
match_info.fillna("")

Unnamed: 0,data,sequential,random,perc_vide_plus,perc_matched_ok,perc_type,type
aka,3062,,,0.349304,,1.0,aka
aka_fac,3035,,,0.346224,,0.991182,aka
aka_geo,2973,,,0.339151,,0.970934,aka
aka_matched,1910,1913.0,1970.0,,,0.623775,aka
aka_matched_ok,1940,1907.0,1897.0,,,0.633573,aka
aka_pai,1619,,,0.184691,,0.528739,aka
matched_pairs,3818,3644.0,3804.0,,,1.0,matched_pairs
matched_pairs_ok,3665,3628.0,3614.0,,,0.959927,matched_pairs
nodate,5763,,,0.657426,,,
nodate_novide,141,,,0.016085,,,


### **[EN]** Analysis of non matched records

In [578]:
vide_plus.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 198423 to 230315
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8916 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
 13  loookup          9438 non-null   object
 14  vide_type        9438 non-null   object
 15  lookup           9438 non-null   object
 16  name_sp          9438 non-null   object
 17  lookup_sp        9438 non-null 

In [579]:

pd.set_option('display.max_rows',250)
matched_index = match_records['records_matched']['data']
non_matched_index = set(vide_plus.index.unique())-set(matched_index)
vide_non_matched = vide_plus.loc[list(non_matched_index)].sort_values(['sort_key','nome-geografico'])[['nome-geografico','match','name','nome-vide','vide_type','lookup','faculdade','nome-pai','uc-entrada','match_error','match_obs']]
vide_non_matched.to_csv('../inferences/remissivas/vide_non_matched.csv',sep=',')

In [580]:
vide_types_non_matches = vide_non_matched.groupby('vide_type').count()[['name']]
vide_types_non_matches['perc'] = vide_types_non_matches['name']/ vide_types_non_matches['name'].sum()
vide_types_non_matches

Unnamed: 0_level_0,name,perc
vide_type,Unnamed: 1_level_1,Unnamed: 2_level_1
add,2160,0.406321
cut,2227,0.418924
novid,126,0.023702
rep,763,0.143529
rep+,20,0.003762
repap,20,0.003762


### **[EN]** Sample of non-matched records


In [581]:
vide_non_matched.head(31)

Unnamed: 0,nome-geografico,match,name,nome-vide,vide_type,lookup,faculdade,nome-pai,uc-entrada,match_error,match_obs
220890,Portalegre,,"""Pedro Rodrigues, vide; Abreu""",,novid,"""Pedro Rodrigues, vide; Abreu""",,,0000-00-00,False,
271719,Abreiro,,Abel de Mendonça Machado de Araújo,Abel de Mendonça,cut,Abel de Mendonça,,,0000-00-00,False,
271719,Mirandela,,Abel de Mendonça Machado de Araújo,Abel de Mendonça,cut,Abel de Mendonça,,,0000-00-00,False,
182548,Eiró,,Abel Xavier Teixeira de Magalhães,José Joaquim Xavier Teixeira de Magalhães,rep,José Joaquim Xavier Teixeira de Magalhães,Cursos jurídicos (Cânones ou Leis),,0000-00-00,False,
285686,Oliveira de Frades,,Abílio Ribeiro de Almeida Campos de Melo,Abílio Ribeiro de Almeida,cut,Abílio Ribeiro de Almeida,Cursos jurídicos (Cânones ou Leis),António de Almeida Silva Campos de Melo,0000-00-00,False,
285686,Pinheiro,,Abílio Ribeiro de Almeida Campos de Melo,Abílio Ribeiro de Almeida,cut,Abílio Ribeiro de Almeida,Cursos jurídicos (Cânones ou Leis),António de Almeida Silva Campos de Melo,0000-00-00,False,
286149,Amoreira da Gandra,,Adelino Pinto Tavares Ferrão de Mendonça,Ferrão,cut,Adelino Pinto Tavares Ferrão,,,0000-00-00,False,
226700,Marvão,,Adolfo Augusto Rôlo,Adolfo António Rôlo,rep,Adolfo António Rôlo,Medicina,,1871-06-06,False,
226683,Marvão,,Adolfo António Rôlo,Adolfo Augusto Zuzarte Rôlo,rep,Adolfo Augusto Zuzarte Rôlo,,,0000-00-00,False,
273326,Lisboa,,Adriano Ernesto de Castilho Barreto,Castilho,cut,Adriano Ernesto de Castilho,,,0000-00-00,False,



Analysis:
1. 220890	Portalegre	"Pedro Rodrigues, vide; Abreu" links with 140806 __problem in vide expression__
2. 271719	Abreiro/Mirandela	Abel de Mendonça Machado de Araújo	Abel de Mendonça links with 286147 __no back vide expression__
3. 182548	Eiró	Abel Xavier Teixeira de Magalhães	José Joaquim Xavier Teixeira de Magalhães links with 182950  __no back vide expression__
4. 285686	Oliveira de Frades	Abílio Ribeiro de Almeida Campos de Melo	Abílio Ribeiro de Almeida links with 142075 __no back vide expression__
5. 286149	Amoreira da Gandra	Adelino Pinto Tavares Ferrão de Mendonça	Ferrão links with 248088 __no back vide expression__ and __typo in geo name__
6. 273326	Lisboa	Adriano Ernesto de Castilho Barreto	Castilho links with 189993 __no back vide expression__
7. 230176	Arcos	Tomás Joaquim Lopes de Mariz e Silva	Adriano Joaquim Lopes Mariz e Silva Monteiro links with 250994 __variation in the vide name (Maris/Mariz)__
8. 282429	NaN	Adriano Osório Pereira Gouveia	Adriano Osório Pereira Cerenato	rep	Adriano Osório Pereira Cerenato	links with 291196 __no back vide expression__
9. 296930	Almarge	Adriano Sisnando Brotero de Avelar Quintino	Adriano Sisnando Brotero Quintino de Avelar	rep	Adriano Sisnando Brotero Quintino de Avelar links with 133134 __no back vide expression__
10. 225520	Lisboa	Adrião Pereira	Gomes	add	Adrião Pereira Gomes	Cânones, links with 178240 __no back vide expression__
11. 147465	Trancoso	Afonso Tavares de Araújo	Afonso de Araújo Tavares	rep	Afonso de Araújo Tavares links with 197047 __no back vide expression__
12. 169888	Lisboa	Afonso Furtado	Mendonça	add	Afonso Furtado Mendonça link with 214147 (see) or 169890__ __ambiguity__ 
13. 251547	Baía	Afonso Luís	da Fonseca	add	Afonso Luís da Fonseca	links with 139362 __no back vide expression__
14. 225529	Monção	Afonso Pereira	Pimenta	add	Afonso Pereira Pimenta	 links with 241162 __no back vide expression__
15. 129050	Elvas	Afonso Rodrigues Caldas	Rodrigues	cut	Afonso Rodrigues __no link found__
16. 221241	Elvas	Afonso Sardinha	Afonso Vaz Sardinha	rep	Afonso Vaz Sardinha	Cânones see link missing	__no link found__
17. 235544	Elvas	Afonso Soares da Mota	Afonso Soares de Lemos	rep	Afonso Soares de Lemos	link 211794 	__no back vide expression__ 
18. 225535	Aldeia Nova do Cabo	Afonso de Sá Pereira	Sá	cut	Afonso de Sá links with 211378 __no back vide expression__ 
19. 199294	Vila Real	Afonso Teixeira	   Mendonça e Azevedo	add	Afonso Teixeira Mendonça e Azevedo	Cânones	 links with 148819/See  214149/see __ambiguity__
20. 316331	Quinta do Alqueidão	Agostinho António de Sousa Brito Resende	Soutomaior	add	Agostinho António de Sousa Brito Resende Souto...
	link to 224178 	__no back vide expression_ NO match on geoname Alqueidão, quinta do Alqueidão__
21. 234238 Lisboa	Agostinho Armando de Vasconcelos e Sousa	Agostinho Armando Vasconcelos	rep	Agostinho Armando Vasconcelos	
        Links to 148028 __fail lookup not matching linked record name: both lookup the same though__

### **[EN]** Aka Records non matched

There is an inbalance of "see" and "aka" numbers, so a high number of unmatched "see"  is expected.

Aka records should be more easily matched with corresponding see. That is the case in fact with around 55% of aka records matched

Let's see the reason why Aka records do not find a matching "see".



In [583]:
aka_see_not_matched_index = vide_non_matched[vide_non_matched['uc-entrada']!='0000-00-00'].index.unique()
print("Number of aka records not matched:", {len(aka_see_not_matched_index)})
print("Partial list, change head parameter for more:")

vide_non_matched.loc[aka_see_not_matched_index].head(20)

Number of aka records not matched: {1053}
Partial list, change head parameter for more:


Unnamed: 0,nome-geografico,match,name,nome-vide,vide_type,lookup,faculdade,nome-pai,uc-entrada,match_error,match_obs
226700,Marvão,,Adolfo Augusto Rôlo,Adolfo António Rôlo,rep,Adolfo António Rôlo,Medicina,,1871-06-06,False,
250994,Arcos,,Adriano Joaquim de Mariz e Silva Monteiro,Tomás Joaquim Lopes de Maris e Silva,rep+,Tomás Joaquim Lopes de Maris e Silva,Cursos jurídicos (Cânones ou Leis),,1794-10-14,False,
250994,Aveiro,,Adriano Joaquim de Mariz e Silva Monteiro,Tomás Joaquim Lopes de Maris e Silva,rep+,Tomás Joaquim Lopes de Maris e Silva,Cursos jurídicos (Cânones ou Leis),,1794-10-14,False,
180061,Salgueiro,,João António Osório Pereira Gouveia,Adriano Osório Pereira Guerra,rep+,Adriano Osório Pereira Guerra,Cursos jurídicos (Cânones ou Leis),,1800-10-31,False,
180742,Salgueiro,,Adriano Osório Pereira Guerra,João António Pereira Cerenato,rep+,João António Pereira Cerenato,Leis,,1799-10-07,False,
129050,Elvas,,Afonso Rodrigues Caldas,Rodrigues,cut,Afonso Rodrigues,Leis,,1657-11-02,False,
221241,Elvas,,Afonso Sardinha,Afonso Vaz Sardinha,rep,Afonso Vaz Sardinha,Cânones,Gonçalo Rodrigues,1706-10-01,False,
199294,Vila Real,,Afonso Teixeira,Mendonça e Azevedo,add,Afonso Teixeira Mendonça e Azevedo,Cânones,,1650-11-08,False,
199294,Vila Real,,Afonso Teixeira,Mendonça e Azevedo,add,Afonso Teixeira Mendonça e Azevedo,Leis,,1650-11-08,False,
187458,Santa Olaia,,Agostinho Brandão,Pinto,add,Agostinho Brandão Pinto,Cursos jurídicos (Cânones ou Leis),,1688-01-21,False,


##### Analysis

1. 226700 Marvão Adolfo Augusto Rôlo, vide Adolfo António Rôlo matches 226683 Adolfo António Rôlo, vide Adolfo Augusto Zuzarte Rôlo __back vide does not match__
2. 250994 Arcos	Tomás Joaquim Lopes de Mariz e Silva vide Adriano Joaquim Lopes Mariz e Silva Monteiro links with  230176 __variation in the vide name (Maris/Mariz)__
3. 180061	Salgueiro	_João António Osório Pereira Gouveia_, vide Adriano Osório Pereira Guerra	rep+	Adriano Osório Pereira Guerra	Direito (Cânones ou Leis) 1800-10-31	__variation in the vide name__
  * 180742	Salgueiro   Adriano Osório Pereira Guerra, vide _João António Pereira Cerenato_	Leis 1799-10-07
  * Other possible matches 291196, 191903 complex case
4. 129050	Elvas	Afonso Rodrigues Caldas	vide Rodrigues	cut	Afonso Rodrigues	Leis	NaN	1657-11-02	__see record not found manualy__
5. 221241	Elvas	Afonso Sardinha	vide Afonso Vaz Sardinha	rep	Afonso Vaz Sardinha	Cânones	Gonçalo Rodrigues	1706-10-01	__see record not found manualy__
6. 199294	Vila Real	Afonso Teixeira	vide Mendonça e Azevedo	add	Afonso Teixeira Mendonça e Azevedo	Cânones	1650-11-08
  * links with see record 214149 Afonso Teixeira de Mendonça, vide Teixeira  __vide in aka record does not match name in see record__
  * links also with 148819 Afonso Teixeira de Azevedo, vide Teixeira __vide in aka record does not match name in see record__
  * so the vide expression in 199924 should be __vide Mendonça e vide Azevedo__ to link with Afonso Teixeira de Mendonça and Afonso Teixeira de Azevedo
7. 187458	Santa Olaia	Agostinho Brandão	vide Pinto	add	Agostinho Brandão Pinto	Direito (Cânones ou Leis)	NaN	1688-01-21	__matching record is aka, not see__
  * links with 245344 which is not a see record nor a vide record. __187458 and 245344 are dupicates__ __matching record is aka, not see__
8.  152599	Lisboa	Agostinho José de Carvalho vide	Agostinho José de Figueiredo Carvalho e Oliveira	Leis	1791-10-27
   *  links with 174123 Agostinho José de Figueiredo Carvalho e Oliveira, but it is not a see record. __152599 and 174213 are duplicates__ __matching record is aka, not see__
9. 149805	Lisboa Aires Correia Baharém vide	Correia	cut	Aires Correia	Teologia	pai Manuel Correia de Menezes	1594-10-18 
  * links with see record 196492 slight variation in the vide expression __variation in the vide name (Baharém/Baharem)__


10. 192844	Ovfmatsen	Alberto Chremer	vide Cremert	add	Alberto Chremer Cremert	Cânones	
   *  links with 207263 Alberto Cremert no vide expression __matching record is aka, not see__ __duplicate__


### **[EN]** Matched records

Sucessive lines are matches. Sometime more than one line per record when there is more than one geographic name or faculty.

In [588]:
vide_plus.loc[matched_index].sort_values(['nome-geografico','sort_key','uc-entrada'])[['uc-entrada','nome-geografico','name','lookup','nome-pai','faculdade','faculdade.obs','match_obs']].head(30)

Unnamed: 0,uc-entrada,nome-geografico,name,lookup,nome-pai,faculdade,faculdade.obs,match_obs
202622,0000-00-00,Constância,Fernão de Álvares Temudo,Fernão de Álvares,,,,
144388,1573-11-13,Constância,Fernão de Álvares,Fernão de Álvares Temudo,Pantaleão Rosado,Cânones,Cânones,
171438,0000-00-00,Constância,João da Veiga Mendes Nogueira,João da Veiga,,Leis,Leis,
213495,1757-10-01,Constância,João da Veiga,João da Veiga Mendes Nogueira,,Leis,Leis,
214577,0000-00-00,Constância,Julião Velho,Julião Velho Almeida,,,,
143676,1663-07-10,Constância,Julião Velho de Almeida,Julião Velho,,Cânones,Cânones,
203159,0000-00-00,Constância,Manuel da Costa,Manuel da Costa Oliveira,Manuel da Costa,,,
176277,1672-01-24,Constância,Manuel da Costa de Oliveira,Manuel da Costa,Manuel da Costa,Cânones,Cânones,
243351,0000-00-00,Constância,Manuel Ribeiro Pinhão,Manuel Ribeiro,Pedro Ribeiro,Cânones,Cânones,
165844,1623-10-09,Constância,Manuel Ribeiro,Manuel Ribeiro Pinhão,Pedro Ribeiro,Cânones,. Cânones,


# **[EN]** Save current stats on cross reference processing

This allows later in "git" see how the situation evolves.

In [57]:
# save status to file
fname = '030-remissivas_info.txt'

with open(fname,'w+') as f:
    print(f"Cross references, current stats: {current_time}",file=f)
    print(file=f)

    vide_plus.info(buf=f)

    print(match_info.fillna(""), buf=f)
    

    



NameError: name 'nmatched_ok' is not defined

### **[EN]** Focus on specific records

Define a column and a pattern to search for. Pattern is a _regular expression_.
For more information on the patterns and alternative searches see https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html

Examples:

* column='nome', pattern='André': 'André' anywhere in column 'name' (will also get 166395 Manuel André Ribeiro)
* column='nome', pattern='André$': names ending in 'André' (e.g. 146664 Manuel André )
* column='nome', pattern='^André': names starting with 'André'
* column='nome', pattern='André|Joaquim': names containing either 'André' or 'Joaquim'
* column='naturalidade', pattern='Alcácer|Alcacer':  naturalidade contains either 'Alcácer' or 'Alcacer'

In [209]:
import pandas as pd
column = 'name'
pattern = '^Estevão'
pd.set_option('display.max_rows',1000)
#                                      na=False prevents errors column is missing
vide_selection = vide_plus[vide_plus[column].str.contains(pattern,na=False)]
vide_selection.sort_values([column])

Unnamed: 0_level_0,name,sex,nome_vide,nome_vide_date,nome_vide_obs,nome_geografico,nome_geografico_date,nome_geografico_obs,faculdade,faculdade_date,faculdade_obs,nome_pai,nome_pai_date,nome_pai_obs,uc_entrada,uc_entrada_date,uc_entrada_obs,loookup,vide_type,lookup,name_sp,lookup_sp,sort_key,match_error,match_seq,match_problem,match_rand
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
200659,Estevão Afonso da Costa,m,Afonso,1664-10-15,,Bragança,1664-10-15,,Cânones,1664-10-15,Cânones,,,,1664-10-15,1664-10-15,,,cut,Estevão Afonso,Estevão Afonso Costa,Estevão Afonso,Estevão Afonso-Estevão Afonso Costa,,,,
276913,Estevão Anacleto Duarte,m,Estevão Anacleto,0000-00-00,,Vila Viçosa,0000-00-00,,Leis,0000-00-00,Leis,António Duarte,0000-00-00,,0000-00-00,0000-00-00,,,cut,Estevão Anacleto,Estevão Anacleto Duarte,Estevão Anacleto,Estevão Anacleto-Estevão Anacleto Duarte,,,,
236147,Estevão Barreto de Magalhães e Menezes,m,Estevão de Magalhães e Menezes,0000-00-00,,Braga,0000-00-00,,,,,,,,0000-00-00,0000-00-00,,,rep,Estevão de Magalhães e Menezes,Estevão Barreto Magalhães Menezes,Estevão Magalhães Menezes,Estevão Barreto Magalhães Menezes-Estevão Magalhães Menezes,,236150.0,,236150.0
129044,Estevão Caetano,m,de Araújo Rangel,1724-10-01,,Porto,1724-10-01,,Direito (Cânones ou Leis),1724-10-01,Faculdade inferida,,,,1724-10-01,1724-10-01,,,add,Estevão Caetano de Araújo Rangel,Estevão Caetano,Estevão Caetano Araújo Rangel,Estevão Caetano-Estevão Caetano Araújo Rangel,,,,
134106,Estevão Caetano de Araújo Rangel,m,Caetanao,0000-00-00,,Porto,0000-00-00,,,,,,,,0000-00-00,0000-00-00,,,add,Estevão Caetano de Araújo Rangel Caetanao,Estevão Caetano Araújo Rangel,Estevão Caetano Araújo Rangel Caetanao,Estevão Caetano Araújo Rangel-Estevão Caetano Araújo Rangel Caetanao,,,,
133510,Estevão Cardoso,m,da Silveira,1615-10-02,,Vila Viçosa,1615-10-02,,Leis,1615-10-02,Leis,,,,1615-10-02,1615-10-02,,,add,Estevão Cardoso da Silveira,Estevão Cardoso,Estevão Cardoso Silveira,Estevão Cardoso-Estevão Cardoso Silveira,,230444.0,,230444.0
230444,Estevão Cardoso da Silveira,m,Cardoso,0000-00-00,,Vila Viçosa,0000-00-00,,,,,,,,0000-00-00,0000-00-00,,,cut,Estevão Cardoso,Estevão Cardoso Silveira,Estevão Cardoso,Estevão Cardoso-Estevão Cardoso Silveira,,133510.0,,133510.0
152876,Estevão Dias,m,Pereira,0000-00-00,,Cascais,0000-00-00,,,,,,,,0000-00-00,0000-00-00,,,add,Estevão Dias Pereira,Estevão Dias,Estevão Dias Pereira,Estevão Dias-Estevão Dias Pereira,,233458.0,,233458.0
233458,Estevão Dias Pereira,m,Dias,1619-10-24,,Cascais,1619-10-24,,Cânones,1619-10-24,Cânones,Álvaro Pereira,1619-10-24,,1619-10-24,1619-10-24,,,cut,Estevão Dias,Estevão Dias Pereira,Estevão Dias,Estevão Dias-Estevão Dias Pereira,,152876.0,W01-Match from aka to see,152876.0
293823,Estevão Falcão Cota,m,Menezes,0000-00-00,,,,,,,,,,,0000-00-00,0000-00-00,,,add,Estevão Falcão Cota Menezes,Estevão Falcão Cota,Estevão Falcão Cota Menezes,Estevão Falcão Cota-Estevão Falcão Cota Menezes,,,,


## Listas ordenadas
## _Sorted lists_

In [211]:
vide_selection[['nome_geografico','match','match_type','match_problem','name','nome_vide','faculdade','nome_vide_date']].sort_values(['nome_geografico','name','nome_vide_date']).head(20)

Unnamed: 0_level_0,nome_geografico,match_seq,match_rand,match_problem,name,nome_vide,faculdade,nome_vide_date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
173368,Abrantes,,,,Estevão Lopes Galvão,Lopes,,0000-00-00
149184,Arrifana de Sousa,,,,Estevão de Freitas e Azevedo,Freitas,,0000-00-00
206520,Beco,,,,Estevão Mendes,Vasconcelos,Cânones,0000-00-00
233530,Beja,,,,Estevão Lopes Pereira,Lopes,,0000-00-00
236147,Braga,236150.0,236150.0,,Estevão Barreto de Magalhães e Menezes,Estevão de Magalhães e Menezes,,0000-00-00
236150,Braga,236147.0,236147.0,W01-Match from aka to see,Estevão de Magalhães e Menezes,Estevão Barreto de Magalhães e Menezes,Cânones,1738-10-01
200659,Bragança,,,,Estevão Afonso da Costa,Afonso,Cânones,1664-10-15
309989,Brasil,,,,Estevão Mauricio de Velasco e Tavora,Estevão Mauricio de Velasco Molina,Cânones,0000-00-00
286832,Brasil,,,,Estevão Maurício de Velasco Molina,Estevão Maurício de Velasco e Távora,Cânones,1761-11-05
198947,Brasil,,,,Estevão Maurício de Velasco e Távora,Velasco,,0000-00-00


### Examine individual records

In [106]:
from timelinknb import pperson,Session
pd.set_option('display.max_rows',250)

with Session() as session:
    session.begin()
    pperson(219458)


n$Estevão José dos Santos/m/id=219458/obs="""

            Id: 219458
            Código de referência: PT/AUC/ELU/UC-AUC/B/001-001/S/003081

            Nome        : Estevão José dos Santos, vide Estevão José
            Data inicial: 0000-00-00
            Data final  : 0000-00-00
            Filiação:
            Naturalidade: Lisboa
            Faculdade:

            Matrícula(s):

            Instituta:
        """
  atr$código-de-referência/"PT/AUC/ELU/UC-AUC/B/001-001/S/003081"/2020-12-30
  atr$data-do-registo/2020-12-30/2020-12-30
  atr$url/"https://pesquisa.auc.uc.pt/details?id=219458"/2020-12-30
  ls$uc-entrada/0000-00-00/0000-00-00
  ls$uc-saida/0000-00-00/0000-00-00
  ls$nome-vide/Estevão José/0000-00-00
  ls$nome/Estevão José/0000-00-00/obs=Estevão José dos Santos, vide Estevão José
  ls$nome/Estevão José dos Santos/0000-00-00
  ls$nome-primeiro/Estevão/0000-00-00
  ls$nome-apelido/José dos Santos/0000-00-00
  ls$nome-apelido/Santos/0000-00-00
  ls$naturalidade/Lisboa/00

In [120]:
from timelinknb import Session
from timelinknb.pandas import group_attributes

pd.set_option('display.max_rows',250)

with Session() as session:
    session.begin()
    ga = group_attributes(['215193','182145'],person_info=False,exclude_attributes=['pobs'])

ga.sort_values(['date','type','value'], inplace=True)
ga[['date','type','value','attr_obs']]


Unnamed: 0_level_0,date,type,value,attr_obs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
182145,1604-10-11,faculdade,Cânones,Faculdade corrigida
182145,1604-10-11,faculdade,Leis,Faculdade corrigida
215193,1604-10-11,faculdade,Leis,Leis
182145,1604-10-11,faculdade.ano,Cânones.1604,Faculdade corrigida
182145,1604-10-11,faculdade.ano,Leis.1604,Faculdade corrigida
215193,1604-10-11,faculdade.ano,Leis.1604,Leis
182145,1604-10-11,instituta,1604-10-11,11.10.1604 1604-10-11
215193,1604-10-11,instituta,1604-10-11,1604.10.11 1604-10-11
182145,1604-10-11,instituta.ano,1604,11.10.1604 1604-10-11
215193,1604-10-11,instituta.ano,1604,1604.10.11 1604-10-11
