
**[EN]** English only
# Cross references

The digital version of the FA carries over a cross-reference mechanism from the card catalog. As was usual at the time, extra cards were inserted into the catalog to guide users searching for name variants of the main entry. These cards have a “base” name, followed the word “vide” and a “name expression”. 

This notebook analyses the cross-reference information in the FA.




To run this file: follow instructions in the README.md file in this directory.



#### This is a very long file. Use the outline view in the left pane to quickly jump to sections

##  Setup

In [4]:
from timelinknb import current_time,current_machine, get_mhk_db
import ucalumni.config as alumniconf

db_name = alumniconf.mhk_db_name
db = get_mhk_db(db_name, connect_args={'connect_timeout': 3600})
print(current_machine,current_time,f'db={db_name}')


imac-jrc.local 2022-05-11 19:09:38.661894 db=ucalumni


Prepare a dataframe to collect the results of cross reference analysis


In [5]:
import pandas as pd

columns = ['data','sequential','random']
vars = ['vide','vide_plus',
        'see','see_matched','see_matched_ok','nodate_novide',
        'aka','nodate','nodate_novide','aka_matched','aka_matched_ok',
        'records_matched','records_matched_ok','records_error',
        'matched_pairs','matched_pairs_ok','records_see_aka','records_aka_see','records_aka_aka', 'records_see_see',
        'records_transitive','records_asymmetric']

match_info = pd.DataFrame(index=vars,columns=columns)
match_records = dict([(k,dict.fromkeys(columns)) for k in vars])
match_info.sort_index(inplace=True)
match_info

Unnamed: 0,data,sequential,random
aka,,,
aka_matched,,,
aka_matched_ok,,,
matched_pairs,,,
matched_pairs_ok,,,
nodate,,,
nodate_novide,,,
nodate_novide,,,
records_aka_aka,,,
records_aka_see,,,


## Get records which contain a "see" note (vide)

Note that records with more than one faculty and/or more than one geographic name 
generate more than one line. So the number of lines in the data frame is greater
than the number of records.

**To obtain the real number of records in a data frame it is necessary to count the number of unique record identifiers (six-digit numbers) in the data frame index.**

> nvide = len(vide.index.unique())

In [6]:
from timelinknb.pandas import attribute_to_df


# Get list of people with attribute nome-vide and add nome-geografico, nome-pai, entry date and faculdade
vide = attribute_to_df(
                    the_type='nome-vide',
                    person_info=True,
                    more_cols=['nome-geografico','faculdade','nome-pai','uc-entrada','uc-saida'],
                    sql_echo=False)
# drop columns that are not usefull
vide.drop(['nome-vide.date','nome-vide.obs','nome-geografico.date','nome-geografico.obs','nome-pai.date','nome-pai.obs','uc-entrada.date','uc-entrada.obs'],axis=1, inplace=True)
nvide = len(vide.index.unique())
print(current_machine,current_time,f'db={db_name}')
print("Number of records with 'vide' cross reference:'",nvide)
match_info.loc['vide','data'] = nvide
match_records['vide']['data'] = vide.index.unique()
print()
print(vide.info())



imac-jrc.local 2022-05-11 19:09:38.661894 db=ucalumni
Number of records with 'vide' cross reference:' 8625

<class 'pandas.core.frame.DataFrame'>
Index: 9286 entries, 127765 to 358077
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9286 non-null   object
 1   sex              9286 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8784 non-null   object
 4   faculdade        4793 non-null   object
 5   faculdade.date   4793 non-null   object
 6   faculdade.obs    4775 non-null   object
 7   nome-pai         3484 non-null   object
 8   uc-entrada       9286 non-null   object
 9   uc-saida         9286 non-null   object
 10  uc-saida.date    9286 non-null   object
 11  uc-saida.obs     0 non-null      object
dtypes: object(12)
memory usage: 1.2+ MB
None


In [7]:

print()
print("Check a few:")
vide.head(5)


Check a few:


Unnamed: 0_level_0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,uc-saida.date,uc-saida.obs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
127765,André Vaz Cabaço,m,Vaz,Coimbra,,,,,0000-00-00,0000-00-00,0000-00-00,
127798,António Joaquim do Cabo,m,e Faria,Belém,,,,,0000-00-00,0000-00-00,0000-00-00,
127819,Álvaro de Madureira Cabral,m,Madureira,Lamego,,,,,0000-00-00,0000-00-00,0000-00-00,
128000,António Cabral,m,Castelo Branco,Celorico,Leis,1603-10-07,Leis,João Gil de Abreu,1603-10-07,1616-05-16,1616-05-16,
128013,António Cabral,m,Camelo,Ranhados,Cânones,1642-10-29,Cânones,Lourenço Cabral,1642-10-29,1651-05-21,1651-05-21,


#### Problems in processing 'vide' notes with multiple "vide"

There are a few cases in the form  "vide _...name..._ e vide _...name..._"

1. https://pesquisa.auc.uc.pt/details?id=141274
2. https://pesquisa.auc.uc.pt/details?id=147377
3. https://pesquisa.auc.uc.pt/details?id=147659
4. https://pesquisa.auc.uc.pt/details?id=150350
5. https://pesquisa.auc.uc.pt/details?id=150562
6. https://pesquisa.auc.uc.pt/details?id=152472
7. https://pesquisa.auc.uc.pt/details?id=189389
8. https://pesquisa.auc.uc.pt/details?id=190076
9. https://pesquisa.auc.uc.pt/details?id=191599
10. https://pesquisa.auc.uc.pt/details?id=192039
11. https://pesquisa.auc.uc.pt/details?id=196728
12. https://pesquisa.auc.uc.pt/details?id=197167
13. https://pesquisa.auc.uc.pt/details?id=207991
14. https://pesquisa.auc.uc.pt/details?id=209208
15. https://pesquisa.auc.uc.pt/details?id=216619
16. https://pesquisa.auc.uc.pt/details?id=244099
17. https://pesquisa.auc.uc.pt/details?id=248624
18. https://pesquisa.auc.uc.pt/details?id=266150

19. https://pesquisa.auc.uc.pt/details?id=130281 
      * Nuno da Câmara	is tricky, because it combines note and vide, and the vide part has two names Nuno da Câmara (D.), 
        vide Nuno Casimiro da Câmara e Nuno José da Câmara it links with  130516 and 130517
        Handling these requires changing the grammar rules, scheduled for next version.

### Determine the type of cross reference

__Forward cross references (“see”)__
* Almost empty records with a name with “vide”
* A few with more than one (…vide... e vide…)
* No dates (empty “UnitDateInitial” field)
* Other than the name:
    * 93% place of birth
    * 27% father’s name
    * 23% faculty 
  
__Back cross references (“also knows as/aka”)__
* Normal records with “vide” in the name.
* Dates (valid “UnitDateInitial” field)
* Contain all types of information:
    * 97% place of birth
    * 53% father’s name
    * 99% faculty
    * degrees, enrolment, and so on.
* Can be matched with “see” records.
* These records are the non preferred form of the name and should link to a preferred form.



#### "See"  or forward cross-references: "vide" and no dates

These records are the non preferred form of the name and should link to a preferred form.

In [9]:

zdate_filter = vide['uc-entrada'] == '0000-00-00'
vide.loc[zdate_filter,'rec_type'] = 'see'

see_vide = vide[zdate_filter]
nsee_vide = len(see_vide.index.unique())
match_info.loc['see','data'] = nsee_vide
match_records['see']['data'] = list(see_vide.index.unique())
print("Number of vide records with zero dates (forward cross references):",nsee_vide)

nsee_vide_geo = len(see_vide[see_vide['nome-geografico'].notnull()].index.unique())
match_info.loc['see_geo','data'] = nsee_vide_geo
print(f"    of which {nsee_vide_geo} with place of birth {nsee_vide_geo/nsee_vide:.2%}")

nsee_vide_pai = len(see_vide[see_vide['nome-pai'].notnull()].index.unique())
match_info.loc['see_pai','data'] = nsee_vide_pai
print(f"    of which {nsee_vide_pai} with father's name  {nsee_vide_pai/nsee_vide:.2%}")

nsee_vide_fac = len(see_vide[see_vide['faculdade'].notnull()].index.unique())
match_info.loc['see_fac','data'] = nsee_vide_fac
print(f"    of which {nsee_vide_fac} with faculty        {nsee_vide_fac/nsee_vide:.2%}")

print()
base_vide_cols=['nome-vide','nome-geografico','faculdade','nome-pai']


Number of vide records with zero dates (forward cross references): 5563
    of which 5153 with place of birth 92.63%
    of which 1512 with father's name  27.18%
    of which 1305 with faculty        23.46%



In [10]:
match_info.sort_index(inplace=True)
match_info.fillna(" ")

Unnamed: 0,data,sequential,random
aka,,,
aka_matched,,,
aka_matched_ok,,,
matched_pairs,,,
matched_pairs_ok,,,
nodate,,,
nodate_novide,,,
nodate_novide,,,
records_aka_aka,,,
records_aka_see,,,


In [11]:
# Show some
see_vide.head()

Unnamed: 0_level_0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,uc-saida.date,uc-saida.obs,rec_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
127765,André Vaz Cabaço,m,Vaz,Coimbra,,,,,0000-00-00,0000-00-00,0000-00-00,,see
127798,António Joaquim do Cabo,m,e Faria,Belém,,,,,0000-00-00,0000-00-00,0000-00-00,,see
127819,Álvaro de Madureira Cabral,m,Madureira,Lamego,,,,,0000-00-00,0000-00-00,0000-00-00,,see
128053,António da Fonseca Cabral,m,Fonseca,Samodães,Cânones,0000-00-00,Cânones,Sebastião da Fonseca Cabral,0000-00-00,0000-00-00,0000-00-00,,see
128061,António de Matos Cabral,m,Matos,Alhos Vedros,Cânones,0000-00-00,Cânones,Tomé de Matos Cabral,0000-00-00,0000-00-00,0000-00-00,,see


#### "Aka"  or back references: records with "vide" other types of information

These are the records that should be linked back to zero date vide records.

There are too few of them!

In [12]:
# count vide record with a proper (non-zero) date
aka_filter = vide['uc-entrada'] != '0000-00-00'
vide.loc[aka_filter,'rec_type'] = 'aka'
aka_vide = vide[aka_filter]

naka_vide = len(set(aka_vide.index.values))
match_info.loc['aka','data'] = naka_vide
print("Number of records with vide and proper date (aka):",naka_vide)
match_records['aka']['data'] = list(aka_vide.index.unique())

naka_vide_geo = len(aka_vide[aka_vide['nome-geografico'].notnull()].index.unique())
match_info.loc['aka_geo','data'] = naka_vide_geo
print(f"    of which {naka_vide_geo} with place of birth {naka_vide_geo/naka_vide:.2%}")

naka_vide_pai = len(aka_vide[aka_vide['nome-pai'].notnull()].index.unique())
match_info.loc['aka_pai','data'] = naka_vide_pai
print(f"    of which {naka_vide_pai} with father's name  {naka_vide_pai/naka_vide:.2%}")

naka_vide_fac = len(aka_vide[aka_vide['faculdade'].notnull()].index.unique())
match_info.loc['aka_fac','data'] = naka_vide_fac
print(f"    of which {naka_vide_fac} with faculty        {naka_vide_fac/naka_vide:.2%}")

print("Number of records with vide and zero date (see):",nsee_vide)
# we subtract 
print("Number of zero date records in excess of dated vide records         :", nsee_vide-naka_vide)
match_info.sort_index(inplace=True)

Number of records with vide and proper date (aka): 3062
    of which 2973 with place of birth 97.09%
    of which 1619 with father's name  52.87%
    of which 3035 with faculty        99.12%
Number of records with vide and zero date (see): 5563
Number of zero date records in excess of dated vide records         : 2501


In [16]:
match_info.fillna(" ")

Unnamed: 0,data,sequential,random
aka,3062.0,,
aka_fac,3035.0,,
aka_geo,2973.0,,
aka_matched,,,
aka_matched_ok,,,
aka_pai,1619.0,,
matched_pairs,,,
matched_pairs_ok,,,
nodate,5763.0,,
nodate_novide,,,


In [17]:
# Show some
aka_vide.head()

Unnamed: 0_level_0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,uc-saida.date,uc-saida.obs,rec_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
128000,António Cabral,m,Castelo Branco,Celorico,Leis,1603-10-07,Leis,João Gil de Abreu,1603-10-07,1616-05-16,1616-05-16,,aka
128013,António Cabral,m,Camelo,Ranhados,Cânones,1642-10-29,Cânones,Lourenço Cabral,1642-10-29,1651-05-21,1651-05-21,,aka
128142,Diogo de Morais Cabral,m,Morais,Mêda,Cânones,1681-10-28,Cânones,,1681-10-28,1689-10-01,1689-10-01,,aka
128155,Fernão Cabral,m,Albuquerque,Celorico da Beira,Cânones,1663-10-15,Cânones,,1663-10-15,1674-07-24,1674-07-24,,aka
128333,Inácio de Figueiredo Cabral,m,Albuquerque,Penalva,Cânones,1655-10-22,Cânones,,1655-10-22,1662-07-21,1662-07-21,,aka


### Look at other records with no dates, even if they have no "vide" expression

To test if all zero date records are part of the cross reference scheme.
Maybe the "vide" expression was missed during input in the database.

Frist collect all records th zero date.

In [15]:
from timelinknb.pandas import attribute_to_df
from timelinknb import Session

with Session() as session:
    session.begin()

# Get list of people with no start-date and add nome-geografico, nome-pai, nome-vide and faculdade
zero_date = attribute_to_df(
                    the_type='uc-entrada',
                    the_value='0000-00-00',
                    person_info=True,
                    more_cols=['nome-vide','nome-geografico','nome-pai','faculdade','uc-saida'],
                    sql_echo=False)
zero_date.drop(['nome-vide.date','nome-vide.obs','nome-geografico.date','nome-geografico.obs','nome-pai.date','nome-pai.obs','uc-entrada.date','uc-entrada.obs'],axis=1, inplace=True)                    
nzero_date = len(set(zero_date.index.unique()))
print()
print(current_machine,current_time,f'db={db_name}')
print("Total number of rows with zero date:", len(zero_date))
print("Total number of records with zero date:", nzero_date)
match_info.loc['nodate','data'] = nzero_date

base_vide_cols=['nome-vide','nome-geografico','faculdade','nome-pai']
zero_date[base_vide_cols].count(axis=0)



imac-jrc.local 2022-05-11 19:09:38.661894 db=ucalumni
Total number of rows with zero date: 6061
Total number of records with zero date: 5763


nome-vide          5843
nome-geografico    5605
faculdade          1521
nome-pai           1673
dtype: int64

#### List of records with no date and no "vide": are they part of the cross references?

These are zero date records with no vide information, 
which means that there are no name transformations 
to be used in searching of matching records.
But since they have no dates they might be part of 
the cross-reference set.

In late April 2022 there were around 200 records.


In [18]:

# From the zero date set filter those with no "vide" 
zd_no_vide = zero_date[zero_date['nome-vide'].isnull()]
nzd_no_vide = len(set(zd_no_vide.index.values))
print()
print("Number of records with zero date and no 'vide':",nzd_no_vide)




Number of records with zero date and no 'vide': 200


#### Check if the unit dates were left blank by mistake

If a record with no unit dates contains neverthless dated information, 
then it would be possible to register the unit dates from that information,
and the blank unit dates are an error.

First collect all the attributes available for those "zero date no vide" records.

In [20]:
from timelinknb.pandas import group_attributes

zdnv_group = group_attributes(set(zd_no_vide.index.values))

Next search for attributes with valid dates in that set.

In [21]:
zdnv_with_dates = (zdnv_group['date']>'0000-00-00') & (zdnv_group['date'] < '1917-12-31')
false_zd = zdnv_group[zdnv_with_dates]
nfalse_zd = len(set(false_zd.index.values))
print("Number of records with dates in attributes but not unit dates:",nfalse_zd)
print("These are not cross-reference records, just records with unfilled unit dates")
print("Sample:")
false_zd.head(10)[['name','date','type','value','attr_obs']]

Number of records with dates in attributes but not unit dates: 59
These are not cross-reference records, just records with unfilled unit dates
Sample:


Unnamed: 0_level_0,name,date,type,value,attr_obs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
128851,Luís de Miranda Cabral,1695-02-23,grau,Bacharel em Artes,Incorporação de Bacharel em Artes: 23.02.1695
128851,Luís de Miranda Cabral,1695-02-23,grau.ano,Bacharel em Artes.1695,Incorporação de Bacharel em Artes: 23.02.1695
128851,Luís de Miranda Cabral,1695-05-18,grau,Licenciado em Artes,18.05.1695
128851,Luís de Miranda Cabral,1695-05-18,grau.ano,Licenciado em Artes.1695,18.05.1695
131475,Diogo Fialho,1656-03-30,grau,Licenciado em Artes,Licenciado em Artes 30.03.1656: Atos e Graus L...
131475,Diogo Fialho,1656-03-30,grau.ano,Licenciado em Artes.1656,Licenciado em Artes 30.03.1656: Atos e Graus L...
137651,Manuel Pais de Figueiredo,1685-07-30,grau,Formatura em Cânones,Formatura 30.07.1685
137651,Manuel Pais de Figueiredo,1685-07-30,grau.ano,Formatura em Cânones.1685,Formatura 30.07.1685
139433,Alexandre da Fonseca,1614-05-17,grau,Licenciado em Artes,Licenciado em Artes 17.05.1614
139433,Alexandre da Fonseca,1614-05-17,grau.ano,Licenciado em Artes.1614,Licenciado em Artes 17.05.1614


We remove those records from the possible cross reference aditions.

In late April 2022 there were 60 of such record. They are not cross references.

Removing those from the zero dated, no "vide" records, around 140 remain.


In [24]:
zd_no_vide_clean = zd_no_vide.drop(false_zd.index)
zd_no_vide_clean['rec_type'] = 'see'
nzd_no_vide_clean = len(zd_no_vide_clean.index.unique())
print()
print("Number of records with zero date and no 'vide' (cleaned):",nzd_no_vide_clean)
match_info.loc['nodate_novide','data'] = nzd_no_vide_clean
match_records['nodate_novide','data'] = zd_no_vide_clean.index.unique()
print("Information contained in these records:")
base_vide_cols=['nome-vide','nome-geografico','faculdade','nome-pai','uc-entrada']
zd_no_vide_clean[base_vide_cols].count(axis=0)


Number of records with zero date and no 'vide' (cleaned): 141
Information contained in these records:


nome-vide            0
nome-geografico    132
faculdade           79
nome-pai            63
uc-entrada         152
dtype: int64

Lets see what they look like

In [25]:
zd_no_vide_clean[['name','nome-vide','nome-pai','nome-geografico','faculdade','faculdade.obs','uc-entrada','rec_type']].head().sort_values('name').fillna(" ")

Unnamed: 0_level_0,name,nome-vide,nome-pai,nome-geografico,faculdade,faculdade.obs,uc-entrada,rec_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
128114,Belchior de Sá Cabral,,,Alfândega,Cânones,Cânones,0000-00-00,see
129384,Damião Dias Caldeira,,,Estremoz,Leis,Leis,0000-00-00,see
128371,João Cabral,,António Teixeira,Torres Vedras,,,0000-00-00,see
130534,Manuel Domingues Ferreira,,,Ferreiros,,,0000-00-00,see
130359,Manuel Ferreira,,,Penalva,Artes,Faculdade inferida,0000-00-00,see


#### Add zero date records with no 'vide' to records to be matched

We join the zero date no 'vide' records to the vide records,.

We assume that zero date records are also "see also" records which were not flagged as 'vide' due to input variations.

But we know this is not always the case, some of the zero date records are normal records where the unit dates were not recorded for some reason.

In [26]:
import pandas as pd

vide_plus = pd.concat([vide,zd_no_vide_clean])
nvide_plus = len(vide_plus.index.unique())
match_info.loc['vide_plus','data'] = nvide_plus
match_records['vide_plus']['data'] = vide_plus.index.unique()
print(f"Number of unique records involved in the cross references: {nvide_plus}")
vide_plus.info()

Number of unique records involved in the cross references: 8766
<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 127765 to 316291
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8916 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
dtypes: object(13)
memory usage: 1.3+ MB


#### Update information on "see" type referecences

Taking into account the new zero date, no vide records (around 140)

In [27]:
see_vide = vide_plus[vide_plus['uc-entrada'] == '0000-00-00']

nsee_vide = len(see_vide.index.unique())
match_info.loc['see','data'] = nsee_vide
match_records['see']['data']=list(see_vide.index.unique())
print("Number of vide records with zero dates (forward cross references) updated:",nsee_vide)

nsee_vide_geo = len(see_vide[see_vide['nome-geografico'].notnull()].index.unique())
match_info.loc['see_geo','data'] = nsee_vide_geo
print(f"    of which {nsee_vide_geo} with place of birth {nsee_vide_geo/nsee_vide:.2%}")

nsee_vide_pai = len(see_vide[see_vide['nome-pai'].notnull()].index.unique())
match_info.loc['see_pai','data'] = nsee_vide_pai
print(f"    of which {nsee_vide_pai} with father's name  {nsee_vide_pai/nsee_vide:.2%}")

nsee_vide_fac = len(see_vide[see_vide['faculdade'].notnull()].index.unique())
match_info.loc['see_fac','data'] = nsee_vide_fac
print(f"    of which {nsee_vide_fac} with faculty        {nsee_vide_fac/nsee_vide:.2%}")
print()

base_vide_cols=['nome-vide','nome-geografico','faculdade','nome-pai']


Number of vide records with zero dates (forward cross references) updated: 5704
    of which 5274 with place of birth 92.46%
    of which 1569 with father's name  27.51%
    of which 1378 with faculty        24.16%



#### Closer look at "see" references


##### Presence of place of birth in zero date records

Most of them have place of birth information.


In [28]:
see_vide_with_geo = see_vide[see_vide['nome-geografico'].notnull()]
nsee_vide_with_geo = len(set(see_vide_with_geo.index.values))
print("See references with geo info (unique records):",
       nsee_vide_with_geo,
       "out of",nsee_vide,
       f'= {nsee_vide_with_geo/nzero_date:.2%}')
print("Other information (note that some records have more than one geographic name)")
see_vide_with_geo[base_vide_cols].count(axis=0)


See references with geo info (unique records): 5274 out of 5704 = 91.51%
Other information (note that some records have more than one geographic name)


nome-vide          5433
nome-geografico    5565
faculdade          1392
nome-pai           1639
dtype: int64

##### See references with no birth place

Check which information is available when place of birth is missing.

The values are similar to normal "see" records.


In [29]:
see_vide_nogeo = see_vide[see_vide['nome-geografico'].isnull()]
nsee_vide_nogeo = len(set(see_vide_nogeo.index.values))
print("Zero date records without geo info:",
       nsee_vide_nogeo,
       "out of",nzero_date,
       f'= {nsee_vide_nogeo/nzero_date:.2%}')
print()
print("Other information:")
see_vide_nogeo[base_vide_cols].count(axis=0)

Zero date records without geo info: 430 out of 5763 = 7.46%

Other information:


nome-vide          410
nome-geografico      0
faculdade           64
nome-pai            20
dtype: int64

### Final ist of records involved in cross references



In [30]:
import pandas as pd

pd.set_option('display.max_rows',50)
vide_plus.info()
vide_plus.head(10)

<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 127765 to 316291
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8916 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
dtypes: object(13)
memory usage: 1.3+ MB


Unnamed: 0_level_0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,uc-saida.date,uc-saida.obs,rec_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
127765,André Vaz Cabaço,m,Vaz,Coimbra,,,,,0000-00-00,0000-00-00,0000-00-00,,see
127798,António Joaquim do Cabo,m,e Faria,Belém,,,,,0000-00-00,0000-00-00,0000-00-00,,see
127819,Álvaro de Madureira Cabral,m,Madureira,Lamego,,,,,0000-00-00,0000-00-00,0000-00-00,,see
128000,António Cabral,m,Castelo Branco,Celorico,Leis,1603-10-07,Leis,João Gil de Abreu,1603-10-07,1616-05-16,1616-05-16,,aka
128013,António Cabral,m,Camelo,Ranhados,Cânones,1642-10-29,Cânones,Lourenço Cabral,1642-10-29,1651-05-21,1651-05-21,,aka
128053,António da Fonseca Cabral,m,Fonseca,Samodães,Cânones,0000-00-00,Cânones,Sebastião da Fonseca Cabral,0000-00-00,0000-00-00,0000-00-00,,see
128061,António de Matos Cabral,m,Matos,Alhos Vedros,Cânones,0000-00-00,Cânones,Tomé de Matos Cabral,0000-00-00,0000-00-00,0000-00-00,,see
128066,António de Mendonça Cabral,m,Mendonça,Pernambuco,,,,,0000-00-00,0000-00-00,0000-00-00,,see
128093,António Pinto Cabral,m,Pinto,Grajal,,,,,0000-00-00,0000-00-00,0000-00-00,,see
128103,António Veloso Cabral,m,Veloso,Sanfins,,,,,0000-00-00,0000-00-00,0000-00-00,,see


In [31]:
match_info.sort_index(inplace=True)
match_info.fillna('')

Unnamed: 0,data,sequential,random
aka,3062.0,,
aka_fac,3035.0,,
aka_geo,2973.0,,
aka_matched,,,
aka_matched_ok,,,
aka_pai,1619.0,,
matched_pairs,,,
matched_pairs_ok,,,
nodate,5763.0,,
nodate_novide,141.0,,


## Analyse geographic names for variations

Birth place is key to maching, but there are many variations for the same place name
and many spelling varitants.

We make an index of variations in geographic names and 
find name matches within the set of names from similar spelled places.

For similarity we use the so called "gestalt pattern matching" from the Python library see: https://docs.python.org/3/library/difflib.html

The threshold ratio was determined empirically. It is not a problem that some false varitations are detected since 
a further check is done with matching the names.

The following code compares all the the geographic names and prints out those considered to be variants of the same place, with
the similarity ration. Note that this ratio is sensitive to length, and fails the threshold in short forms like "Algozo/Algoso" or
when the two forms are of different lengths like "Poiares/Vila Nova de Poiares".

Stil it detects many usefull variants.

In [32]:
import difflib

# we have many variations in geografic names
have_geoname_filter = vide_plus['nome-geografico'].notnull()
geonames_index = sorted(vide_plus[have_geoname_filter]['nome-geografico'].unique())
print("Number of different geo names:",len(geonames_index))
geo_similars = {}
diff_threshold = .85

for geo in [g for g in geonames_index if g is not None]:

    for similar in [s for s in geonames_index if s is not None and s > geo] :
        diff = difflib.SequenceMatcher(None, geo, similar).ratio()
        if  diff >= diff_threshold and geo is not None and similar is not None:
            pass
            print(f"{geo} / {similar} diff:{diff:.3}")
            geo_similars[geo] = geo_similars.get(geo,[]) + [similar]
            geo_similars[similar] = geo_similars.get(similar,[]) + [geo]


Number of different geo names: 1429
Aguas Santas / Águas Santas diff:0.917
Alcacer / Alcácer diff:0.857
Alcaide / Alcaíde diff:0.857
Alenquer / Alquer diff:0.857
Alhos Vedras / Alhos Vedros diff:0.917
Almalaguer / Almalaguez diff:0.9
Ameixilhoeira da Carregação / Mexilhoeira da Carregação diff:0.923
Anadaluzia / Andaluzia diff:0.947
Angra do Heroismo / Angra do Heroísmo diff:0.941
Arco / Arcos diff:0.889
Arcos Valdevez / Arcos de Valdevez diff:0.903
Arcos Valdevez / Arcos de Valedevez diff:0.875
Arcos de Valdevez / Arcos de Valedevez diff:0.971
Arcozelo / Arcozelos diff:0.941
Arrifana de Sousa / Arrifana do Sousa diff:0.941
Atei / Athei diff:0.889
Bairros / Barrosa diff:0.857
Barcelos / Barcos diff:0.857
Barcos / Buarcos diff:0.923
Barreira / Barreiria diff:0.941
Barreira / Barreiro diff:0.875
Barrosa / Barrosas diff:0.933
Bemviver / Benviver diff:0.875
Cabacos / Cabaços diff:0.857
Cabananas / Cabanas diff:0.875
Cabeceira de Basto / Cabeceira de Bastos diff:0.973
Cabeceira de Basto / C

## Matching "vide" references

Try to match "forward" and "backward" references by generating the target names from "vide"
expressions 

### Generation of target names from "vide" expressions

 
There are four types of "vide" expressions:

1. “Cut”: António Veloso Cabral, vide Veloso, result: António Veloso. The “vide” expression is a family name before the last; the target name is computed as the base name up to and including the “vide” expression; the resulting name is a shorter version of the base name, with the last family name(s) removed
2. “Add”: André de Campos, vide Cordeiro, result: André de Campos Cordeiro. The “vide” expression is not present in the base name; the target name is the base name with the “vide” expression added at the end; in some cases, the real target name will have and extra particle before the vide expression, like “de”, “e”, ... etc...). 
3.	“Replace”: Adriano Sisnando Brotero de Avelar Quintino, vide Adriano Sisnando Brotero Quintino de Avelar, result: Adriano Sisnando Brotero Quintino de Avelar. The “vide” expression is a full name, sharing the first name with the base name. This happens when the transformation of family names cannot be expressed by “cut” and “add”, so the author of the card wrote the full target name after “vide” for clarity.
4.	“Partial replace”: Francisco António Campos, vide de Novais Campos, result: Francisco António de Novais Campos. The “vide” expression replaces part of the base name; the “base name” and the “vide” expression overlap at the end; the matched part in the “base name” is replaced by the “vide” expression.



#### Collect first names from database, filter rare ones

We need the first names in the database for the next step.


In [33]:
# collect possible first names from current database
from timelinknb.pandas import attribute_values
from timelinknb import Session


# collect list of first names, ignore the less frequent ones
#
threshold = 5
pnomes = []
pnomes_table = attribute_values('nome-primeiro')
for id,linha in pnomes_table.iterrows():
    pnome = id
    count = linha['count']
    if count>threshold:
        pnomes.append(pnome)

print(f"Number of first names with more than {threshold} occurrences {len(pnomes)}")
print("Use this to copy to other places:")
print()
print("[")
for i in range(len(pnomes)):
    print(f"'{pnomes[i]}',", end='')
    if int((i+1)%5) == 0:
        print()
print("]")

Number of first names with more than 5 occurrences 279
Use this to copy to other places:

[
'Abel','Abílio','Acácio','Acúrcio','Adão',
'Adelino','Adolfo','Adriano','Adrião','Afonso',
'Agnelo','Agostinho','Aires','Albano','Alberto',
'Albino','Aleixo','Alexandre','Alfredo','Alípio',
'Álvaro','Amadeu','Amador','Amancio','Amândio',
'Amaro','Ambrósio','Américo','Anacleto','Anastácio',
'André','Ângelo','Aníbal','Aniceto','Anselmo',
'Antão','Antero','António','Antrónio','Apolinário',
'Arcanjo','Aristides','Armando','Arnaldo','Arsenio',
'Artur','Ascenso','Atanásio','Augusto','Aureliano',
'Aurélio','Avelino','Baltasar','Barnabé','Bartolomeu',
'Basílio','Batista','Belchior','Benjamim','Bento',
'Berardo','Bernardino','Bernardo','Boaventura','Bonifácio',
'Brás','Bruno','Caetano','Calisto','Camilo',
'Cândido','Carlos','Casimiro','Celestino','César',
'Cipriano','Cláudio','Clemente','Constantino','Cosme',
'Crisogono','Crispim','Cristiano','Cristóvão','Custódio',
'Dâmaso','Damião','Daniel','David','De


#### Apply vide expression transformation for records, get the target name

Echo replacements that involve changing the first name, because they are error prone.




In [34]:
from os.path import commonprefix
import re


vide_plus['loookup']=''
vide_plus['vide_type']=''

for id,linha in vide_plus.iterrows():
    nome =  linha['name']
    if not pd.isnull(linha['nome-vide']) :
        nome_vide = linha['nome-vide']
        nomes = nome.split(" ")
        nomes_vide = nome_vide.split(" ")
        # find a common suffix (invert names, use commonprefix, invert result)
        terminacao_comum = commonprefix([nome[::-1],nome_vide[::-1]])[::-1]
        # check it is a separate name and not just common letters at the end
        # a proper family name should share a starting space 
        if len(terminacao_comum) > 0:
            if terminacao_comum[0] != ' ':
                terminacao_comum = ''    # not a separate name, abandom
            else:
                terminacao_comum = terminacao_comum.strip()
        # currently using common suffix lowers the matches why?
        # terminacao_comum = ''

        # Type CUT: vide is a inner part of the original name
        # e.g. André Vaz Cabaço, vide Vaz
        # but also Manuel de Almeida Cabral, vide de Almeida 
        pos = nome.find(nome_vide)
        if pos > -1:
            lookup_name = nome[0:pos] + nome_vide
            vtype="cut"
        # Type REP: vide name looks like a full name 
        # e.g. António de Abreu Bacelar de Azevedo, vide António Abreu Bacelar
        # relaxing the same first name rule, lots  of leaks
        #  This leaks a lot : elif len(nomes_vide)>1 and nomes_vide[0] in pnomes :
        elif nomes[0] == nomes_vide[0]:
            lookup_name = nome_vide 
            vtype='rep'
        # Type REPAP: vide overlaps end of name
        # e.g. Joaquim Carvalho, vide Ramos de Carvalho
        # but vide must not contain first name
        # in that case probably a REP
        # otherwise generates leaks and lowers mumbers of matches
        elif terminacao_comum > '':
            if not nomes_vide[0] in pnomes :
                lookup_name = re.sub(f'{terminacao_comum}$',nome_vide,nome)
                vtype='repap'
            else:  # if common termination and first name better replace
                lookup_name = nome_vide
                vtype='rep'              
        else:
            # TYPE ADD vide name is not part of original nor a full name
            # so it must be an aditional surname
            # e.g. Fernão Cabral, vide Albuquerque = Fernão Cabral Albuquerque
            lookup_name = nome+" "+nome_vide
            vtype='add'
    else:
        lookup_name = nome
        vtype='novid'
  
        
    # we try to recover cases where there was replacement of first name
    # they are missed by the REP amd REPAP rules above and end up 
    # producing lookup which are the sobreposition of two names
    # this was added by examining bad "ADD" and "REPAP" results
    # if the result is a long name (>5 names), both name and vide start
    # with first names and vide also long (>4) then probable a replace
    # that changes the first name.
    nomes_lookup = lookup_name.split()
    if vtype != 'rep' \
         and nomes[0] in pnomes and nomes_vide[0] in pnomes\
         and nomes[0] != nomes_vide[0]\
         and len(nomes_vide) > 3\
         and len(nomes_lookup) > 5:
        old_lookup = lookup_name
        lookup_name = nome_vide
        vtype = 'rep+'
        print(id,nome,"vide", nome_vide,"--->",lookup_name,"\n  instead of", old_lookup)

    # print(f'{type} :{id:7}{nome:40}{nome_vide:40} = {lookup_name}')
    vide_plus.loc[id,'lookup'] = lookup_name
    vide_plus.loc[id,'vide_type'] = vtype


174974 Dionísio Dinis de Oliveira vide Dinis de Oliveira da Fonseca ---> Dinis de Oliveira da Fonseca 
  instead of Dionísio Dinis de Oliveira Dinis de Oliveira da Fonseca
175757 João Batista de Oliveira vide José Batista de Oliveira Baeina ---> José Batista de Oliveira Baeina 
  instead of João Batista de Oliveira José Batista de Oliveira Baeina
179121 João Xavier Mousinho da Silveira Gomide vide José Xavier Mousinho Gomide da Silveira ---> José Xavier Mousinho Gomide da Silveira 
  instead of João Xavier Mousinho da Silveira Gomide José Xavier Mousinho Gomide da Silveira
180061 João António Osório Pereira Gouveia vide Adriano Osório Pereira Guerra ---> Adriano Osório Pereira Guerra 
  instead of João António Osório Pereira Gouveia Adriano Osório Pereira Guerra
180742 Adriano Osório Pereira Guerra vide João António Pereira Cerenato ---> João António Pereira Cerenato 
  instead of Adriano Osório Pereira Guerra João António Pereira Cerenato
195729 Joaquim da Conceição vide Jacinto de Sã

#### Collect stats on type of vide transformation applied

In [35]:
vide_types = vide_plus.groupby('vide_type').count()[['name']]
vide_types['perc'] = vide_types['name']/ vide_types['name'].sum()
vide_types

Unnamed: 0_level_0,name,perc
vide_type,Unnamed: 1_level_1,Unnamed: 2_level_1
add,4057,0.429858
cut,4126,0.437169
novid,152,0.016105
rep,1057,0.111994
rep+,20,0.002119
repap,26,0.002755


#### Double check partial replace transformation

* Partial replace:
    * Francisco António Campos, vide de Novais Campos, result: Francisco António de Novais Campos. 
        * the “vide” expression replaces part of the base name; 
        * the “base name” and the “vide” expression overlap at the end; 
        * the matched part in the “base name” is replaced by the “vide” expression.

They are sensistive to misspelling in first names.

In [36]:
# Check if we got many cases of vide with overlap, and if they are handled right
repap = vide_plus[['name','nome-vide','vide_type', 'lookup','nome-geografico','faculdade','uc-entrada']].sort_values(['name','nome-vide','vide_type', 'lookup'])
repap[repap.vide_type == 'repap']

Unnamed: 0_level_0,name,nome-vide,vide_type,lookup,nome-geografico,faculdade,uc-entrada
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
151153,António Barreiros,Rodrigues Barreiros,repap,António Rodrigues Barreiros,Lisboa,Leis,1655-10-16
171604,Belchior de Matos,Rodrigues de Matos,repap,Belchior Rodrigues de Matos,Vila Viçosa,Cânones,1621-10-11
171604,Belchior de Matos,Rodrigues de Matos,repap,Belchior Rodrigues de Matos,Vila Viçosa,Leis,1621-10-11
192525,Diogo Chamorro,Homem Chamorro,repap,Diogo Homem Chamorro,Porto,,0000-00-00
130230,Francisco António Campos,de Novais Campos,repap,Francisco António de Novais Campos,Azeitão,,0000-00-00
135280,Francisco Tavares de Figueiredo,Farncisco Xavier Tavares de Figueiredo,repap,Francisco Farncisco Xavier Tavares de Figueiredo,Meãs,Cânones,1762-10-01
135280,Francisco Tavares de Figueiredo,Farncisco Xavier Tavares de Figueiredo,repap,Francisco Farncisco Xavier Tavares de Figueiredo,vila,Cânones,1762-10-01
209659,Gaspar da Cunha,Macedo da Cunha,repap,Gaspar Macedo da Cunha,Amarante,,0000-00-00
165045,Isidoro da Cunha de Eça,dos Santos de Eça,repap,Isidoro da Cunha dos Santos de Eça,Alvorninha,,0000-00-00
165046,Isidoro dos Santos de Eça,da Cunha de Eça,repap,Isidoro dos Santos da Cunha de Eça,Alvorninha,Cânones,1735-10-01


Current fails:

* 135280	Francisco Tavares de Figueiredo	__vide__ Farncisco Xavier Tavares de Figueiredo	repap	Francisco Farncisco Xavier Tavares de Figueiredo __first name misspelled__	
* 245474    Jerónimo de Magalhães Mexia	jerónimo __vide__ josé de Macêdo Magalhães Mexia	repap	Jerónimo de jerónimo josé de Macêdo Magalhães ...	__first name misspelled__
* 277264	José Luís Alves Feijó __vide__ Angelo do Santíssimo Sacramento Alves Feijó	repap	José Luís Angelo do Santíssimo Sacramento Alve...	__first name misspelled__
* 228003	José da Fonseca Marques da Silva __vide__ da Fonseca da Silva	repap	José da Fonseca Marques da Fonseca da Silva: __bad expression should be a replace__
* 204835    Luís de Figueiredo	__vide__ Figueiredo Lobo ou Lobo de Figueiredo	repap	Luís Figueiredo Lobo ou Lobo de Figueiredo __bad expression not understandable__


#### Remove particles from names

To increase the chance of matches we make a copy of names and target names derived from "vide"
without the particles that are used in Portuguese names (not very uniformely)


In [37]:

def remove_particles(name):
    particles = ("de","da","e","das","dos","do")
    return " ".join([n for n in name.split() if n not in particles])

vide_plus['name_sp'] = vide_plus['name'].apply(lambda name: remove_particles(name))
vide_plus['lookup_sp'] = vide_plus['lookup'].apply(lambda name: remove_particles(name))
vide_plus[vide_plus['name']!=vide_plus['name_sp']][['name','name_sp', 'lookup','lookup_sp']].head(10)

Unnamed: 0_level_0,name,name_sp,lookup,lookup_sp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
127798,António Joaquim do Cabo,António Joaquim Cabo,António Joaquim do Cabo e Faria,António Joaquim Cabo Faria
127819,Álvaro de Madureira Cabral,Álvaro Madureira Cabral,Álvaro de Madureira,Álvaro Madureira
128053,António da Fonseca Cabral,António Fonseca Cabral,António da Fonseca,António Fonseca
128061,António de Matos Cabral,António Matos Cabral,António de Matos,António Matos
128066,António de Mendonça Cabral,António Mendonça Cabral,António de Mendonça,António Mendonça
128125,Britaldo de Gouveia Cabral,Britaldo Gouveia Cabral,Britaldo de Gouveia,Britaldo Gouveia
128142,Diogo de Morais Cabral,Diogo Morais Cabral,Diogo de Morais,Diogo Morais
128159,Filipe de Gouveia Cabral,Filipe Gouveia Cabral,Filipe de Gouveia,Filipe Gouveia
128244,Francisco Xavier da Veiga Cabral,Francisco Xavier Veiga Cabral,Francisco Xavier da Veiga,Francisco Xavier Veiga
128295,João da Costa Cabreira,João Costa Cabreira,João da Costa,João Costa


#### Save name transformations for reference

Output table with the generation of target names from base names and vide expressions.

File available at /inferences/cross-references/vide_transform.csv

This table allows comparing between sucessive versions for fine tuning.

In [531]:
# save for change tracking
vide_plus[['name','name_sp','nome-vide','vide_type','lookup', 'lookup_sp']].sort_values('name_sp').to_csv('../inferences/remissivas/vide_transform.csv',sep=',')

### Match

#### Sort by geographic name, name and lookup

First attemp is to order the records so that matching cross references end up in consecutive rows.
We sort by place of birth and inside place of birth by name and target vide name.

This is a type of similarity filter, that puts many matches in sucessive rows.


In [40]:

# sort by naturalidade, 'lookup', 'name' (with the  name and lookup ordered alfabetically)
# this should put the vide pair in sucessive rows, but misses some cases due to ordering issues
cols = ['lookup_sp','name_sp']
vide_plus['sort_key'] = vide_plus[cols].apply(lambda row: '-'.join(sorted(row.values.astype(str))), axis=1)
vide_plus.sort_values(['nome-geografico','sort_key'], inplace=True)
vide_plus[['nome-geografico','sort_key', 'name_sp','lookup_sp','nome-vide','vide_type','uc-entrada','nome-pai','faculdade']].head(10)

Unnamed: 0_level_0,nome-geografico,sort_key,name_sp,lookup_sp,nome-vide,vide_type,uc-entrada,nome-pai,faculdade
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
198423,Constância,António Homem Magalhães-António Homem Magalhãe...,António Homem Magalhães Corte Real,António Homem Magalhães,Magalhães,cut,0000-00-00,,Leis
144388,Constância,Fernão Álvares-Fernão Álvares Temudo,Fernão Álvares,Fernão Álvares Temudo,Temudo,add,1573-11-13,Pantaleão Rosado,Cânones
202622,Constância,Fernão Álvares-Fernão Álvares Temudo,Fernão Álvares Temudo,Fernão Álvares,Álvares,cut,0000-00-00,,
129617,Constância,João Claúdio Ferreira Calado-João Claúdio Ferr...,João Claúdio Ferreira Calado,João Claúdio Ferreira Calado Oliveiro,Oliveiro,add,0000-00-00,,
171438,Constância,João Veiga-João Veiga Mendes Nogueira,João Veiga Mendes Nogueira,João Veiga,João da Veiga,cut,0000-00-00,,Leis
213495,Constância,João Veiga-João Veiga Mendes Nogueira,João Veiga,João Veiga Mendes Nogueira,João da Veiga Mendes Nogueira,rep,1757-10-01,,Leis
143676,Constância,Julião Velho-Julião Velho Almeida,Julião Velho Almeida,Julião Velho,Velho,cut,1663-07-10,,Cânones
214577,Constância,Julião Velho-Julião Velho Almeida,Julião Velho,Julião Velho Almeida,Almeida,add,0000-00-00,,
176277,Constância,Manuel Costa-Manuel Costa Oliveira,Manuel Costa Oliveira,Manuel Costa,Costa,cut,1672-01-24,Manuel da Costa,Cânones
203159,Constância,Manuel Costa-Manuel Costa Oliveira,Manuel Costa,Manuel Costa Oliveira,Oliveira,add,0000-00-00,Manuel da Costa,


In [43]:
# we set records with no nome geográfico temporarly to a string
# so that we can have them together for consideration
vide_plus.loc[vide_plus['nome-geografico'].isnull(),'nome-geografico'] = '***NA***'


#### Sequential matching

In late April around 3600 records found a match by this process.

Bit some matches are inconsistent (assymetric or ambiguous).


In [44]:
def compare_names(name1, name2):
    return remove_particles(name1) == remove_particles(name2)

previous_lookup = ''
previous_nome = ''
previous_id = ''
previous_date = ''
sequential_matches = []

for id,linha in vide_plus.iterrows():
    nome = linha['name_sp']
    lookup_name = linha['lookup_sp']
    uc_date = linha['uc-entrada']
    rec_type = linha['rec_type']

    if compare_names(previous_lookup,nome)\
         and compare_names(previous_nome,lookup_name)\
         and id != previous_id:
        # we store the direction of the match see-aka or aka-see
        from_type = rec_type
        to_type = previous_rec_type
        to_tuple = (id,previous_id,f'{from_type}-{to_type}')
        from_tuple = (previous_id,id,f'{to_type}-{from_type}')
        if to_tuple in sequential_matches:
            # print("Skipping duplicate match",to_tuple)
            pass
        else:
            sequential_matches.append((id,previous_id,f'{from_type}-{to_type}'))
        if from_tuple in sequential_matches:
            # print("Skipping duplicate match",to_tuple)
            pass
        else:    
            sequential_matches.append((previous_id,id,f'{to_type}-{from_type}'))

    previous_id = id
    previous_nome = nome
    previous_lookup = lookup_name  
    previous_date = uc_date
    previous_rec_type = rec_type

vide_plus.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 198423 to 230315
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  9438 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
 13  loookup          9438 non-null   object
 14  vide_type        9438 non-null   object
 15  lookup           9438 non-null   object
 16  name_sp          9438 non-null   object
 17  lookup_sp        9438 non-null 

##### Analyse sequential match results

In [46]:
method = 'sequential'
match_records['matched_pairs'][method] = list(set(sequential_matches))
match_info.loc['matched_pairs'][method] = len(match_records['matched_pairs'][method])

# pairs 
pairs_see_aka = [(o,d,mtype) for (o,d,mtype) in sequential_matches if mtype == 'see-aka']
pairs_aka_see = [(o,d,mtype) for (o,d,mtype) in sequential_matches if mtype == 'aka-see']
pairs_aka_aka = [(o,d,mtype) for (o,d,mtype) in sequential_matches if mtype == 'aka-aka']
pairs_see_see = [(o,d,mtype) for (o,d,mtype) in sequential_matches if mtype == 'see-see']

# records
rec_matched = set([id for (id,d,t) in sequential_matches]+
                  [id for (o,id,t) in sequential_matches])
rec_see_aka = set([id for (id,d,t) in pairs_see_aka])
rec_aka_see = set([id for (o,id,t) in pairs_aka_see])
rec_see_see = set([id for (id,d,t) in pairs_see_see] +
                  [id for (o,id,t) in pairs_see_see])
rec_aka_aka = set([id for (id,d,t) in pairs_aka_aka] +
                  [id for (o,id,t) in pairs_aka_aka])

match_records['records_matched'][method] = list(rec_matched)
match_records['records_see_aka'][method] = list(rec_see_aka)
match_records['records_aka_see'][method] = list(rec_aka_see)
match_records['records_aka_aka'][method] = list(rec_aka_aka)
match_records['records_see_see'][method] = list(rec_see_see)

match_info.loc['records_matched'][method] = len(rec_matched)
match_info.loc['records_see_aka'][method] = len(rec_see_aka)
match_info.loc['records_aka_see'][method] = len(rec_aka_see)
match_info.loc['records_aka_aka'][method] = len(rec_aka_aka)
match_info.loc['records_see_see'][method] = len(rec_see_see)

# new
match_info.loc['aka_matched'][method] = len(rec_aka_see.union(rec_aka_aka))
match_records['aka_matched'][method] = list(rec_aka_see.union(rec_aka_aka))
match_info.loc['see_matched'][method] = len(rec_see_aka.union(rec_see_see))
match_records['see_matched'][method] = list(rec_see_aka.union(rec_see_see))

##### Check for ambigous matches
Look for records matched with more than one or involved in transitive matching (A->B->C)

Note that sequential methid only matches symmetric links and so no asymmetric cases are generated in this method

In [48]:
import networkx as nx

method = 'sequential'

matched_pairs = match_records['matched_pairs'][method]
records = match_records['records_matched'][method]

matched_multiple = []
matched_single = []

origins = [o for (o,d,t) in matched_pairs]
destinations = [d for (o,d,t) in matched_pairs]
rec_in_matches = origins + destinations
for i in rec_in_matches:
    c = rec_in_matches.count(i)
    if c >2:
        matched_multiple.append(i)
    elif c == 1:
        matched_single.append(i)
matched_multiple = list(set(matched_multiple))
matched_single = list(set(matched_single))


# alternative method, perhaps more informative:
pairs_to_check = match_records['matched_pairs'][method]

asymmetric_pairs = []
for (o,d,t) in pairs_to_check:
    if t == 'see-aka':
        rt = 'aka-see'
    elif t == 'aka-see':
        rt = 'see-aka'
    else:
        rt = t
    if (d,o,rt) not in pairs_to_check:
        asymmetric_pairs.append((o,d,t))
        print("asymmetic match:",(o,d,t))

print("Records with more than one match      :", len(matched_multiple))
print("Records with just one match           :", len(matched_single))

G = nx.Graph()
simple_pairs = [(o,d) for (o,d,t) in matched_pairs]
G.add_edges_from(simple_pairs)
transitive  = [c for c in nx.connected_components(G) if len(c) > 2]
# number of records in ambiguous matches
amb_records = [item for amb in transitive for item in amb]
namb_records = len(set(amb_records))
print("Records in ambiguous matches          :", namb_records)
for amb in transitive:
    print(amb)
print("Are multiple in ambiguous?            :",set(matched_multiple).issubset(set(amb_records)))

rec_errors = set(amb_records).union(matched_multiple).union(matched_single)
rec_ok = set(records).difference(rec_errors)

match_records['records_error'][method] = list(rec_errors)
match_info.loc['records_error'][method] = len(rec_errors)
match_records['records_matched_ok'][method] = list(rec_ok)
match_info.loc['records_matched_ok'][method] = len(rec_ok)
match_records['records_asymmetric'][method] = list(matched_single)
match_info.loc['records_asymmetric'][method] = len(matched_single)
match_records['records_transitive'][method] = list(amb_records)
match_info.loc['records_transitive'][method] = namb_records

# new
aka = match_records['aka']['data']
aka_ok = set(aka).intersection(rec_ok)
see = match_records['see']['data']
see_ok = set(see).intersection(rec_ok)
match_info.loc['aka_matched_ok'][method] = len(aka_ok)
match_records['aka_matched_ok'][method] = list(aka_ok)
match_info.loc['see_matched_ok'][method] = len(see_ok)
match_records['see_matched_ok'][method] = list(see_ok)

pairs_ok = set([(o,d,t) for (o,d,t) in match_records['matched_pairs'][method]
                                                        if o not in rec_errors and d not in rec_errors])
match_records['matched_pairs_ok'][method] = list(pairs_ok)
match_info.loc['matched_pairs_ok'][method] = len(pairs_ok)

vide_plus.loc[matched_single,'match_error'] = False
vide_plus.loc[matched_single,'match_obs'] = "W01-Single match (asymmetric) "+method
vide_plus.loc[matched_multiple,'match_error'] = True
vide_plus.loc[matched_multiple,'match_obs'] = "E02-Multiple match "+method
vide_plus.loc[amb_records,'match_error'] = True
vide_plus.loc[amb_records,'match_obs'] = "E01-Ambiguity in match "+method

match_info.fillna('')

Records with more than one match      : 4
Records with just one match           : 0
Records in ambiguous matches          : 12
{'183306', '235009', '183307'}
{'169757', '162809', '136283'}
{'255769', '206151', '203494'}
{'278765', '181236', '178312'}
Are multiple in ambiguous?            : True


Unnamed: 0,data,sequential,random
aka,3062.0,,
aka_fac,3035.0,,
aka_geo,2973.0,,
aka_matched,,1913.0,
aka_matched_ok,,1907.0,
aka_pai,1619.0,,
matched_pairs,,3644.0,
matched_pairs_ok,,3628.0,
nodate,5763.0,,
nodate_novide,141.0,,


In [539]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                           ***NA***
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

##### Show some of the ambiguous records

pandas and timelink in colorful relation

In [49]:
from timelinknb.pandas import display_group_attributes

pd.set_option('display.max_rows',250)

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

show_only = 8
for ambiguous_records in transitive[:show_only]:
    display_group_attributes(ambiguous_records,
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,183306,António de Sousa,António de Sousa Macedo,Lisboa,0000-00-00,,Gonçalo de Sousa
1,235009,António de Sousa,Macedo,Lisboa,0000-00-00,,João
2,183307,António de Sousa de Macedo,Sousa,Lisboa,1629-11-15,Leis,Gonçalo de Sousa


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,183306,naturalidade,Lisboa,
1,0000-00-00,235009,naturalidade,Lisboa,
2,0000-00-00,183306,nome,António de Sousa,
3,0000-00-00,235009,nome,António de Sousa,
4,0000-00-00,183306,nome,António de Sousa Macedo,"António de Sousa, vide António de Sousa Macedo"
5,0000-00-00,235009,nome,António de Sousa Macedo,"António de Sousa, vide Macedo"
6,0000-00-00,183306,nome-pai,Gonçalo de Sousa,
7,0000-00-00,235009,nome-pai,João,
8,0000-00-00,183306,nome-vide,António de Sousa Macedo,
9,0000-00-00,235009,nome-vide,Macedo,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,136283,Manuel Francisco de Ribolhos,Manuel Francisco,Ribolhos,0000-00-00,Cânones,
1,169757,Manuel Francisco de Ribolhos,Francisco,Ribolhos,0000-00-00,,
2,162809,Manuel Francisco,Manuel Francisco de Ribolhos,Ribolhos,1756-10-01,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,136283,faculdade,Cânones,Cânones
1,0000-00-00,136283,naturalidade,Ribolhos,
2,0000-00-00,169757,naturalidade,Ribolhos,
3,0000-00-00,136283,nome,Manuel Francisco,"Manuel Francisco de Ribolhos, vide Manuel Francisco"
4,0000-00-00,169757,nome,Manuel Francisco,"Manuel Francisco de Ribolhos, vide Francisco"
5,0000-00-00,136283,nome,Manuel Francisco de Ribolhos,
6,0000-00-00,169757,nome,Manuel Francisco de Ribolhos,
7,0000-00-00,169757,nome-vide,Francisco,
8,0000-00-00,136283,nome-vide,Manuel Francisco,
9,0000-00-00,136283,uc-entrada,0000-00-00,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,206151,Francisco Rodrigues de Valadares,Rodrigues,Vila Viçosa,0000-00-00,Cânones,Rodrigo Rodrigues
1,203494,Francisco Rodrigues,Valadares,Vila Viçosa,1605-10-20,Cânones,Rodrigo Rodrigues
2,255769,Francisco Rodrigues,Valadares,Vila Viçosa,1605-10-20,Cânones,Rodrigo Rodrigues


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,206151,faculdade,Cânones,Cânones
1,0000-00-00,206151,naturalidade,Vila Viçosa,
2,0000-00-00,206151,nome,Francisco Rodrigues,"Francisco Rodrigues de Valadares, vide Rodrigues"
3,0000-00-00,206151,nome,Francisco Rodrigues de Valadares,
4,0000-00-00,206151,nome-pai,Rodrigo Rodrigues,
5,0000-00-00,206151,nome-vide,Rodrigues,
6,0000-00-00,206151,uc-entrada,0000-00-00,
7,0000-00-00,206151,uc-saida,0000-00-00,
8,1605-10-20,203494,colegio,Colégio de S.Paulo,colegial de São Paulo
9,1605-10-20,255769,colegio,Colégio de S.Paulo,colegial de São Paulo


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,181236,António Gomes de Macedo,Gomes,Coimbra,0000-00-00,,
1,178312,António Gomes,Macedo,Coimbra,1641-10-02,Teologia,Manuel Rodrigues
2,278765,António Gomes,Macedo,Coimbra,1749-04-26,Medicina,José Gomes


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,181236,naturalidade,Coimbra,
1,0000-00-00,181236,nome,António Gomes,"António Gomes de Macedo, vide Gomes"
2,0000-00-00,181236,nome,António Gomes de Macedo,
3,0000-00-00,181236,nome-vide,Gomes,
4,0000-00-00,181236,uc-entrada,0000-00-00,
5,0000-00-00,181236,uc-saida,0000-00-00,
6,1641-10-02,178312,faculdade,Teologia,Teologia
7,1641-10-02,178312,matricula-faculdade,Teologia,02.10.1641
8,1641-10-02,178312,naturalidade,Coimbra,
9,1641-10-02,178312,nome,António Gomes,


##### aka-aka matches in sequential mode

In sequential mode we do not filter by record type so some aka-aka and see-see matches occur.
matches 

In [50]:
from timelinknb.pandas import display_group_attributes

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

pairs = match_records['matched_pairs']['sequential']
show_pairs = [(o,d,t) for o,d,t in pairs if t == 'aka-aka' and o<d]
show_only = 10
print(f"aka-aka matches in sequential mode (show only {show_only}) of {len(show_pairs)}:")
for o,d,t in show_pairs[:show_only]:
    display_group_attributes([o,d],
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

aka-aka matches in sequential mode (show only 10) of 100:


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,143239,Jerónimo de Almeida,Ribeiro,Ferreira,1553-10-00,Leis,
1,163231,Jerónimo de Almeida Ribeiro,Almeida,Ferreira,1560-01-24,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1553-00-00:1555-06-00,143239,instituta,1553-00-00:1555-06-00,"curso: 1 curso de Instituta, 1 de Código desde Outubro de 1553 a Junho de 1555"
1,1553-10-00,143239,faculdade,Leis,Leis
2,1553-10-00,143239,naturalidade,Ferreira,
3,1553-10-00,143239,nome,Jerónimo de Almeida,
4,1553-10-00,143239,nome,Jerónimo de Almeida Ribeiro,"Jerónimo de Almeida, vide Ribeiro"
5,1553-10-00,143239,nome-vide,Ribeiro,
6,1553-10-00,143239,uc-entrada,1553-10-00,
7,1553-10-00,143239,uc-entrada.ano,1553,
8,1559-07-27,143239,exame,Exame para Bacharel,27.07.1559
9,1559-07-27,143239,grau,Bacharel em Leis,"""1559/07/27"""


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,129553,Manuel de Castro Caldeira,Castro,Abrantes,1601-10-14,Cursos jurídicos (Cânones ou Leis),Lopo Castro
1,191232,Manuel de Castro,Caldeira,Abrantes,1601-10-14,Medicina,Lopo de Castro


Unnamed: 0,date,id,type,value,attr_obs
0,1601-10-14,129553,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
1,1601-10-14,191232,faculdade,Medicina,Medicina
2,1601-10-14,129553,instituta,1601-10-14,14.10.1601 1601-10-14
3,1601-10-14,191232,instituta,1601-10-14,14.10.1601 1601-10-14
4,1601-10-14,129553,naturalidade,Abrantes,
5,1601-10-14,191232,naturalidade,Abrantes,
6,1601-10-14,129553,nome,Manuel de Castro,"Manuel de Castro Caldeira, vide Castro"
7,1601-10-14,191232,nome,Manuel de Castro,
8,1601-10-14,129553,nome,Manuel de Castro Caldeira,
9,1601-10-14,191232,nome,Manuel de Castro Caldeira,"Manuel de Castro, vide Caldeira"


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,167251,Pedro Simões Esteves,Simões,Campo Maior,1665-10-19,Cursos jurídicos (Cânones ou Leis),
1,231579,Pedro Simões,Pedro Simões Esteves,Campo Maior,1665-10-19,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1665-03-21,231579,grau,Bacharel em Artes,Bacharel em Artes 21.03.1665
1,1665-10-19,167251,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
2,1665-10-19,231579,faculdade,Cânones,Cânones
3,1665-10-19,167251,instituta,1665-10-19,19.10.1665 1665-10-19
4,1665-10-19,231579,instituta,1665-10-19,19.10.1665 1665-10-19
5,1665-10-19,167251,naturalidade,Campo Maior,
6,1665-10-19,231579,naturalidade,Campo Maior,
7,1665-10-19,167251,nome,Pedro Simões,"Pedro Simões Esteves, vide Simões"
8,1665-10-19,231579,nome,Pedro Simões,
9,1665-10-19,167251,nome,Pedro Simões Esteves,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,149046,Bernardo José de Azevedo,Vieira,Paredes,1725-10-01,Cânones,José de Azevedo Vieira
1,216902,Bernardo José de Azevedo Vieira,Azevedo,Paredes,1726-10-01,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1725-10-01,149046,faculdade,Cânones,Cânones
1,1725-10-01,149046,instituta,1725-10-01,01.10.1725 1725-10-01
2,1725-10-01,149046,naturalidade,Paredes,
3,1725-10-01,149046,nome,Bernardo José de Azevedo,
4,1725-10-01,149046,nome,Bernardo José de Azevedo Vieira,"Bernardo José de Azevedo, vide Vieira"
5,1725-10-01,149046,nome-nota,padre,
6,1725-10-01,149046,nome-pai,José de Azevedo Vieira,
7,1725-10-01,149046,nome-vide,Vieira,
8,1725-10-01,149046,padre,sim,padre
9,1725-10-01,149046,uc-entrada,1725-10-01,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,210366,Joaquim José Ribeiro de Vasconcelos,Joaquim José Ribeiro,"Baía, Brasil",1779-10-26,Filosofia,João Ribeiro de Vasconcelos
1,163896,Joaquim José Ribeiro,Vasconcelos,Baía,1781-11-05,Matemática,João Ribeiro de Vasconcelos


Unnamed: 0,date,id,type,value,attr_obs
0,1779-10-26,210366,faculdade,Filosofia,Filosofia
1,1779-10-26,210366,matricula-faculdade,Filosofia,(obrigado)
2,1779-10-26,210366,matricula-faculdade.obrigado,Filosofia,(obrigado)
3,1779-10-26,210366,matricula-faculdade.obrigado.ano,Filosofia.1779,(obrigado)
4,1779-10-26,210366,naturalidade,"Baía, Brasil",
5,1779-10-26,210366,nome,Joaquim José Ribeiro,"Joaquim José Ribeiro de Vasconcelos, vide Joaquim José Ribeiro"
6,1779-10-26,210366,nome,Joaquim José Ribeiro de Vasconcelos,
7,1779-10-26,210366,nome-pai,João Ribeiro de Vasconcelos,
8,1779-10-26,210366,nome-vide,Joaquim José Ribeiro,
9,1779-10-26,210366,uc-entrada,1779-10-26,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,132440,Manuel Gomes Quaresma,Gomes,Coimbra,1714-04-24,Artes,
1,186310,Manuel Gomes,Quaresma,Coimbra,1714-10-01,Medicina,


Unnamed: 0,date,id,type,value,attr_obs
0,1714-04-24,132440,faculdade,Artes,Faculdade inferida
1,1714-04-24,132440,grau,Bacharel em Artes,Bacharel em artes 24.04.1714
2,1714-04-24,132440,naturalidade,Coimbra,
3,1714-04-24,132440,nome,Manuel Gomes,"Manuel Gomes Quaresma, vide Gomes"
4,1714-04-24,132440,nome,Manuel Gomes Quaresma,
5,1714-04-24,132440,nome-vide,Gomes,
6,1714-04-24,132440,uc-entrada,1714-04-24,
7,1714-04-24,132440,uc-entrada.ano,1714,
8,1714-10-01,186310,faculdade,Medicina,Medicina
9,1714-10-01,186310,matricula-faculdade,Medicina,01.10.1714


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,173563,Manuel Dias Nunes,Dias,Beja,1659-10-20,Medicina,
1,161852,Manuel Dias,Nunes,Beja,1663-02-22,Medicina,


Unnamed: 0,date,id,type,value,attr_obs
0,1659-10-20,173563,faculdade,Medicina,Medicina
1,1659-10-20,173563,matricula-faculdade,Medicina,20.10.1659
2,1659-10-20,173563,naturalidade,Beja,
3,1659-10-20,173563,nome,Manuel Dias,"Manuel Dias Nunes, vide Dias"
4,1659-10-20,173563,nome,Manuel Dias Nunes,
5,1659-10-20,173563,nome-vide,Dias,
6,1659-10-20,173563,uc-entrada,1659-10-20,
7,1659-10-20,173563,uc-entrada.ano,1659,
8,1660-12-23,173563,matricula-faculdade,Medicina,23.12.1660
9,1661-10-15,173563,matricula-faculdade,Medicina,15.10.1661


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,200599,João Teixeira,Morais,Bragança,1615-10-26,Cânones,Jacome de Morais
1,251249,João Teixeira de Morais,Teixeira,Bragança,1616-02-22,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1615-10-26,200599,faculdade,Cânones,Cânones
1,1615-10-26,200599,grau,Bacharel em Cânones,
2,1615-10-26,200599,matricula-faculdade,Cânones,26.10.1615
3,1615-10-26,200599,naturalidade,Bragança,
4,1615-10-26,200599,nome,João Teixeira,
5,1615-10-26,200599,nome,João Teixeira Morais,"João Teixeira, vide Morais"
6,1615-10-26,200599,nome-pai,Jacome de Morais,
7,1615-10-26,200599,nome-vide,Morais,
8,1615-10-26,200599,uc-entrada,1615-10-26,
9,1615-10-26,200599,uc-entrada.ano,1615,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,266159,Domingos Antunes,de Abreu,Lisboa,1578-10-08,Cânones,Brás Dias de Abreu
1,140681,Domingos Antunes de Abreu,Antunes,Lisboa,1579-10-27,Cânones,Brás Dias de Abreu


Unnamed: 0,date,id,type,value,attr_obs
0,1578-10-08,266159,faculdade,Cânones,Cânones
1,1578-10-08,266159,instituta,1578-10-08,"""1578/10/08 1578-10-08"""
2,1578-10-08,266159,naturalidade,Lisboa,
3,1578-10-08,266159,nome,Domingos Antunes,
4,1578-10-08,266159,nome,Domingos Antunes de Abreu,"Domingos Antunes, vide de Abreu"
5,1578-10-08,266159,nome-pai,Brás Dias de Abreu,
6,1578-10-08,266159,nome-vide,de Abreu,
7,1578-10-08,266159,uc-entrada,1578-10-08,
8,1578-10-08,266159,uc-entrada.ano,1578,
9,1578-10-08:1579-06-08,266159,instituta,1578-10-08:1579-06-08,curso: Instituta e Cânones: 08.10.1578 até 08.06.1579


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,213910,Rui Lopes da Veiga,Lopes,Coimbra,1558-02-03,Artes,
1,253753,Rui Lopes,da Veiga,Coimbra,1568-12-23,Leis,


Unnamed: 0,date,id,type,value,attr_obs
0,1558-02-03,213910,faculdade,Artes,Faculdade inferida
1,1558-02-03,213910,grau,Bacharel em Artes,ter o tempo que se requer para Bacharel em Artes: 03.02.1558
2,1558-02-03,213910,naturalidade,Coimbra,
3,1558-02-03,213910,nome,Rui Lopes,"Rui Lopes da Veiga, vide Lopes"
4,1558-02-03,213910,nome,Rui Lopes da Veiga,
5,1558-02-03,213910,nome-vide,Lopes,
6,1558-02-03,213910,uc-entrada,1558-02-03,
7,1558-02-03,213910,uc-entrada.ano,1558,
8,1560-05-23,213910,grau,Licenciado em Artes,23.05.1560
9,1560-05-23,213910,uc-saida,1560-05-23,


In [51]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                           ***NA***
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

#### Non sequential matching

In [52]:
import pandas as pd
import numpy as np

vide_plus['match_error']=False

previous_geo_name = ''

# list of record to need to be debugged
# use a break point in "pass" statement of the "if" 
#  at the start of the loop
# problematic = ['169888','169890','214417']  # add to list what you what to debug
#
# 211703 See matches 168662 which in turn matches 211704 e 211706
problematic = ['168662','211703','211704','211706'] 


# Loop through the vide records
random_matches = []
for id,linha in vide_plus.iterrows():

    if id in problematic:
        pass  # do breakpoint here to debug problematic records

    nome = linha['name_sp']
    lookup_name = linha['lookup_sp']
    nome_geo = linha['nome-geografico']
    rec_type = linha['rec_type']

    # we now check for similar geo names
    # and load names from variants
    if nome_geo != previous_geo_name:
        simile = geo_similars.get(nome_geo,[])
        if len(simile) > 0 :
            # we have similar geo names
            local_records = vide_plus[vide_plus['nome-geografico'].isin([nome_geo] + simile)]
            pass
        else:   # if no variants just load names with this place of birth
            local_records = vide_plus[vide_plus['nome-geografico'] == nome_geo]

        previous_geo_name = nome_geo

    # search for records with name matching the vide expression of this record
    candidates = []

    found_lookup_name = local_records['name_sp']==lookup_name

    for cid,same_name in local_records[found_lookup_name].iterrows():
        if same_name['lookup_sp'] == nome and cid != id:
            candidates.append(cid)

    # if nothing found search for records with vide expression equal to this one
    if len(candidates) == 0:
        found_lookup_name = local_records['lookup_sp']==lookup_name

        for cid,same_name in local_records[found_lookup_name].iterrows():
            if same_name['lookup_sp'] == nome and cid != id:
                candidates.append(cid)
    
    if len(candidates) > 0:  # we found some candidates
        for cand in candidates:
            mrec_type = vide_plus.loc[[cand]].iloc[0]['rec_type']
            mtype = f'{rec_type}-{mrec_type}'
            match = (id,cand,mtype)
            if match not in random_matches:
                random_matches.append(match)
    



##### Analyse results (random)

In [53]:
method = 'random'
match_records['matched_pairs'][method] = list(set(random_matches))
match_info.loc['matched_pairs'][method] = len(match_records['matched_pairs'][method])

# pairs 
pairs_see_aka = [(o,d,mtype) for (o,d,mtype) in random_matches if mtype == 'see-aka']
pairs_aka_see = [(o,d,mtype) for (o,d,mtype) in random_matches if mtype == 'aka-see']
pairs_aka_aka = [(o,d,mtype) for (o,d,mtype) in random_matches if mtype == 'aka-aka']
pairs_see_see = [(o,d,mtype) for (o,d,mtype) in random_matches if mtype == 'see-see']


# records
rec_matched = set([id for (id,d,t) in random_matches]+
                  [id for (o,id,t) in random_matches])
rec_see_aka = set([id for (id,d,t) in pairs_see_aka])
rec_aka_see = set([id for (o,id,t) in pairs_aka_see])
rec_see_see = set([id for (id,d,t) in pairs_see_see] +
                  [id for (o,id,t) in pairs_see_see])
rec_aka_aka = set([id for (id,d,t) in pairs_aka_aka] +
                  [id for (o,id,t) in pairs_aka_aka])

match_records['records_matched'][method] = list(rec_matched)
match_records['records_see_aka'][method] = list(rec_see_aka)
match_records['records_aka_see'][method] = list(rec_aka_see)
match_records['records_aka_aka'][method] = list(rec_aka_aka)
match_records['records_see_see'][method] = list(rec_see_see)

match_info.loc['records_matched'][method] = len(rec_matched)
match_info.loc['records_see_aka'][method] = len(rec_see_aka)
match_info.loc['records_aka_see'][method] = len(rec_aka_see)
match_info.loc['records_aka_aka'][method] = len(rec_aka_aka)
match_info.loc['records_see_see'][method] = len(rec_see_see)

# new
match_info.loc['aka_matched'][method] = len(rec_aka_see.union(rec_aka_aka))
match_records['aka_matched'][method] = list(rec_aka_see.union(rec_aka_aka))
match_info.loc['see_matched'][method] = len(rec_see_aka.union(rec_see_see))
match_records['see_matched'][method] = list(rec_see_aka.union(rec_see_see))

matched_rand_df = pd.DataFrame(random_matches, columns=['from','to','type'])
matched_rand_df.groupby('type').count()

Unnamed: 0_level_0,from,to
type,Unnamed: 1_level_1,Unnamed: 2_level_1
aka-aka,218,218
aka-see,1769,1769
see-aka,1796,1796
see-see,21,21


#####  Check the matches for errors (random)

In [54]:
import networkx as nx

method = 'random'

matched_pairs = match_records['matched_pairs'][method]
records = match_records['records_matched'][method]

matched_multiple = []
matched_single = []

origins = [o for (o,d,t) in matched_pairs]
destinations = [d for (o,d,t) in matched_pairs]
rec_in_matches = origins + destinations
for i in rec_in_matches:
    c = rec_in_matches.count(i)
    if c >2:
        matched_multiple.append(i)
    elif c == 1:
        matched_single.append(i)
matched_multiple = list(set(matched_multiple))
matched_single = list(set(matched_single))

print("Records with more than two matches    :", len(matched_multiple))
print("Records asymmetric (one match only)   :", len(matched_single))

G = nx.Graph()
simple_pairs = [(o,d) for (o,d,t) in matched_pairs]
G.add_edges_from(simple_pairs)
transitive  = [c for c in nx.connected_components(G) if len(c) > 2]
# number of records in ambiguous matches
amb_records = [item for amb in transitive for item in amb]
namb_records = len(set(amb_records))
print("Records in ambiguous matches          :", namb_records)
for amb in transitive:
    print(amb)
print("Are multiple in ambiguous             :",set(matched_multiple).issubset(set(amb_records)))

rec_errors = set(amb_records).union(matched_multiple).union(matched_single)
rec_ok = set(records).difference(rec_errors)

match_records['records_error'][method] = list(rec_errors)
match_info.loc['records_error'][method] = len(rec_errors)
match_records['records_matched_ok'][method] = list(rec_ok)
match_info.loc['records_matched_ok'][method] = len(rec_ok)
match_records['records_asymmetric'][method] = list(matched_single)
match_info.loc['records_asymmetric'][method] = len(matched_single)
match_records['records_transitive'][method] = list(amb_records)
match_info.loc['records_transitive'][method] = namb_records

# new
aka = match_records['aka']['data']
aka_ok = set(aka).intersection(rec_ok)
see = match_records['see']['data']
see_ok = set(see).intersection(rec_ok)
match_info.loc['aka_matched_ok'][method] = len(aka_ok)
match_records['aka_matched_ok'][method] = list(aka_ok)
match_info.loc['see_matched_ok'][method] = len(see_ok)
match_records['see_matched_ok'][method] = list(see_ok)

pairs_ok = set([(o,d,t) for (o,d,t) in match_records['matched_pairs'][method]
                                                        if o not in rec_errors and d not in rec_errors])
match_records['matched_pairs_ok'][method] = list(pairs_ok)
match_info.loc['matched_pairs_ok'][method] = len(pairs_ok)

vide_plus.loc[matched_single,'match_error'] = False # we dont consider a single match an error
vide_plus.loc[matched_single,'match_obs'] = "W01-Single match (asymmetric) "+method
vide_plus.loc[matched_multiple,'match_error'] = True
vide_plus.loc[matched_multiple,'match_obs'] = "E02-Multiple match "+method
vide_plus.loc[amb_records,'match_error'] = True
vide_plus.loc[amb_records,'match_obs'] = "E03-Ambiguity in match "+method
match_info.fillna('')

Records with more than two matches    : 38
Records asymmetric (one match only)   : 76
Records in ambiguous matches          : 116
{'238842', '233035', '238845'}
{'142386', '142388', '171377'}
{'160196', '160158', '152482'}
{'266130', '146547', '266114'}
{'194771', '201515', '316381'}
{'226697', '226704', '195422'}
{'132440', '238066', '186417', '186310', '186309'}
{'208873', '243481', '208877'}
{'152890', '243711', '152894'}
{'238488', '238487', '226966'}
{'242102', '238985', '242104'}
{'233838', '206540', '206536'}
{'222156', '189891', '189900'}
{'240879', '147410', '240882'}
{'183306', '235009', '183307'}
{'164823', '166756', '164824'}
{'169888', '169890', '214147'}
{'153316', '139166', '139146'}
{'169757', '162809', '136283'}
{'172677', '172681', '190285'}
{'221393', '233103', '233094'}
{'143426', '175731', '175730'}
{'203487', '206151', '161932', '255769', '203494'}
{'210090', '232128', '232079'}
{'253465', '212857', '252907'}
{'134102', '191654', '191659'}
{'254410', '201728', '20

Unnamed: 0,data,sequential,random
aka,3062.0,,
aka_fac,3035.0,,
aka_geo,2973.0,,
aka_matched,,1913.0,1970.0
aka_matched_ok,,1907.0,1897.0
aka_pai,1619.0,,
matched_pairs,,3644.0,3804.0
matched_pairs_ok,,3628.0,3614.0
nodate,5763.0,,
nodate_novide,141.0,,


##### Show some of the ambiguous records


In [55]:
pd.set_option('display.max_rows',250)

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

show_only = 6
for ambiguous_records in transitive[:show_only]:
    display_group_attributes(ambiguous_records,
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,233035,Miguel Soares,Pereira,Porto,0000-00-00,,
1,238842,Miguel Soares Pereira,Soares,Porto,1594-11-07,Cânones,Bernardino Pereira
2,238845,Miguel Soares Pereira,Soares,Porto,1600-11-20,Cânones,Leonardo Pereira


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,233035,naturalidade,Porto,
1,0000-00-00,233035,nome,Miguel Soares,
2,0000-00-00,233035,nome,Miguel Soares Pereira,"Miguel Soares, vide Pereira"
3,0000-00-00,233035,nome-vide,Pereira,
4,0000-00-00,233035,uc-entrada,0000-00-00,
5,0000-00-00,233035,uc-saida,0000-00-00,
6,1594-11-07,238842,faculdade,Cânones,Cânones
7,1594-11-07,238842,instituta,1594-11-07,07.11.1594 1594-11-07
8,1594-11-07,238842,naturalidade,Porto,
9,1594-11-07,238842,nome,Miguel Soares,"Miguel Soares Pereira, vide Soares"


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,142386,André de Matos de Almada,Matos,Lisboa,0000-00-00,,
1,142388,André de Matos de Almada,Matos,Lisboa,1576-10-01,Leis,Jerónimo de Matos
2,171377,André de Matos,Almada,Lisboa,1639-11-07,Cânones,Fernão de Matos


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,142386,naturalidade,Lisboa,
1,0000-00-00,142386,nome,André de Matos,"André de Matos de Almada, vide Matos"
2,0000-00-00,142386,nome,André de Matos de Almada,
3,0000-00-00,142386,nome-vide,Matos,
4,0000-00-00,142386,uc-entrada,0000-00-00,
5,0000-00-00,142386,uc-saida,0000-00-00,
6,1576-10-01,142388,faculdade,Leis,Leis
7,1576-10-01,142388,naturalidade,Lisboa,
8,1576-10-01,142388,nome,André de Matos,"André de Matos de Almada, vide Matos"
9,1576-10-01,142388,nome,André de Matos de Almada,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,152482,Brás Ribeiro da Fonseca,Ribeiro,Nabainhos,0000-00-00,Leis,
1,160158,Brás Ribeiro,Fonseca,Nabainhos,1640-10-09,Leis,Miguel Ribeiro
2,160196,Brás Ribeiro,Fonseca,Nabainhos,1640-10-09,Leis,Miguel Ribeiro


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,152482,faculdade,Leis,Leis
1,0000-00-00,152482,naturalidade,Nabainhos,
2,0000-00-00,152482,nome,Brás Ribeiro,"Brás Ribeiro da Fonseca, vide Ribeiro"
3,0000-00-00,152482,nome,Brás Ribeiro da Fonseca,
4,0000-00-00,152482,nome-vide,Ribeiro,
5,0000-00-00,152482,uc-entrada,0000-00-00,
6,0000-00-00,152482,uc-saida,0000-00-00,
7,1640-10-09,160158,faculdade,Leis,Leis
8,1640-10-09,160196,faculdade,Leis,Leis
9,1640-10-09,160158,instituta,1640-10-09,1640.10.09 1640-10-09


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,266114,Rafael dos Anjos de Andrade,dos Anjos,Lisboa,0000-00-00,,Nicolau Ferreira
1,146547,Rafael dos Anjos de Andrade,dos Anjos,Lisboa,1640-10-01,Medicina,Nicolau Ferreira
2,266130,Rafael dos Anjos,de Andrade,Lisboa,1640-10-01,Medicina,Nicolau Ferreira


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,266114,naturalidade,Lisboa,
1,0000-00-00,266114,nome,Rafael dos Anjos,"Rafael dos Anjos de Andrade, vide dos Anjos"
2,0000-00-00,266114,nome,Rafael dos Anjos de Andrade,
3,0000-00-00,266114,nome-pai,Nicolau Ferreira,
4,0000-00-00,266114,nome-vide,dos Anjos,
5,0000-00-00,266114,uc-entrada,0000-00-00,
6,0000-00-00,266114,uc-saida,0000-00-00,
7,1640-10-01,146547,faculdade,Medicina,Medicina
8,1640-10-01,266130,faculdade,Medicina,Medicina
9,1640-10-01,146547,matricula-faculdade,Medicina,"""1640/10/01"""


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,201515,Manuel José de Magalhães Teixeira,Manuel José,Braga,0000-00-00,,
1,194771,Manuel José,Magalhães Teixeira,Braga,1735-01-30,Cânones,
2,316381,Manuel José de Magalhães Teixeira,Manuel José,Braga,1741-06-27,Cânones,Domingos Teixeira de Magalhães


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,201515,naturalidade,Braga,
1,0000-00-00,201515,nome,Manuel José,"Manuel José de Magalhães Teixeira, vide Manuel José"
2,0000-00-00,201515,nome,Manuel José de Magalhães Teixeira,
3,0000-00-00,201515,nome-vide,Manuel José,
4,0000-00-00,201515,uc-entrada,0000-00-00,
5,0000-00-00,201515,uc-saida,0000-00-00,
6,1735-01-30,194771,faculdade,Cânones,Cânones
7,1735-01-30,194771,instituta,1735-01-30,30.01.1735 1735-01-30
8,1735-01-30,194771,naturalidade,Braga,
9,1735-01-30,194771,nome,Manuel José,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,195422,Francisco Xavier,da Silva,Lisboa,0000-00-00,,
1,226704,Francisco Xavier da Silva,Xavier,Lisboa,1725-11-26,Cânones,
2,226697,Francisco Xavier da Silva,Francisco Xavier,Lisboa,1760-12-15,Cânones,Manuel Nunes


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,195422,naturalidade,Lisboa,
1,0000-00-00,195422,nome,Francisco Xavier,
2,0000-00-00,195422,nome,Francisco Xavier da Silva,"Francisco Xavier, vide da Silva"
3,0000-00-00,195422,nome-vide,da Silva,
4,0000-00-00,195422,uc-entrada,0000-00-00,
5,0000-00-00,195422,uc-saida,0000-00-00,
6,1725-11-26,226704,faculdade,Cânones,Cânones
7,1725-11-26,226704,matricula-faculdade,Cânones,26.11.1725
8,1725-11-26,226704,naturalidade,Lisboa,
9,1725-11-26,226704,nome,Francisco Xavier,"Francisco Xavier da Silva, vide Xavier"


#### Show some of the aka-aka records (potential duplicates)

In [56]:
from timelinknb.pandas import display_group_attributes

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

pairs = match_records['matched_pairs']['random']
show_pairs = [(o,d,t) for o,d,t in pairs if t == 'aka-aka' and o<d]
show_only = 4
print(f"aka-aka matches in sequential mode (show only {show_only}) of {len(show_pairs)}:")
for o,d,t in show_pairs[:show_only]:
    display_group_attributes([o,d],
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

aka-aka matches in sequential mode (show only 4) of 109:


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,143239,Jerónimo de Almeida,Ribeiro,Ferreira,1553-10-00,Leis,
1,163231,Jerónimo de Almeida Ribeiro,Almeida,Ferreira,1560-01-24,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1553-00-00:1555-06-00,143239,instituta,1553-00-00:1555-06-00,"curso: 1 curso de Instituta, 1 de Código desde Outubro de 1553 a Junho de 1555"
1,1553-10-00,143239,faculdade,Leis,Leis
2,1553-10-00,143239,naturalidade,Ferreira,
3,1553-10-00,143239,nome,Jerónimo de Almeida,
4,1553-10-00,143239,nome,Jerónimo de Almeida Ribeiro,"Jerónimo de Almeida, vide Ribeiro"
5,1553-10-00,143239,nome-vide,Ribeiro,
6,1553-10-00,143239,uc-entrada,1553-10-00,
7,1553-10-00,143239,uc-entrada.ano,1553,
8,1559-07-27,143239,exame,Exame para Bacharel,27.07.1559
9,1559-07-27,143239,grau,Bacharel em Leis,"""1559/07/27"""


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,129553,Manuel de Castro Caldeira,Castro,Abrantes,1601-10-14,Cursos jurídicos (Cânones ou Leis),Lopo Castro
1,191232,Manuel de Castro,Caldeira,Abrantes,1601-10-14,Medicina,Lopo de Castro


Unnamed: 0,date,id,type,value,attr_obs
0,1601-10-14,129553,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
1,1601-10-14,191232,faculdade,Medicina,Medicina
2,1601-10-14,129553,instituta,1601-10-14,14.10.1601 1601-10-14
3,1601-10-14,191232,instituta,1601-10-14,14.10.1601 1601-10-14
4,1601-10-14,129553,naturalidade,Abrantes,
5,1601-10-14,191232,naturalidade,Abrantes,
6,1601-10-14,129553,nome,Manuel de Castro,"Manuel de Castro Caldeira, vide Castro"
7,1601-10-14,191232,nome,Manuel de Castro,
8,1601-10-14,129553,nome,Manuel de Castro Caldeira,
9,1601-10-14,191232,nome,Manuel de Castro Caldeira,"Manuel de Castro, vide Caldeira"


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,167251,Pedro Simões Esteves,Simões,Campo Maior,1665-10-19,Cursos jurídicos (Cânones ou Leis),
1,231579,Pedro Simões,Pedro Simões Esteves,Campo Maior,1665-10-19,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1665-03-21,231579,grau,Bacharel em Artes,Bacharel em Artes 21.03.1665
1,1665-10-19,167251,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
2,1665-10-19,231579,faculdade,Cânones,Cânones
3,1665-10-19,167251,instituta,1665-10-19,19.10.1665 1665-10-19
4,1665-10-19,231579,instituta,1665-10-19,19.10.1665 1665-10-19
5,1665-10-19,167251,naturalidade,Campo Maior,
6,1665-10-19,231579,naturalidade,Campo Maior,
7,1665-10-19,167251,nome,Pedro Simões,"Pedro Simões Esteves, vide Simões"
8,1665-10-19,231579,nome,Pedro Simões,
9,1665-10-19,167251,nome,Pedro Simões Esteves,


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,faculdade,nome-pai
0,149046,Bernardo José de Azevedo,Vieira,Paredes,1725-10-01,Cânones,José de Azevedo Vieira
1,216902,Bernardo José de Azevedo Vieira,Azevedo,Paredes,1726-10-01,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1725-10-01,149046,faculdade,Cânones,Cânones
1,1725-10-01,149046,instituta,1725-10-01,01.10.1725 1725-10-01
2,1725-10-01,149046,naturalidade,Paredes,
3,1725-10-01,149046,nome,Bernardo José de Azevedo,
4,1725-10-01,149046,nome,Bernardo José de Azevedo Vieira,"Bernardo José de Azevedo, vide Vieira"
5,1725-10-01,149046,nome-nota,padre,
6,1725-10-01,149046,nome-pai,José de Azevedo Vieira,
7,1725-10-01,149046,nome-vide,Vieira,
8,1725-10-01,149046,padre,sim,padre
9,1725-10-01,149046,uc-entrada,1725-10-01,


In [549]:
vide_plus.loc['217701']

name                                         José de Santo António
sex                                                              m
nome-vide                                                Lencastre
nome-geografico                                           ***NA***
faculdade                                                      NaN
faculdade.date                                                 NaN
faculdade.obs                                                  NaN
nome-pai                                                       NaN
uc-entrada                                              0000-00-00
uc-saida                                                0000-00-00
uc-saida.date                                           0000-00-00
uc-saida.obs                                                  None
rec_type                                                       see
loookup                                                           
vide_type                                                     

In [550]:

# set back the missing nome_geografico to null
no_geo_filter = vide_plus['nome-geografico'] == '***NA***'
vide_plus.loc[no_geo_filter,'nome-geografico'] = np.nan
print(len(vide_plus[vide_plus['nome-geografico'] == '***NA***']))
vide_plus.info()

0
<class 'pandas.core.frame.DataFrame'>
Index: 9438 entries, 198423 to 230315
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             9438 non-null   object
 1   sex              9438 non-null   object
 2   nome-vide        9286 non-null   object
 3   nome-geografico  8916 non-null   object
 4   faculdade        4872 non-null   object
 5   faculdade.date   4872 non-null   object
 6   faculdade.obs    4853 non-null   object
 7   nome-pai         3547 non-null   object
 8   uc-entrada       9438 non-null   object
 9   uc-saida         9438 non-null   object
 10  uc-saida.date    9438 non-null   object
 11  uc-saida.obs     0 non-null      object
 12  rec_type         9438 non-null   object
 13  loookup          9438 non-null   object
 14  vide_type        9438 non-null   object
 15  lookup           9438 non-null   object
 16  name_sp          9438 non-null   object
 17  lookup_sp        9438 non-nul


Why:
* 230791	Abrantes	Manuel Fernandes da Silveira	None	Manuel Fernandes da Silveira	Fernandes da Silveira	0000-00-00	NaN
* 230756	Abrantes	Manuel Fernandes da Silveira	None	Manuel da Silveira	Manuel Fernandes da Silveira	1598-10-12	Leis

### Merge the result of the two methods, check for errors, again

Since we are mixing matches from two different methods it can happen that, together, they introduce new errrors,  specially transitive matches.

In [57]:
import networkx as nx
method = 'data'  # short for merged methods

matched_rand_pairs = match_records['matched_pairs']['random']
matched_seq_pairs = match_records['matched_pairs']['sequential']
matched_pairs = list(set(matched_rand_pairs + matched_seq_pairs))
print("Number of matched pairs (union of both methods)  :",len(matched_pairs))
match_records['matched_pairs'][method] = matched_pairs
match_info.loc['matched_pairs',method] = len(matched_pairs)

rec_errors_seq = match_records['records_error']['sequential']
rec_errors_rand = match_records['records_error']['random']


# And now filter, this is necessary because error detected are different in each method
matched_multiple = []
matched_single = []

origins = [o for (o,d,t) in matched_pairs]
destinations = [d for (o,d,t) in matched_pairs]
rec_in_matches = origins + destinations
for i in rec_in_matches:
    c = rec_in_matches.count(i)
    if i == '172681':
        pass
    if c >2:
        matched_multiple.append(i)
    elif c == 1:
        matched_single.append(i)
  
        
matched_multiple = list(set(matched_multiple))
matched_single = list(set(matched_single))
print("Number of matches random              :",len(matched_rand_pairs))
print("Number of matches sequential          :",len(matched_seq_pairs))
print("Number of matches both                :",len(matched_pairs))
print("Records with more than two matches    :", len(matched_multiple))
print("Records with just one match           :", len(matched_single))

# alternative method, perhaps more informative:
pairs_to_check = matched_pairs
print()
print("The following pairs have to reverse match:")
asymmetric_pairs = []
for (o,d,t) in matched_pairs:
    if t == 'see-aka':
        rt = 'aka-see'
    elif t == 'aka-see':
        rt = 'see-aka'
    else:
        rt = t
    if (d,o,rt) not in matched_pairs:
        asymmetric_pairs.append((o,d,t))
        print((o,d,t))

# now test for transitivity
G = nx.Graph()
simple_pairs = [(o,d) for (o,d,t) in matched_pairs]
G.add_edges_from(simple_pairs)
transitive  = [c for c in nx.connected_components(G) if len(c) > 2]

# number of records in ambiguous matchesamb_records = [item for amb in transitive for item in amb]
namb_records = len(set(amb_records))
print("Records in ambiguous matches          :", namb_records)

match_records['records_transitive'][method] = list(amb_records)
match_info.loc['records_transitive'][method] = namb_records

for amb in transitive:
    print(amb)
print("Are multiple in ambiguous             :",set(matched_multiple).issubset(set(amb_records)))

vide_plus.loc[matched_single,'match_error'] = False
vide_plus.loc[matched_single,'match_obs'] = "W01-Single match (asymmetric) "+method

vide_plus.loc[matched_multiple,'match_error'] = True
vide_plus.loc[matched_multiple,'match_obs'] = "E03-Multiple match"+method

vide_plus.loc[amb_records,'match_error'] = True
vide_plus.loc[amb_records,'match_obs'] = "E04-Ambiguity in match "+method

# do a new list of records in error
rec_errors = set(amb_records).union(matched_multiple)

records = set(rec_in_matches)
rec_ok = records.difference(rec_errors)

print("Records involved in matches           :", len(records))
print("Records matched without errors        :", len(rec_ok))
print("Records matched with errors           :", len(rec_errors))

match_records['records_error'][method] = list(rec_errors)
match_info.loc['records_error'][method] = len(rec_errors)
match_records['records_matched_ok'][method] = list(rec_ok)
match_info.loc['records_matched_ok'][method] = len(rec_ok)
match_records['records_matched'][method] = list(records)
match_info.loc['records_matched'][method] = len(records)

aka = match_records['aka']['data']
aka_ok = set(aka).intersection(rec_ok)
see = match_records['see']['data']
see_ok = set(see).intersection(rec_ok)
match_info.loc['aka_matched_ok'][method] = len(aka_ok)
match_records['aka_matched_ok'][method] = list(aka_ok)
match_info.loc['see_matched_ok'][method] = len(see_ok)
match_records['see_matched_ok'][method] = list(see_ok)

pairs_ok = set([(o,d,t) for (o,d,t) in match_records['matched_pairs'][method]
                                                        if o not in rec_errors and d not in rec_errors])
match_records['matched_pairs_ok'][method] = list(pairs_ok)
match_info.loc['matched_pairs_ok'][method] = len(pairs_ok)


Number of matched pairs (union of both methods)  : 3818
Number of matches random              : 3804
Number of matches sequential          : 3644
Number of matches both                : 3818
Records with more than two matches    : 38
Records with just one match           : 76

The following pairs have to reverse match:
('241026', '241012', 'see-aka')
('177796', '184419', 'aka-see')
('234767', '203369', 'see-aka')
('212796', '242752', 'see-aka')
('225717', '173224', 'aka-see')
('164227', '248237', 'see-aka')
('223100', '252345', 'see-aka')
('195505', '160216', 'see-aka')
('221053', '207054', 'see-aka')
('209320', '151354', 'see-aka')
('181367', '214929', 'see-aka')
('206278', '151170', 'see-aka')
('130534', '163482', 'see-aka')
('206505', '158689', 'see-aka')
('182659', '233550', 'aka-see')
('186417', '238066', 'see-see')
('166987', '163021', 'aka-see')
('203487', '206151', 'see-see')
('199474', '193320', 'see-aka')
('136704', '227086', 'see-see')
('204089', '196842', 'see-aka')
('18427

Detect asymmetries in matches with no errors (asymmetries are not considered errors)

In [58]:
method = 'data'  # short for merged methods

pairs_ok = match_records['matched_pairs_ok'][method]
# pairs 
pairs_see_aka = [(o,d,mtype) for (o,d,mtype) in pairs_ok if mtype == 'see-aka']
pairs_aka_see = [(o,d,mtype) for (o,d,mtype) in pairs_ok if mtype == 'aka-see']
pairs_aka_aka = [(o,d,mtype) for (o,d,mtype) in pairs_ok if mtype == 'aka-aka']
pairs_see_see = [(o,d,mtype) for (o,d,mtype) in pairs_ok if mtype == 'see-see']

aka_in_see_aka = [d for (o,d,mtype) in pairs_ok if mtype == 'see-aka']
aka_in_aka_see = [o for (o,d,mtype) in pairs_ok if mtype == 'aka-see']
asymmetry_aka = sorted(list(set(aka_in_see_aka) ^ set(aka_in_aka_see)))

print("Asymmetry for aka:", asymmetry_aka )
print("N Asymmetry for aka:", len(asymmetry_aka) )

see_in_see_aka = [o for (o,d,mtype) in pairs_ok if mtype == 'see-aka']
see_in_aka_see = [d for (o,d,mtype) in pairs_ok if mtype == 'aka-see']
asymmetry_see = sorted(list(set(see_in_see_aka) ^ set(see_in_aka_see)))

print("Asymmetry for see:", asymmetry_see )
print("N Asymmetry for see:", len(asymmetry_see))

# alternative method, perhaps more informative:
print()
pairs_to_check = pairs_ok
print("Asymmetric matches: for each match bellow the reverse one was not found")
asymmetric_pairs = []
for (o,d,t) in pairs_to_check:
    if t == 'see-aka':
        rt = 'aka-see'
    elif t == 'aka-see':
        rt = 'see-aka'
    else:
        rt = t
    if (d,o,rt) not in pairs_to_check:
        asymmetric_pairs.append((o,d,t))
        print((o,d,t))

asymmetric_records = set([o for (o,d,t) in asymmetric_pairs] + [d for (o,d,t) in asymmetric_pairs])
print()

match_info.loc['records_asymmetric',method] = len(asymmetric_records)
match_records['records_asymmetric'][method] = list(asymmetric_records)
# records
rec_matched = set([id for (id,d,t) in pairs_ok]+
                  [id for (o,id,t) in pairs_ok])
rec_see_aka = set([id for (id,d,t) in pairs_see_aka])
rec_aka_see = set([id for (o,id,t) in pairs_aka_see])
rec_see_see = set([id for (id,d,t) in pairs_see_see] +
                  [id for (o,id,t) in pairs_see_see])
rec_aka_aka = set([id for (id,d,t) in pairs_aka_aka] +
                  [id for (o,id,t) in pairs_aka_aka])

match_records['records_see_aka'][method] = list(rec_see_aka)
match_records['records_aka_see'][method] = list(rec_aka_see)
match_records['records_aka_aka'][method] = list(rec_aka_aka)
match_records['records_see_see'][method] = list(rec_see_see)

match_info.loc['records_see_aka'][method] = len(rec_see_aka)
match_info.loc['records_aka_see'][method] = len(rec_aka_see)
match_info.loc['records_aka_aka'][method] = len(rec_aka_aka)
match_info.loc['records_see_see'][method] = len(rec_see_see)

# new
match_info.loc['aka_matched'][method] = len(rec_aka_see.union(rec_aka_aka))
match_records['aka_matched'][method] = list(rec_aka_see.union(rec_aka_aka))
match_info.loc['see_matched'][method] = len(rec_see_aka.union(rec_see_see))
match_records['see_matched'][method] = list(rec_see_aka.union(rec_see_see))

matched_rand_df = pd.DataFrame(pairs_ok, columns=['from','to','type'])
matched_rand_df.groupby('type').count()

Asymmetry for aka: ['128907', '131947', '139423', '141866', '142554', '151170', '151354', '158689', '160216', '163482', '166987', '177796', '179399', '181415', '182602', '182659', '196842', '197630', '198768', '200713', '203369', '207054', '208712', '209662', '211581', '214929', '225717', '230482', '230756', '239847', '241012', '242752', '245318', '247449', '248237', '252345']
N Asymmetry for aka: 36
Asymmetry for see: ['128114', '129384', '130534', '134006', '140367', '148963', '149251', '163021', '164227', '173224', '181367', '184271', '184419', '185191', '188508', '194373', '195505', '204089', '205772', '206278', '206505', '209320', '212796', '221053', '223100', '226997', '230791', '233397', '233550', '234767', '235056', '241026', '241346', '251029', '252478', '256874']
N Asymmetry for see: 36

Asymmetric matches: for each match bellow the reverse one was not found
('241026', '241012', 'see-aka')
('177796', '184419', 'aka-see')
('234767', '203369', 'see-aka')
('212796', '242752', 's

Unnamed: 0_level_0,from,to
type,Unnamed: 1_level_1,Unnamed: 2_level_1
aka-aka,188,188
aka-see,1722,1722
see-aka,1746,1746
see-see,9,9


In [59]:
match_info.fillna("")

Unnamed: 0,data,sequential,random
aka,3062,,
aka_fac,3035,,
aka_geo,2973,,
aka_matched,1910,1913.0,1970.0
aka_matched_ok,1940,1907.0,1897.0
aka_pai,1619,,
matched_pairs,3818,3644.0,3804.0
matched_pairs_ok,3665,3628.0,3614.0
nodate,5763,,
nodate_novide,141,,


##### Check role of no date novide records in assymetries matches

Since these records have no "vide" expression they do not generate the symmetric name forookup.

In [60]:
see_with_no_vide = set(match_records['vide_plus']['data']) - set(match_records['vide']['data'])
print("Number of records with see and no vide: ",len(see_with_no_vide))
asymmetric_see_no_vide = list(set(asymmetric_records).intersection(see_with_no_vide))
print("See no vide part in asymmetric matches: ",len(asymmetric_see_no_vide),set(asymmetric_records).intersection(see_with_no_vide))

Number of records with see and no vide:  141
See no vide part in asymmetric matches:  22 {'181367', '251029', '195505', '185191', '252478', '209320', '140367', '134006', '149251', '184271', '129384', '256874', '194373', '204089', '130534', '233397', '234767', '164227', '223100', '128114', '148963', '136704'}


##### Check asymmetric pairs

In [61]:

match_list = asymmetric_pairs

pd.set_option('display.max_rows',250)

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

show_only = 6
for (o,d,t) in match_list[:show_only]:
    print(o,d,t)
    display_group_attributes([o,d],
                             header_cols=['uc-entrada','name','nome-vide','naturalidade','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')

241026 241012 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,241026,0000-00-00,Luís Alves Mergulhão,Luís Alves Mergulhão,Lisboa,,
1,241012,1601-10-29,Luís de Mergulhão,Luís Alves Mergulhão,Lisboa,Leis,Diogo Mergulhão


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,241026,naturalidade,Lisboa,
1,0000-00-00,241026,nome,Luís Alves Mergulhão,"Luís Alves Mergulhão, vide Luís Alves Mergulhão"
2,0000-00-00,241026,nome,Luís Alves Mergulhão,
3,0000-00-00,241026,nome-vide,Luís Alves Mergulhão,
4,0000-00-00,241026,uc-entrada,0000-00-00,
5,0000-00-00,241026,uc-saida,0000-00-00,
6,1601-10-29,241012,faculdade,Leis,Leis
7,1601-10-29,241012,instituta,1601-10-29,29.10.1601 1601-10-29
8,1601-10-29,241012,naturalidade,Lisboa,
9,1601-10-29,241012,nome,Luís Alves Mergulhão,"Luís de Mergulhão, vide Luís Alves Mergulhão"


177796 184419 aka-see


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,184419,0000-00-00,Domingos Marques,Giraldes,Idanha-a-Nova,,
1,177796,1658-10-15,Domingos Marques Giraldes,Giraldes,Idanha-a-Nova,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,184419,naturalidade,Idanha-a-Nova,
1,0000-00-00,184419,nome,Domingos Marques,
2,0000-00-00,184419,nome,Domingos Marques Giraldes,"Domingos Marques, vide Giraldes"
3,0000-00-00,184419,nome-vide,Giraldes,
4,0000-00-00,184419,uc-entrada,0000-00-00,
5,0000-00-00,184419,uc-saida,0000-00-00,
6,1658-10-15,177796,faculdade,Cânones,Cânones
7,1658-10-15,177796,instituta,1658-10-15,15.10.1658 1658-10-15
8,1658-10-15,177796,naturalidade,Idanha-a-Nova,
9,1658-10-15,177796,nome,Domingos Marques Giraldes,"Domingos Marques Giraldes, vide Giraldes"


234767 203369 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,234767,0000-00-00,Francisco Leitão Pereira,,Índia,,
1,203369,1550-10-10,Francisco Leitão,Pereira,"Índia, Goa",Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,234767,naturalidade,Índia,
1,0000-00-00,234767,nome,Francisco Leitão Pereira,
2,0000-00-00,234767,uc-entrada,0000-00-00,
3,0000-00-00,234767,uc-saida,0000-00-00,
4,1550-10-10,203369,faculdade,Cânones,Faculdade inferida
5,1550-10-10,203369,instituta,1550-10-10,"""""""  1 curso em Instituta 10.10.1550: 2 meses em Cânones - Junho e Julho de 1556 - em Cânones, 10 meses de Outubro de 1557 a 31.07.1558 a 23.07.1559  """""""
6,1550-10-10,203369,naturalidade,"Índia, Goa",Índia (Goa)
7,1550-10-10,203369,nome,Francisco Leitão,
8,1550-10-10,203369,nome,Francisco Leitão Pereira,"Francisco Leitão, vide Pereira"
9,1550-10-10,203369,nome-vide,Pereira,


212796 242752 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,212796,0000-00-00,Manuel Mendes,Mendes,Viana do Alentejo,Cânones,Manuel Mendes Pimentel
1,242752,1612-10-27,Manuel Mendes Pimentel,Mendes,Viana do Alentejo,Cânones,Manuel Mendes Pimentel
2,242752,1612-10-27,Manuel Mendes Pimentel,Mendes,Viana do Alentejo,Leis,Manuel Mendes Pimentel


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,212796,faculdade,Cânones,Cânones
1,0000-00-00,212796,naturalidade,Viana do Alentejo,
2,0000-00-00,212796,nome,Manuel Mendes,"Manuel Mendes, vide Mendes"
3,0000-00-00,212796,nome,Manuel Mendes,
4,0000-00-00,212796,nome-pai,Manuel Mendes Pimentel,
5,0000-00-00,212796,nome-vide,Mendes,
6,0000-00-00,212796,uc-entrada,0000-00-00,
7,0000-00-00,212796,uc-saida,0000-00-00,
8,1612-10-27,242752,faculdade,Cânones,Faculdade corrigida
9,1612-10-27,242752,faculdade,Leis,Faculdade corrigida


225717 173224 aka-see


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,173224,0000-00-00,João Nunes,Rogado,Terena,Leis,André Rogado
1,225717,1587-10-03,João Nunes Rogado,Rogado,Terena,Leis,André Rogado


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,173224,faculdade,Leis,Leis
1,0000-00-00,173224,naturalidade,Terena,
2,0000-00-00,173224,nome,João Nunes,
3,0000-00-00,173224,nome,João Nunes Rogado,"João Nunes, vide Rogado"
4,0000-00-00,173224,nome-pai,André Rogado,
5,0000-00-00,173224,nome-vide,Rogado,
6,0000-00-00,173224,uc-entrada,0000-00-00,
7,0000-00-00,173224,uc-saida,0000-00-00,
8,1587-10-03,225717,faculdade,Leis,Leis
9,1587-10-03,225717,matricula-faculdade,Leis,1587.10.03


164227 248237 see-aka


Unnamed: 0,id,uc-entrada,name,nome-vide,naturalidade,faculdade,nome-pai
0,164227,0000-00-00,Aires Fernandes Freire,,Lisboa,,
1,248237,1552-00-00,Aires Fernandes,Freire,Lisboa,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,0000-00-00,164227,naturalidade,Lisboa,
1,0000-00-00,164227,nome,Aires Fernandes Freire,
2,0000-00-00,164227,uc-entrada,0000-00-00,
3,0000-00-00,164227,uc-saida,0000-00-00,
4,1552-00-00,248237,faculdade,Cânones,Faculdade inferida
5,1552-00-00,248237,naturalidade,Lisboa,
6,1552-00-00,248237,nome,Aires Fernandes,
7,1552-00-00,248237,nome,Aires Fernandes Freire,"Aires Fernandes, vide Freire"
8,1552-00-00,248237,nome-vide,Freire,
9,1552-00-00,248237,uc-entrada,1552-00-00,


Add match information to the records

In [62]:
import pandas as pd

pairs = match_records['matched_pairs_ok']['data']
def get_match(id,pairs):
    match_list = [(d,mtype) for (o,d,mtype) in pairs if o == id]
    if len(match_list) == 0:
        return (None,None)
    else:
        return match_list[0]

ids = vide_plus.index.values
matches = [get_match(id,pairs) for id in ids]
cols = pd.DataFrame(matches,columns=['match','match_type'], index=ids)
vide_plus = pd.concat([vide_plus,cols],axis=1)


## Match results

### Match general statistics

In [63]:
nvide_plus = match_info.loc['vide_plus','data']
match_info.fillna("")
vars_perc_vide = ['aka','aka_fac','aka_geo','aka_pai',
                'nodate','nodate_novide',
                'records_matched','records_matched_ok',
                'records_see_aka','records_see_see',
                'records_aka_see','records_aka_aka',
                'records_transitive',
                'vide','vide_plus']

match_info.loc[vars_perc_vide,'perc_vide_plus'] = match_info.loc[vars_perc_vide,'data']/nvide_plus

nrec_matched = match_info.loc['records_matched_ok','data']
vars_perc_matches = ['records_matched_ok',
                     'records_see_aka','records_see_see',
                     'records_aka_see','records_aka_aka',
                     'records_transitive']
match_info.loc[vars_perc_matches,'perc_matched_ok'] = match_info.loc[vars_perc_matches,'data']/nrec_matched
                     
nsee = match_info.loc['see','data']
vars_perc_see = ['see_matched','see_matched_ok','records_see_aka','records_see_see', 'see','see_fac','see_geo','see_pai']
match_info.loc[vars_perc_see,'perc_type'] = match_info.loc[vars_perc_see,'data']/nsee
match_info.loc[vars_perc_see,'type'] = 'see'


naka = match_info.loc['aka','data']
vars_perc_aka = ['aka','aka_matched','aka_matched_ok','aka_fac','aka_geo','aka_pai','records_aka_see','records_aka_aka']
match_info.loc[vars_perc_aka,'perc_type'] = match_info.loc[vars_perc_aka,'data']/naka
match_info.loc[vars_perc_aka,'type'] = 'aka'


nmatched_pairs = match_info.loc['matched_pairs','data']
vars_perc_matched = ['matched_pairs','matched_pairs_ok']
match_info.loc[vars_perc_matched,'perc_type'] = match_info.loc[vars_perc_matched,'data']/nmatched_pairs
match_info.loc[vars_perc_matched,'type'] = 'matched_pairs'

nrec_matched = match_info.loc['records_matched','data']
vars_perc_rec_matched = ['records_matched','records_matched_ok','records_transitive']
match_info.loc[vars_perc_rec_matched,'perc_type'] = match_info.loc[vars_perc_rec_matched,'data']/nrec_matched
match_info.loc[vars_perc_rec_matched,'type'] = 'records_mached'

In [64]:
match_info.fillna(" ")

Unnamed: 0,data,sequential,random,perc_vide_plus,perc_matched_ok,perc_type,type
aka,3062,,,0.349304,,1.0,aka
aka_fac,3035,,,0.346224,,0.991182,aka
aka_geo,2973,,,0.339151,,0.970934,aka
aka_matched,1910,1913.0,1970.0,,,0.623775,aka
aka_matched_ok,1940,1907.0,1897.0,,,0.633573,aka
aka_pai,1619,,,0.184691,,0.528739,aka
matched_pairs,3818,3644.0,3804.0,,,1.0,matched_pairs
matched_pairs_ok,3665,3628.0,3614.0,,,0.959927,matched_pairs
nodate,5763,,,0.657426,,,
nodate_novide,141,,,0.016085,,,


### Generate matching file and dataframe

In [67]:

matching_view_cols = ['match','nome-geografico','uc-entrada','uc-saida','name','nome-vide','lookup','nome-pai','vide_type','faculdade','match_type','match_error','match_obs']

matched_error = vide_plus[vide_plus['match_error']==True]
matched_error_index = matched_error.index.unique()

matched_index = match_records['records_matched']['data']
matched_ok_index = list(set(matched_index)-set(matched_error_index))

matched = vide_plus.loc[matched_index].sort_values(['sort_key','nome-geografico','uc-entrada'])[matching_view_cols]
nmatched = len(matched_index)
print("Number of matched records:",nmatched)
matched.to_csv('../inferences/cross-references/vide_matched.csv',sep=',',)
matched.head(40)


Number of matched records: 3818


Unnamed: 0,match,nome-geografico,uc-entrada,uc-saida,name,nome-vide,lookup,nome-pai,vide_type,faculdade,match_type,match_error,match_obs
179898,151044.0,Pinheiro de Ázere,0000-00-00,0000-00-00,Adrião da Barca de Gouveia,Barca,Adrião da Barca,,cut,,see-aka,False,
151044,179898.0,Pinheiro de Ázere,1596-10-19,1620-07-11,Adrião da Barca,Gouveia,Adrião da Barca Gouveia,Baltasar Cardoso,add,Cânones,aka-see,False,
151589,131748.0,Viana,0000-00-00,0000-00-00,Afonso de Barros,Caminha,Afonso de Barros Caminha,,add,,see-aka,False,
131748,151589.0,Viana,1684-10-01,1687-10-01,Afonso de Barros Caminha,Barros,Afonso de Barros,,cut,Cânones,aka-see,False,
250325,151588.0,Estremoz,0000-00-00,0000-00-00,Afonso de Barros Preto,Barros,Afonso de Barros,Francisco Dias Zagalo,cut,,see-aka,False,
151588,250325.0,Estremoz,1563-11-16,1577-10-11,Afonso de Barros,Preto,Afonso de Barros Preto,,add,Leis,aka-see,False,
181618,186611.0,Viseu,0000-00-00,0000-00-00,Afonso Botelho Machado,Botelho,Afonso Botelho,,cut,Leis,see-aka,False,
186611,181618.0,Viseu,1609-10-01,1624-10-30,Afonso Botelho,Machado,Afonso Botelho Machado,António Machado,add,Cânones,aka-see,False,
186611,181618.0,Viseu,1609-10-01,1624-10-30,Afonso Botelho,Machado,Afonso Botelho Machado,António Machado,add,Leis,aka-see,False,
221796,164067.0,Elvas,0000-00-00,0000-00-00,Afonso Frausto Segurado,Frausto,Afonso Frausto,,cut,,see-aka,False,


### Matches, excluding errors



In [68]:
matched_ok_index = match_records['records_matched_ok']['data']
matched.loc[matched_ok_index].sort_values(['name','nome-geografico','uc-entrada']).head(40).fillna('')

Unnamed: 0,match,nome-geografico,uc-entrada,uc-saida,name,nome-vide,lookup,nome-pai,vide_type,faculdade,match_type,match_error,match_obs
151044,179898,Pinheiro de Ázere,1596-10-19,1620-07-11,Adrião da Barca,Gouveia,Adrião da Barca Gouveia,Baltasar Cardoso,add,Cânones,aka-see,False,
179898,151044,Pinheiro de Ázere,0000-00-00,0000-00-00,Adrião da Barca de Gouveia,Barca,Adrião da Barca,,cut,,see-aka,False,
186611,181618,Viseu,1609-10-01,1624-10-30,Afonso Botelho,Machado,Afonso Botelho Machado,António Machado,add,Cânones,aka-see,False,
186611,181618,Viseu,1609-10-01,1624-10-30,Afonso Botelho,Machado,Afonso Botelho Machado,António Machado,add,Leis,aka-see,False,
181618,186611,Viseu,0000-00-00,0000-00-00,Afonso Botelho Machado,Botelho,Afonso Botelho,,cut,Leis,see-aka,False,
164067,221796,Elvas,1621-00-00,1627-02-12,Afonso Frausto,Segurado,Afonso Frausto Segurado,,add,Cânones,aka-see,False,
221796,164067,Elvas,0000-00-00,0000-00-00,Afonso Frausto Segurado,Frausto,Afonso Frausto,,cut,,see-aka,False,
177343,203124,Fronteira,1577-10-03,1585-11-23,Afonso Garcia,Tinoco,Afonso Garcia Tinoco,Pedro Garcia Tinoco,add,Leis,aka-see,False,
203124,177343,Fronteira,0000-00-00,0000-00-00,Afonso Garcia Tinoco,Garcia,Afonso Garcia,Pedro Garcia Tinoco,cut,Leis,see-aka,False,
187008,190253,Arruda,0000-00-00,0000-00-00,Afonso Henriques,Homem,Afonso Henriques Homem,,add,Cânones,see-aka,False,


### Show diferences in matching results

In [570]:
match_info

Unnamed: 0,data,sequential,random,perc_vide_plus,perc_matched_ok,perc_type,type
aka,3062,,,0.349304,,1.0,aka
aka_fac,3035,,,0.346224,,0.991182,aka
aka_geo,2973,,,0.339151,,0.970934,aka
aka_matched,1910,1913.0,1970.0,,,0.623775,aka
aka_matched_ok,1940,1907.0,1897.0,,,0.633573,aka
aka_pai,1619,,,0.184691,,0.528739,aka
matched_pairs,3818,3644.0,3804.0,,,1.0,matched_pairs
matched_pairs_ok,3665,3628.0,3614.0,,,0.959927,matched_pairs
nodate,5763,,,0.657426,,,
nodate_novide,141,,,0.016085,,,


####  Only matched in random mode

The extra success of the random mode comes from a better tolerance to variations of geographic names.

This is because the random mode uses a similarity factor to find students of with the same birth place while the sequential method uses sorting on geographic name and names to get the matches adjacent.

Each methods manages to succeed in cases where the other fails, but random is more efficient.

In [69]:
matched_rand_index = match_records['records_matched_ok']['random']
matched_seq_index = match_records['records_matched_ok']['sequential']
matched_error_index = match_records['records_error']['data']

matched_rand_only = list(set(matched_rand_index)-set(matched_seq_index)-set(matched_error_index))
nmatched_rand_only = len(matched_rand_only)
print(f"Number of records matched only in random access mode (errors excluded): {nmatched_rand_only}")
print()
print("Sample:")
matched.loc[matched_rand_only].sort_values(['name','nome-geografico','uc-entrada',])[matching_view_cols].head(40)

Number of records matched only in random access mode (errors excluded): 68

Sample:


Unnamed: 0,match,nome-geografico,uc-entrada,uc-saida,name,nome-vide,lookup,nome-pai,vide_type,faculdade,match_type,match_error,match_obs
183928,222335,Nogoselo,1756-10-01,1756-10-01,Agostinho Manuel,Agostinho Manuel de Sequeira,Agostinho Manuel de Sequeira,,rep,Cursos jurídicos (Cânones ou Leis),aka-see,False,
222335,183928,Nagoselo,0000-00-00,0000-00-00,Agostinho Manuel de Sequeira,Agostinho Manuel,Agostinho Manuel,,cut,,see-aka,False,
148502,213090,Santiago do Cacém,1650-10-12,1658-03-30,André Ascenso,Salema,André Ascenso Salema,Manuel Raposo Pessanha,add,Cânones,aka-see,False,
148502,213090,Santiago do Cacém,1650-10-12,1658-03-30,André Ascenso,Salema,André Ascenso Salema,Manuel Raposo Pessanha,add,Leis,aka-see,False,
213090,148502,Santiago de Cacém,0000-00-00,0000-00-00,André Ascenso Salema,Ascenso,André Ascenso,,cut,Leis,see-aka,False,
178267,251037,Ilha Terceira,1567-10-01,1575-05-17,André Gomes,Monteiro,André Gomes Monteiro,António Vaz,add,Cânones,aka-see,False,
251037,178267,Ilha da Terceira,0000-00-00,0000-00-00,André Gomes Monteiro,Gomes,André Gomes,,cut,,see-aka,False,
222372,140377,Várzae de Meruge,1728-10-01,1731-10-01,André de Sequeira,Abranches,André de Sequeira Abranches,,add,Cânones,aka-see,False,
140377,222372,Várzea de Meruge,0000-00-00,0000-00-00,André de Sequeira Abranches,Sequeira,André de Sequeira,,cut,,see-aka,False,
238571,202253,Setã,1632-01-10,1659-11-08,António Lopes,Leitão,António Lopes Leitão,António André,add,Cânones,aka-see,False,


#### Only matched in sequential mode

A few cases sequential is more successful.

In [70]:
pd.set_option('display.max_rows',100)


matched_seq_only = list(set(matched_seq_index)-set(matched_rand_index)-set(matched_error_index))
nmatched_not_rand = len(matched_seq_only)
print(f"Number of records matched only in sequential mode (errors excluded): {nmatched_not_rand}")
print()
matched.loc[matched_seq_only].sort_values(['name','nome-geografico','uc-entrada',]).head(20)[matching_view_cols]


Number of records matched only in sequential mode (errors excluded): 14



Unnamed: 0,match,nome-geografico,uc-entrada,uc-saida,name,nome-vide,lookup,nome-pai,vide_type,faculdade,match_type,match_error,match_obs
241634,250486,Algoso,0000-00-00,0000-00-00,António Pimentel,Morais,António Pimentel Morais,,add,Cânones,see-aka,False,
250486,241634,Algozo,1656-10-15,1665-03-24,António Pimentel Morais,Pimentel,António Pimentel,,cut,Leis,aka-see,False,
212564,151964,Mouta,1613-03-23,1615-10-05,Jorge Vaz,Barros,Jorge Vaz Barros,Manuel Vaz,add,Teologia,aka-see,False,
151964,212564,Mouta ou Moita,0000-00-00,0000-00-00,Jorge Vaz de Barros,Vaz,Jorge Vaz,Manuel Vaz,cut,Teologia,see-aka,False,
188868,131782,Azoia de Baixo,1748-10-01,1752-10-01,José Henriques,Figueira,José Henriques Figueira,,add,Cânones,aka-see,False,
131782,188868,Azoia,0000-00-00,0000-00-00,José Henriques Figueira,José Henriques,José Henriques,,cut,Cânones,see-aka,False,
239722,242576,São José de Godim,0000-00-00,0000-00-00,José Manuel Borges de Sousa,José Manuel Borges de Sousa Pinto,José Manuel Borges de Sousa Pinto,,rep,,see-aka,False,
242576,239722,Gondim,1762-10-01,1766-10-01,José Manuel Borges de Sousa Pinto,José Manuel Borges de Sousa,José Manuel Borges de Sousa,,cut,Leis,aka-see,False,
242576,239722,São José,1762-10-01,1766-10-01,José Manuel Borges de Sousa Pinto,José Manuel Borges de Sousa,José Manuel Borges de Sousa,,cut,Leis,aka-see,False,
136003,143938,Lobelhe do Mato,1642-10-20,1648-02-29,Manuel Cardoso,de Almeida,Manuel Cardoso de Almeida,Agostinho Cardoso,add,Medicina,aka-see,False,


#### Analyse Aka to Aka (see also) links.

These are true duplicates. Some of them could be prevented with a check on dates, but this serves to assess the extend of duplicate records in the data.

Analysis:
* 150364 265272: strange because the two records are exactly the same except for the name and vide **and the date on instituta** One of the vide should be a "see" record.
* 129553 191232: also strange: same father name, both records contain the same enrollment in "instituta" in 1601-10-14. the record 191232 
  has the faculdade "Medicina" and an enrollment date of 1613.01.12, while keeping the instituta date. 
* 207361 251998 This looks like a late addition to the "vide" scheme, a note on record 251998 states "Mudou o nome no ano de 1573 aos 03.06 - Atos e Graus 10, fl. 143, caderno 3º". To be in conformity 251988 should be a "see" records with no dates.
* 190606 248991 the record 248991 should be a "See" it retains a single enrolment date  in 1588.10.01 which also exists in the paired record. With no dates and the redundant enrollment removed 248991 this would have been a normal match
* 5 188413 193737: this is a true duplicate but the shorter record 188413 seems to contain redundant information except that the faculdade is 
  recorded as "Leis" while in 193737 is recorded as "Cânones". Note that except for the name of the faculdade 193737 always refers "Leis" in the various fields, including the degree.

In [71]:
vide_plus.columns

Index(['name', 'sex', 'nome-vide', 'nome-geografico', 'faculdade',
       'faculdade.date', 'faculdade.obs', 'nome-pai', 'uc-entrada', 'uc-saida',
       'uc-saida.date', 'uc-saida.obs', 'rec_type', 'loookup', 'vide_type',
       'lookup', 'name_sp', 'lookup_sp', 'sort_key', 'match_error',
       'match_obs', 'match', 'match_type'],
      dtype='object')

In [72]:
from timelinknb.pandas import display_group_attributes

date_threshold = 15  # difference in years for flagging false duplicate.
show_only = 20

no_show=['código-de-referência','data-do-registo','url','faculdade.ano','naturalidade.ano',
         'matricula-faculdade.ano','nome-apelido','nome-primeiro','nome-geografico.ano',
         'grau.ano','matricula-outra.ano','nome-geografico','instituta.ano']

pairs = match_records['matched_pairs_ok']['data']
show_pairs = [(o,d,t) for o,d,t in pairs if t == 'aka-aka' and o<d]

aka_aka_same_date = []
aka_aka_far_apart = []
aka_aka_possible_see = []
for o,d,t in show_pairs:
    if o == '141854':
        pass
    # get the dates of entry to filter those that cannot be then same
    date_o = matched.loc[[o]]['uc-entrada'][0]
    date_d = matched.loc[[d]]['uc-entrada'][0]
    date_s_o = matched.loc[[o]]['uc-saida'][0]
    date_s_d = matched.loc[[d]]['uc-saida'][0]

    if date_o == date_s_o:
        aka_aka_possible_see.append(o)
        
    if date_d == date_s_d:
        aka_aka_possible_see.append(d)

    if date_o == date_d and date_s_o == date_s_d:
        # print("aka-aka pair with same date:",date_o,(o,d,t))
        aka_aka_same_date.append((o,d,t))
    else:
        year_o = int(date_o[:4])    
        year_d = int(date_d[:4])
        if max(year_o,year_d) - min(year_o,year_d) > date_threshold:
            # print(f"False aka-aka: records more than {date_threshold} years appart",(o,d,t),date_s_o,date_d)
            aka_aka_far_apart.append((o,d,t))

print(f"Number of aka-aka pairs with the same date:",len(aka_aka_same_date))
print(f"Number of aka-aka pairs more {date_threshold} years apart:",len(aka_aka_far_apart))
print(f"Number of possible false aka records (records with a single date, probably a see record)",len(aka_aka_possible_see))


print(f"aka-aka matches (show only {show_only}) of {len(show_pairs)}:")
i = 0
for o,d,t in show_pairs[:show_only]:
    i += 1
    print(i,o,d)
    if (o,d,t) in aka_aka_same_date:
        print("SAME DATES: Possible double registration of the same card")
    elif (o,d,t) in aka_aka_far_apart:
        print(f"FAR APART >{date_threshold} years: possible false match, records chronologically affar")
    if o in aka_aka_possible_see:
        print(f"{o} is a possible 'see' record")
    if d in aka_aka_possible_see:
        print(f"{d} is a possible 'see' record")
    
    display_group_attributes([o,d],
                             header_cols=['name','nome-vide','naturalidade','uc-entrada','uc-saida','faculdade','nome-pai'],
                             exclude_attributes=no_show,
                             sort_attributes=['date','type','value'],
                             cmap_name='Pastel1')


Number of aka-aka pairs with the same date: 5
Number of aka-aka pairs more 15 years apart: 9
Number of possible false aka records (records with a single date, probably a see record) 38
aka-aka matches (show only 20) of 94:
1 143239 163231


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,143239,Jerónimo de Almeida,Ribeiro,Ferreira,1553-10-00,1559-07-27,Leis,
1,163231,Jerónimo de Almeida Ribeiro,Almeida,Ferreira,1560-01-24,1560-07-24,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1553-00-00:1555-06-00,143239,instituta,1553-00-00:1555-06-00,"curso: 1 curso de Instituta, 1 de Código desde Outubro de 1553 a Junho de 1555"
1,1553-10-00,143239,faculdade,Leis,Leis
2,1553-10-00,143239,naturalidade,Ferreira,
3,1553-10-00,143239,nome,Jerónimo de Almeida,
4,1553-10-00,143239,nome,Jerónimo de Almeida Ribeiro,"Jerónimo de Almeida, vide Ribeiro"
5,1553-10-00,143239,nome-vide,Ribeiro,
6,1553-10-00,143239,uc-entrada,1553-10-00,
7,1553-10-00,143239,uc-entrada.ano,1553,
8,1559-07-27,143239,exame,Exame para Bacharel,27.07.1559
9,1559-07-27,143239,grau,Bacharel em Leis,"""1559/07/27"""


2 129553 191232
129553 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,191232,Manuel de Castro,Caldeira,Abrantes,1601-10-14,1613-01-12,Medicina,Lopo de Castro
1,129553,Manuel de Castro Caldeira,Castro,Abrantes,1601-10-14,1601-10-14,Cursos jurídicos (Cânones ou Leis),Lopo Castro


Unnamed: 0,date,id,type,value,attr_obs
0,1601-10-14,129553,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
1,1601-10-14,191232,faculdade,Medicina,Medicina
2,1601-10-14,129553,instituta,1601-10-14,14.10.1601 1601-10-14
3,1601-10-14,191232,instituta,1601-10-14,14.10.1601 1601-10-14
4,1601-10-14,129553,naturalidade,Abrantes,
5,1601-10-14,191232,naturalidade,Abrantes,
6,1601-10-14,129553,nome,Manuel de Castro,"Manuel de Castro Caldeira, vide Castro"
7,1601-10-14,191232,nome,Manuel de Castro,
8,1601-10-14,129553,nome,Manuel de Castro Caldeira,
9,1601-10-14,191232,nome,Manuel de Castro Caldeira,"Manuel de Castro, vide Caldeira"


3 167251 231579
167251 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,167251,Pedro Simões Esteves,Simões,Campo Maior,1665-10-19,1665-10-19,Cursos jurídicos (Cânones ou Leis),
1,231579,Pedro Simões,Pedro Simões Esteves,Campo Maior,1665-10-19,1666-10-15,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1665-03-21,231579,grau,Bacharel em Artes,Bacharel em Artes 21.03.1665
1,1665-10-19,167251,faculdade,Cursos jurídicos (Cânones ou Leis),Faculdade inferida
2,1665-10-19,231579,faculdade,Cânones,Cânones
3,1665-10-19,167251,instituta,1665-10-19,19.10.1665 1665-10-19
4,1665-10-19,231579,instituta,1665-10-19,19.10.1665 1665-10-19
5,1665-10-19,167251,naturalidade,Campo Maior,
6,1665-10-19,231579,naturalidade,Campo Maior,
7,1665-10-19,167251,nome,Pedro Simões,"Pedro Simões Esteves, vide Simões"
8,1665-10-19,231579,nome,Pedro Simões,
9,1665-10-19,167251,nome,Pedro Simões Esteves,


4 149046 216902


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,149046,Bernardo José de Azevedo,Vieira,Paredes,1725-10-01,1746-10-01,Cânones,José de Azevedo Vieira
1,216902,Bernardo José de Azevedo Vieira,Azevedo,Paredes,1726-10-01,1729-10-01,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1725-10-01,149046,faculdade,Cânones,Cânones
1,1725-10-01,149046,instituta,1725-10-01,01.10.1725 1725-10-01
2,1725-10-01,149046,naturalidade,Paredes,
3,1725-10-01,149046,nome,Bernardo José de Azevedo,
4,1725-10-01,149046,nome,Bernardo José de Azevedo Vieira,"Bernardo José de Azevedo, vide Vieira"
5,1725-10-01,149046,nome-nota,padre,
6,1725-10-01,149046,nome-pai,José de Azevedo Vieira,
7,1725-10-01,149046,nome-vide,Vieira,
8,1725-10-01,149046,padre,sim,padre
9,1725-10-01,149046,uc-entrada,1725-10-01,


5 163896 210366
163896 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,210366,Joaquim José Ribeiro de Vasconcelos,Joaquim José Ribeiro,"Baía, Brasil",1779-10-26,1781-10-05,Filosofia,João Ribeiro de Vasconcelos
1,163896,Joaquim José Ribeiro,Vasconcelos,Baía,1781-11-05,1781-11-05,Matemática,João Ribeiro de Vasconcelos


Unnamed: 0,date,id,type,value,attr_obs
0,1779-10-26,210366,faculdade,Filosofia,Filosofia
1,1779-10-26,210366,matricula-faculdade,Filosofia,(obrigado)
2,1779-10-26,210366,matricula-faculdade.obrigado,Filosofia,(obrigado)
3,1779-10-26,210366,matricula-faculdade.obrigado.ano,Filosofia.1779,(obrigado)
4,1779-10-26,210366,naturalidade,"Baía, Brasil",
5,1779-10-26,210366,nome,Joaquim José Ribeiro,"Joaquim José Ribeiro de Vasconcelos, vide Joaquim José Ribeiro"
6,1779-10-26,210366,nome,Joaquim José Ribeiro de Vasconcelos,
7,1779-10-26,210366,nome-pai,João Ribeiro de Vasconcelos,
8,1779-10-26,210366,nome-vide,Joaquim José Ribeiro,
9,1779-10-26,210366,uc-entrada,1779-10-26,


6 161852 173563
161852 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,173563,Manuel Dias Nunes,Dias,Beja,1659-10-20,1666-01-18,Medicina,
1,161852,Manuel Dias,Nunes,Beja,1663-02-22,1663-02-22,Medicina,


Unnamed: 0,date,id,type,value,attr_obs
0,1659-10-20,173563,faculdade,Medicina,Medicina
1,1659-10-20,173563,matricula-faculdade,Medicina,20.10.1659
2,1659-10-20,173563,naturalidade,Beja,
3,1659-10-20,173563,nome,Manuel Dias,"Manuel Dias Nunes, vide Dias"
4,1659-10-20,173563,nome,Manuel Dias Nunes,
5,1659-10-20,173563,nome-vide,Dias,
6,1659-10-20,173563,uc-entrada,1659-10-20,
7,1659-10-20,173563,uc-entrada.ano,1659,
8,1660-12-23,173563,matricula-faculdade,Medicina,23.12.1660
9,1661-10-15,173563,matricula-faculdade,Medicina,15.10.1661


7 200599 251249
251249 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,200599,João Teixeira,Morais,Bragança,1615-10-26,1616-03-02,Cânones,Jacome de Morais
1,251249,João Teixeira de Morais,Teixeira,Bragança,1616-02-22,1616-02-22,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1615-10-26,200599,faculdade,Cânones,Cânones
1,1615-10-26,200599,grau,Bacharel em Cânones,
2,1615-10-26,200599,matricula-faculdade,Cânones,26.10.1615
3,1615-10-26,200599,naturalidade,Bragança,
4,1615-10-26,200599,nome,João Teixeira,
5,1615-10-26,200599,nome,João Teixeira Morais,"João Teixeira, vide Morais"
6,1615-10-26,200599,nome-pai,Jacome de Morais,
7,1615-10-26,200599,nome-vide,Morais,
8,1615-10-26,200599,uc-entrada,1615-10-26,
9,1615-10-26,200599,uc-entrada.ano,1615,


8 140681 266159


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,266159,Domingos Antunes,de Abreu,Lisboa,1578-10-08,1594-05-15,Cânones,Brás Dias de Abreu
1,140681,Domingos Antunes de Abreu,Antunes,Lisboa,1579-10-27,1593-10-16,Cânones,Brás Dias de Abreu


Unnamed: 0,date,id,type,value,attr_obs
0,1578-10-08,266159,faculdade,Cânones,Cânones
1,1578-10-08,266159,instituta,1578-10-08,"""1578/10/08 1578-10-08"""
2,1578-10-08,266159,naturalidade,Lisboa,
3,1578-10-08,266159,nome,Domingos Antunes,
4,1578-10-08,266159,nome,Domingos Antunes de Abreu,"Domingos Antunes, vide de Abreu"
5,1578-10-08,266159,nome-pai,Brás Dias de Abreu,
6,1578-10-08,266159,nome-vide,de Abreu,
7,1578-10-08,266159,uc-entrada,1578-10-08,
8,1578-10-08,266159,uc-entrada.ano,1578,
9,1578-10-08:1579-06-08,266159,instituta,1578-10-08:1579-06-08,curso: Instituta e Cânones: 08.10.1578 até 08.06.1579


9 213910 253753


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,213910,Rui Lopes da Veiga,Lopes,Coimbra,1558-02-03,1560-05-23,Artes,
1,253753,Rui Lopes,da Veiga,Coimbra,1568-12-23,1569-12-11,Leis,


Unnamed: 0,date,id,type,value,attr_obs
0,1558-02-03,213910,faculdade,Artes,Faculdade inferida
1,1558-02-03,213910,grau,Bacharel em Artes,ter o tempo que se requer para Bacharel em Artes: 03.02.1558
2,1558-02-03,213910,naturalidade,Coimbra,
3,1558-02-03,213910,nome,Rui Lopes,"Rui Lopes da Veiga, vide Lopes"
4,1558-02-03,213910,nome,Rui Lopes da Veiga,
5,1558-02-03,213910,nome-vide,Lopes,
6,1558-02-03,213910,uc-entrada,1558-02-03,
7,1558-02-03,213910,uc-entrada.ano,1558,
8,1560-05-23,213910,grau,Licenciado em Artes,23.05.1560
9,1560-05-23,213910,uc-saida,1560-05-23,


10 187658 187661


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,187661,Gaspar da Costa Brandão,Gaspar Afonso da Costa Brandão,"Vila Cova de Sub-Avô, hoje Vila Cova de Alva",1720-10-01,1726-05-25,Leis,
1,187658,Gaspar Afonso da Costa Brandão,Gaspar da Costa Brandão,"Vila Cova de Sub-Avô, hoje Vila Cova de Alva",1721-10-01,1726-07-28,Leis,Bento de Figueiredo Brandão


Unnamed: 0,date,id,type,value,attr_obs
0,1720-10-01,187661,faculdade,Leis,Leis
1,1720-10-01,187661,instituta,1720-10-01,01.10.1720 1720-10-01
2,1720-10-01,187661,naturalidade,"Vila Cova de Sub-Avô, hoje Vila Cova de Alva",
3,1720-10-01,187661,nome,Gaspar Afonso da Costa Brandão,"Gaspar da Costa Brandão, vide Gaspar Afonso da Costa Brandão"
4,1720-10-01,187661,nome,Gaspar da Costa Brandão,
5,1720-10-01,187661,nome-vide,Gaspar Afonso da Costa Brandão,
6,1720-10-01,187661,uc-entrada,1720-10-01,
7,1720-10-01,187661,uc-entrada.ano,1720,
8,1721-10-01,187658,faculdade,Leis,Leis
9,1721-10-01,187658,naturalidade,"Vila Cova de Sub-Avô, hoje Vila Cova de Alva",


11 144496 163573


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,144496,João Álvares,Ribeiro,Porto,1607-10-09,1616-04-15,Cânones,Pantaleão Luís
1,163573,João Álvares Ribeiro,Álvares,Porto,1612-12-12,1615-06-04,Cânones,Pantaleão Luís


Unnamed: 0,date,id,type,value,attr_obs
0,1607-10-09,144496,faculdade,Cânones,Cânones
1,1607-10-09,144496,instituta,1607-10-09,"""1607/10/09 1607-10-09"""
2,1607-10-09,144496,naturalidade,Porto,
3,1607-10-09,144496,nome,João Álvares,
4,1607-10-09,144496,nome,João Álvares Ribeiro,"João Álvares, vide Ribeiro"
5,1607-10-09,144496,nome-pai,Pantaleão Luís,
6,1607-10-09,144496,nome-vide,Ribeiro,
7,1607-10-09,144496,uc-entrada,1607-10-09,
8,1607-10-09,144496,uc-entrada.ano,1607,
9,1608-10-06,144496,matricula-faculdade,Cânones,"""1608/10/06"""


12 143242 251147


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,251147,Jerónimo de Almeida Morais,Almeida,Lisboa,1723-10-01,1731-11-24,(Medecina),
1,143242,Jerónimo de Almeida,Morais,Lisboa,1723-10-01,1725-10-01,Medicina,


Unnamed: 0,date,id,type,value,attr_obs
0,1723-10-01,251147,faculdade,(Medecina),
1,1723-10-01,143242,faculdade,Medicina,Medicina
2,1723-10-01,251147,faculdade-original,Medecina,
3,1723-10-01,251147,matricula-faculdade,(Medecina),01.10.1723
4,1723-10-01,143242,matricula-faculdade,Medicina,"""1723/10/01"""
5,1723-10-01,143242,naturalidade,Lisboa,
6,1723-10-01,251147,naturalidade,Lisboa,
7,1723-10-01,143242,nome,Jerónimo de Almeida,
8,1723-10-01,251147,nome,Jerónimo de Almeida,"Jerónimo de Almeida Morais, vide Almeida"
9,1723-10-01,143242,nome,Jerónimo de Almeida Morais,"Jerónimo de Almeida, vide Morais"


13 168704 205306


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,168704,António Mendes Neto,Mendes,Santarém,1549-04-11,1553-05-31,Cânones,
1,168704,António Mendes Neto,Mendes,Santarém,1549-04-11,1553-05-31,Leis,
2,205306,António Mendes,Neto,Santarém,1540-04-11,1549-07-10,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1540-04-11,205306,faculdade,Cânones,Cânones
1,1540-04-11,205306,naturalidade,Santarém,
2,1540-04-11,205306,nome,António Mendes,
3,1540-04-11,205306,nome,António Mendes Neto,"António Mendes, vide Neto"
4,1540-04-11,205306,nome-vide,Neto,
5,1540-04-11,205306,uc-entrada,1540-04-11,
6,1540-04-11,205306,uc-entrada.ano,1540,
7,1549-04-11,168704,faculdade,Cânones,Faculdade corrigida
8,1549-04-11,168704,faculdade,Leis,Faculdade corrigida
9,1549-04-11,168704,faculdade-original,Cânones,


14 181667 219120


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,219120,António Machado Vilas Boas,Machado,Vila do Conde,1621-10-02,1625-10-16,Cânones,
1,181667,António Machado,Vilas Boas,"Vila do Conde, Porto",1625-10-03,1625-10-10,Cânones,Bartolomeu Jorge


Unnamed: 0,date,id,type,value,attr_obs
0,1621-10-02,219120,faculdade,Cânones,Cânones
1,1621-10-02,219120,matricula-faculdade,Cânones,02.10.1621
2,1621-10-02,219120,naturalidade,Vila do Conde,
3,1621-10-02,219120,nome,António Machado,"António Machado Vilas Boas, vide Machado"
4,1621-10-02,219120,nome,António Machado Vilas Boas,
5,1621-10-02,219120,nome-vide,Machado,
6,1621-10-02,219120,uc-entrada,1621-10-02,
7,1621-10-02,219120,uc-entrada.ano,1621,
8,1622-10-03,219120,matricula-faculdade,Cânones,03.10.1622
9,1623-10-05,219120,matricula-faculdade,Cânones,05.10.1623


15 133131 180160


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,180160,Manuel de Gouveia,Quintela,Lisboa,1656-10-01,1657-10-01,Cânones,
1,133131,Manuel de Gouveia Quintela,Gouveia,Lisboa,1656-10-07,1664-02-22,Cânones,João de Gouveia


Unnamed: 0,date,id,type,value,attr_obs
0,1656-10-01,180160,faculdade,Cânones,Cânones
1,1656-10-01,180160,naturalidade,Lisboa,
2,1656-10-01,180160,nome,Manuel de Gouveia,
3,1656-10-01,180160,nome,Manuel de Gouveia Quintela,"Manuel de Gouveia, vide Quintela"
4,1656-10-01,180160,nome-vide,Quintela,
5,1656-10-01,180160,uc-entrada,1656-10-01,
6,1656-10-01,180160,uc-entrada.ano,1656,
7,1656-10-07,133131,faculdade,Cânones,Cânones
8,1656-10-07,133131,instituta,1656-10-07,07.10.1656 1656-10-07
9,1656-10-07,180160,instituta,1656-10-07,07.10.1656 1656-10-07


16 144661 171665


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,171665,Pedro Álvares Nogueira,Álvares,Coimbra,1573-10-02,1595-10-10,Cânones,Álvaro Annes Nogueira
1,144661,Pedro Álvares,Nogueira,Coimbra,1573-10-02,1574-07-31,Cânones,


Unnamed: 0,date,id,type,value,attr_obs
0,1573-10-02,144661,faculdade,Cânones,Cânones
1,1573-10-02,171665,faculdade,Cânones,Cânones
2,1573-10-02,171665,matricula-faculdade,Cânones,02.10.1573
3,1573-10-02,144661,naturalidade,Coimbra,
4,1573-10-02,171665,naturalidade,Coimbra,
5,1573-10-02,144661,nome,Pedro Álvares,
6,1573-10-02,171665,nome,Pedro Álvares,"Pedro Álvares Nogueira, vide Álvares"
7,1573-10-02,144661,nome,Pedro Álvares Nogueira,"Pedro Álvares, vide Nogueira"
8,1573-10-02,171665,nome,Pedro Álvares Nogueira,
9,1573-10-02,171665,nome-nota,padre,


17 202057 206393
202057 is a possible 'see' record


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,206393,João Rodrigues,Teles,Arraiolos,1616-11-12,1623-02-14,Medicina,André Rodrigues
1,202057,João Rodrigues Teles,Rodrigues,Arraiolos,1618-05-02,1618-05-02,Artes,


Unnamed: 0,date,id,type,value,attr_obs
0,1616-11-12,206393,faculdade,Medicina,Medicina
1,1616-11-12,206393,matricula-faculdade,Medicina,1616.11.12
2,1616-11-12,206393,naturalidade,Arraiolos,
3,1616-11-12,206393,nome,João Rodrigues,
4,1616-11-12,206393,nome,João Rodrigues Teles,"João Rodrigues, vide Teles"
5,1616-11-12,206393,nome-pai,André Rodrigues,
6,1616-11-12,206393,nome-vide,Teles,
7,1616-11-12,206393,uc-entrada,1616-11-12,
8,1616-11-12,206393,uc-entrada.ano,1616,
9,1617-10-15,206393,instituta,1617-10-15,1617.10.15 1617-10-15


18 191348 237553


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,237553,Manuel Pereira,Castro,Monção,1597-10-13,1604-11-02,Cânones,Manuel Pereira
1,191348,Manuel Pereira de Castro,Pereira,Monção,1605-11-15,1606-05-13,Cânones,Manuel Pereira


Unnamed: 0,date,id,type,value,attr_obs
0,1597-10-13,237553,faculdade,Cânones,Cânones
1,1597-10-13,237553,instituta,1597-10-13,13.10.1597 1597-10-13
2,1597-10-13,237553,naturalidade,Monção,
3,1597-10-13,237553,nome,Manuel Pereira,
4,1597-10-13,237553,nome,Manuel Pereira Castro,"Manuel Pereira, vide Castro"
5,1597-10-13,237553,nome-pai,Manuel Pereira,
6,1597-10-13,237553,nome-vide,Castro,
7,1597-10-13,237553,uc-entrada,1597-10-13,
8,1597-10-13,237553,uc-entrada.ano,1597,
9,1599-02-08,237553,matricula-faculdade,Cânones,08.02.1599


19 199685 206340


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,199685,Diogo Mendes Leão,Mendes,Lisboa,1652-09-16,1653-10-15,Leis,
1,206340,Diogo Mendes,Leão,Lisboa,1654-11-28,1660-07-24,Medicina,André Mendes de Leão


Unnamed: 0,date,id,type,value,attr_obs
0,1652-09-16,199685,faculdade,Leis,Leis
1,1652-09-16,199685,matricula-faculdade,Leis,16.09.1652
2,1652-09-16,199685,naturalidade,Lisboa,
3,1652-09-16,199685,nome,Diogo Mendes,"Diogo Mendes Leão, vide Mendes"
4,1652-09-16,199685,nome,Diogo Mendes Leão,
5,1652-09-16,199685,nome-vide,Mendes,
6,1652-09-16,199685,uc-entrada,1652-09-16,
7,1652-09-16,199685,uc-entrada.ano,1652,
8,1653-10-15,199685,matricula-faculdade,Leis,15.10.1653
9,1653-10-15,199685,uc-saida,1653-10-15,


20 160999 214635
FAR APART >15 years: possible false match, records chronologically affar


Unnamed: 0,id,name,nome-vide,naturalidade,uc-entrada,uc-saida,faculdade,nome-pai
0,214635,Pedro Velho,Fragoso,Vila Nova de Portimão,1608-10-01,1617-07-23,Cânones,Francisco Velho
1,160999,Pedro Velho Fragoso,Velho,Vila Nova de Portimão,1626-05-12,1636-11-05,,Francisco Velho


Unnamed: 0,date,id,type,value,attr_obs
0,1608-10-01,214635,faculdade,Cânones,Cânones
1,1608-10-01,214635,instituta,1608-10-01,01.10.1608 1608-10-01
2,1608-10-01,214635,naturalidade,Vila Nova de Portimão,
3,1608-10-01,214635,nome,Pedro Velho,
4,1608-10-01,214635,nome,Pedro Velho Fragoso,"Pedro Velho, vide Fragoso"
5,1608-10-01,214635,nome-pai,Francisco Velho,
6,1608-10-01,214635,nome-vide,Fragoso,
7,1608-10-01,214635,uc-entrada,1608-10-01,
8,1608-10-01,214635,uc-entrada.ano,1608,
9,1609-10-06,214635,matricula-faculdade,Cânones,06.10.1609


#### Types of transformations in matched records

In [73]:
vide_types_matches = matched.groupby('vide_type').count()[['name']]
vide_types_matches['perc'] = vide_types_matches['name']/ vide_types_matches['name'].sum()
vide_types_matches

Unnamed: 0_level_0,name,perc
vide_type,Unnamed: 1_level_1,Unnamed: 2_level_1
add,1897,0.460213
cut,1899,0.460699
novid,26,0.006308
rep,294,0.071325
repap,6,0.001456


In [74]:
match_info.fillna("")

Unnamed: 0,data,sequential,random,perc_vide_plus,perc_matched_ok,perc_type,type
aka,3062,,,0.349304,,1.0,aka
aka_fac,3035,,,0.346224,,0.991182,aka
aka_geo,2973,,,0.339151,,0.970934,aka
aka_matched,1910,1913.0,1970.0,,,0.623775,aka
aka_matched_ok,1940,1907.0,1897.0,,,0.633573,aka
aka_pai,1619,,,0.184691,,0.528739,aka
matched_pairs,3818,3644.0,3804.0,,,1.0,matched_pairs
matched_pairs_ok,3665,3628.0,3614.0,,,0.959927,matched_pairs
nodate,5763,,,0.657426,,,
nodate_novide,141,,,0.016085,,,


### Analysis of non matched records

In [77]:

pd.set_option('display.max_rows',250)
matched_index = match_records['records_matched']['data']
non_matched_index = set(vide_plus.index.unique())-set(matched_index)
vide_non_matched = vide_plus.loc[list(non_matched_index)].sort_values(['sort_key','nome-geografico'])[['nome-geografico','match','name','nome-vide','vide_type','lookup','faculdade','nome-pai','uc-entrada','match_error','match_obs']]
vide_non_matched.to_csv('../inferences/cross-references/vide_non_matched.csv',sep=',')

In [78]:
vide_types_non_matches = vide_non_matched.groupby('vide_type').count()[['name']]
vide_types_non_matches['perc'] = vide_types_non_matches['name']/ vide_types_non_matches['name'].sum()
vide_types_non_matches

Unnamed: 0_level_0,name,perc
vide_type,Unnamed: 1_level_1,Unnamed: 2_level_1
add,2160,0.406321
cut,2227,0.418924
novid,126,0.023702
rep,763,0.143529
rep+,20,0.003762
repap,20,0.003762


### Sample of non-matched records


In [79]:
vide_non_matched.head(31)

Unnamed: 0,nome-geografico,match,name,nome-vide,vide_type,lookup,faculdade,nome-pai,uc-entrada,match_error,match_obs
220890,Portalegre,,"""Pedro Rodrigues, vide; Abreu""",,novid,"""Pedro Rodrigues, vide; Abreu""",,,0000-00-00,False,
271719,Abreiro,,Abel de Mendonça Machado de Araújo,Abel de Mendonça,cut,Abel de Mendonça,,,0000-00-00,False,
271719,Mirandela,,Abel de Mendonça Machado de Araújo,Abel de Mendonça,cut,Abel de Mendonça,,,0000-00-00,False,
182548,Eiró,,Abel Xavier Teixeira de Magalhães,José Joaquim Xavier Teixeira de Magalhães,rep,José Joaquim Xavier Teixeira de Magalhães,Cursos jurídicos (Cânones ou Leis),,0000-00-00,False,
285686,Oliveira de Frades,,Abílio Ribeiro de Almeida Campos de Melo,Abílio Ribeiro de Almeida,cut,Abílio Ribeiro de Almeida,Cursos jurídicos (Cânones ou Leis),António de Almeida Silva Campos de Melo,0000-00-00,False,
285686,Pinheiro,,Abílio Ribeiro de Almeida Campos de Melo,Abílio Ribeiro de Almeida,cut,Abílio Ribeiro de Almeida,Cursos jurídicos (Cânones ou Leis),António de Almeida Silva Campos de Melo,0000-00-00,False,
286149,Amoreira da Gandra,,Adelino Pinto Tavares Ferrão de Mendonça,Ferrão,cut,Adelino Pinto Tavares Ferrão,,,0000-00-00,False,
226700,Marvão,,Adolfo Augusto Rôlo,Adolfo António Rôlo,rep,Adolfo António Rôlo,Medicina,,1871-06-06,False,
226683,Marvão,,Adolfo António Rôlo,Adolfo Augusto Zuzarte Rôlo,rep,Adolfo Augusto Zuzarte Rôlo,,,0000-00-00,False,
273326,Lisboa,,Adriano Ernesto de Castilho Barreto,Castilho,cut,Adriano Ernesto de Castilho,,,0000-00-00,False,



Analysis:
1. 220890	Portalegre	"Pedro Rodrigues, vide; Abreu" links with 140806 __problem in vide expression__
2. 271719	Abreiro/Mirandela	Abel de Mendonça Machado de Araújo	Abel de Mendonça links with 286147 __no back vide expression__
3. 182548	Eiró	Abel Xavier Teixeira de Magalhães	José Joaquim Xavier Teixeira de Magalhães links with 182950  __no back vide expression__
4. 285686	Oliveira de Frades	Abílio Ribeiro de Almeida Campos de Melo	Abílio Ribeiro de Almeida links with 142075 __no back vide expression__
5. 286149	Amoreira da Gandra	Adelino Pinto Tavares Ferrão de Mendonça	Ferrão links with 248088 __no back vide expression__ and __typo in geo name__
6. 273326	Lisboa	Adriano Ernesto de Castilho Barreto	Castilho links with 189993 __no back vide expression__
7. 230176	Arcos	Tomás Joaquim Lopes de Mariz e Silva	Adriano Joaquim Lopes Mariz e Silva Monteiro links with 250994 __variation in the vide name (Maris/Mariz)__
8. 282429	NaN	Adriano Osório Pereira Gouveia	Adriano Osório Pereira Cerenato	rep	Adriano Osório Pereira Cerenato	links with 291196 __no back vide expression__
9. 296930	Almarge	Adriano Sisnando Brotero de Avelar Quintino	Adriano Sisnando Brotero Quintino de Avelar	rep	Adriano Sisnando Brotero Quintino de Avelar links with 133134 __no back vide expression__
10. 225520	Lisboa	Adrião Pereira	Gomes	add	Adrião Pereira Gomes	Cânones, links with 178240 __no back vide expression__
11. 147465	Trancoso	Afonso Tavares de Araújo	Afonso de Araújo Tavares	rep	Afonso de Araújo Tavares links with 197047 __no back vide expression__
12. 169888	Lisboa	Afonso Furtado	Mendonça	add	Afonso Furtado Mendonça link with 214147 (see) or 169890__ __ambiguity__ 
13. 251547	Baía	Afonso Luís	da Fonseca	add	Afonso Luís da Fonseca	links with 139362 __no back vide expression__
14. 225529	Monção	Afonso Pereira	Pimenta	add	Afonso Pereira Pimenta	 links with 241162 __no back vide expression__
15. 129050	Elvas	Afonso Rodrigues Caldas	Rodrigues	cut	Afonso Rodrigues __no link found__
16. 221241	Elvas	Afonso Sardinha	Afonso Vaz Sardinha	rep	Afonso Vaz Sardinha	Cânones see link missing	__no link found__
17. 235544	Elvas	Afonso Soares da Mota	Afonso Soares de Lemos	rep	Afonso Soares de Lemos	link 211794 	__no back vide expression__ 
18. 225535	Aldeia Nova do Cabo	Afonso de Sá Pereira	Sá	cut	Afonso de Sá links with 211378 __no back vide expression__ 
19. 199294	Vila Real	Afonso Teixeira	   Mendonça e Azevedo	add	Afonso Teixeira Mendonça e Azevedo	Cânones	 links with 148819/See  214149/see __ambiguity__
20. 316331	Quinta do Alqueidão	Agostinho António de Sousa Brito Resende	Soutomaior	add	Agostinho António de Sousa Brito Resende Souto...
	link to 224178 	__no back vide expression_ NO match on geoname Alqueidão, quinta do Alqueidão__
21. 234238 Lisboa	Agostinho Armando de Vasconcelos e Sousa	Agostinho Armando Vasconcelos	rep	Agostinho Armando Vasconcelos	
        Links to 148028 __fail lookup not matching linked record name: both lookup the same though__

### Aka Records non matched

There is an inbalance of "see" and "aka" numbers, so a high number of unmatched "see"  is expected.

Aka records should be more easily matched with corresponding see. That is the case in fact with around 55% of aka records matched

Let's see the reason why Aka records do not find a matching "see".



In [80]:
aka_see_not_matched_index = vide_non_matched[vide_non_matched['uc-entrada']!='0000-00-00'].index.unique()
print("Number of aka records not matched:", {len(aka_see_not_matched_index)})
print("Partial list, change head parameter for more:")

vide_non_matched.loc[aka_see_not_matched_index].head(20)

Number of aka records not matched: {1053}
Partial list, change head parameter for more:


Unnamed: 0,nome-geografico,match,name,nome-vide,vide_type,lookup,faculdade,nome-pai,uc-entrada,match_error,match_obs
226700,Marvão,,Adolfo Augusto Rôlo,Adolfo António Rôlo,rep,Adolfo António Rôlo,Medicina,,1871-06-06,False,
250994,Arcos,,Adriano Joaquim de Mariz e Silva Monteiro,Tomás Joaquim Lopes de Maris e Silva,rep+,Tomás Joaquim Lopes de Maris e Silva,Cursos jurídicos (Cânones ou Leis),,1794-10-14,False,
250994,Aveiro,,Adriano Joaquim de Mariz e Silva Monteiro,Tomás Joaquim Lopes de Maris e Silva,rep+,Tomás Joaquim Lopes de Maris e Silva,Cursos jurídicos (Cânones ou Leis),,1794-10-14,False,
180061,Salgueiro,,João António Osório Pereira Gouveia,Adriano Osório Pereira Guerra,rep+,Adriano Osório Pereira Guerra,Cursos jurídicos (Cânones ou Leis),,1800-10-31,False,
180742,Salgueiro,,Adriano Osório Pereira Guerra,João António Pereira Cerenato,rep+,João António Pereira Cerenato,Leis,,1799-10-07,False,
129050,Elvas,,Afonso Rodrigues Caldas,Rodrigues,cut,Afonso Rodrigues,Leis,,1657-11-02,False,
221241,Elvas,,Afonso Sardinha,Afonso Vaz Sardinha,rep,Afonso Vaz Sardinha,Cânones,Gonçalo Rodrigues,1706-10-01,False,
199294,Vila Real,,Afonso Teixeira,Mendonça e Azevedo,add,Afonso Teixeira Mendonça e Azevedo,Cânones,,1650-11-08,False,
199294,Vila Real,,Afonso Teixeira,Mendonça e Azevedo,add,Afonso Teixeira Mendonça e Azevedo,Leis,,1650-11-08,False,
187458,Santa Olaia,,Agostinho Brandão,Pinto,add,Agostinho Brandão Pinto,Cursos jurídicos (Cânones ou Leis),,1688-01-21,False,


##### Analysis

1. 226700 Marvão Adolfo Augusto Rôlo, vide Adolfo António Rôlo matches 226683 Adolfo António Rôlo, vide Adolfo Augusto Zuzarte Rôlo __back vide does not match__
2. 250994 Arcos	Tomás Joaquim Lopes de Mariz e Silva vide Adriano Joaquim Lopes Mariz e Silva Monteiro links with  230176 __variation in the vide name (Maris/Mariz)__
3. 180061	Salgueiro	_João António Osório Pereira Gouveia_, vide Adriano Osório Pereira Guerra	rep+	Adriano Osório Pereira Guerra	Direito (Cânones ou Leis) 1800-10-31	__variation in the vide name__
  * 180742	Salgueiro   Adriano Osório Pereira Guerra, vide _João António Pereira Cerenato_	Leis 1799-10-07
  * Other possible matches 291196, 191903 complex case
4. 129050	Elvas	Afonso Rodrigues Caldas	vide Rodrigues	cut	Afonso Rodrigues	Leis	NaN	1657-11-02	__see record not found manualy__
5. 221241	Elvas	Afonso Sardinha	vide Afonso Vaz Sardinha	rep	Afonso Vaz Sardinha	Cânones	Gonçalo Rodrigues	1706-10-01	__see record not found manualy__
6. 199294	Vila Real	Afonso Teixeira	vide Mendonça e Azevedo	add	Afonso Teixeira Mendonça e Azevedo	Cânones	1650-11-08
  * links with see record 214149 Afonso Teixeira de Mendonça, vide Teixeira  __vide in aka record does not match name in see record__
  * links also with 148819 Afonso Teixeira de Azevedo, vide Teixeira __vide in aka record does not match name in see record__
  * so the vide expression in 199924 should be __vide Mendonça e vide Azevedo__ to link with Afonso Teixeira de Mendonça and Afonso Teixeira de Azevedo
7. 187458	Santa Olaia	Agostinho Brandão	vide Pinto	add	Agostinho Brandão Pinto	Direito (Cânones ou Leis)	NaN	1688-01-21	__matching record is aka, not see__
  * links with 245344 which is not a see record nor a vide record. __187458 and 245344 are dupicates__ __matching record is aka, not see__
8.  152599	Lisboa	Agostinho José de Carvalho vide	Agostinho José de Figueiredo Carvalho e Oliveira	Leis	1791-10-27
   *  links with 174123 Agostinho José de Figueiredo Carvalho e Oliveira, but it is not a see record. __152599 and 174213 are duplicates__ __matching record is aka, not see__
9. 149805	Lisboa Aires Correia Baharém vide	Correia	cut	Aires Correia	Teologia	pai Manuel Correia de Menezes	1594-10-18 
  * links with see record 196492 slight variation in the vide expression __variation in the vide name (Baharém/Baharem)__


10. 192844	Ovfmatsen	Alberto Chremer	vide Cremert	add	Alberto Chremer Cremert	Cânones	
   *  links with 207263 Alberto Cremert no vide expression __matching record is aka, not see__ __duplicate__


### Matched records

Sucessive lines are matches. Sometime more than one line per record when there is more than one geographic name or faculty.

In [81]:
vide_plus.loc[matched_index].sort_values(['nome-geografico','sort_key','uc-entrada'])[['uc-entrada','nome-geografico','name','lookup','nome-pai','faculdade','faculdade.obs','match_obs']].head(30)

Unnamed: 0,uc-entrada,nome-geografico,name,lookup,nome-pai,faculdade,faculdade.obs,match_obs
202622,0000-00-00,Constância,Fernão de Álvares Temudo,Fernão de Álvares,,,,
144388,1573-11-13,Constância,Fernão de Álvares,Fernão de Álvares Temudo,Pantaleão Rosado,Cânones,Cânones,
171438,0000-00-00,Constância,João da Veiga Mendes Nogueira,João da Veiga,,Leis,Leis,
213495,1757-10-01,Constância,João da Veiga,João da Veiga Mendes Nogueira,,Leis,Leis,
214577,0000-00-00,Constância,Julião Velho,Julião Velho Almeida,,,,
143676,1663-07-10,Constância,Julião Velho de Almeida,Julião Velho,,Cânones,Cânones,
203159,0000-00-00,Constância,Manuel da Costa,Manuel da Costa Oliveira,Manuel da Costa,,,
176277,1672-01-24,Constância,Manuel da Costa de Oliveira,Manuel da Costa,Manuel da Costa,Cânones,Cânones,
243351,0000-00-00,Constância,Manuel Ribeiro Pinhão,Manuel Ribeiro,Pedro Ribeiro,Cânones,Cânones,
165844,1623-10-09,Constância,Manuel Ribeiro,Manuel Ribeiro Pinhão,Pedro Ribeiro,Cânones,. Cânones,


# Save current stats on cross reference processing

This allows later in "git" see how the situation evolves.

In [95]:
# save status to file
fname = '../inferences/cross-references/015-remissivas_info.txt'

with open(fname,'w+') as f:
    print(f"Cross references, current stats: {current_time}",file=f)
    print(file=f)

    vide_plus.info(buf=f)

    print(match_info.fillna(""), file=f)
    

    



### Focus on specific records

Use this to check specific records.


Define a column and a pattern to search for. Pattern is a _regular expression_.
For more information on the patterns and alternative searches see https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html

Examples:

* column='nome', pattern='André': 'André' anywhere in column 'name' (will also get 166395 Manuel André Ribeiro)
* column='nome', pattern='André$': names ending in 'André' (e.g. 146664 Manuel André )
* column='nome', pattern='^André': names starting with 'André'
* column='nome', pattern='André|Joaquim': names containing either 'André' or 'Joaquim'
* column='naturalidade', pattern='Alcácer|Alcacer':  naturalidade contains either 'Alcácer' or 'Alcacer'

In [94]:
import pandas as pd
column = 'name'
pattern = '^Estevão'
pd.set_option('display.max_rows',1000)
#                                      na=False prevents errors column is missing
vide_selection = vide_plus[vide_plus[column].str.contains(pattern,na=False)]
vide_selection.sort_values([column]).head(10)

Unnamed: 0,name,sex,nome-vide,nome-geografico,faculdade,faculdade.date,faculdade.obs,nome-pai,uc-entrada,uc-saida,...,loookup,vide_type,lookup,name_sp,lookup_sp,sort_key,match_error,match_obs,match,match_type
200659,Estevão Afonso da Costa,m,Afonso,Bragança,Cânones,1664-10-15,Cânones,,1664-10-15,1671-07-03,...,,cut,Estevão Afonso,Estevão Afonso Costa,Estevão Afonso,Estevão Afonso-Estevão Afonso Costa,False,,,
276913,Estevão Anacleto Duarte,m,Estevão Anacleto,Vila Viçosa,Leis,0000-00-00,Leis,António Duarte,0000-00-00,0000-00-00,...,,cut,Estevão Anacleto,Estevão Anacleto Duarte,Estevão Anacleto,Estevão Anacleto-Estevão Anacleto Duarte,False,,,
236147,Estevão Barreto de Magalhães e Menezes,m,Estevão de Magalhães e Menezes,Braga,,,,,0000-00-00,0000-00-00,...,,rep,Estevão de Magalhães e Menezes,Estevão Barreto Magalhães Menezes,Estevão Magalhães Menezes,Estevão Barreto Magalhães Menezes-Estevão Maga...,False,,236150.0,see-aka
129044,Estevão Caetano,m,de Araújo Rangel,Porto,Cursos jurídicos (Cânones ou Leis),1724-10-01,Faculdade inferida,,1724-10-01,1725-10-01,...,,add,Estevão Caetano de Araújo Rangel,Estevão Caetano,Estevão Caetano Araújo Rangel,Estevão Caetano-Estevão Caetano Araújo Rangel,False,,,
134106,Estevão Caetano de Araújo Rangel,m,Caetanao,Porto,,,,,0000-00-00,0000-00-00,...,,add,Estevão Caetano de Araújo Rangel Caetanao,Estevão Caetano Araújo Rangel,Estevão Caetano Araújo Rangel Caetanao,Estevão Caetano Araújo Rangel-Estevão Caetano ...,False,,,
133510,Estevão Cardoso,m,da Silveira,Vila Viçosa,Leis,1615-10-02,Leis,,1615-10-02,1623-05-24,...,,add,Estevão Cardoso da Silveira,Estevão Cardoso,Estevão Cardoso Silveira,Estevão Cardoso-Estevão Cardoso Silveira,False,,230444.0,aka-see
230444,Estevão Cardoso da Silveira,m,Cardoso,Vila Viçosa,,,,,0000-00-00,0000-00-00,...,,cut,Estevão Cardoso,Estevão Cardoso Silveira,Estevão Cardoso,Estevão Cardoso-Estevão Cardoso Silveira,False,,133510.0,see-aka
152876,Estevão Dias,m,Pereira,Cascais,,,,,0000-00-00,0000-00-00,...,,add,Estevão Dias Pereira,Estevão Dias,Estevão Dias Pereira,Estevão Dias-Estevão Dias Pereira,False,,233458.0,see-aka
233458,Estevão Dias Pereira,m,Dias,Cascais,Cânones,1619-10-24,Cânones,Álvaro Pereira,1619-10-24,1623-10-03,...,,cut,Estevão Dias,Estevão Dias Pereira,Estevão Dias,Estevão Dias-Estevão Dias Pereira,False,,152876.0,aka-see
293823,Estevão Falcão Cota,m,Menezes,***NA***,,,,,0000-00-00,0000-00-00,...,,add,Estevão Falcão Cota Menezes,Estevão Falcão Cota,Estevão Falcão Cota Menezes,Estevão Falcão Cota-Estevão Falcão Cota Menezes,False,,,



## Sorted lists

In [89]:
vide_selection[['nome-geografico','match','match_type','match_obs','name','nome-vide','faculdade','uc-entrada','uc-saida']].sort_values(['nome-geografico','name','uc-entrada']).head(20)

Unnamed: 0,nome-geografico,match,match_type,match_obs,name,nome-vide,faculdade,uc-entrada,uc-saida
293823,***NA***,,,,Estevão Falcão Cota,Menezes,,0000-00-00,0000-00-00
285795,***NA***,,,,Estevão Machado de Melo,Castro,,0000-00-00,0000-00-00
173368,Abrantes,,,,Estevão Lopes Galvão,Lopes,,0000-00-00,0000-00-00
149184,Arrifana de Sousa,,,,Estevão de Freitas e Azevedo,Freitas,,0000-00-00,0000-00-00
206520,Beco,,,,Estevão Mendes,Vasconcelos,Cânones,0000-00-00,0000-00-00
233530,Beja,,,,Estevão Lopes Pereira,Lopes,,0000-00-00,0000-00-00
236147,Braga,236150.0,see-aka,,Estevão Barreto de Magalhães e Menezes,Estevão de Magalhães e Menezes,,0000-00-00,0000-00-00
236150,Braga,236147.0,aka-see,,Estevão de Magalhães e Menezes,Estevão Barreto de Magalhães e Menezes,Cânones,1738-10-01,1740-10-01
200659,Bragança,,,,Estevão Afonso da Costa,Afonso,Cânones,1664-10-15,1671-07-03
309989,Brasil,,,,Estevão Mauricio de Velasco e Tavora,Estevão Mauricio de Velasco Molina,Cânones,0000-00-00,0000-00-00


### Examine individual records

In [90]:
from timelinknb import pperson,Session
pd.set_option('display.max_rows',250)

with Session() as session:
    session.begin()
    pperson(219458)


n$Estevão José dos Santos/m/id=219458/obs="""

            Id: 219458
            Código de referência: PT/AUC/ELU/UC-AUC/B/001-001/S/003081

            Nome        : Estevão José dos Santos, vide Estevão José
            Data inicial: 0000-00-00
            Data final  : 0000-00-00
            Filiação:
            Naturalidade: Lisboa
            Faculdade:

            Matrícula(s):

            Instituta:
        """
  atr$código-de-referência/"PT/AUC/ELU/UC-AUC/B/001-001/S/003081"/2020-12-30
  atr$data-do-registo/2020-12-30/2020-12-30
  atr$url/"https://pesquisa.auc.uc.pt/details?id=219458"/2020-12-30
  ls$uc-entrada/0000-00-00/0000-00-00
  ls$uc-saida/0000-00-00/0000-00-00
  ls$nome-vide/Estevão José/0000-00-00
  ls$nome/Estevão José/0000-00-00/obs=Estevão José dos Santos, vide Estevão José
  ls$nome/Estevão José dos Santos/0000-00-00
  ls$nome-primeiro/Estevão/0000-00-00
  ls$nome-apelido/José dos Santos/0000-00-00
  ls$nome-apelido/Santos/0000-00-00
  ls$nome-geografico/Lisboa

### Examine groups of records in a single chrological table

In [92]:
from timelinknb import Session
from timelinknb.pandas import group_attributes

pd.set_option('display.max_rows',250)

with Session() as session:
    session.begin()
    ga = group_attributes(['215193','182145'],person_info=False,exclude_attributes=['pobs'])

ga.sort_values(['date','type','value'], inplace=True)
ga[['date','type','value','attr_obs']]


Unnamed: 0_level_0,date,type,value,attr_obs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
182145,1604-10-11,faculdade,Cânones,Faculdade corrigida
182145,1604-10-11,faculdade,Leis,Faculdade corrigida
215193,1604-10-11,faculdade,Leis,Leis
182145,1604-10-11,faculdade-original,Leis,
182145,1604-10-11,faculdade.ano,Cânones.1604,Faculdade corrigida
182145,1604-10-11,faculdade.ano,Leis.1604,Faculdade corrigida
215193,1604-10-11,faculdade.ano,Leis.1604,Leis
182145,1604-10-11,instituta,1604-10-11,11.10.1604 1604-10-11
215193,1604-10-11,instituta,1604-10-11,1604.10.11 1604-10-11
182145,1604-10-11,instituta.ano,1604,11.10.1604 1604-10-11
