**[PT]** Português

---

**[EN]** English

# Explorando o ficheiro de autoridades da Biblioteca Nacional de Portugal

---

# Exploring the authority records of the Portuguese National Library




## Referências

--

## References

* https://dados.gov.pt/pt/datasets/catalogo-bnp-registos-de-autoridade/ (download)
* https://purl.pt/11442/1/


### Estudantes com informação na wiki data e id da BNP

---

### Student in wikidata and id in Portuguese National Library

In [89]:
import pandas as pd

students = pd.read_csv("../inferences/wikidata/students_wikidata_matched.csv",dtype={'bnp_id':str,'fauc_id':str})
bnp_fauc = students.loc[students['bnp_id'].notnull() & students['fauc_id'].notnull()]
bnp_fauc_dict = dict(list(bnp_fauc[['bnp_id','fauc_id']].itertuples(index=False, name=None)))
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    455 non-null    int64 
 1   wikidata      455 non-null    object
 2   name          455 non-null    object
 3   bnp_id        277 non-null    object
 4   naturalidade  421 non-null    object
 5   placeID       421 non-null    object
 6   data_nasc     455 non-null    object
 7   fauc_id       160 non-null    object
dtypes: int64(1), object(7)
memory usage: 28.6+ KB


In [90]:
bnp_fauc_dict

{'239156': '182939',
 '1460': '140699',
 '108598': '215040',
 '15765': '186608',
 '178095': '227536',
 '95880': '165553',
 '101830': '132510',
 '105505': '231020',
 '83928': '165113',
 '92687': '212876',
 '43584': '233633',
 '104153': '203860',
 '100061': '240348',
 '43290': '228499',
 '23095': '221394',
 '735392': '224045',
 '130524': '206494',
 '32584': '223565',
 '93744': '283501',
 '241793': '149900',
 '86230': '241774',
 '169783': '254021',
 '29480': '239985',
 '314551': '245300',
 '190731': '128838',
 '101823': '167004',
 '2826': '141759',
 '60483': '161685',
 '116993': '238951',
 '5625': '149831',
 '93825': '173131',
 '12867': '151823',
 '37499': '156097',
 '92638': '128467',
 '116712': '167725',
 '853810': '147720',
 '55851': '208773',
 '92636': '283949',
 '7353': '174689',
 '131512': '253184',
 '122935': '250279',
 '88166': '296360',
 '36459': '165280',
 '268786': '213294',
 '68981': '192904',
 '112058': '180816',
 '1418331': '185507',
 '30165': '177304',
 '105800': '199620',


## Ficheiros de autoridade da BNP disponíveis localmente

---

## Portuguese National Library authority records available locally


Download from https://dados.gov.pt/pt/datasets/catalogo-bnp-registos-de-autoridade/

into `extras/bnp/catalogoautoridades.marcxchange`

In [91]:
from pathlib import Path

path = '../extras/bnp/catalogoautoridades.marcxchange'
authority_records = [f for f in list(Path(path).rglob('*.xml'))]
print([f.name for f in authority_records])


['authorities_1723900_to_1844400.xml', 'authorities_456204_to_913891.xml', 'authorities_1290322_to_1444155.xml', 'authorities_1_to_100936.xml', 'authorities_1444156_to_1586439.xml', 'authorities_1586454_to_1723898.xml', 'authorities_1152445_to_1290321.xml', 'authorities_100937_to_184478.xml', 'authorities_913896_to_1152444.xml', 'authorities_264875_to_456203.xml', 'authorities_184479_to_264874.xml']


In [23]:
!pip install lxml



We parse authority records looking for portuguese authors before the 20th century

In [92]:
from lxml import etree

from timelinknb import current_time,current_machine, get_db
from ucalumni.config import default_db
from ucalumni.aluno import get_and_process_aluno
from ucalumni.extractors import get_extractors

get_extractors()

db_spec = default_db
db = get_db(db_spec)
print(current_machine,current_time,f'db={db_spec}')


df = pd.DataFrame(columns = ['name_bn', 'date_bn', 'qualification', 'bnp_id', 'fauc_id'])

xsl_file = '../extras/bnp/visbd-fauc.xsl'
xsl = etree.parse(xsl_file) 

marxchange_ns = "info:lc/xmlns/marcxchange-v1"
nsmap = {None: marxchange_ns}

for auth_file_name in authority_records:
    print("Parsing: ",auth_file_name)
    print()
    auth_file = etree.parse(auth_file_name)
    recs = auth_file.getroot()
    
    for rec in recs:
        cf001 = rec.find("controlfield[@tag = '001']",namespaces=nsmap)

        bnp_id = cf001.text
        url = f"http://urn.bn.pt/bibliografia/unimarc/xml?id={bnp_id}"

        country = rec.find("datafield[@tag = '102']/subfield[@code='a']",namespaces=nsmap)
        if country is not None:
            if country.text == 'PT':
                # Portuguese author
                dates = rec.find("datafield[@tag = '200']/subfield[@code='f']",namespaces=nsmap)
                f200_a = rec.find("datafield[@tag = '200']/subfield[@code='a']",namespaces=nsmap)
                f200_b = rec.find("datafield[@tag = '200']/subfield[@code='b']",namespaces=nsmap)
                f200_c = rec.find("datafield[@tag = '200']/subfield[@code='c']",namespaces=nsmap)
                
                if dates is not None and len(dates.text)>=4:
                    date_text = dates.text
                    date_text = date_text.strip("ca?fl. ")
                    century = date_text[:2]
                    try:
                        icentury = int(century)
                    except ValueError as ve:
                        print(f"Could not understand date: |{dates.text}| on record id: {bnp_id}")
                        continue
                        
                    if icentury < 19:
                        if f200_a is not None:
                            name = f200_a.text.strip(",")
                        if f200_b is not None:
                            name = f200_b.text.strip(",")+ " " + name
                        if f200_c is not None:
                            qualification =  f200_c.text
                        else:
                            qualification = None

                        
                        id = None
                        
                        if bnp_id  in bnp_fauc_dict.keys():  # this person was previously found in fauc
                            id = bnp_fauc_dict[bnp_id]
                            aluno = get_and_process_aluno(id,db)
                            print()
                            print()
                            print(aluno.as_entry())
                            records = etree.parse(url)
                            transform = etree.XSLT(xsl)
                            print(str(transform(records)))
                            print()
                            
                        print(name,date_text,qualification,bnp_id,id)
                        row = pd.DataFrame.from_dict({ 'name_bn' : [name], 
                                             'date_bn' : [date_text], 
                                             'qualification' : [qualification],
                                             'bnp_id': bnp_id,
                                             'fauc_id':id}, orient='columns')

                        df = pd.concat( [df, row], axis=0)



                        #for cf in rec.findall("controlfield",namespaces=nsmap):
                        #    print(cf.get('tag'),cf.text)
                        #print("...")

                        # tags=['101','102','123','160','200','300','305','310','320','330','340','356']
                        tags=[]
                        for tag in tags: 
                            df = rec.find(f"datafield[@tag = '{tag}']",namespaces=nsmap)
                            if df is not None:
                                print(f"{df.get('tag'):3s} {df.get('ind1'):1s}{df.get('ind2'):1s}")
                                for sf in list(df):
                                    print(f"   ${sf.get('code')} {sf.text}",)
                        

df.set_index('bnp_id', drop=False, inplace=True)                       




mini-m1.local 2022-05-25 19:14:31.757152 db=('sqlite', 'fauc.db')
Parsing:  ../extras/bnp/catalogoautoridades.marcxchange/authorities_1723900_to_1844400.xml

Pedro Mariz de Sousa Sarmento 1745-1822 None 1724018 None
João Ernesto Cabral de Vasconcelos Seive do Canto 17-- None 1724099 None
João Manuel de Melo 17-- None 1724100 None
Jacinto Pereira de Brito 17-- None 1724140 None
Francisco de Brito Cação 16-- None 1724142 None
Pedro André Borges de Lima 17-- None 1724170 None
Francisco Xavier de Albuquerque 17-- None 1724199 None
Peter Becker 16-- None 1724287 None
Santos Rodrigues Lima 17-- None 1724289 None
Manuel Rodrigues Batalha 17-- None 1724416 None
Giovanni Poleni 1683-1761 None 1724538 None
João Ribeiro Gaio -1601 None 1724960 None
Luís Antonio de Azevedo 1755-1815 None 1725000 None
Martinho da França Pereira Coutinho 1821-1884 None 1725685 None
A. Cyrillo Soares 1883-1950 None 1726866 None
6º Marquês de Marialva 1775-1823 None 1727067 None
Tomásia Maria Micaela de Loureiro e Lac

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14556 entries, 1724018 to 264798
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name_bn        14556 non-null  object
 1   date_bn        14556 non-null  object
 2   qualification  1402 non-null   object
 3   bnp_id         14556 non-null  object
 4   fauc_id        48 non-null     object
dtypes: object(5)
memory usage: 682.3+ KB


In [86]:
df.head()


Unnamed: 0_level_0,name_bn,date_bn,qualification,bnp_id,fauc_id
bnp_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1724018,Pedro Mariz de Sousa Sarmento,1745-1822,,1724018,
1724099,João Ernesto Cabral de Vasconcelos Seive do Canto,17--,,1724099,
1724100,João Manuel de Melo,17--,,1724100,
1724140,Jacinto Pereira de Brito,17--,,1724140,
1724142,Francisco de Brito Cação,16--,,1724142,


In [87]:
df.tail()

Unnamed: 0_level_0,name_bn,date_bn,qualification,bnp_id,fauc_id
bnp_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
264567,João Gomes Ferreira,1851-1897,,264567,
264653,Bernardo Silva,1867-1948,,264653,
264655,Domingos Duarte,168-,,264655,
264692,Luis Francisco Soares de Sousa Falcão,1715-,,264692,
264798,Caetano Costa Lima,1835-1898,,264798,


In [88]:
df.to_csv('../inferences/wikidata/bnp_authorities.csv')