# Database overview notebook
## Visão geral da base de dados

> _First time use of the `notebooks`: follow instructions in the `README.md` file in this directory._

> Antes de utilizar pela primeira vez os `notebooks`: seguir as instruções no ficheiro `README.md` nesta directoria.

## Setup

In [2]:
from timelinknb import current_time,current_machine, get_mhk_db
import ucalumni.config as alumniconf

db_name = alumniconf.mhk_db_name
db = get_mhk_db(db_name)
print(current_machine,current_time,f'db={db_name}')



mini-m1.local 2022-05-10 22:50:53.809694 db=ucalumni


## Database status

In [4]:
from sqlalchemy import select,text
from timelinknb import Session

with Session() as session:
    classes = session.execute(
        select(
            text('class,count(*) as n from entities group by class')
        )
    )
    for c,n in classes:
        print(f'{n:6} | {c}')


   235 | act
     1 | aregister
2850168 | attribute
    17 | class
155307 | person
205316 | relation
    24 | rperson
   235 | source


In [3]:
from timelink.mhk.models.base import Person
from timelinknb import Session, pperson

with Session() as session:
    pid = session.execute(
        select(Person.id).limit(3)
    )
    for id, in pid:
        pperson(id)
    


n$"""
            Pedro Dias
            vide Nogueira
        """/m/id=162326/obs="""

            Id: 162326
            Código de referência: PT/AUC/ELU/UC-AUC/B/001-001/D/001033

            Nome        : Pedro Dias
            vide Nogueira
            Data inicial: 1565-10-01
            Data final  : 1572-07-03
            Filiação: João Fernandes
            Naturalidade: Coimbra
            Faculdade:

            Matrícula(s):
            Instituta e Cânones - 01.10.1565 até 1566
            03.10.1566 até 17.06.1567
            01.10.1567 até 14.06.1568
            01.10.1568 até 30.04.1569
            01.11.1569 até 08.07.1570
            01.10.1570 até 10.06.1571
            01.20.1571 até 03.07.1572

            Provou o tempo que se requer para se fazer Bacharel em Artes - 1565
            24.02.15650 recebeu o grau de Bacharel em Artes
            Exame Grau de Bacharel em Cânones - 19.07.1570


        """
  atr$código-de-referência/"PT/AUC/ELU/UC-AUC/B/001-001/D/00103

## Source files

In [5]:
from sqlalchemy import select
from pathlib import Path

from timelink.mhk.models.base import Source


kleio_files = [f.stem for f in list(Path('../sources').rglob('*.cli'))]
print("Number of kleio_files:", len(kleio_files))

stmt = select(Source.id,Source.updated)

with Session() as session:
    imported_sources = session.execute(stmt)
    sources_in_db = [s.id for s in imported_sources]
    print("Number of imported files:",len(sources_in_db))
    print("Files not in the database:", len(set(kleio_files)-set(sources_in_db)))
    for source in sorted(list(set(kleio_files)-set(sources_in_db))):
        print(source,end=' ')
    print()
    print("Imported sources with no file found:", len(set(sources_in_db)-set(kleio_files)))
    for source in sorted(list(set(sources_in_db)-set(kleio_files))):
        print(source,end=' ')
    print()


Number of kleio_files: 235
Number of imported files: 235
Files not in the database: 0

Imported sources with no file found: 0



## Analyse attributes extracted from records

### Attributes in the database

In [8]:
from sqlalchemy import func
from sqlalchemy import select
from timelinknb import get_attribute_table

attr_table = get_attribute_table()

# nml = %sql select the_type, count(*) as tot from attributes group by the_type
stmt = select(attr_table.c.the_type,func.count().label('tot')).group_by('the_type')
print(stmt)
attributes_extracted = []
with Session() as session:
    nml = session.execute(stmt)
    for the_type, tot in nml:
        print(f'{tot:6} | {the_type}')
        attributes_extracted.append((the_type,tot))

attr_not_redundand = [(t,c) for (t,c) in attributes_extracted if "." not in t]

SELECT attributes.the_type, count(*) AS tot 
FROM attributes GROUP BY attributes.the_type
105298 | código-de-referência
   356 | colegio
105298 | data-do-registo
    16 | errata
 53697 | exame
104638 | faculdade
  7167 | faculdade-original
104638 | faculdade.ano
 87112 | grau
 87112 | grau.ano
 40672 | instituta
 40672 | instituta.ano
  2444 | matricula-classe
  2444 | matricula-classe.ano
   114 | matricula-classe.obrigado
   114 | matricula-classe.obrigado.ano
    32 | matricula-classe.ordinário
    32 | matricula-classe.ordinário.ano
     2 | matricula-classe.voluntário
     2 | matricula-classe.voluntário.ano
   170 | matricula-curso
   170 | matricula-curso.ano
313032 | matricula-faculdade
313032 | matricula-faculdade.ano
  9475 | matricula-faculdade.obrigado
  9475 | matricula-faculdade.obrigado.ano
  6512 | matricula-faculdade.ordinário
  6512 | matricula-faculdade.ordinário.ano
  1546 | matricula-faculdade.voluntário
  1546 | matricula-faculdade.voluntário.ano
  1373 | matricul

In [12]:
import pandas as pd

df = pd.DataFrame.from_records(attr_not_redundand, columns=['attr','n'])
df.sort_values('attr')

Unnamed: 0,attr,n
1,colegio,356
0,código-de-referência,105298
2,data-do-registo,105298
3,errata,16
4,exame,53697
5,faculdade,104638
6,faculdade-original,7167
7,grau,87112
8,instituta,40672
9,matricula-classe,2444


**[PT]** Exemplo de registo complexo
**[EN]** Example of complex record

Example of the current capabilities of the algorithm including correction of “Faculdade”, religious order, titles, enrollment at class level, exam results and degrees.

In [31]:
from timelinknb.pandas import group_attributes

pd.set_option('display.max_rows',1000)
pd.set_option('display.max_colwidth',100)
df = group_attributes(['163686'])
df = df[~df['type'].str.contains(".", regex = False)]
df = df[['date','type','value','attr_obs']].sort_values(['date','type','value'])
df.fillna(" ")

Unnamed: 0_level_0,date,type,value,attr_obs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
163686,1798-10-15,faculdade,Teologia,Faculdade corrigida
163686,1798-10-15,faculdade-original,Matemática,
163686,1798-10-15,matricula-faculdade,Filosofia,15.10.1798
163686,1798-10-15,matricula-faculdade,Matemática,(obrigado)
163686,1798-10-15,naturalidade,Rio Bom,
163686,1798-10-15,nome,José Doutél,
163686,1798-10-15,nome-apelido,Doutél,
163686,1798-10-15,nome-geografico,Rio Bom,
163686,1798-10-15,nome-nota,frei monge de São Bernardo,
163686,1798-10-15,nome-pai,António Venceslau Doutel,


João Pedro Ribeiro

In [33]:
df = group_attributes(['316297'])
df = df[~df['type'].str.contains(".", regex = False)]
df = df[['date','type','value','attr_obs']].sort_values(['date','type','value'])
df.fillna(" ")

Unnamed: 0_level_0,date,type,value,attr_obs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
316297,1774-11-15,faculdade,Cânones,Faculdade corrigida
316297,1774-11-15,faculdade-original,Direito - Cânones,
316297,1774-11-15,matricula-classe,"Curso jurídico, 1º ano","Vol. III, L. I, fl. 20v."
316297,1774-11-15,matricula-curso,Curso jurídico,"Vol. III, L. I, fl. 20v."
316297,1774-11-15,naturalidade,Porto,
316297,1774-11-15,nome,João Pedro Ribeiro,
316297,1774-11-15,nome-apelido,Pedro Ribeiro,
316297,1774-11-15,nome-apelido,Ribeiro,
316297,1774-11-15,nome-geografico,Porto,
316297,1774-11-15,nome-pai,Pedro do Rosário Ribeiro,
