> First time use: follow instructions in the README.md file in this directory.


**[PT]** Português

---

**[EN]** English


# Toponímia


Identificação e geolocalização dos topónimos

---

# Place names

Identification and geocoding of place names


## Setup

In [1]:
from timelink.api.database import TimelinkDatabase
from ucalumni.config import default_db_url

print(f"Creating TimelinkDatabase instance from {default_db_url}")
db = TimelinkDatabase(db_url=default_db_url)

Creating TimelinkDatabase instance from sqlite:///../database/sqlite3/fauc.db?check_same_thread=False


## Lista de lugares diferentes e número de ocorrências

---

## List of different places with number of occurrences

In [3]:
from timelink.pandas import attribute_values

attribute = 'naturalidade'
period = ('1500-00-00','1990-00-00')

places = attribute_values(attribute,dates_between=period, db=db)
places['place_name'] = places.index.values
places.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11499 entries, Lisboa to Óvoa, Viseu
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   count       11499 non-null  int64 
 1   date_min    11499 non-null  object
 2   date_max    11499 non-null  object
 3   place_name  11499 non-null  object
dtypes: int64(1), object(3)
memory usage: 449.2+ KB


### Lugares principais

---

### Main locations

In [5]:
places.sort_values('count', ascending=False).head(10)



Unnamed: 0_level_0,count,date_min,date_max,place_name
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lisboa,8784,1537-02-12,1916-07-19,Lisboa
Coimbra,5526,1537-00-00,1915-10-12,Coimbra
Porto,3391,1537-05-30,1917-10-22,Porto
Braga,1608,1540-01-21,1914-07-24,Braga
Évora,1072,1537-11-22,1910-10-10,Évora
Viseu,986,1537-00-00,1912-07-03,Viseu
Guimarães,980,1537-12-18,1912-07-18,Guimarães
Lamego,972,1537-00-00,1909-10-05,Lamego
Aveiro,790,1538-04-21,1913-10-13,Aveiro
Vila Real,765,1537-03-07,1909-11-09,Vila Real


### Lugares só com uma ocorrência
---

### Locations with just one occurrence

In [6]:
places[places['count'] == 1].info()

<class 'pandas.core.frame.DataFrame'>
Index: 7554 entries, - Lisboa to Óvoa, Viseu
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   count       7554 non-null   int64 
 1   date_min    7554 non-null   object
 2   date_max    7554 non-null   object
 3   place_name  7554 non-null   object
dtypes: int64(1), object(3)
memory usage: 295.1+ KB


In [10]:
places[places['count']==1].sample(10)

Unnamed: 0_level_0,count,date_min,date_max,place_name
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Vilões, Ourém",1,1762-10-01,1762-10-01,"Vilões, Ourém"
"São Miguel, São João de Areias",1,1730-10-30,1730-10-30,"São Miguel, São João de Areias"
"Eiras, Amadora",1,1765-10-01,1765-10-01,"Eiras, Amadora"
Côvo,1,1687-10-01,1687-10-01,Côvo
Lagarelhos,1,1734-10-01,1734-10-01,Lagarelhos
Quintiãs,1,1823-10-21,1823-10-21,Quintiãs
Alagoas,1,1725-10-01,1725-10-01,Alagoas
Alfazeirão,1,1742-01-15,1742-01-15,Alfazeirão
Couto de Capareiros,1,1756-10-01,1756-10-01,Couto de Capareiros
Orges,1,1765-12-01,1765-12-01,Orges


# Identificação de topónimos

---

# Geocoding

https://craftingdh.netlify.app/tutorials/folium/

## Toponímia de Portugal Continental 1:200 000

 ### Serviço Nacional de Informação Geográfica

* Metadados https://snig.dgterritorio.gov.pt/rndg/srv/por/catalog.search#/metadata/57479cf3-df10-47a0-9860-f7e3157596b1
* Dados: http://mapas.dgterritorio.pt/ATOM-download/SCN200k/toponimia/toponimia.zip
* Acesso público sem restrições
* Sempre que o utilizador publique e/ou divulgue, por meio analógico ou digital, informação geográfica propriedade da Direção-Geral do Território, ainda que parcialmente adaptada, deverá atribuir créditos com inclusão do texto "Informação geográfica cedida pela Direção-Geral do Território"

Notas: 
* Inclui mapas e informação de layer formato dbf.
* Tabela de dados não inclui coordenadas mas inclui um id  (gml_id) que pode ser usado para obter as coordenadas noutras bases.
* Pode ser útil para corrigir problemas de ortografia nos topónimos da base de dados

### Convert informação para Pandas DataFrame

In [11]:
!pip install simpledbf

.bash_profile RUN!
Collecting simpledbf
  Downloading simpledbf-0.2.6.tar.gz (17 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: simpledbf
  Building wheel for simpledbf (pyproject.toml) ... [?25ldone
[?25h  Created wheel for simpledbf: filename=simpledbf-0.2.6-py3-none-any.whl size=13784 sha256=35711c83222c74e40405863194e15eb86d61b9db67b9679506e6e627c5e7cbea
  Stored in directory: /Users/jrc/Library/Caches/pip/wheels/37/52/21/14be45b7c160488637e82d6a317f4379458bb4dd60be21d5fa
Successfully built simpledbf
Installing collected packages: simpledbf
Successfully installed simpledbf-0.2.6


In [13]:
import pandas as pd
from simpledbf import Dbf5

file_name = '../extras/geocoding/SNIG/portugal-continental-200k/Toponimia200k.dbf'
dbf = Dbf5(file_name)
print("Number of records:",dbf.numrec)
dbf.fields

Number of records: 7054


[('DeletionFlag', 'C', 1),
 ('gml_id', 'C', 254),
 ('beginLifes', 'C', 20),
 ('localId', 'C', 15),
 ('namespace', 'C', 35),
 ('versionId', 'N', 10),
 ('Integer', 'N', 10),
 ('mostDetail', 'N', 10),
 ('language', 'C', 3),
 ('sourceOfNa', 'C', 3),
 ('pronunciat', 'C', 254),
 ('text', 'C', 49),
 ('script', 'C', 4),
 ('relatedSpa', 'C', 27),
 ('relatedS_1', 'C', 45),
 ('relatedS_2', 'N', 10)]

These are included in the INE data. But do not overlap. 

In [14]:
topo200K_df = dbf.to_dataframe()
topo200K_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7054 entries, 0 to 7053
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   gml_id      7054 non-null   object 
 1   beginLifes  7054 non-null   object 
 2   localId     0 non-null      float64
 3   namespace   7054 non-null   object 
 4   versionId   7054 non-null   int64  
 5   Integer     7054 non-null   int64  
 6   mostDetail  7054 non-null   int64  
 7   language    7054 non-null   object 
 8   sourceOfNa  7054 non-null   object 
 9   pronunciat  0 non-null      float64
 10  text        7054 non-null   object 
 11  script      7054 non-null   object 
 12  relatedSpa  7054 non-null   object 
 13  relatedS_1  0 non-null      float64
 14  relatedS_2  0 non-null      float64
dtypes: float64(4), int64(3), object(8)
memory usage: 826.8+ KB


In [17]:
topo200K_df.sample(5)

Unnamed: 0,gml_id,beginLifes,localId,namespace,versionId,Integer,mostDetail,language,sourceOfNa,pronunciat,text,script,relatedSpa,relatedS_1,relatedS_2
2241,PT.GN.114541,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Água Formosa,Latn,CDG200k_populatedPlace2241,,
3879,PT.GN.116179,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Fontinha,Latn,CDG200k_populatedPlace3526,,
1669,PT.GN.113969,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Areias,Latn,CDG200k_populatedPlace1631,,
3511,PT.GN.115811,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Santos Evos,Latn,CDG200k_populatedPlace3658,,
5460,PT.GN.117760,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Lixa,Latn,CDG200k_populatedPlace5170,,


In [18]:
topo200K_df[topo200K_df.gml_id == 'PT.GN.50']

Unnamed: 0,gml_id,beginLifes,localId,namespace,versionId,Integer,mostDetail,language,sourceOfNa,pronunciat,text,script,relatedSpa,relatedS_1,relatedS_2


Não vemos uma coluna a identificar o tipo de topónimo
mas é possível inferir a partir do campo "relatedSpa"
que tem um código único com um prefixo

In [19]:
topo200K_df['inferredType'] = topo200K_df.relatedSpa.str.strip("CDG_k0123456789")
topo200K_df.groupby(['inferredType'])['inferredType'].count()

inferredType
buildingForte          28
buildingSantuario      14
landformCabo           10
landformIlha           19
landformPonta          58
landformSerra         159
populatedPlace       6766
Name: inferredType, dtype: int64

In [20]:
topo200K_df[topo200K_df.inferredType.str.contains('Santuario')]

Unnamed: 0,gml_id,beginLifes,localId,namespace,versionId,Integer,mostDetail,language,sourceOfNa,pronunciat,text,script,relatedSpa,relatedS_1,relatedS_2,inferredType
6680,PT.GN.118980,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Fátima,Latn,CDG200k_buildingSantuario28,,,buildingSantuario
6681,PT.GN.118981,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Imaculado Coração de Maria,Latn,CDG200k_buildingSantuario29,,,buildingSantuario
6682,PT.GN.118982,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,N. Srª da Conceição,Latn,CDG200k_buildingSantuario30,,,buildingSantuario
6683,PT.GN.118983,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,N. Srª dos Milagres,Latn,CDG200k_buildingSantuario31,,,buildingSantuario
6684,PT.GN.118984,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,S. Bartolumeu,Latn,CDG200k_buildingSantuario32,,,buildingSantuario
6685,PT.GN.118985,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Srª de Nazo,Latn,CDG200k_buildingSantuario33,,,buildingSantuario
6686,PT.GN.118986,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,N. Srª dos Anúncios,Latn,CDG200k_buildingSantuario34,,,buildingSantuario
6687,PT.GN.118987,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Cristo Rei,Latn,CDG200k_buildingSantuario35,,,buildingSantuario
6688,PT.GN.118988,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Senhora da Rocha,Latn,CDG200k_buildingSantuario36,,,buildingSantuario
6689,PT.GN.118989,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Srº da Pedra,Latn,CDG200k_buildingSantuario37,,,buildingSantuario


In [21]:
topo200K_df.sample(10)

Unnamed: 0,gml_id,beginLifes,localId,namespace,versionId,Integer,mostDetail,language,sourceOfNa,pronunciat,text,script,relatedSpa,relatedS_1,relatedS_2,inferredType
2501,PT.GN.114801,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Serra das Alhadas,Latn,CDG200k_populatedPlace2606,,,populatedPlace
561,PT.GN.112861,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,São Bartolomeu,Latn,CDG200k_populatedPlace332,,,populatedPlace
590,PT.GN.112890,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Aldeia Ruiva,Latn,CDG200k_populatedPlace361,,,populatedPlace
129,PT.GN.112429,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Arzinha,Latn,CDG200k_populatedPlace129,,,populatedPlace
6953,PT.GN.119253,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Vila Nova de Cerveira,Latn,CDG200k_populatedPlace6726,,,populatedPlace
3236,PT.GN.115536,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Vila Nova,Latn,CDG200k_populatedPlace3362,,,populatedPlace
4239,PT.GN.116539,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Travassos,Latn,CDG200k_populatedPlace4407,,,populatedPlace
4463,PT.GN.116763,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Sarzeda,Latn,CDG200k_populatedPlace4131,,,populatedPlace
5482,PT.GN.117782,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,Covilhã,Latn,CDG200k_populatedPlace5192,,,populatedPlace
3369,PT.GN.115669,2015-01-01T00:00:00Z,,http://id.igeo.pt/so/GN/NamedPlaced,2015,577791,144447,Por,DGT,,São Bento,Latn,CDG200k_populatedPlace3495,,,populatedPlace


In [22]:
topo200k_file = '../inferences/places/topo_200k.csv'
topo200K_df[['gml_id','text','sourceOfNa']].to_csv(topo200k_file,index=False)

### Cruzar com topónimos da base local

In [23]:
!pip install recordlinkage

.bash_profile RUN!
Collecting recordlinkage
  Downloading recordlinkage-0.16-py3-none-any.whl.metadata (8.1 kB)
Collecting jellyfish>=1 (from recordlinkage)
  Downloading jellyfish-1.1.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.6 kB)
Collecting scipy>=1 (from recordlinkage)
  Downloading scipy-1.14.1-cp311-cp311-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting scikit-learn>=1 (from recordlinkage)
  Downloading scikit_learn-1.5.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (13 kB)
Collecting joblib (from recordlinkage)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn>=1->recordlinkage)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading recordlinkage-0.16-py3-none-any.whl (926 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m926.9/926.9 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jellyfish-1.1.0-cp311-cp311-macosx_11_0_arm64.whl (303 kB)
Downloading scik

In [24]:
import recordlinkage
from recordlinkage.preprocessing import clean

indexer = recordlinkage.index.SortedNeighbourhood('place_name','text',window=11)
candidates = indexer.index(places,topo200K_df)
print(len(candidates))

52284


In [None]:
compare = recordlinkage.Compare()
compare.string('place_name','text',
    # ['jaro', 'jarowinkler', 'levenshtein', 'damerau_levenshtein', 'qgram', 'cosine', 'smith_waterman', 'lcs'].
    method='damerau_levenshtein',
    threshold=0.90,
    label='score')
features = compare.compute(candidates,places,topo200K_df)
features.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 52284 entries, ('Guimarães', np.int64(2792)) to ('Óbidos, Brasil', np.int64(6679))
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   score   52284 non-null  float64
dtypes: float64(1)
memory usage: 1.2+ MB


### Total de topónimos identificados

In [26]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

1.0     3524
0.0    48760
Name: count, dtype: int64

#### Verificar inferências

In [27]:
potential_matches = features[features.sum(axis=1) > 0].reset_index()

potential_matches['topo']=topo200K_df.loc[potential_matches['level_1']]['text'].values
potential_matches['topo_id']=topo200K_df.loc[potential_matches['level_1']]['gml_id'].values
potential_matches[potential_matches.value != potential_matches.topo].head(50)

Unnamed: 0,value,level_1,score,topo,topo_id
0,Marmeleira,3106,1.0,Marmeleiro,PT.GN.115406
1,Souto Covo,4944,1.0,Souto Novo,PT.GN.117244
2,Souto Covo,6484,1.0,Souto Novo,PT.GN.118784
3,Escalas de Baixo,2187,1.0,Escalos de Baixo,PT.GN.114487
4,Aldeia Nossa do Cabo,2494,1.0,Aldeia Nova do Cabo,PT.GN.114794
5,Carvalhos,4039,1.0,Carvalhosa,PT.GN.116339
6,Carvalhos,5352,1.0,Carvalhosa,PT.GN.117652
7,Reguengo de Monsaraz,270,1.0,Reguengos de Monsaraz,PT.GN.112570
8,Santa Marinha de Zêzere,4941,1.0,Santa Marinha do Zêzere,PT.GN.117241
9,Carvalheira,4489,1.0,Carvalheiro,PT.GN.116789


Check how many not found we can match

In [28]:
from os.path import exists

not_found_file = '../inferences/places/osm_not_found.csv'

not_found_df: pd.DataFrame = None

if exists(not_found_file):
    not_found_df = pd.read_csv(not_found_file)
    not_found = list(not_found_df['not_found'])
else:
    not_found = []
    not_found_df = pd.DataFrame(columns=['not_found'])


In [29]:
nfs = not_found_df['not_found'].values
not_found_but_in_topo = potential_matches[potential_matches['value'].isin(nfs)].sort_values(['value','topo'])
not_found_but_in_topo[not_found_but_in_topo['value'] != not_found_but_in_topo['topo']]

Unnamed: 0,value,level_1,score,topo,topo_id
4,Aldeia Nossa do Cabo,2494,1.0,Aldeia Nova do Cabo,PT.GN.114794
3491,Alperdinha,2438,1.0,Alpedrinha,PT.GN.114738
3502,Alvarelos,3056,1.0,Alvarelhos,PT.GN.115356
3492,Carrapatos,5660,1.0,Carrapatas,PT.GN.117960
3503,Carrazedo de Ansiães,4667,1.0,Carrazeda de Ansiães,PT.GN.116967
60,Castre Daire,4459,1.0,Castro Daire,PT.GN.116759
3,Escalas de Baixo,2187,1.0,Escalos de Baixo,PT.GN.114487
3514,Figueró dos Vinhos,2222,1.0,Figueiró dos Vinhos,PT.GN.114522
36,Marmaleira,1400,1.0,Marmeleira,PT.GN.113700
37,Marmaleira,2857,1.0,Marmeleira,PT.GN.115157
