> First time use: follow instructions in the README.md file in this directory.


# Toponímia


Identificação e geolocalização dos topónimos

---

# Place names

Identification and geocoding of place names


## Setup

In [18]:
from timelink.api.database import TimelinkDatabase
from ucalumni.config import default_db_url

print(f"Creating TimelinkDatabase instance from {default_db_url}")
db = TimelinkDatabase(db_url=default_db_url)


Creating TimelinkDatabase instance from sqlite:///../database/sqlite3/fauc.db?check_same_thread=False


We use the results of another project that aims at geocoding and "administrative coding"
of place names in historical sources.
* https://github.com/joaquimrcarvalho/toponimia-portuguesa (currently private, contact joaquimcarvalho@mpu.edu.mo) for access).

The results consist of two files:

* `inferences/places/gngc_names_geocoded.csv` with place names currently geocoded
* `inferences/places/gngc_changes.csv` with place names that require normalization for geocoding

These files are produced by the notebook https://github.com/joaquimrcarvalho/toponimia-portuguesa/blob/9b84d476e4fd4400c3c6d748a744c13168ed72a2/001-geocoding.ipynb

Get a fresh copy of those files and put them in `inferences\places|` before running this notebook



### Check for update files in toponima project

In [2]:
import pandas as pd
# read geocoded_places from inferences/places/gngc_names_geocoded.csv
geocoded_places = pd.read_csv("../inferences/places/gngc_names_geocoded.csv")
# conver geonamesid to int
geocoded_places["geonamesid"] = pd.to_numeric(geocoded_places["geonamesid"], errors='coerce').astype('Int64')
# trim columns name and name_normalized
geocoded_places["name"] = geocoded_places["name"].str.strip()
geocoded_places["name_normalized"] = geocoded_places["name_normalized"].str.strip()
geocoded_places.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11498 entries, 0 to 11497
Data columns (total 25 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pop              11498 non-null  int64  
 1   date_in          11498 non-null  object 
 2   date_max         11498 non-null  object 
 3   name             11497 non-null  object 
 4   name_normalized  11497 non-null  object 
 5   id_ine_inspire   15 non-null     object 
 6   geonamesid       4978 non-null   Int64  
 7   context_0        11497 non-null  object 
 8   context_1        4387 non-null   object 
 9   context_2        281 non-null    object 
 10  context_3        116 non-null    object 
 11  country          11498 non-null  object 
 12  gaz_id_explicit  3 non-null      object 
 13  latitude         4977 non-null   float64
 14  longitude        4977 non-null   float64
 15  code             4923 non-null   object 
 16  dsg              4923 non-null   object 
 17  level       

In [3]:

geocoded_places.sample(5)

Unnamed: 0,pop,date_in,date_max,name,name_normalized,id_ine_inspire,geonamesid,context_0,context_1,context_2,...,code,dsg,level,inside,address,geonames_name,gazetteer,total_score,date,wikidata_id
665,15,1630-10-16,1885-10-13,Belém,"Belém, Santa Maria de Belém, Lisboa, Lisboa",,2270978.0,Belém,Santa Maria de Belém,Lisboa,...,PRT.12.7.32_1,Santa Maria de Belém,3.0,PRT.12.7_1,"Belém, Santa Maria de Belém, Lisboa, Lisboa",Belém,geonames-pt,30.0,2024-11-10 14:52:29.134665,Q376866
9818,1,1868-10-01,1868-10-01,"São João de Aião, Porto","São João de Aião, Porto",,,São João de Aião,Porto,,...,,,,,,,,,,
10839,1,1824-10-02,1824-10-02,"Vale de Madeiros, Viseu","Vale de Madeiros, Viseu",,2733105.0,Vale de Madeiros,Viseu,,...,PRT.20.9.2_1,Canas de Senhorim,3.0,PRT.20.9_1,"Vale de Madeiros, Canas de Senhorim, Nelas, Viseu",Vale de Madeiros,geonames-pt,15.0,2024-11-10 14:52:32.396240,*notfound*
10799,1,1817-11-08,1817-11-08,"Vale Mendes, Vila Real","Vale Mendes, Vila Real",,,Vale Mendes,Vila Real,,...,,,,,,,,,,
8205,1,1729-12-05,1729-12-05,Praga,Praga,,,Praga,,,...,,,,,,,,,,


### Get the change list

Records where it was necessary to change the birth place in other to geocode the place name.

These include spelling errors, name changes in modern times, qualification of ambiguous names by adding geographic context.

The list can also be inferred by comparing columns `name` and `name_normalized` 
and keep the rows where they differ.

Note some changes may involve extra spaces in the name and so not immediately evident.

In [6]:
# read changes from inferences/places/gngc_names_changes.csv
changes = geocoded_places[geocoded_places.name != geocoded_places.name_normalized]
changes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 593 entries, 12 to 11371
Data columns (total 25 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pop              593 non-null    int64  
 1   date_in          593 non-null    object 
 2   date_max         593 non-null    object 
 3   name             592 non-null    object 
 4   name_normalized  592 non-null    object 
 5   id_ine_inspire   11 non-null     object 
 6   geonamesid       488 non-null    Int64  
 7   context_0        592 non-null    object 
 8   context_1        259 non-null    object 
 9   context_2        148 non-null    object 
 10  context_3        107 non-null    object 
 11  country          593 non-null    object 
 12  gaz_id_explicit  1 non-null      object 
 13  latitude         488 non-null    float64
 14  longitude        488 non-null    float64
 15  code             487 non-null    object 
 16  dsg              487 non-null    object 
 17  level            4

Number of recods involved in the changes

In [11]:
# sum column pop in df changes
print("Number of record involved in place name changes:", changes["pop"].sum())


Number of record involved in place name changes: 8569


In [12]:
changes.sample(5)

Unnamed: 0,pop,date_in,date_max,name,name_normalized,id_ine_inspire,geonamesid,context_0,context_1,context_2,...,code,dsg,level,inside,address,geonames_name,gazetteer,total_score,date,wikidata_id
61,202,1540-10-20,1770-10-01,Arrifana de Sousa,Penafiel,,2736469,Penafiel,,,...,PRT.15.11.24_1,Penafiel,3.0,PRT.15.11_1,"Penafiel, Penafiel, Penafiel, Porto",Penafiel,geonames-pt,20.0,2024-11-10 14:52:29.276160,Q49287885
7895,1,1583-10-18,1583-10-18,Pedrulha do Monte,"Pedrulha, Coimbra (Santa Cruz), Coimbra, Coimbra",,2736519,Pedrulha,Coimbra (Santa Cruz),Coimbra,...,PRT.7.3.13_1,Coimbra (Santa Cruz),3.0,PRT.7.3_1,"Pedrulha, Coimbra (Santa Cruz), Coimbra, Coimbra",Pedrulha,geonames-pt,30.0,2024-11-10 14:52:29.117772,*notfound*
646,16,1626-10-21,1842-10-31,Rio de Vide,Rio Vide,,8011848,Rio Vide,,,...,PRT.7.9.3_1,Rio Vide,3.0,PRT.7.9_1,"Rio Vide, Rio Vide, Miranda do Corvo, Coimbra",Rio Vide,geonames-pt,15.0,2024-11-10 14:52:30.878577,Q1024471
597,18,1779-10-20,1848-06-05,"Viana, Minho",Viana do Castelo,,2732773,Viana do Castelo,,,...,PRT.18.9.34_1,Viana do Castelo (Monserrate),3.0,PRT.18.9_1,"Viana do Castelo, Viana do Castelo (Monserrate...",Viana do Castelo,geonames-pt,20.0,2024-11-10 14:52:29.249437,Q49278394
313,40,1581-10-21,1876-06-06,Vieira,Vieira do Minho,,2732748,Vieira do Minho,,,...,PRT.4.11.20_1,Vieira do Minho,3.0,PRT.4.11_1,"Vieira do Minho, Vieira do Minho, Vieira do Mi...",Vieira do Minho,geonames-pt,20.0,2024-11-10 14:52:29.359670,*notfound*


## Create field information to add to the original record

We produce a text snippet to add to the main record at https://pesquisa.auc.uc.pt/details?id=264605

In [13]:
# iterate through rows in dataframe changes
for index, row in changes.iterrows():
    # get the geonamesid
    geonamesid = row["geonamesid"]
    if pd.isna(geonamesid):
        geonamesid = None
    if geonamesid is not None:
    # get the normalized name
        name = row["name_normalized"]
        geoname_name = row['geonames_name']
        # get the address
        address = row["address"]
        country = row["country"]
        wikidata_id = row["wikidata_id"]
        snippet = f"Naturalidade designação alternativa: {name}\n"
        if address is not None and address != name:
            snippet += f"Naturalidade contexto: {address}\n"
        if country is not None:
            snippet += f"País: {country}\n"
        if geonamesid is not None:
            snippet += f"Geonames id: {geonamesid}\n"
            snippet += f"Geonames name: {geoname_name}\n"
        if wikidata_id is not None and wikidata_id != "*notfound*":
            snippet += f"Wikidata id: {wikidata_id}\n"

        # set column snippet with the snippet
        changes.loc[index, "snippet"] = snippet


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  changes.loc[index, "snippet"] = snippet


In [19]:
changes[['name', 'name_normalized', 'address', 'snippet']].sample(5)

Unnamed: 0,name,name_normalized,address,snippet
2023,"Brasil, Pernambuco","Recife, Pernambuco, Brasil","Recife, Recife, Pernambuco, Brasil","Naturalidade designação alternativa: Recife, P..."
7220,"Minas, Congonhas do Sabará","Congonhas, Brasil","Congonhas, Congonhas, Minas Gerais, Brasil",Naturalidade designação alternativa: Congonhas...
9648,São Tiago de Cacém,São Tiago de Cacém,,
835,Correlos,Currelos,"Currelos, Currelos, Carregal do Sal, Viseu",Naturalidade designação alternativa: Currelos\...
642,Pedrulha,"Pedrulha, Coimbra (Santa Cruz), Coimbra, Coimbra","Pedrulha, Coimbra (Santa Cruz), Coimbra, Coimbra","Naturalidade designação alternativa: Pedrulha,..."


In [20]:
from timelink.pandas import entities_with_attribute
from pandas import DataFrame

# iterate through rows in dataframe changes where geoanmesid is not null
see_only = 15
for index, row in changes[changes.geonamesid.notnull()].sort_values('pop',ascending=False).head(see_only).iterrows():
    # get the geonamesid
    naturalidade = row["name"]
    print("Naturalidade:", naturalidade)
    print(row["snippet"])
    people: DataFrame = entities_with_attribute(
        the_type="naturalidade",
        the_value=naturalidade,
        entity_type="person",
        dates_in=('1530-01-01', '2021-01-01'),
        show_elements=["name"],
        more_attributes=["uc-entrada", "uc-saida"],
        db=db)
    # interate through people
    print("Total de estudantes:", len(people))
    print("Amostra:")
    for index, person in people.sample(5).iterrows():
        print(index, person["name"], person["uc-entrada"], person["uc-saida"])
    print()


Naturalidade: Ilha da Madeira
Naturalidade designação alternativa: Funchal (Sé), Funchal (Sé), Funchal, Madeira
País: Portugal
Geonames id: 8014027
Geonames name: Funchal (Sé)
Wikidata id: Q2190091

Total de estudantes: 575
Amostra:
172253 Lourenço de Matos 1648-10-09 1650-10-15
176565 Simão José de Oliveira 1795-10-13 1795-10-13
181173 Manuel da Cunha Guterres 1745-10-01 1754-07-19
168012 Pantaleão de Sá e Freitas 1730-12-13 1735-05-18
241278 Bartolomeu Luís Pimenta 1741-10-01 1745-07-01

Naturalidade: Baía, Brasil
Naturalidade designação alternativa: Salvador, Brasil
Naturalidade contexto: Salvador, Salvador, Bahia, Brasil
País: Brasil
Geonames id: 3450554
Geonames name: Salvador
Wikidata id: Q32148682

Total de estudantes: 495
Amostra:
142178 José Pires de Carvalho e Albuquerque 1729-10-01 1734-07-21
151475 Manuel Teles Barreto 1675-11-09 1678-10-10
151926 João Borges de Barros 1685-10-01 1691-10-01
208724 Paulo Ferreira de Andrade Vargas 1752-10-01 1760-07-29
193834 Francisco Ramir