> First time use: follow instructions in the README.md file in this directory.


**[PT]** Português

---

**[EN]** English


# Georeferenciação com GeoNames


Identificação e geolocalização de topónimos com GeoNames.

Este bloco de notas utiliza informação disponibilizada por GeoNames em 
http://www.geonames.org segundo a licença [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)

---

# Georeferencing place names with GeoNames

Identification and geocoding of place names with GeoNames.

This notebook uses information made availabe be GeoNames
http://www.geonames.org under a [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/)



## Ligação à base de dados local

Para inicializar a base de dados
local ver [000-database-setup](000-database-setup.ipynb)

---

## Setup local database access  

To initialize the local database see [000-database-setup](000-database-setup.ipynb)


In [1]:
from timelink.api.database import TimelinkDatabase
from ucalumni.config import default_db_url

print(f"Creating TimelinkDatabase instance from {default_db_url}")
db = TimelinkDatabase(db_url=default_db_url)

Creating TimelinkDatabase instance from sqlite:///../database/sqlite3/fauc.db?check_same_thread=False


## Lista de lugares diferentes e número de ocorrências

---

## List of different places with number of occurrences

In [3]:
from timelink.pandas import attribute_values

attribute = 'naturalidade'
period = ('1500-00-00','1990-00-00')

places = attribute_values(attribute,dates_between=period, db=db)
places['place_name'] = places.index.values
places.reset_index()
places.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11499 entries, Lisboa to Óvoa, Viseu
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   count       11499 non-null  int64 
 1   date_min    11499 non-null  object
 2   date_max    11499 non-null  object
 3   place_name  11499 non-null  object
dtypes: int64(1), object(3)
memory usage: 449.2+ KB


### Lugares principais

---

### Main locations

In [4]:
places.sort_values('count', ascending=False).head(10)



Unnamed: 0_level_0,count,date_min,date_max,place_name
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lisboa,8784,1537-02-12,1916-07-19,Lisboa
Coimbra,5526,1537-00-00,1915-10-12,Coimbra
Porto,3391,1537-05-30,1917-10-22,Porto
Braga,1608,1540-01-21,1914-07-24,Braga
Évora,1072,1537-11-22,1910-10-10,Évora
Viseu,986,1537-00-00,1912-07-03,Viseu
Guimarães,980,1537-12-18,1912-07-18,Guimarães
Lamego,972,1537-00-00,1909-10-05,Lamego
Aveiro,790,1538-04-21,1913-10-13,Aveiro
Vila Real,765,1537-03-07,1909-11-09,Vila Real


### Lugares só com uma ocorrência
---

### Locations with just one occurrence

In [5]:
places[places['count'] == 1].info()

<class 'pandas.core.frame.DataFrame'>
Index: 7554 entries, - Lisboa to Óvoa, Viseu
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   count       7554 non-null   int64 
 1   date_min    7554 non-null   object
 2   date_max    7554 non-null   object
 3   place_name  7554 non-null   object
dtypes: int64(1), object(3)
memory usage: 295.1+ KB


In [6]:
places[places['count']==1].head(10)

Unnamed: 0_level_0,count,date_min,date_max,place_name
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
- Lisboa,1,1658-10-02,1658-10-02,- Lisboa
- Vila Franca,1,1658-10-01,1658-10-01,- Vila Franca
-Lisboa,1,1723-10-01,1723-10-01,-Lisboa
"A de Barros, Caria",1,1765-10-01,1765-10-01,"A de Barros, Caria"
"A de Barros, Lamego",1,1624-10-10,1624-10-10,"A de Barros, Lamego"
A dos Francos,1,1745-10-01,1745-10-01,A dos Francos
"ALgaça, Poiares",1,1751-10-01,1751-10-01,"ALgaça, Poiares"
AZagães,1,1749-10-01,1749-10-01,AZagães
Abade,1,1747-12-14,1747-12-14,Abade
Abade de São Romão,1,1540-01-10,1540-01-10,Abade de São Romão


## Geonames
>The GeoNames geographical database covers all countries and contains over eleven million placenames that are available for download free of charge.

* http://www.geonames.org
* Dowloads em http://download.geonames.org/export/dump/

Description of available files in [readme.txt](../extras/gecoding/geonames/readme.txt)

To use this notebook you need the following files from geonames:

* one or more "XX.ZIP" for the countries of interest
* the file "featureCodes.txt" to 
To import geocode data download the files needed from the link above
into directory `../extras/geocoding/geonames/` (to use another directory 
change the variable `path_to_geonames` in the cell bellow  ).


In [18]:
# Collect geonames files
from pathlib import Path
from os.path import exists
path_to_geonames = '../extras/geocoding/geonames'

files = list(Path(path_to_geonames).rglob("[A-Z][A-Z].txt"))
[file.name for file in sorted(files)]


['AO.txt',
 'BR.txt',
 'CV.txt',
 'ES.txt',
 'GW.txt',
 'IE.txt',
 'MZ.txt',
 'PT.txt',
 'ST.txt',
 'TL.txt']

In [None]:
# from readme.txt
read_me = """
geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see d, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int)
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format
"""


In [19]:
lines = read_me.splitlines()
fields = [f.split(':')[0].strip().replace(' ','_') for f in lines if f != '']
fields

['geonameid',
 'name',
 'asciiname',
 'alternatenames',
 'latitude',
 'longitude',
 'feature_class',
 'feature_code',
 'country_code',
 'cc2',
 'admin1_code',
 'admin2_code',
 'admin3_code',
 'admin4_code',
 'population',
 'elevation',
 'dem',
 'timezone',
 'modification_date']

### Converter informação para Pandas DataFrame

In [20]:
 dtypes = {
    'geonameid':str,
    'admin1_code':str,
    'admin2_code':str,
    'admin3_code':str,
    'admin4_code':str,
 }

In [21]:
import pandas as pd

geonames_df = None
for file in [f for f in files if 'alternatenames' not in str(f.parent)] :
    print("Reading from ",file.name)
    df = pd.read_csv(file,sep='\t',names=fields,dtype=dtypes,header=0, low_memory=False, index_col='geonameid')
    if geonames_df is None:
        geonames_df = df.copy()
    else:
        geonames_df = pd.concat([geonames_df,df],axis=0)
geonames_df.info()

Reading from  MZ.txt
Reading from  TL.txt
Reading from  ST.txt
Reading from  PT.txt
Reading from  GW.txt
Reading from  AO.txt
Reading from  BR.txt
Reading from  CV.txt
Reading from  ES.txt
Reading from  IE.txt
<class 'pandas.core.frame.DataFrame'>
Index: 419185 entries, 345948 to 12493118
Data columns (total 18 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   name               419184 non-null  object 
 1   asciiname          419182 non-null  object 
 2   alternatenames     145930 non-null  object 
 3   latitude           419185 non-null  float64
 4   longitude          419185 non-null  float64
 5   feature_class      419185 non-null  object 
 6   feature_code       419161 non-null  object 
 7   country_code       419185 non-null  object 
 8   cc2                25133 non-null   object 
 9   admin1_code        418343 non-null  object 
 10  admin2_code        230887 non-null  object 
 11  admin3_code        122463 non-null 

Get the admin level 5 codes which are in separate file


In [22]:
admin_code5_exists = False
admin_code5_file = '../extras/geocoding/geonames/adminCode5.txt'
if exists(admin_code5_file):
    admin_code5_exists = True
    geonames_ac5  = pd.read_csv(admin_code5_file,sep='\t',names=['geonameid','admin5_code'],header=None, dtype={'geonameid':'str','admin5_code':'str'},index_col='geonameid')
    geonames_df = pd.merge(geonames_df, geonames_ac5, how='left',on='geonameid')
geonames_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 419185 entries, 345948 to 12493118
Data columns (total 19 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   name               419184 non-null  object 
 1   asciiname          419182 non-null  object 
 2   alternatenames     145930 non-null  object 
 3   latitude           419185 non-null  float64
 4   longitude          419185 non-null  float64
 5   feature_class      419185 non-null  object 
 6   feature_code       419161 non-null  object 
 7   country_code       419185 non-null  object 
 8   cc2                25133 non-null   object 
 9   admin1_code        418343 non-null  object 
 10  admin2_code        230887 non-null  object 
 11  admin3_code        122463 non-null  object 
 12  admin4_code        3367 non-null    object 
 13  population         419185 non-null  int64  
 14  elevation          11182 non-null   float64
 15  dem                419185 non-null  int64  
 16  

Manter apenas topónimos povoados

---

Keep only populated places

In [23]:
geonames_df = geonames_df[geonames_df.population>0]
geonames_df.reset_index(inplace=True)
geonames_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30700 entries, 0 to 30699
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   geonameid          30700 non-null  object 
 1   name               30700 non-null  object 
 2   asciiname          30700 non-null  object 
 3   alternatenames     19274 non-null  object 
 4   latitude           30700 non-null  float64
 5   longitude          30700 non-null  float64
 6   feature_class      30700 non-null  object 
 7   feature_code       30700 non-null  object 
 8   country_code       30700 non-null  object 
 9   cc2                229 non-null    object 
 10  admin1_code        30697 non-null  object 
 11  admin2_code        29828 non-null  object 
 12  admin3_code        21253 non-null  object 
 13  admin4_code        153 non-null    object 
 14  population         30700 non-null  int64  
 15  elevation          403 non-null    float64
 16  dem                307

In [24]:
geonames_df[geonames_df.country_code == 'PT'].sample(10)

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date,admin5_code
2276,8011565,Remondes,Remondes,,41.38923,-6.78475,A,ADM3,PT,,5,408,40814,,212,,551,Europe/Lisbon,2019-07-24,
1597,8010886,Vila Ruiva,Vila Ruiva,,38.2553,-7.93251,A,ADM3,PT,,3,207,20704,,467,,197,Europe/Lisbon,2019-07-24,
2245,8011534,Fradizela,Fradizela,,41.65509,-7.18071,A,ADM3,PT,,5,407,40714,,234,,342,Europe/Lisbon,2019-07-24,
4858,8014148,Além da Ribeira,Alem da Ribeira,,39.66852,-8.40842,A,ADM3,PT,,18,1418,141816,,764,,123,Europe/Lisbon,2019-07-24,
3390,8012679,Viariz,Viariz,,41.17084,-7.97808,A,ADM3,PT,,17,1302,130220,,520,,787,Europe/Lisbon,2019-07-24,
2080,8011369,Portela das Cabras,Portela das Cabras,,41.66997,-8.49914,A,ADM3,PT,,4,313,31338,,278,,269,Europe/Lisbon,2019-07-24,
4862,8014152,Juncal do Campo,Juncal do Campo,,39.90892,-7.60114,A,ADM3,PT,,6,502,50210,,355,,261,Europe/Lisbon,2019-07-24,
1005,3372952,Lajes,Lajes,"Lagens,Lajens,Lajes,TER",38.76352,-27.10336,P,PPL,PT,,23,4302,430206,,3744,,67,Atlantic/Azores,2018-03-15,
726,2737437,Montemor-o-Velho,Montemor-o-Velho,"Montemor-o-Vel'ju,Montemor-o-Velho,Montemor-u-...",40.17287,-8.68616,P,PPLA2,PT,,7,610,61007,,3154,,20,Europe/Lisbon,2014-04-06,
687,2736337,Perelhal,Perelhal,"Parelhal,Perelhal",41.53075,-8.68982,P,PPL,PT,,4,302,30260,,1749,,61,Europe/Lisbon,2018-06-21,


### Get extra information (example)

#### Get feature codes

In [28]:
from os.path import exists

fcodes_exist = False
features_codes_file = '../extras/geocoding/geonames/featureCodes_en.txt'
if exists(features_codes_file):
    fcodes_exist = True
    geonames_fc  = pd.read_csv(features_codes_file,sep='\t',names=['fcode','fname','fdesc'],index_col='fcode', header=0)

In [29]:
geonames_fc.head()

Unnamed: 0_level_0,fname,fdesc
fcode,Unnamed: 1_level_1,Unnamed: 2_level_1
A.ADM1H,historical first-order administrative division,a former first-order administrative division
A.ADM2,second-order administrative division,a subdivision of a first-order administrative ...
A.ADM2H,historical second-order administrative division,a former second-order administrative division
A.ADM3,third-order administrative division,a subdivision of a second-order administrative...
A.ADM3H,historical third-order administrative division,a former third-order administrative division


#### Get admin codes

In [31]:
from os.path import exists

admin_code1_exists = False
admin_code1_file = '../extras/geocoding/geonames/admin1CodesASCII.txt'
if exists(admin_code1_file):
    admin_code1_exists = True
    geonames_ac1  = pd.read_csv(admin_code1_file,sep='\t',names=['acode1','ac1_name','ac1_name_ascii','geonames_id'],dtype={'geonames_id':'str'},index_col='geonames_id', header=0)

admin_code2_exists = False
admin_code2_file = '../extras/geocoding/geonames/admin2Codes.txt'
if exists(admin_code2_file):
    admin_code2_exists = True
    geonames_ac2  = pd.read_csv(admin_code1_file,sep='\t',names=['acode2','ac2_name','ac2_name_ascii','geonames_id'],dtype={'geonames_id':'str'},index_col='geonames_id', header=0)



In [32]:
geonames_ac1.head()

Unnamed: 0_level_0,acode1,ac1_name,ac1_name_ascii
geonames_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3039676,AD.05,Ordino,Ordino
3040131,AD.04,La Massana,La Massana
3040684,AD.03,Encamp,Encamp
3041203,AD.02,Canillo,Canillo
3041566,AD.07,Andorra la Vella,Andorra la Vella


In [33]:
geonames_ac1.loc[geonames_ac1.ac1_name == 'Coimbra']

Unnamed: 0_level_0,acode1,ac1_name,ac1_name_ascii
geonames_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2740636,PT.07,Coimbra,Coimbra


In [34]:
geonames_ac1.loc['2740636']

acode1              PT.07
ac1_name          Coimbra
ac1_name_ascii    Coimbra
Name: 2740636, dtype: object

In [35]:
place = 'Coimbra'
result = geonames_df[geonames_df.name == place]
for i,row in result.iterrows():
    name = row['name']
    fcode = f"{row.feature_class}.{row.feature_code}"
    if fcodes_exist:
        fcode_desc = geonames_fc.loc[fcode].fname
    else:
        fcodes = 'NA'

    admin_code = None
    for acode_column in ['admin4_code','admin3_code','admin2_code','admin1_code']:
        if type(row[acode_column]) is str:
            admin_code = row[acode_column]
            break
    if admin_code is None:
        admin_code = '(NA)'
    print(f" {row.country_code} {row.geonameid} {name} {fcode} {fcode_desc} {admin_code}")
    print(f"      {row.alternatenames}")
result

 PT 2740637 Coimbra P.PPLA seat of a first-order administrative division 060325
      CBP,Coimbra,Coímbra,Coïmbra,Koimbra,Koimbro,Koimpra,Koímbra,ke ying bu la,koinbura,Κόιμπρα,Коимбра,コインブラ,科英布拉
 PT 8010483 Coimbra A.ADM2 second-order administrative division 0603
      Coimbra,Coimbra Municipality,Coinvra,Conimbriga,Coímbra,Coïmbra,Gorad Kaimbra,Koimbra,Koimbro,Koimpra,Koímbra,ke ying bu la,ko xim bra,ko'imabra,koimbeula,koinbura,kwymbra,kwyymbra,qlmryt,qwymbrh,Κοΐμπρα,Горад Каімбра,Коимбра,Коїмбра,קוימברה,قلمرية,کوئیمبرا,کویمبرا,কোইমব্রা,โกอิมบรา,კოიმბრა,コインブラ,科英布拉,코임브라
 BR 6321278 Coimbra A.ADM2 second-order administrative division 3116704
      nan


Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date,admin5_code
859,2740637,Coimbra,Coimbra,"CBP,Coimbra,Coímbra,Coïmbra,Koimbra,Koimbro,Ko...",40.20564,-8.41955,P,PPLA,PT,,7,603,60325.0,,106582,,98,Europe/Lisbon,2019-02-26,
1195,8010483,Coimbra,Coimbra,"Coimbra,Coimbra Municipality,Coinvra,Conimbrig...",40.21026,-8.42683,A,ADM2,PT,,7,603,,,143396,,88,Europe/Lisbon,2020-02-07,
10237,6321278,Coimbra,Coimbra,,-20.84494,-42.79834,A,ADM2,BR,,15,3116704,,,7054,,740,America/Sao_Paulo,2015-07-20,


### Cruzar com topónimos da base local

In [36]:
!pip install recordlinkage

.bash_profile RUN!


In [37]:
import recordlinkage
from recordlinkage.preprocessing import clean

indexer = recordlinkage.index.SortedNeighbourhood('place_name','name',window=11)
candidates = indexer.index(places,geonames_df)
print(len(candidates))

88187


In [38]:
compare = recordlinkage.Compare()
compare.string('place_name','name',
    # ['jaro', 'jarowinkler', 'levenshtein', 'damerau_levenshtein', 'qgram', 'cosine', 'smith_waterman', 'lcs'].
    method='damerau_levenshtein',
    threshold=0.90,
    label='score')
compare.exact('place_name','name',
    label='equal')
features = compare.compute(candidates,places,geonames_df)
features.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 88187 entries, ('Porto', np.int64(8560)) to ('Óvoa, Viseu', np.int64(11379))
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   score   88187 non-null  float64
 1   equal   88187 non-null  int64  
dtypes: float64(1), int64(1)
memory usage: 2.7+ MB


In [39]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

2.0     4052
1.0      428
0.0    83707
Name: count, dtype: int64

In [40]:
potential_matches = features[features.sum(axis=1) >0 ].reset_index()

potential_matches['place_name']=potential_matches['value']
potential_matches.drop('value',axis=1, inplace=True)
potential_matches['geoname']=geonames_df.loc[potential_matches['level_1']]['name'].values
potential_matches['country']=geonames_df.loc[potential_matches['level_1']]['country_code'].values
potential_matches['pop']=geonames_df.loc[potential_matches['level_1']]['population'].values
potential_matches['geoname_id']=geonames_df.loc[potential_matches['level_1']]['geonameid'].values
potential_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4480 entries, 0 to 4479
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   level_1     4480 non-null   int64  
 1   score       4480 non-null   float64
 2   equal       4480 non-null   int64  
 3   place_name  4480 non-null   object 
 4   geoname     4480 non-null   object 
 5   country     4480 non-null   object 
 6   pop         4480 non-null   int64  
 7   geoname_id  4480 non-null   object 
dtypes: float64(1), int64(3), object(4)
memory usage: 280.1+ KB


In [41]:
potential_matches.sort_values(['place_name','geoname','country','pop'],
                                        ascending=[True,True,True,False],
                                        inplace=True)
potential_matches.sample(10)


Unnamed: 0,level_1,score,equal,place_name,geoname,country,pop,geoname_id
13,5003,1.0,0,Pinheiro de Azere,Pinheiro de Ázere,PT,937,8014293
3317,3737,1.0,1,Fráguas,Fráguas,PT,905,8013027
3379,3079,1.0,1,Marrazes,Marrazes,PT,22528,8012368
2785,4256,1.0,1,Pinhão,Pinhão,PT,648,8013546
3943,1398,1.0,1,Nordeste,Nordeste,PT,4937,8010686
1848,312,1.0,1,Monchique,Monchique,PT,5421,2266268
2326,5121,1.0,1,Santa Cruz da Trapa,Santa Cruz da Trapa,PT,1313,8014411
2829,178,1.0,1,Seixal,Seixal,PT,656,2263117
463,2185,1.0,1,Poiares,Poiares,PT,411,8011474
3071,2423,1.0,1,Penha Garcia,Penha Garcia,PT,748,8011712


In [42]:
potential_matches.drop_duplicates(subset=['place_name','geoname','country'], keep='first',inplace=True)
potential_matches.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2706 entries, 3560 to 3172
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   level_1     2706 non-null   int64  
 1   score       2706 non-null   float64
 2   equal       2706 non-null   int64  
 3   place_name  2706 non-null   object 
 4   geoname     2706 non-null   object 
 5   country     2706 non-null   object 
 6   pop         2706 non-null   int64  
 7   geoname_id  2706 non-null   object 
dtypes: float64(1), int64(3), object(4)
memory usage: 190.3+ KB


In [46]:
potential_matches.sample(10)

Unnamed: 0,level_1,score,equal,place_name,geoname,country,pop,geoname_id
1943,1540,1.0,1,Arada,Arada,PT,3318,8010829
3445,4622,1.0,1,Pinho,Pinho,PT,777,8013912
1381,5003,1.0,1,Pinheiro de Ázere,Pinheiro de Ázere,PT,937,8014293
1075,4105,1.0,1,Outeiro,Outeiro,PT,1234,8013395
3763,14803,1.0,1,Escobar,Escobar,ES,386,2517948
2648,4564,1.0,1,Vilar Seco,Vilar Seco,PT,745,8013854
3034,4100,1.0,1,Meixedo,Meixedo,PT,467,8013390
138,4962,1.0,0,Cernache de Bonjardim,Cernache do Bonjardim,PT,3052,8014252
3570,3392,1.0,1,Airães,Airães,PT,2486,8012681
3286,4594,1.0,1,Cárquere,Cárquere,PT,854,8013884


#### Verificar inferências

In [47]:
potential_matches[potential_matches.equal == 0]

Unnamed: 0,level_1,score,equal,place_name,geoname,country,pop,geoname_id
4387,3749,1.0,0,Abituteiras,Abitureiras,PT,972,8013039
144,957,1.0,0,Aboim da Nobrega,Aboim da Nóbrega,PT,987,2743436
4388,957,1.0,0,Aboim de Nóbrega,Aboim da Nóbrega,PT,987,2743436
94,948,1.0,0,Albergaria-a Velha,Albergaria-a-Velha,PT,7974,2743233
4416,1126,1.0,0,Albergaria-a-Velha,Albergaria-A-Velha,PT,25252,8010414
...,...,...,...,...,...,...,...,...
165,546,1.0,0,Vilar do Paraiso,Vilar do Paraíso,PT,14727,2732444
4379,3935,1.0,0,Vilarelhos,Vilarelho,PT,1125,8013225
4360,5443,1.0,0,Vilarinho da Castanheiro,Vilarinho da Castanheira,PT,415,8014733
4294,4809,1.0,0,Várzea de Moruge,Várzea de Meruge,PT,249,8014099


In [53]:
potential_matches.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2706 entries, 3560 to 3172
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   level_1     2706 non-null   int64  
 1   score       2706 non-null   float64
 2   equal       2706 non-null   int64  
 3   place_name  2706 non-null   object 
 4   geoname     2706 non-null   object 
 5   country     2706 non-null   object 
 6   pop         2706 non-null   int64  
 7   geoname_id  2706 non-null   object 
dtypes: float64(1), int64(3), object(4)
memory usage: 190.3+ KB


Check how many not found we can match

In [52]:
from os.path import exists

not_found_file = '../inferences/places/osm_not_found.csv'

not_found_df: pd.DataFrame = None

if exists(not_found_file):
    not_found_df = pd.read_csv(not_found_file)
    not_found = list(not_found_df['not_found'])
else:
    not_found = []
    not_found_df = pd.DataFrame(columns=['not_found'])
not_found_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4539 entries, 0 to 4538
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   not_found  4538 non-null   object
dtypes: object(1)
memory usage: 35.6+ KB


In [55]:
nfs = not_found_df['not_found'].values
not_found_but_in_topo = potential_matches[potential_matches['place_name'].isin(nfs)].sort_values(['place_name','geoname'])
not_found_but_in_topo[not_found_but_in_topo['place_name'] != not_found_but_in_topo['geoname']]

Unnamed: 0,level_1,score,equal,place_name,geoname,country,pop,geoname_id
4307,5404,1.0,0,Aldeia Galega de Merciana,Aldeia Galega da Merceana,PT,2079,8014694
9,4989,1.0,0,Aldeia Nossa do Cabo,Aldeia Nova do Cabo,PT,600,8014279
4428,517,1.0,0,Aldeia do Rato,Aldeia do Mato,PT,441,2272084
56,4880,1.0,0,Almarge do Bispo,Almargem do Bispo,PT,8983,8014170
4452,3658,1.0,0,Alvarelos,Alvarelhos,PT,3151,8012948
31,5536,1.0,0,Alvarções do Corgo,Alvações do Corgo,PT,477,8014826
101,4033,1.0,0,Bertiandes,Bertiandos,PT,414,8013323
87,5436,1.0,0,Borba da Montanha,Borba de Montanha,PT,1294,8014726
104,10974,1.0,0,Campos dos Goitacases,Campos dos Goytacazes,BR,463545,6322015
4395,2192,1.0,0,Carrapatos,Carrapatas,PT,197,8011481
