> First time use: follow instructions in the README.md file in this directory.


**[PT]** Português

---

**[EN]** English


# Georeferenciação com GeoNames


Identificação e geolocalização de topónimos com GeoNames.

Este bloco de notas utiliza informação disponibilizada por GeoNames em 
http://www.geonames.org segundo a licença [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)

---

# Georeferencing place names with GeoNames

Identification and geocoding of place names with GeoNames.

This notebook uses information made availabe be GeoNames
http://www.geonames.org under a [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/)



## Ligação à base de dados local

Para inicializar a base de dados
local ver [000-database-setup](000-database-setup.ipynb)

---

## Setup local database access  

To initialize the local database see [000-database-setup](000-database-setup.ipynb)


In [101]:
from timelinknb import get_db

db_spec =  ('sqlite','fauc.db')
db = get_db(db_spec)

## Lista de lugares diferentes e número de ocorrências

---

## List of different places with number of occurrences

In [19]:
from timelinknb.pandas import attribute_values

attribute = 'naturalidade'
period = ('1500-00-00','1990-00-00')

places = attribute_values(attribute,dates_between=period)
places['place_name'] = places.index.values
places.reset_index()
places.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11499 entries, Lisboa to Óvoa, Viseu
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   count       11499 non-null  int64 
 1   date_min    11499 non-null  object
 2   date_max    11499 non-null  object
 3   place_name  11499 non-null  object
dtypes: int64(1), object(3)
memory usage: 449.2+ KB


### Lugares principais

---

### Main locations

In [102]:
places.sort_values('count', ascending=False).head(10)



Unnamed: 0_level_0,count,date_min,date_max,place_name
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lisboa,8784,1537-02-12,1916-07-19,Lisboa
Coimbra,5526,1537-00-00,1915-10-12,Coimbra
Porto,3391,1537-05-30,1917-10-22,Porto
Braga,1608,1540-01-21,1914-07-24,Braga
Évora,1072,1537-11-22,1910-10-10,Évora
Viseu,986,1537-00-00,1912-07-03,Viseu
Guimarães,980,1537-12-18,1912-07-18,Guimarães
Lamego,972,1537-00-00,1909-10-05,Lamego
Aveiro,790,1538-04-21,1913-10-13,Aveiro
Vila Real,765,1537-03-07,1909-11-09,Vila Real


### Lugares só com uma ocorrência
---

### Locations with just one occurrence

In [103]:
places[places['count'] == 1].info()

<class 'pandas.core.frame.DataFrame'>
Index: 7554 entries, - Lisboa to Óvoa, Viseu
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   count       7554 non-null   int64 
 1   date_min    7554 non-null   object
 2   date_max    7554 non-null   object
 3   place_name  7554 non-null   object
dtypes: int64(1), object(3)
memory usage: 295.1+ KB


In [104]:
places[places['count']==1].head(10)

Unnamed: 0_level_0,count,date_min,date_max,place_name
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
- Lisboa,1,1658-10-02,1658-10-02,- Lisboa
- Vila Franca,1,1658-10-01,1658-10-01,- Vila Franca
-Lisboa,1,1723-10-01,1723-10-01,-Lisboa
"A de Barros, Caria",1,1765-10-01,1765-10-01,"A de Barros, Caria"
"A de Barros, Lamego",1,1624-10-10,1624-10-10,"A de Barros, Lamego"
A dos Francos,1,1745-10-01,1745-10-01,A dos Francos
"ALgaça, Poiares",1,1751-10-01,1751-10-01,"ALgaça, Poiares"
AZagães,1,1749-10-01,1749-10-01,AZagães
Abade,1,1747-12-14,1747-12-14,Abade
Abade de São Romão,1,1540-01-10,1540-01-10,Abade de São Romão


## Geonames
>The GeoNames geographical database covers all countries and contains over eleven million placenames that are available for download free of charge.

* http://www.geonames.org
* Dowloads em http://download.geonames.org/export/dump/

Description of available files in [readme.txt](../extras/gecoding/geonames/readme.txt)

To use this notebook you need the following files from geonames:

* one or more "XX.ZIP" for the countries of interest
* the file "featureCodes.txt" to 
To import geocode data download the files needed from the link above
into directory `../extras/geocoding/geonames/` (to use another directory 
change the variable `path_to_geonames` in the cell bellow  ).


In [177]:
# Collect geonames files
from pathlib import Path
from os.path import exists
path_to_geonames = '../extras/geocoding/geonames'

files = list(Path(path_to_geonames).rglob("[A-Z][A-Z].txt"))
[file.name for file in files]


['MZ.txt',
 'TL.txt',
 'ST.txt',
 'PT.txt',
 'MZ.txt',
 'AQ.txt',
 'TL.txt',
 'ST.txt',
 'PT.txt',
 'GW.txt',
 'BR.txt',
 'CV.txt',
 'GW.txt',
 'AO.txt',
 'BR.txt',
 'CV.txt']

In [178]:
# from readme.txt
read_me = """
geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see d, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int) 
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format
"""


In [179]:
lines = read_me.splitlines()
fields = [f.split(':')[0].strip().replace(' ','_') for f in lines if f != '']
fields

['geonameid',
 'name',
 'asciiname',
 'alternatenames',
 'latitude',
 'longitude',
 'feature_class',
 'feature_code',
 'country_code',
 'cc2',
 'admin1_code',
 'admin2_code',
 'admin3_code',
 'admin4_code',
 'population',
 'elevation',
 'dem',
 'timezone',
 'modification_date']

### Converter informação para Pandas DataFrame

In [180]:
 dtypes = {'geonameid':str,
 'admin1_code':str,
 'admin2_code':str,
 'admin3_code':str,
 'admin4_code':str,
 }

In [184]:
import pandas as pd

geonames_df = None
for file in [f for f in files if 'alternatenames' not in str(f.parent)] :
    print("Reading from ",file.name)
    df = pd.read_csv(file,sep='\t',names=fields,dtype=dtypes,header=0, low_memory=False, index_col='geonameid')
    if geonames_df is None:
        geonames_df = df.copy()
    else:
        geonames_df = pd.concat([geonames_df,df],axis=0)
geonames_df.info()

Reading from  MZ.txt
Reading from  TL.txt
Reading from  ST.txt
Reading from  PT.txt
Reading from  GW.txt
Reading from  AO.txt
Reading from  BR.txt
Reading from  CV.txt
<class 'pandas.core.frame.DataFrame'>
Index: 304244 entries, 345948 to 12450777
Data columns (total 18 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   name               304244 non-null  object 
 1   asciiname          304244 non-null  object 
 2   alternatenames     102462 non-null  object 
 3   latitude           304244 non-null  float64
 4   longitude          304244 non-null  float64
 5   feature_class      304244 non-null  object 
 6   feature_code       304242 non-null  object 
 7   country_code       304244 non-null  object 
 8   cc2                20207 non-null   object 
 9   admin1_code        303563 non-null  object 
 10  admin2_code        118566 non-null  object 
 11  admin3_code        38450 non-null   object 
 12  admin4_code        0 non-nul

Get the admin level 5 codes which are in separate file


In [185]:
admin_code5_exists = False
admin_code5_file = '../extras/geocoding/geonames/adminCode5.txt'
if exists(admin_code5_file):
    admin_code5_exists = True
    geonames_ac5  = pd.read_csv(admin_code5_file,sep='\t',names=['geonameid','admin5_code'],header=None, dtype={'geonameid':'str','admin5_code':'str'},index_col='geonameid')
    geonames_df = pd.merge(geonames_df, geonames_ac5, how='left',on='geonameid')
geonames_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 304244 entries, 345948 to 12450777
Data columns (total 19 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   name               304244 non-null  object 
 1   asciiname          304244 non-null  object 
 2   alternatenames     102462 non-null  object 
 3   latitude           304244 non-null  float64
 4   longitude          304244 non-null  float64
 5   feature_class      304244 non-null  object 
 6   feature_code       304242 non-null  object 
 7   country_code       304244 non-null  object 
 8   cc2                20207 non-null   object 
 9   admin1_code        303563 non-null  object 
 10  admin2_code        118566 non-null  object 
 11  admin3_code        38450 non-null   object 
 12  admin4_code        0 non-null       object 
 13  population         304244 non-null  int64  
 14  elevation          8058 non-null    float64
 15  dem                304244 non-null  int64  
 16  

Manter apenas topónimos povoados

---

Keep only populated places

In [186]:
geonames_df = geonames_df[geonames_df.population>0]
geonames_df.reset_index(inplace=True)
geonames_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13450 entries, 0 to 13449
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   geonameid          13450 non-null  object 
 1   name               13450 non-null  object 
 2   asciiname          13450 non-null  object 
 3   alternatenames     3085 non-null   object 
 4   latitude           13450 non-null  float64
 5   longitude          13450 non-null  float64
 6   feature_class      13450 non-null  object 
 7   feature_code       13450 non-null  object 
 8   country_code       13450 non-null  object 
 9   cc2                41 non-null     object 
 10  admin1_code        13449 non-null  object 
 11  admin2_code        13196 non-null  object 
 12  admin3_code        5193 non-null   object 
 13  admin4_code        0 non-null      object 
 14  population         13450 non-null  int64  
 15  elevation          87 non-null     float64
 16  dem                134

In [188]:
geonames_df[geonames_df.country_code == 'PT'].sample(10)

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date,admin5_code
4246,8013536,Cedros,Cedros,,39.48775,-31.17676,A,ADM3,PT,,23,4802,480202,,128,,484,Atlantic/Azores,2019-07-24,
2283,8011572,Urrós,Urros,,41.33858,-6.46128,A,ADM3,PT,,5,408,40821,,318,,656,Europe/Lisbon,2019-07-24,
5333,8014623,São João de Brito,Sao Joao de Brito,,38.75586,-9.1395,A,ADM3,PT,,14,1106,110642,,11727,,94,Europe/Lisbon,2019-07-24,
3417,8012707,Vila Fria,Vila Fria,,41.38778,-8.23849,A,ADM3,PT,,17,1303,130332,,629,,184,Europe/Lisbon,2019-07-24,
5305,8014595,Avelãs de Ambom,Avelas de Ambom,,40.61922,-7.23765,A,ADM3,PT,,11,907,90707,,69,,852,Europe/Lisbon,2019-07-24,
5537,8014827,Castanheiro do Sul,Castanheiro do Sul,,41.12844,-7.5096,A,ADM3,PT,,22,1815,181501,,439,,640,Europe/Lisbon,2019-07-24,
509,2271977,Algueirão,Algueirao,,38.79764,-9.3437,P,PPL,PT,,14,1111,111102,,66250,,176,Europe/Lisbon,2018-02-07,
724,2737398,Moreira,Moreira,Moreira,41.24756,-8.64788,P,PPL,PT,,17,1306,130609,,12890,,82,Europe/Lisbon,2018-03-01,
4798,8014088,Póvoa de São Miguel,Povoa de Sao Miguel,,38.24739,-7.35835,A,ADM3,PT,,3,210,21002,,888,,146,Europe/Lisbon,2019-07-24,
755,2737936,Manteigas,Manteigas,"Manteigas,Mantejgas,Мантейгас",40.4028,-7.53977,P,PPLA2,PT,,11,908,90802,,3900,,794,Europe/Lisbon,2014-03-06,


### Get extra information (example)

#### Get feature codes

In [189]:
from os.path import exists

fcodes_exist = False
features_codes_file = '../extras/toponimia/geonames/featureCodes_en.txt'
if exists(features_codes_file):
    fcodes_exist = True
    geonames_fc  = pd.read_csv(features_codes_file,sep='\t',names=['fcode','fname','fdesc'],index_col='fcode', header=0)

In [190]:
geonames_fc.head()

Unnamed: 0_level_0,fname,fdesc
fcode,Unnamed: 1_level_1,Unnamed: 2_level_1
A.ADM1H,historical first-order administrative division,a former first-order administrative division
A.ADM2,second-order administrative division,a subdivision of a first-order administrative ...
A.ADM2H,historical second-order administrative division,a former second-order administrative division
A.ADM3,third-order administrative division,a subdivision of a second-order administrative...
A.ADM3H,historical third-order administrative division,a former third-order administrative division


#### Get admin codes

In [191]:
from os.path import exists

admin_code1_exists = False
admin_code1_file = '../extras/toponimia/geonames/admin1CodesASCII.txt'
if exists(admin_code1_file):
    admin_code1_exists = True
    geonames_ac1  = pd.read_csv(admin_code1_file,sep='\t',names=['acode1','ac1_name','ac1_name_ascii','geonames_id'],dtype={'geonames_id':'str'},index_col='geonames_id', header=0)

admin_code2_exists = False
admin_code2_file = '../extras/toponimia/geonames/admin2Codes.txt'
if exists(admin_code2_file):
    admin_code2_exists = True
    geonames_ac2  = pd.read_csv(admin_code1_file,sep='\t',names=['acode2','ac2_name','ac2_name_ascii','geonames_id'],dtype={'geonames_id':'str'},index_col='geonames_id', header=0)



In [125]:
geonames_ac1.head()

Unnamed: 0_level_0,acode1,ac1_name,ac1_name_ascii
geonames_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3039676,AD.05,Ordino,Ordino
3040131,AD.04,La Massana,La Massana
3040684,AD.03,Encamp,Encamp
3041203,AD.02,Canillo,Canillo
3041566,AD.07,Andorra la Vella,Andorra la Vella


In [192]:
geonames_ac1.loc[geonames_ac1.ac1_name == 'Coimbra']

Unnamed: 0_level_0,acode1,ac1_name,ac1_name_ascii
geonames_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2740636,PT.07,Coimbra,Coimbra


In [193]:
geonames_ac1.loc['2740636']

acode1              PT.07
ac1_name          Coimbra
ac1_name_ascii    Coimbra
Name: 2740636, dtype: object

In [194]:
place = 'Coimbra'
result = geonames_df[geonames_df.name == place]
for i,row in result.iterrows():
    name = row['name']
    fcode = f"{row.feature_class}.{row.feature_code}"
    if fcodes_exist:
        fcode_desc = geonames_fc.loc[fcode].fname
    else:
        fcodes = 'NA'

    admin_code = None
    for acode_column in ['admin4_code','admin3_code','admin2_code','admin1_code']:
        if type(row[acode_column]) is str:
            admin_code = row[acode_column]
            break
    if admin_code is None:
        admin_code = '(NA)'
    print(f" {row.country_code} {row.geonameid} {name} {fcode} {fcode_desc} {admin_code}")
    print(f"      {row.alternatenames}")
result

 PT 2740637 Coimbra P.PPLA second-order administrative division 060325
      CBP,Coimbra,Coímbra,Coïmbra,Koimbra,Koimbro,Koimpra,Koímbra,ke ying bu la,koinbura,Κόιμπρα,Коимбра,コインブラ,科英布拉
 PT 8010483 Coimbra A.ADM2 second-order administrative division 0603
      Coimbra,Coimbra Municipality,Coinvra,Conimbriga,Coímbra,Coïmbra,Gorad Kaimbra,Koimbra,Koimbro,Koimpra,Koímbra,ke ying bu la,ko xim bra,ko'imabra,koimbeula,koinbura,kwymbra,kwyymbra,qlmryt,qwymbrh,Κοΐμπρα,Горад Каімбра,Коимбра,Коїмбра,קוימברה,قلمرية,کوئیمبرا,کویمبرا,কোইমব্রা,โกอิมบรา,კოიმბრა,コインブラ,科英布拉,코임브라
 BR 6321278 Coimbra A.ADM2 second-order administrative division 3116704
      nan


Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1_code,admin2_code,admin3_code,admin4_code,population,elevation,dem,timezone,modification_date,admin5_code
859,2740637,Coimbra,Coimbra,"CBP,Coimbra,Coímbra,Coïmbra,Koimbra,Koimbro,Ko...",40.20564,-8.41955,P,PPLA,PT,,7,603,60325.0,,106582,,98,Europe/Lisbon,2019-02-26,
1195,8010483,Coimbra,Coimbra,"Coimbra,Coimbra Municipality,Coinvra,Conimbrig...",40.21026,-8.42683,A,ADM2,PT,,7,603,,,143396,,88,Europe/Lisbon,2020-02-07,
10237,6321278,Coimbra,Coimbra,,-20.84494,-42.79834,A,ADM2,BR,,15,3116704,,,7054,,740,America/Sao_Paulo,2015-07-20,


### Cruzar com topónimos da base local

In [13]:
!pip install recordlinkage

[0m

In [195]:
import recordlinkage
from recordlinkage.preprocessing import clean

indexer = recordlinkage.index.SortedNeighbourhood('place_name','name',window=11)
candidates = indexer.index(places,geonames_df)
print(len(candidates))

69445


In [196]:
compare = recordlinkage.Compare()
compare.string('place_name','name',
    # ['jaro', 'jarowinkler', 'levenshtein', 'damerau_levenshtein', 'qgram', 'cosine', 'smith_waterman', 'lcs'].
    method='damerau_levenshtein', 
    threshold=0.90,
    label='score')
compare.exact('place_name','name',
    label='equal')
features = compare.compute(candidates,places,geonames_df)
features.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 69445 entries, ('Porto', 8560) to ('Óvoa, Viseu', 4814)
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   score   69445 non-null  float64
 1   equal   69445 non-null  int64  
dtypes: float64(1), int64(1)
memory usage: 2.0+ MB


In [197]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

2.0     3824
1.0      433
0.0    65188
dtype: int64

In [198]:
potential_matches = features[features.sum(axis=1) >0 ].reset_index()

potential_matches['place_name']=potential_matches['value']
potential_matches.drop('value',axis=1, inplace=True)
potential_matches['geoname']=geonames_df.loc[potential_matches['level_1']]['name'].values
potential_matches['country']=geonames_df.loc[potential_matches['level_1']]['country_code'].values
potential_matches['pop']=geonames_df.loc[potential_matches['level_1']]['population'].values
potential_matches['geoname_id']=geonames_df.loc[potential_matches['level_1']]['geonameid'].values
potential_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4257 entries, 0 to 4256
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   level_1     4257 non-null   int64  
 1   score       4257 non-null   float64
 2   equal       4257 non-null   int64  
 3   place_name  4257 non-null   object 
 4   geoname     4257 non-null   object 
 5   country     4257 non-null   object 
 6   pop         4257 non-null   int64  
 7   geoname_id  4257 non-null   object 
dtypes: float64(1), int64(3), object(4)
memory usage: 266.2+ KB


In [201]:
potential_matches.sort_values(['place_name','geoname','country','pop'],
                                        ascending=[True,True,True,False],
                                        inplace=True)
potential_matches.sample(10)


Unnamed: 0,level_1,score,equal,place_name,geoname,country,pop,geoname_id
2426,2583,1.0,1,Fajão,Fajão,PT,233,8011872
3855,2749,1.0,1,Sagres,Sagres,PT,1909,8012038
1271,4638,1.0,1,Carregal,Carregal,PT,393,8013928
3382,1402,1.0,1,Vila Franca do Campo,Vila Franca do Campo,PT,11229,8010690
1996,773,1.0,1,Leça da Palmeira,Leça da Palmeira,PT,17996,2738348
3224,3189,1.0,1,Marvila,Marvila,PT,38102,8012478
2191,2818,1.0,1,Maceira,Maceira,PT,229,8012107
2918,5348,1.0,1,Parada de Ester,Parada de Ester,PT,654,8014638
3085,2788,1.0,1,Cadafaz,Cadafaz,PT,140,8012077
1985,823,1.0,1,Figueiró,Figueiró,PT,4579,2739566


In [202]:
potential_matches.drop_duplicates(subset=['place_name','geoname','country'], keep='first',inplace=True)
potential_matches.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2580 entries, 3392 to 3026
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   level_1     2580 non-null   int64  
 1   score       2580 non-null   float64
 2   equal       2580 non-null   int64  
 3   place_name  2580 non-null   object 
 4   geoname     2580 non-null   object 
 5   country     2580 non-null   object 
 6   pop         2580 non-null   int64  
 7   geoname_id  2580 non-null   object 
dtypes: float64(1), int64(3), object(4)
memory usage: 181.4+ KB


In [206]:
potential_matches.sample(10)

Unnamed: 0,level_1,score,equal,place_name,geoname,country,pop,geoname_id
2757,4414,1.0,1,Vilarandelo,Vilarandelo,PT,984,8013704
3485,2747,1.0,1,Budens,Budens,PT,1520,8012036
3009,2800,1.0,1,Velosa,Velosa,PT,114,8012089
1134,4615,1.0,1,Vilarouco,Vilarouco,PT,328,8013905
836,4973,1.0,1,São Vicente da Beira,São Vicente da Beira,PT,1259,8014263
528,2509,1.0,1,Cernache,Cernache,PT,4048,8011798
3004,5450,1.0,1,Vale da Porca,Vale da Porca,PT,286,8014740
2993,1794,1.0,1,Tadim,Tadim,PT,1143,8011083
1625,5489,1.0,1,Couto do Mosteiro,Couto do Mosteiro,PT,1186,8014779
2153,3564,1.0,1,Campanhã,Campanhã,PT,32659,8012854


#### Verificar inferências

In [205]:
potential_matches[potential_matches.equal == 0]

Unnamed: 0,level_1,score,equal,place_name,geoname,country,pop,geoname_id
4164,3749,1.0,0,Abituteiras,Abitureiras,PT,972,8013039
148,957,1.0,0,Aboim da Nobrega,Aboim da Nóbrega,PT,987,2743436
4165,957,1.0,0,Aboim de Nóbrega,Aboim da Nóbrega,PT,987,2743436
105,948,1.0,0,Albergaria-a Velha,Albergaria-a-Velha,PT,7974,2743233
4196,1126,1.0,0,Albergaria-a-Velha,Albergaria-A-Velha,PT,25252,8010414
...,...,...,...,...,...,...,...,...
169,546,1.0,0,Vilar do Paraiso,Vilar do Paraíso,PT,14727,2732444
4154,3935,1.0,0,Vilarelhos,Vilarelho,PT,1125,8013225
4137,5443,1.0,0,Vilarinho da Castanheiro,Vilarinho da Castanheira,PT,415,8014733
4071,4809,1.0,0,Várzea de Moruge,Várzea de Meruge,PT,249,8014099


Check how many not found we can match

In [72]:
from os.path import exists

not_found_file = '../inferences/places/osm_not_found.csv'

not_found_df: pd.DataFrame = None

if exists(not_found_file):
    not_found_df = pd.read_csv(not_found_file)
    not_found = list(not_found_df['not_found'])
else:
    not_found = []
    not_found_df = pd.DataFrame(columns=['not_found'])


In [73]:
nfs = not_found_df['not_found'].values
not_found_but_in_topo = potential_matches[potential_matches['value'].isin(nfs)].sort_values(['value','topo'])
not_found_but_in_topo[not_found_but_in_topo['value'] != not_found_but_in_topo['topo']]

Unnamed: 0,value,level_1,score,topo,topo_id
82,Aldeia de Joane,2905,1.0,Aldeia de Joanes,PT.GN.115205
88,Alhos Vedras,768,1.0,Alhos Vedros,PT.GN.113068
3453,Arcos de Valedevez,6900,1.0,Arcos de Valdevez,PT.GN.119200
76,Avelãs de Caminha,3101,1.0,Avelãs de Caminho,PT.GN.115401
3457,Avelãs do Caminho,3101,1.0,Avelãs de Caminho,PT.GN.115401
3464,Cabeceiras de Bastos,5617,1.0,Cabeceiras de Basto,PT.GN.117917
3465,Celorico de Bastos,5155,1.0,Celorico de Basto,PT.GN.117455
3,Escalas de Baixo,2187,1.0,Escalos de Baixo,PT.GN.114487
3514,Figueró dos Vinhos,2222,1.0,Figueiró dos Vinhos,PT.GN.114522
3459,Freixiandas,2088,1.0,Freixianda,PT.GN.114388
