# GeoNames Exploration


All files listed on https://download.geonames.org/export/dump/readme.txt. Pre-computed city lists:

| File | Description |
|------|------|
| cities500.zip            | all cities with a population > 500 or seats of adm div down to PPLA4 (ca 185.000), see 'geoname' table for columns |
| cities1000.zip           | all cities with a population > 1000 or seats of adm div down to PPLA3 (ca 130.000), see 'geoname' table for columns | 
| cities5000.zip           | all cities with a population > 5000 or PPLA (ca 50.000), see 'geoname' table for columns | 
| cities15000.zip          | all cities with a population > 15000 or capitals (ca 25.000), see 'geoname' table for columns |

The main 'geoname' table has the following fields :
---------------------------------------------------
| Label | Description | Type |
|------------|------------|------|
| geonameid         | integer id of record in geonames database | integer |
| name              | name of geographical point (utf8) | varchar(200) |
| asciiname         | name of geographical point in plain ascii characters | varchar(200)
| alternatenames    | alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table |  varchar(10000)
| latitude          | latitude in decimal degrees | (wgs84)
| longitude         | longitude in decimal degrees | (wgs84)
| feature class     | see http://www.geonames.org/export/codes.html | char(1)
| feature code      | see http://www.geonames.org/export/codes.html | varchar(10)
| country code      | ISO-3166 2-letter country code | 2 characters
| cc2               | alternate country codes, comma separated ISO-3166 2-letter country code | 200 characters
| admin1 code       | fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code | varchar(20)
| admin2 code       | code for the second administrative division, a county in the US, see file admin2Codes.txt; | varchar(80) 
| admin3 code       | code for third level administrative division | varchar(20)
| admin4 code       | code for fourth level administrative division | varchar(20)
| population        | bigint | (8 byte int) 
| elevation         | in meters |  integer
| dem               | digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
| timezone          | the iana timezone id (see file timeZone.txt) |  varchar(40)
| modification date | date of last modification in |  yyyy-MM-dd format

In [1]:
import pystow
from pyobo import Obo, Term, Reference
from pyobo.struct import part_of
import pandas as pd
MODULE = pystow.module("mira", "geonames")

In [2]:
terms = {}

In [3]:
COUNTRIES_URL = "https://download.geonames.org/export/dump/countryInfo.txt"

countries_df = MODULE.ensure_csv(
    url=COUNTRIES_URL,
    read_csv_kwargs=dict(
        skiprows=49,
        keep_default_na=False, # NA is a country code
    ),
)
reorder = ["geonameid", *(c for c in countries_df.columns if c != "geonameid")]
countries_df = countries_df[reorder]
countries_df

Unnamed: 0,geonameid,#ISO,ISO3,ISO-Numeric,fips,Country,Capital,Area(in sq km),Population,Continent,tld,CurrencyCode,CurrencyName,Phone,Postal Code Format,Postal Code Regex,Languages,neighbours,EquivalentFipsCode
0,3041565,AD,AND,20,AN,Andorra,Andorra la Vella,468.0,77006,EU,.ad,EUR,Euro,376,AD###,^(?:AD)*(\d{3})$,ca,"ES,FR",
1,290557,AE,ARE,784,AE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS,.ae,AED,Dirham,971,,,"ar-AE,fa,en,hi,ur","SA,OM",
2,1149361,AF,AFG,4,AF,Afghanistan,Kabul,647500.0,37172386,AS,.af,AFN,Afghani,93,,,"fa-AF,ps,uz-AF,tk","TM,CN,IR,TJ,PK,UZ",
3,3576396,AG,ATG,28,AC,Antigua and Barbuda,St. John's,443.0,96286,,.ag,XCD,Dollar,+1-268,,,en-AG,,
4,3573511,AI,AIA,660,AV,Anguilla,The Valley,102.0,13254,,.ai,XCD,Dollar,+1-264,,,en-AI,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,953987,ZA,ZAF,710,SF,South Africa,Pretoria,1219912.0,57779622,AF,.za,ZAR,Rand,27,####,^(\d{4})$,"zu,xh,af,nso,en-ZA,tn,st,ts,ss,ve,nr","ZW,SZ,MZ,BW,NA,LS",
248,895949,ZM,ZMB,894,ZA,Zambia,Lusaka,752614.0,17351822,AF,.zm,ZMW,Kwacha,260,#####,^(\d{5})$,"en-ZM,bem,loz,lun,lue,ny,toi","ZW,TZ,MZ,CD,NA,MW,AO",
249,878675,ZW,ZWE,716,ZI,Zimbabwe,Harare,390580.0,14439018,AF,.zw,ZWL,Dollar,263,,,"en-ZW,sn,nr,nd","ZA,MZ,BW,ZM",
250,8505033,CS,SCG,891,YI,Serbia and Montenegro,Belgrade,102350.0,10829175,EU,.cs,RSD,Dinar,381,#####,^(\d{5})$,"cu,hu,sq,sr","AL,HU,MK,RO,HR,BA,BG",


In [4]:
code_to_country = {}
cols = ["geonameid", "Country", "#ISO"]
for identifier, name, code in countries_df[cols].values:
    terms[identifier] = term = Term.from_triple("geonames", identifier, name)
    term.append_property("code", code)
    code_to_country[code] = term

In [5]:
ADMIN1_URL = "https://download.geonames.org/export/dump/admin1CodesASCII.txt"

admin1_df = MODULE.ensure_csv(
    url=ADMIN1_URL,
    read_csv_kwargs=dict(
        header=None,
        names=["code", "name", "asciiname", "geonames_id"],
    ),
)
admin1_df = admin1_df[["geonames_id", "code", "name", "asciiname"]]
admin1_df.head()

Unnamed: 0,geonames_id,code,name,asciiname
0,3039162,AD.06,Sant Julià de Loria,Sant Julia de Loria
1,3039676,AD.05,Ordino,Ordino
2,3040131,AD.04,La Massana,La Massana
3,3040684,AD.03,Encamp,Encamp
4,3041203,AD.02,Canillo,Canillo


In [6]:
code_to_admin1 = {}
cols = ["geonames_id", "name", "code"]
for identifier, name, code in admin1_df[cols].values:
    term = Term.from_triple("geonames", identifier, name)
    term.append_property("code", code)
    terms[identifier] = term
    code_to_admin1[code] = term
    
    country_code = code.split('.')[0]
    country_term = code_to_country[country_code]
    term.append_relationship(part_of, country_term)

In [7]:
ADMIN2_URL = "https://download.geonames.org/export/dump/admin2Codes.txt"
admin2_df = MODULE.ensure_csv(
    url=ADMIN2_URL,
    read_csv_kwargs=dict(
        header=None,
        names=["code", "name", "asciiname", "geonames_id"],
    ),
)
admin2_df = admin2_df[["geonames_id", "code", "name", "asciiname"]]
admin2_df.head()

Unnamed: 0,geonames_id,code,name,asciiname
0,12047239,AE.01.101,Abu Dhabi Municipality,Abu Dhabi Municipality
1,12047240,AE.01.102,Al Ain Municipality,Al Ain Municipality
2,12047241,AE.01.103,Al Dhafra,Al Dhafra
3,12047242,AE.04.701,Al Fujairah Municipality,Al Fujairah Municipality
4,12047243,AE.04.702,Dibba Al Fujairah Municipality,Dibba Al Fujairah Municipality


In [8]:
code_to_admin2 = {}
for identifier, name, code in admin2_df[["geonames_id", "name", "code"]].values:
    term = Term.from_triple("geonames", identifier, name)
    term.append_property("code", code)
    terms[identifier] = term
    code_to_admin2[code] = term
    
    admin1_code = code.rsplit('.', 1)[0]
    admin1_term = code_to_admin1[admin1_code]
    term.append_relationship(part_of, admin1_term)

In [75]:
terms

{3041565: Term(reference=Reference(prefix='geonames', identifier=3041565, name='Andorra'), definition=None, provenance=[], relationships=defaultdict(<class 'list'>, {}), properties=defaultdict(<class 'list'>, {'code': ['AD']}), parents=[], synonyms=[], xrefs=[], xref_types=[], alt_ids=[], namespace=None, is_obsolete=None),
 290557: Term(reference=Reference(prefix='geonames', identifier=290557, name='United Arab Emirates'), definition=None, provenance=[], relationships=defaultdict(<class 'list'>, {}), properties=defaultdict(<class 'list'>, {'code': ['AE']}), parents=[], synonyms=[], xrefs=[], xref_types=[], alt_ids=[], namespace=None, is_obsolete=None),
 1149361: Term(reference=Reference(prefix='geonames', identifier=1149361, name='Afghanistan'), definition=None, provenance=[], relationships=defaultdict(<class 'list'>, {}), properties=defaultdict(<class 'list'>, {'code': ['AF']}), parents=[], synonyms=[], xrefs=[], xref_types=[], alt_ids=[], namespace=None, is_obsolete=None),
 3576396: 

In [14]:
CITIES_URL = "https://download.geonames.org/export/dump/cities15000.zip"
columns = [
    "geonames_id", "name", "asciiname", "synonyms", 
    "latitude", "longitude", "feature_class", "feature_code", "country_code", "cc2",  
    "admin1", "admin2", "admin3", "admin4",
    "population", "elevation", "dem", "timezone", "date_modified"
]
cities_df = pystow.ensure_zip_df(
    "mira", "geonames", 
    url=CITIES_URL, inner_path="cities15000.txt",
    read_csv_kwargs=dict(
        header=None,
        names=columns
    ),
)
cities_df =  cities_df[cities_df.population > 60_000]
cities_df.synonyms = cities_df.synonyms.str.split(",")
cities_df

Unnamed: 0,geonames_id,name,asciiname,synonyms,latitude,longitude,feature_class,feature_code,country_code,cc2,admin1,admin2,admin3,admin4,population,elevation,dem,timezone,date_modified
2,290594,Umm Al Quwain City,Umm Al Quwain City,"[Oumm al Qaiwain, Oumm al Qaïwaïn, Um al Kawai...",25.56473,55.55517,P,PPLA,AE,,07,,,,62747,,2,Asia/Dubai,2019-10-24
3,291074,Ras Al Khaimah City,Ras Al Khaimah City,"[Julfa, Khaimah, RAK City, RKT, Ra's al Khayma...",25.78953,55.94320,P,PPLA,AE,,05,,,,351943,,2,Asia/Dubai,2019-09-09
4,291580,Zayed City,Zayed City,"[Bid' Zayed, Bid’ Zayed, Madinat Za'id, Madina...",23.65416,53.70522,P,PPL,AE,,01,103,,,63482,,118,Asia/Dubai,2019-10-24
6,292223,Dubai,Dubai,"[DXB, Dabei, Dibai, Dibay, Doubayi, Dubae, Dub...",25.07725,55.30927,P,PPLA,AE,,03,,,,3478300,,24,Asia/Dubai,2022-06-14
9,292672,Sharjah,Sharjah,"[Al Sharjah, Ash 'Mariqah, Ash Shariqa, Ash Sh...",25.33737,55.41206,P,PPLA,AE,,06,,,,1274749,,6,Asia/Dubai,2023-02-27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26437,893697,Chinhoyi,Chinhoyi,"[Chinhoyi, Chinkhoi, Chinkhoji, Chinoyi, Cinho...",-17.36667,30.20000,P,PPLA,ZW,,05,,,,90800,,1153,Africa/Harare,2023-02-24
26438,894239,Chegutu,Chegutu,"[Cheguto, Chegutu, Hartley, New Hartley]",-18.13021,30.14074,P,PPL,ZW,,05,,,,65800,,1187,Africa/Harare,2022-10-07
26439,894701,Bulawayo,Bulawayo,"[BUQ, Bulavajas, Bulavajo, Bulavejo, Bulawayo,...",-20.15000,28.58333,P,PPLA,ZW,,09,,,,1200337,,1348,Africa/Harare,2023-02-24
26442,1085510,Epworth,Epworth,[Epworth],-17.89000,31.14750,P,PPLX,ZW,,10,,,,123250,,1508,Africa/Harare,2012-01-19


In [10]:
cols = ["geonames_id", "name", "synonyms", "country_code", "admin1", "admin2"]
for identifier, name, synonyms, country, admin1, admin2 in cities_df[cols].values:
    term = Term.from_triple("geonames", identifier, name)
    if synonyms and not isinstance(synonyms, float):
        for synoynm in synonyms:
            term.append_synonym(synoynm)
    if pd.notna(admin2):
        admin2_full = f"{country}.{admin1}.{admin2}"
        admin2_term = code_to_admin2.get(admin2_full)
        if admin2_term is not None:
            term.append_relationship(part_of, admin2_term)
        else:
            pass
            print("could not find admin2", admin2_full)
    elif pd.notna(admin1):
        admin1_full = f"{country}.{admin1}"
        admin1_term = code_to_admin1.get(admin1_full)
        if admin1_term is not None:
            term.append_relationship(part_of, admin1_term)
        else:
            print("could not find admin1", admin1_full)
    else:
        print("missing admin codes", identifier, name, country)

could not find admin2 AL.46.18
could not find admin2 CD.18.8335011
could not find admin2 CD.23.9179898
could not find admin2 CD.11.33
could not find admin2 CD.22.922772
missing admin codes 3513090 Willemstad CW
could not find admin2 DE.13.00
could not find admin2 DE.08.00
could not find admin2 DE.06.00
could not find admin2 DE.16.00
could not find admin2 DE.06.00
could not find admin2 DE.15.00
could not find admin2 DE.16.00
could not find admin2 DE.08.00
could not find admin2 DE.16.00
could not find admin2 DE.16.00
could not find admin2 DE.12.00
could not find admin2 DE.16.00
could not find admin2 DE.06.00
could not find admin2 DE.09.00
could not find admin2 DE.12.00
could not find admin2 DE.16.00
could not find admin2 DE.04.00
could not find admin2 DE.16.00
could not find admin2 DE.11.00
could not find admin2 DE.13.00
could not find admin2 DE.16.00
could not find admin2 DE.06.00
could not find admin2 DE.06.00
could not find admin2 DE.10.00
could not find admin2 DE.08.00
could not find

parentId, childId, type. The type 'ADM' stands for the admin hierarchy modeled by the admin1-4 codes. The other entries are entered with the user interface. The relation toponym-adm hierarchy is not included in the file, it can instead be built from the admincodes of the toponym.

In [13]:
len(terms)

48961

In [None]:
# 
HIERARCHY_URL = "https://download.geonames.org/export/dump/hierarchy.zip"
hierarchy_df = MODULE.ensure_zip_df(
    url=HIERARCHY_URL, inner_path="hierarchy.txt",
    read_csv_kwargs=dict(
        header=None,
        names=["parent", "child", "relation"],
    ),
)
hierarchy_df

In [None]:
hierarchy_df.relation.unique()