Geonames Data

- file: https://download.geonames.org/export/dump/cities5000.zip
- all cities with a population > 5000
- data definition: https://download.geonames.org/export/dump/readme.txt

In [1]:
import pandas as pd

In [2]:
cols = [
    "geonameid",
    "name",
    "asciiname",
    "alternatenames",
    "latitude",
    "longitude",
    "feature class",
    "feature code",
    "country code",
    "cc2",
    "admin1 code",
    "admin2 code",
    "admin3 code",
    "admin4 code",
    "population",
    "elevation",
    "dem",
    "timezone",
    "modification",
]

In [3]:
df = pd.read_csv(
    "https://download.geonames.org/export/dump/cities5000.zip",
    sep="\t",
    names=cols,
    na_values=[""],
    keep_default_na=False,
)
len(df)

58644

filter data source with the following criteria:

- first-order administrative division with a population > 10000
- second/third-order administrative division with a population > 500,000
- capital cities

where feature codes are:

- PPLA - first-order administrative division
- PPLA2 - second-order administrative division
- PPLA3 - third-order administrative division
- PPLC - Capital of a country, region
- PPLCD - Capital of a dependency or special area
- PPLCH - Historical Capital

refer https://download.geonames.org/export/dump/featureCodes_en.txt for details

In [4]:
criteria = (
    ((df["feature code"] == "PPLA") & (df["population"] > 10000))
    | ((df["feature code"].isin(["PPLA2", "PPLA3"])) & (df["population"] > 500_000))
    | (df["feature code"].isin(["PPLC", "PPLCD", "PPLCH"]))
)

- cities ordered by name with the following columns:
  - asciiname
  - population
  - timezone
  - country code
  - latitude
  - longitude
- rename the columns to the following:
  - name
  - pop
  - timezone
  - country
  - lat
  - lon

In [5]:
# unique admin1 codes
df = df[criteria][["asciiname", "population", "timezone", "country code", "latitude", "longitude"]]
df.columns = ["name", "pop", "timezone", "country", "lat", "lon"]

extra cleanup 

- remove starting single quote in the name column
- change lat and lon to 6 significant digits
- case insensitive sorting by name and population
- drop duplicates

In [7]:
df["name"] = df["name"].str.lstrip("'")
df["lat"] = df["lat"].apply(lambda x: float('{:.6g}'.format(x)))
df["lon"] = df["lon"].apply(lambda x: float('{:.6g}'.format(x)))

In [19]:
cities = df.sort_values(
    by=["name", "pop"],
    key=lambda x: x.str.lower() if x.name == "name" else x,
    ascending=[True, False],
)
cities = cities.drop_duplicates(subset=["name", "country"], keep="first")

In [21]:
# export to gzipped csv
cities.to_csv("natal/data/cities.csv.gz", index=False, compression="gzip")