# Create an Authoritative List of Place Names from Pleiades

In order to disambiguate entities detected by NER, an authoritative list of ancient place names were extracted from the reference website Pleiades (https://pleiades.stoa.org/), a community-built gazetteer and graph of ancient places.

Pleaides provides different download formats. I adopted the the JavaScript Object Notation (JSON), that is the a comprehensive dump containing all attributes of all place, name, name variants, location, and connection objects in the database.

The version used is 19-05-2023 downloaded from https://atlantides.org/downloads/pleiades/json/.

The JSON file contained 40,037 JSONs. Navigating the JSON file, each main place name was extracted with the Pleiades ID. In addition, navigating the 'names''romanized' level of each JSON the name variants (if present) were extracted and associated to the Pleiades ID.

In total, 76,625 ancient place names were extracted and written in a new CSV file. It was observed that in some cases the 'romanized' variant name is the same of the 'title' or main name. To avoid redundant matches, rows with the same name-PleiadesID were eliminated using drop_duplicate. The resulting dataset contains 43,752 ancient place names with PleiadesID.

In [1]:
import csv
import pandas as pd
import json
import gzip

In [2]:
## open the JSON file from Pleiades
file_path = "/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Sources/Pleiades/Pleiades-Places-19052023.json.gz"

with gzip.open(file_path, 'rt', encoding='utf-8') as file:
    data = json.load(file)

In [3]:
## print the lenght of the JSONs in the file
len(data['@graph'])

40037

In [10]:
## write a CSV file of place names and Pleiades IDs
#f = csv.writer(open("Pleiades_from_the_web.csv", "w", encoding="utf-8", newline=''))

## define column headers in the csv file
#f.writerow(["Place", "Pleiades ID"])

count = 0 ## count the number of place names extracted from the JSON file

## create an empty list to store rows
rows = []

for i in range(len(data['@graph'])): ## for each place in the JSON file
    PleiadesID = 'https://pleiades.stoa.org/places/' + str(data['@graph'][i]['id']) ## get the ID
    PlaceName = data['@graph'][i]['title'] ## get the place name
    #rows.append([PlaceName, PleiadesID]) ## add to the CSV file
    print(PleiadesID)
    print(PlaceName)
    
    count = count + 1

    if len(data['@graph'][i]['names']) != 0: ## if the place contains name variants  
        for j in range(len(data['@graph'][i]['names'])):
            PlaceName_Variant = data['@graph'][i]['names'][j]['romanized'] ## get the place name variant
            #rows.append([PlaceName_Variant, PleiadesID]) ## add to the CSV file
            print(PleiadesID)
            print(PlaceName_Variant)
            
            count = count + 1
            
## write all rows at once
#f.writerows(rows)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



https://pleiades.stoa.org/places/971720
Boldai tepe
https://pleiades.stoa.org/places/971721
Bubacene?
https://pleiades.stoa.org/places/971721
Bubacene
https://pleiades.stoa.org/places/971722
Burat tepe
https://pleiades.stoa.org/places/971725
Chahar Tut
https://pleiades.stoa.org/places/971727
Chaqalaq tepe
https://pleiades.stoa.org/places/971728
Chashma
https://pleiades.stoa.org/places/971729
Chehel Dukhtaran
https://pleiades.stoa.org/places/971730
Chim Qurghan
https://pleiades.stoa.org/places/971731
Chim tepe
https://pleiades.stoa.org/places/971732
Chirokchi tepe
https://pleiades.stoa.org/places/971734
Chorgul' tepe
https://pleiades.stoa.org/places/971735
Chorgul' tepe
https://pleiades.stoa.org/places/971736
Chul-i Abdan
https://pleiades.stoa.org/places/971739
Dal'verzin tepe
https://pleiades.stoa.org/places/971741
Dasht-i Archi
https://pleiades.stoa.org/places/971742
Degriz tepe
https://pleiades.stoa.org/places/971743
Deh Nahr-i Jadid
https://pleiades.stoa.org/places/971744
Deh Nau
ht

In [11]:
count

76625

In [12]:
Pleiades_Places = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Outputs/2.1.Pleiades_from_the_website.csv")

In [13]:
len(Pleiades_Places)

76625

In [14]:
## eliminate the duplicates of name-PleiadesID
Pleiades_Places = Pleiades_Places.drop_duplicates(subset=['Place', 'Pleiades ID'], keep=False)

In [15]:
len(Pleiades_Places)

43752

In [16]:
Pleiades_Places.to_csv('Pleiades_Places.csv', index=False)