# Finding Places in Text with the World Historical Gazeteer

## Source

[https://programminghistorian.org/en/lessons/finding-places-world-historical-gazetteer](https://programminghistorian.org/en/lessons/finding-places-world-historical-gazetteer) 

## Reflection

World Historical Gazetteer is a resource similar to Google Maps and ArcGIS in that it can map large datasets of locations. This lesson focused on teaching the basics of how to clean up a dataset to prepare for an external mapping/GIS tool so that the resulting data depicts accurate and correct information. 

I had some issues with this lesson, as some of the coding instructions and explanations were vague. I found myself lost many times with what exactly each of the libraries I imported were doing. After cleaning up the files, I was also met with an issue with the World Historical Gazetteer resource where I couldn’t create a new dataset and upload my files I had processed in the lesson. At the time of writing this reflection, I’m still communicating  with WHG about troubleshooting why my file was not accepted. This troubleshooting highlights a key concept I learned in this lesson: while there are tools to help visualize data, the cleaning and formatting can be as challenging as it is crucial.

One particularly confusing aspect of the lesson was the section that covered Named Entity Linking. The process allowed me to use the example sentence in the lesson as a point to link more data about the subjects of the sentence (in this case about Karl-Heinz Quade), but the linked data, a great resource, is never revisited later in the lesson.

From what I can infer from the remaining walkthrough, the actual World Historical Gazetteer resource seems intuitive and ideal for small-scale datasets focused on historical background. The lesson notes that many digital tools such as ArcGIS are developed to fit modern datasets rather than historical ones, so this resource is valuable for specifically history-based research. 


## Code

## Finding Places in Text with Python

In [None]:
text = "Siberia has many rivers"
for index, char in enumerate(text):
    print(index,char)

In [None]:
text = "Siberia has many rivers"
text.find("rivers")

In [None]:
text.find("Rivers")

In [None]:
text.find("y riv")

## Natural language processing

In [None]:
!pip install spacy

In [None]:
from spacy.lang.de import German
nlp = German()
doc = nlp("Berlin ist eine Stadt in Deutchland.")
for token in doc:
    print(token.i, token.text)

### Load the gazeteer

In [1]:
from pathlib import Path

In [2]:
file = open("gazetteer.txt", encoding = "utf-8")
text = file.read();

print(file)
print("gazeteer.txt")
# print(text)

with open('gazetteer_test.txt', 'w', encoding = 'utf-8') as file:
    file.write(text)


<_io.TextIOWrapper name='gazetteer.txt' mode='r' encoding='utf-8'>
gazeteer.txt


In [3]:
gazetteer = Path('gazetteer_test.txt').read_text(encoding = 'utf-8')
# gazetteer = gazetteer.split("\n")

In [4]:
gazetteer = text.split("\n")

In [5]:
print(gazetteer)

['Armenien', 'Aserbaidshan', 'Aserbaidshen', 'Estland', 'Georgien', 'Kasachstan', 'Kirgisien', 'Lettland', 'Litauen', 'Moldawien', 'Russland', 'RSFSR', 'Kazakhstan', 'Turkmenien', 'Usbekistan', 'Ukraine', 'Weißrussland', 'Weissrussland', 'Abchasien', 'Akmola', 'Aktjubinsk', 'Alma Ata', 'Gurjew', 'Karaganda', 'Kostai', 'Ostkasachstan', 'Sudkasachstan', 'Siidkasachstan', 'DshalaLAbad', 'Frunse', 'Osch', 'Basarabeasca', 'Adygejien', 'Altai', 'Archangelsk', 'Astrachan', 'Baschkirien', 'Brjansk', 'Burjatien', 'Dagestan', 'Gorki', 'Gorkif Tschkalowsk', 'Irkutsk', 'Iwanowo', 'Jaroslawl', 'KabardinienBalkarien', 'Kalinin', 'Kaliningrad', 'Kalmykien', 'Kaluga', 'KaratschaiTscherkessien', 'Karelien', 'Kemerowo', 'Kislar', 'Kingissepp', 'Kirow', 'Komi', 'Krasnodar', 'Krasnoyarsk', 'Krim', 'Kuibyschew', 'Kurgan', 'Kursk', 'Leningrad', 'Marij El', 'Molotow', 'Mordowien', 'Moskau', 'Murmansk', 'Nordossetien', 'Nowgorod', 'Nowosibirsk', 'Omsk', 'Ordshonikidse', 'Orjol', 'Oijol', 'Pensa', 'Perm', 'Pri

### Matching Place Names

In [6]:
example_sentence = 'Karl-Heinz Quade ist von März 1944 bis August 1948 im Lager 150 in Grjasowez interniert.'

In [7]:
from spacy.lang.de import German
from spacy.matcher import Matcher

nlp = German()

doc = nlp(example_sentence)

In [8]:
matcher = Matcher(nlp.vocab)
for place in gazetteer:
    pattern = [{'LOWER': place.lower()}]
    matcher.add(place, [pattern])

In [9]:
matches = matcher(doc)
for match_d, start, end in matches:
    print(start, end, doc[start:end].text)

13 14 Grjasowez


In [10]:
pattern = [{'LOWER': 'lager'}, {'LIKE_NUM': True}]
matcher.add('LAGER_PATTERN', [pattern])

In [11]:
matches = matcher(doc)
for match_d, start, end in matches:
    print(start, end, doc[start:end].text)

10 12 Lager 150
13 14 Grjasowez


### Loading Text Files

In [13]:
for file in Path('folder_with_texts').iterdir():
    doc = nlp(file.read_text(encoding = 'utf-8'))
    matches = matcher(doc)
    for match_id, start, end in matches:
        print(file.name, start, end, doc[start:end].text)

gazetteer_test.txt 0 1 Armenien
gazetteer_test.txt 2 3 Aserbaidshan
gazetteer_test.txt 4 5 Aserbaidshen
gazetteer_test.txt 6 7 Estland
gazetteer_test.txt 8 9 Georgien
gazetteer_test.txt 10 11 Kasachstan
gazetteer_test.txt 12 13 Kirgisien
gazetteer_test.txt 14 15 Lettland
gazetteer_test.txt 16 17 Litauen
gazetteer_test.txt 18 19 Moldawien
gazetteer_test.txt 20 21 Russland
gazetteer_test.txt 22 23 RSFSR
gazetteer_test.txt 24 25 Kazakhstan
gazetteer_test.txt 26 27 Turkmenien
gazetteer_test.txt 28 29 Usbekistan
gazetteer_test.txt 30 31 Ukraine
gazetteer_test.txt 32 33 Weißrussland
gazetteer_test.txt 34 35 Weissrussland
gazetteer_test.txt 36 37 Abchasien
gazetteer_test.txt 38 39 Akmola
gazetteer_test.txt 40 41 Aktjubinsk
gazetteer_test.txt 45 46 Gurjew
gazetteer_test.txt 47 48 Karaganda
gazetteer_test.txt 49 50 Kostai
gazetteer_test.txt 51 52 Ostkasachstan
gazetteer_test.txt 53 54 Sudkasachstan
gazetteer_test.txt 55 56 Siidkasachstan
gazetteer_test.txt 57 58 DshalaLAbad
gazetteer_test.txt 5

place_texts.txt 16 17 Stalingrad
place_texts.txt 33 34 Workuta
place_texts.txt 35 36 Astrachan
place_texts.txt 65 66 Stalingrad
place_texts.txt 77 78 Stalingrad
place_texts.txt 104 105 Stalingrad
place_texts.txt 106 107 Saratow
place_texts.txt 180 181 Stalingrad
place_texts.txt 190 191 Jelabuga
place_texts.txt 193 194 Stalingrad
place_texts.txt 222 223 Stalingrad
place_texts.txt 240 241 Jelabuga
place_texts.txt 382 383 Stalingrad
place_texts.txt 396 397 Stalingrad
place_texts.txt 416 417 Stalingrad
place_texts.txt 456 457 Stalingrad
place_texts.txt 483 484 Stalingrad
place_texts.txt 521 522 Stalingrad
place_texts.txt 600 601 Moskau
place_texts.txt 649 650 Stalingrad
place_texts.txt 663 664 Stalingrad
place_texts.txt 715 716 Stalingrad
place_texts.txt 730 731 Stalingrad
place_texts.txt 794 795 Stalingrad
place_texts.txt 825 826 Russland
place_texts.txt 861 862 Stalingrad
place_texts.txt 880 881 Stalingrad
place_texts.txt 971 972 Stalingrad
place_texts.txt 1056 1057 Stalingrad
place_text

### Term Frequency

In [14]:
from collections import Counter

count_list = []
for match_id, start, end in matches:
    count_list.append(doc[start:end].text)

counter = Counter(count_list)

for term, count in counter.most_common(10):
    print(term, count)

Stalingrad 100
Moskau 55
Jelabuga 30
Selenodolsk 26
Kasan 25
Smolensk 16
Workuta 11
Russland 8
Gorki 8
Odessa 8


### Named Entity Recognition

In [None]:
!python -m spacy download de_core_news_sm

In [15]:
import spacy
nlp = spacy.load("de_core_news_sm")

doc = nlp(example_sentence)
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start, ent.end)

Karl-Heinz Quade PER 0 2
Grjasowez LOC 13 14


### DisplaCy

In [16]:
from spacy import displacy
displacy.render(doc, style = "ent")

In [17]:
displacy.render(doc, jupyter = True, style = "ent")

In [18]:
displacy.render(doc, jupyter = True, style = "dep")

Couln't save image

In [24]:
svg = displacy.render(doc, style="dep")

# output_path = Path("./sentence.svg") 
# # you can keep there only "dependency_plot.svg" if you want to save it in the same folder where you run the script 
# output_path.open("w", encoding="utf-8").write(svg)

type(svg)

NoneType

In [22]:
# svg = displacy.render(doc, jupyter = True, style = "dep")
# output_path = Path("sentence.svg")
# output_path.write(svg)

AttributeError: 'WindowsPath' object has no attribute 'write'

### Named Entity Linking

In [25]:
!pip install spacy-dbpedia-spotlight



In [27]:
import spacy
nlp = spacy.load('de_core_news_sm')
nlp.add_pipe('dbpedia_spotlight', config={'language_code': 'de'})

doc = nlp(example_sentence)
for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

Grjasowez DBPEDIA_ENT http://de.dbpedia.org/resource/Grjasowez
interniert DBPEDIA_ENT http://de.dbpedia.org/resource/Internierung


In [28]:
import requests
data = requests.get("http://de.dbpedia.org/data/Grjasowez.json").json()

In [29]:
print(data)

{'http://de.dbpedia.org/resource/Obnora': {'http://dbpedia.org/ontology/sourceConfluence': [{'type': 'uri', 'value': 'http://de.dbpedia.org/resource/Grjasowez'}]}, 'http://de.dbpedia.org/resource/Grjasowez': {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': [{'type': 'uri', 'value': 'http://schema.org/Place'}, {'type': 'uri', 'value': 'http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing'}, {'type': 'uri', 'value': 'http://www.w3.org/2002/07/owl#Thing'}, {'type': 'uri', 'value': 'http://dbpedia.org/ontology/Settlement'}, {'type': 'uri', 'value': 'http://dbpedia.org/ontology/PopulatedPlace'}, {'type': 'uri', 'value': 'http://dbpedia.org/ontology/Location'}, {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q486972'}, {'type': 'uri', 'value': 'http://dbpedia.org/ontology/Place'}], 'http://www.w3.org/2000/01/rdf-schema#label': [{'type': 'literal', 'value': 'Grjasowez', 'lang': 'de'}], 'http://www.w3.org/2000/01/rdf-schema#comment': [{'type': 'literal', 'value': 'Grjasowez (russis

In [30]:
data.keys()

dict_keys(['http://de.dbpedia.org/resource/Obnora', 'http://de.dbpedia.org/resource/Grjasowez', 'http://de.dbpedia.org/resource/Lew_Alexandrowitsch_Tschugajew', 'http://de.dbpedia.org/resource/Gryazovets', 'http://de.wikipedia.org/wiki/Grjasowez'])

note: output not the same as example output in lesson, likely due to earlier error

### Export Our Data

In [32]:
start_date = "1800" #YYYY-MM-DD
end_date = "2000"
source_title = "Karl-Heinz Quade Diary"

output_text = ""
column_header = "id\ttitle\ttitle_source\tstart\tend\n"  
output_text += column_header  

places_list = []
if matches:
    places_list.extend([ doc[start:end].text for match_id, start, end in matches ])
if doc.ents:
    places_list.extend([ ent.text for ent in doc.ents if ent.label_ == "GPE" or ent.label_ == "LOC"])

# remove duplicate place names by creating a list of names and then converting the list to a set
unique_places = set(places_list)

for id, place in enumerate(unique_places):
    output_text += f"{id}\t{place}\t{source_title}\t{start_date}\t{end_date}\n"
#     output_text += f"{id},{place},{source_title},{start_date},{end_date}\n"


filename = source_title.lower().replace(' ','_') + '.tsv'
Path(filename).write_text(output_text)
print('created: ', filename)

created:  karl-heinz_quade_diary.tsv


## Uploading to the World Historical Gazetteer

Note: complete this section once you hear back about the error from wh gazetteer

write short note/reflection on using the software