# Source

Grunewald, Susan, and Andrew Janco. “Finding Places in Text with the World Historical Gazeteer.” Programming Historian, February 11, 2022. https://programminghistorian.org/en/lessons/finding-places-world-historical-gazetteer. 

Note: The World Historical Gazetteer was down when I tried to access it, so I did not do the end of the lesson. 

# Reflection

In this Programming Historian lesson, I learned how to process a text to recognize people and places, as well as how to be able to visualize that data. Additionally, I learned how to use spaCy to look at the structure of a sentence and how it is able to parse sentences syntactically, which I thought was super interesting. This lesson was clearly most helpful for me in learning how to extract locations from a text and tie them to real-life coordinates and other databases of things to get valuable information about the places and people I might extract from a text. For my final project, I will be extracting location information from a huge corpus of text, so knowing how to do it and link it up to another database and gazetteer will be super helpful for me. It was interesting to me to look at the two different ways posed to extract location data, one of them being literally by matching known location names with the text, the other being an AI algorithm that tries to figure out what looks like a place and what doesn’t look like a place. It was also interesting to me that it was really easy to process text in German, a language I don’t understand, but still got all of the information out of it that I wanted to get because spaCy made it easy, so I’m way less scared of dealing with texts in languages I don’t know in the future. And while I think that the most obvious use of this lesson for the future will be in finding locations in my final project, it will also be super useful if I ever want to do linguistic analysis in the future (though I am sure there are probably better programs for that). 

# Code

In [1]:
# importing languages

from spacy.lang.de import German
from spacy.lang.en import English

#test
nlp = German()
doc = nlp("Berlin ist eine Stadt in Deutschland.")
for token in doc:
    print(token.i, token.text)

0 Berlin
1 ist
2 eine
3 Stadt
4 in
5 Deutschland
6 .


In [4]:
#import gazetteer

from pathlib import Path

gazetteer = Path("gazetteer.txt").read_text()
gazetteer = gazetteer.split("\n")



In [36]:
#learning how to match place names

from spacy.lang.de import German
from spacy.matcher import Matcher

nlp = German()

doc = nlp("Karl-Heinz Quade ist von März 1944 bis August 1948 im Lager 150 in Grjasowez interniert.")


matcher = Matcher(nlp.vocab)

#add place pattern to matcher
for place in gazetteer:
    pattern = [{'LOWER': place.lower()}]
    matcher.add(place, [pattern])


#find places
matches = matcher(doc)
for match_id, start, end in matches:
    print(start, end, doc[start:end].text)


#add lagers to matcher
pattern = [{'LOWER': 'lager'},  #the first token should be ‘lager’
           {'LIKE_NUM': True}] # the second token should be a number

# Add the pattern to the matcher
matcher.add("LAGER_PATTERN", [pattern])

matches = matcher(doc)
for match_id, start, end in matches:
    print(start, end, doc[start:end].text)


13 14 Grjasowez
10 12 Lager 150
13 14 Grjasowez


In [37]:
#find places names in places folder (places.txt)

for file in Path("places").iterdir():
    doc = nlp(file.read_text())
    matches = matcher(doc)
    for match_id, start, end in matches:
        print(file.name, start, end, doc[start:end].text)

places.txt 16 17 Stalingrad
places.txt 33 34 Workuta
places.txt 35 36 Astrachan
places.txt 65 66 Stalingrad
places.txt 77 78 Stalingrad
places.txt 104 105 Stalingrad
places.txt 106 107 Saratow
places.txt 180 181 Stalingrad
places.txt 190 191 Jelabuga
places.txt 193 194 Stalingrad
places.txt 222 223 Stalingrad
places.txt 240 241 Jelabuga
places.txt 382 383 Stalingrad
places.txt 396 397 Stalingrad
places.txt 416 417 Stalingrad
places.txt 456 457 Stalingrad
places.txt 483 484 Stalingrad
places.txt 521 522 Stalingrad
places.txt 600 601 Moskau
places.txt 649 650 Stalingrad
places.txt 663 664 Stalingrad
places.txt 715 716 Stalingrad
places.txt 730 731 Stalingrad
places.txt 794 795 Stalingrad
places.txt 825 826 Russland
places.txt 861 862 Stalingrad
places.txt 880 881 Stalingrad
places.txt 971 972 Stalingrad
places.txt 1056 1057 Stalingrad
places.txt 1081 1082 Stalingrad
places.txt 1119 1120 Stalingrad
places.txt 1140 1141 Stalingrad
places.txt 1166 1167 Stalingrad
places.txt 1186 1187 Stalin

In [12]:
#create counter


from collections import Counter

count_list = []

for match_id, start, end in matches:
    count_list.append(doc[start:end].text)

counter = Counter(count_list)

for term, count in counter.most_common(10):
    print(term,count)


Stalingrad 100
Moskau 55
Jelabuga 30
Selenodolsk 26
Kasan 25
Smolensk 16
Workuta 11
Russland 8
Gorki 8
Odessa 8


In [16]:
#using spacy for NER

import spacy
nlp = spacy.load("de_core_news_sm")

doc = nlp("Karl-Heinz Quade ist von März 1944 bis August 1948 im Lager 150 in Grjasowez interniert.")

#find entities in doc
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start, ent.end)

Karl-Heinz Quade PER 0 2
Grjasowez LOC 13 14


In [21]:
#using displaCy

from spacy import displacy


displacy.render(doc, jupyter=True, style="ent")

displacy.render(doc, jupyter=True, style="dep")



In [34]:
# named entity linking
import spacy_dbpedia_spotlight
nlp = spacy.load('de_core_news_sm')
nlp.add_pipe('dbpedia_spotlight', config={'language_code': 'de'})

doc = nlp("Karl-Heinz Quade ist von März 1944 bis August 1948 im Lager 150 in Grjasowez interniert.")
for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

#get info for Grjasowez

import requests
data = requests.get("http://de.dbpedia.org/data/Grjasowez.json").json()



Grjasowez DBPEDIA_ENT http://de.dbpedia.org/resource/Grjasowez
interniert DBPEDIA_ENT http://de.dbpedia.org/resource/Internierung


In [38]:
#export

start_date = "1800" #YYYY-MM-DD
end_date = "2000"
source_title = "Karl-Heinz Quade Diary"

output_text = ""
column_header = "id\ttitle\ttitle_source\tstart\tend\n"  
output_text += column_header  

places_list = []
if matches:
    places_list.extend([ doc[start:end].text for match_id, start, end in matches ])
if doc.ents:
    places_list.extend([ ent.text for ent in doc.ents if ent.label_ == "GPE" or ent.label_ == "LOC"])

# remove duplicate place names by creating a list of names and then converting the list to a set
unique_places = set(places_list)

for id, place in enumerate(unique_places):
    output_text += f"{id}\t{place}\t{source_title}\t{start_date}\t{end_date}\n"

filename = source_title.lower().replace(' ','_') + '.tsv'
Path(filename).write_text(output_text)
print('created: ', filename)



created:  karl-heinz_quade_diary.tsv
