# Named Entity Recognition for *Sovetskaya etnografiya* with SpaCy
Das Skript ist mein Ergebnis aus dem Workshop Natural Language Processing für Historiker:innen mit Flair und SpaCy von Martin Dröge, HU Berlin, März 2022. Es extrahiert Ortsnamen aus allen Ausgaben der sowjetischen ethnografichen Zeitschrift *Sovetskaya etnografiya* und speichert diese als csv, pickle und plain text Datei. Die Ortsnamen werden mit einer Liste von historischen Ortsnamen der Tadschikischen Sozialistischen Sowjetrepublik abgeglichen, um eine Liste von Orten in Tadschikistan zu erhalten, die in der Zeitschrift erwähnt werden. 

## Importe

In [1]:
import spacy
import os, glob
import pickle
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## Laden der Daten

In [7]:
df = pd.DataFrame(columns=['filename', 'rohtext'])
dir = os.getcwd()
for filepath in glob.glob(dir+'/textfiles-flag16/*.txt'): 
    filename = os.path.basename(filepath)
    with open(filepath) as f:
        filetext = f.read()
        df = df.append({'filename': filename, 'rohtext': filetext}, ignore_index=True)
        f.close()
print(df) # check df

         filename                                            rohtext
0    1937_2_3.txt  СОВЕТСКАЯ \nЭТНОГРАФИЯ \n2-3 \n1937 \nИЗДАТЕЛЬ...
1      1986_6.txt  А К А Д Е М И Я  Н А У К  СССР\nОРДЕНА ДРУЖБЫ ...
2      1984_4.txt  ( / ОВЕТСКАЯ\nЭТНОГРАФИЯ\n1984\nА К А Д ЕМ И Я...
3      1979_4.txt  ISSN 0038-5050\nС о в е т с к а я\nЭТНОГРАФИЯ\...
4      1982_2.txt  ^  ^\n \nISSN 0038-5050\n( / ОВЕТСКАЯ\nЭТНОГРА...
..            ...                                                ...
185    1991_5.txt  Вологодская областная универсальная  научная б...
186    1977_6.txt  >\nР е д а к ц и о н н а я  к о л л е г и я :\...
187    1988_4.txt  АКАДЕМИЯ НАУК СССР\nО РДЕНА ДРУЖ БЫ НАРОДОВ И ...
188    1968_1.txt  ЭТНОГРАФИЯ\nИНСТИТУТ ЭТНОГРАФИИ ИМ. Н. Н. МИКЛ...
189    1975_4.txt  СОВЕТСКАЯ ! \nЭТНОГРАФИЯ !\n4\n^\n1975\n1\nИНС...

[190 rows x 2 columns]


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190 entries, 0 to 189
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   filename  190 non-null    object
 1   rohtext   190 non-null    object
dtypes: object(2)
memory usage: 3.1+ KB


## Doc-Objekt in pandas-Dataframe

### Funktion zum Erstellen des Doc-Objekts 

In [4]:
def create_doc_object(text, nlp):
    '''
    Loads SpaCy Language Model and creates a SpaCy Doc-Object.
    INPUT: string
    RETURN: spacy.tokens.doc.Doc
    '''       
        
    return nlp(text)    

In [5]:
nlp = spacy.load('ru_core_news_md')

In [10]:
# create sample df
df = pd.DataFrame(columns=['filename', 'rohtext'])
dir = os.getcwd()
for filepath in glob.glob(dir+'/textfiles-flag16/sample/*.txt'): 
    filename = os.path.basename(filepath)
    with open(filepath) as f:
        filetext = f.read()
        df = df.append({'filename': filename, 'rohtext': filetext}, ignore_index=True)
        f.close()
print(df) # check df

      filename                                            rohtext
0   1990_2.txt  ISSN 0 0 3 8 -5 0 5 0\nГ О В ЕТС КАЯ\nЭТНОГРАФ...
1   1990_3.txt  э\nf  ^\n \nISSN 0 0 3 8 - 5 0 5 0\nЬ\n О  В  ...
2   1990_1.txt  АКАДЕМИЯ НАУК СССР\nОРДЕНА ДРУЖ БЫ  НАРОДОВ ИН...
3   1990_4.txt  АКАДЕМИЯ НАУК СССР\n)РДЕНА ДРУЖ Б Ы  НАРОДОВ И...
4   1990_5.txt  I S S N  0 0 3 8 - 5 0 5 0\n/ О В ЕТС КАЯ\nЭТН...
5   1990_6.txt  I S S N  0 0 3 8 - 5 0 5 0\n/ О В ЕТС КАЯ\nЭТН...
6   1991_1.txt  ISSN  0 0 3 8 -5 0 5 0\nu u o o - 3 U 3 U\nО 9...
7   1991_2.txt  Вологодская областная универсальная  научная б...
8   1991_3.txt  Вологодская областная универсальная  научная б...
9   1991_6.txt  L\nm -  \nIS S N  0 0 3 8 -5 0 5 0\nСОВЕТСКАЯ\...
10  1991_4.txt  Вологодская областная универсальная  научная б...
11  1991_5.txt  Вологодская областная универсальная  научная б...


In [11]:
%%time

# Doc-Objekte erstellen
df.loc[:, 'doc_object'] = df.loc[:, 'rohtext'].apply(lambda text: create_doc_object(text, nlp))

CPU times: user 14min 36s, sys: 50.2 s, total: 15min 26s
Wall time: 15min 37s


In [12]:
# pickle speichern

df.to_pickle('se_ner-sample.p')

### Einlesen der pickle-Datei

In [13]:
df_p = pd.read_pickle('se_ner-sample.p')

In [14]:
df_p.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   filename    12 non-null     object
 1   rohtext     12 non-null     object
 2   doc_object  12 non-null     object
dtypes: object(3)
memory usage: 416.0+ bytes


### Checks

In [15]:
type(df_p.loc[0, 'doc_object'])

spacy.tokens.doc.Doc

In [16]:
token_test = df_p.loc[0, 'doc_object']

In [17]:
for token in token_test[:10]:
    print(token.text, token.lemma_, token.pos_,token.ent_type_)

ISSN issn PROPN 
0 0 NUM 
0 0 NUM 
3 3 NUM 
8 8 NUM 
-5 -5 PUNCT 
0 0 NUM 
5 5 NUM 
0 0 NUM 

 
 SPACE 


## Tokenisierung

In [18]:

def tokenize(doc):
    '''
    Tokenizes text using Doc-Object
    INPUT: Doc-Object
    RETURN: list with tokens
    '''
    return [ token.text for token in doc if not token.is_punct ]

In [19]:
%%time

df.loc[:, 'tokens'] = df.loc[:, 'doc_object'].apply(lambda doc: tokenize(doc))

CPU times: user 816 ms, sys: 29.7 ms, total: 846 ms
Wall time: 859 ms


In [20]:
df.loc[:, 'ntokens'] = df.loc[:, 'tokens'].apply(lambda tokens: len(tokens))

In [21]:
df.loc[:, 'ntokens'].describe()

count        12.000000
mean     113146.000000
std       13251.053673
min       98998.000000
25%      105272.500000
50%      111018.500000
75%      115477.750000
max      148056.000000
Name: ntokens, dtype: float64

## NER

In [22]:
def extract_named_entities(doc, entity='PER'):
    '''
    Extracts named entities from Doc-Object.
    INPUT: Doc-Object
    RETURN: List with entities    
    '''
    return [ token.text for token in doc.ents if token.label_ == entity ]

In [23]:
%%time

entities = ['PER', 'ORG', 'LOC', 'MISC']

for entity in entities:
    df.loc[:, entity] = df.loc[:, 'doc_object'].apply(lambda doc: extract_named_entities(doc, entity=entity))

CPU times: user 606 ms, sys: 198 ms, total: 804 ms
Wall time: 810 ms


In [24]:
df.head(3).T

Unnamed: 0,0,1,2
filename,1990_2.txt,1990_3.txt,1990_1.txt
rohtext,ISSN 0 0 3 8 -5 0 5 0\nГ О В ЕТС КАЯ\nЭТНОГРАФ...,э\nf ^\n \nISSN 0 0 3 8 - 5 0 5 0\nЬ\n О В ...,АКАДЕМИЯ НАУК СССР\nОРДЕНА ДРУЖ БЫ НАРОДОВ ИН...
doc_object,"(ISSN, 0, 0, 3, 8, -5, 0, 5, 0, \n, Г, О, В, Е...","(э, \n, f, , ^, \n \n, ISSN, 0, 0, 3, 8, -, 5...","(АКАДЕМИЯ, НАУК, СССР, \n, ОРДЕНА, ДРУЖ, БЫ, ..."
tokens,"[ISSN, 0, 0, 3, 8, -5, 0, 5, 0, \n, Г, О, В, Е...","[э, \n, f, , ^, \n \n, ISSN, 0, 0, 3, 8, 5, 0...","[АКАДЕМИЯ, НАУК, СССР, \n, ОРДЕНА, ДРУЖ, БЫ, ..."
ntokens,108823,111274,105736
PER,"[Я. С., Ибн Х, В. В. Н, Самойлова, Б. В. Андри...","[Родионов, В. В. Мунтян, М. И., Нина Ивановна ...","[Л. Б., Б. П. Ш, Федор Кондратьевич Вовк (Волк..."
ORG,"[ЕТС КАЯ\nЭТНОГРАФИЯ\n1990\n•НАУКА*\n, ОД ЕРЖ ...","[БЫ НАРОДОВ ИНСТИТУТ ЭТНОГРАФИИ им., Верховно...",[АКАДЕМИЯ НАУК СССР\nОРДЕНА ДРУЖ БЫ НАРОДОВ И...
LOC,"[Карачаево-Черкесской автономной, Северном \nТ...","[СССР, СССР, СССР, Узбекистана, Казахстана, Но...","[СССР, Сухуми, Абхазии, Камчатки, Киев, Ленинг..."
MISC,[],[],[]


In [147]:
check_dtype = df.loc[0, 'tokens']

In [148]:
type(check_dtype)

list

In [149]:
type(check_dtype[0])

str

In [None]:
# display locations

for location in df['LOC']:
    print(location)



In [None]:
# store all locations in list (actually: list of lists, one list per text)
locations = []
for location in df['LOC']:
    locations.append(location)

# store unique locations in set (without original order)
locs_unique = set()
for lst in df['LOC']:
    for loc in lst: 
        locs_unique.add(loc)

# store unique locations in ordered set: https://pypi.org/project/ordered-set/ 
from ordered_set import OrderedSet

locs_ordered = OrderedSet()
for lst in df['LOC']:
    for loc in lst: 
        locs_ordered.add(loc)
print(locs_ordered)



## Locations-Liste speichern

### als csv

In [26]:
df.to_csv('se_ner-results.csv', index=False)

In [27]:
df_ohne_doc = df.drop(['doc_object'], axis=1)

In [28]:
df_ohne_doc.to_csv('se_ner-results-withoutdocobject.csv', index=False)

### als pickle

In [29]:
df_ohne_doc.to_pickle('se_ner-results.p')

In [46]:
# Save locations to pickle 
with open('se_ner-locations.p', 'wb') as out_1p:
    pickle.dump(locations, out_1p)
with open('se_ner-locations-unique.p', 'wb') as out_2p:
    pickle.dump(locs_ordered, out_2p)


### als plaintext

In [51]:
# Save locations to txt files
out_1 = open('se_ner-locations.txt', 'w')
for lst in locations:
    for element in lst: 
        out_1.write(element + '\n')
out_1.close()

out_2 = open('se_ner-locations-unique.txt', 'w')
for element in locs_ordered:
    out_2.write(element + '\n')
out_2.close()
