# Named Entity Recognition for Danish

<div class="admonition note" name="html-admonition" style="background: lightblue; padding: 10px">
<p class="title">Note</p>
This section, "Working in Languages Beyond English," is co-authored with <a href="http://www.quinndombrowski.com/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
</div>

In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER) as applied to Danish. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

---

## Dataset

The example text for Danish is *Evangelines Genvordigheder: Til Kvinder med rødt Haar* by Elinor Glyn [from Project Gutenberg](http://www.gutenberg.org/ebooks/33632).

**Here's a preview of spaC's NER tagging *Evangelines Genvordigheder: Til Kvinder med rødt Haar*.**

If you compare the results to the [English example](Named-Entity-Recognition), you'll notice that the Danish NER is much less good at recognizing entities, and is especially bad ata distinguishing different kinds of entities, like ORG vs LOC vs PER. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it.

You can read more about the [data sources used to train Danish](https://spacy.io/models/da) on the spaCy model page.

In [3]:
displacy.render(document, style="ent")

---

## NER with spaCy
If you've already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model.

### Install spaCy

In [None]:
!pip install -U spacy

### Import Libraries

We're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

### Download Language Model

Next we need to download the Danish-language model (`da_core_news_lg`), which will be processing and making predictions about our texts. You can read more about the [data sources used to train Danish](https://spacy.io/models/da) on the spaCy model page.

In [None]:
!python -m spacy download da_core_news_md

### Load Language Model

Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model.

**1.** We can import the model as a module and then load it from the module.

In [3]:
import da_core_news_md
nlp = da_core_news_md.load()

**2.** We can load the model by name.

In [4]:
#nlp = spacy.load('da_core_news_md')

If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu).

## Process Document

We first need to process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code.

After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In the cell below, we open and the example document. Then we run `nlp()` on the text and create our document.

In [5]:
filepath = '../texts/da.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## Get Named Entities

All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from the example document.

In [6]:
document.ents

(﻿EVANGELINES,
 ELINOR GLYN
 
  ,
 KØBENHAVN,
 Eventyrerske,
 Hatte paa,
 Eventyrerske,
 Fru Carruthers,
 Godset,
 Fru Carruthers,
 Fru Carruthers,
 Papa,
 Mama --,
 Familie,
 Mama,
 Lyst,
 Mamas Fader,
 Lord,
 Moder,
 Mama,
 Papa,
 Officer,
 Indien,
 Eventyrerske,
 Fru Carruthers,
 Tante,
 Carruthers,
 Pjank,
 Godset,
 Diplomat,
 Paris,
 Rusland,
 England,
 Herre,
 Eventyrerske,
 Lyst,
 Tusind om Aaret,
 Liv,
 Hr. Carruthers,
 Tante,
 Sort,
 Brystet paa,
 Fru Carruthers Død,
 Doktor Garrison,
 Selskabslivet,
 Fru Carruthers,
 Sæsonen;,
 Schweiz,
 Efteraaret,
 London,
 Hjertetilfælde,
 Thomas,
 Familievase,
 Forbavselsen,
 Pund,
 Eventyrerske,
 Hr. Carruthers,
 Gudernes Skød,
 Mænd,
 Mænd,
 Bridge,
 Carruthers,
 Ex-Ambassadører,
 Korridoren,
 Selskabeligheden,
 London,
 Tennisbold,
 Piger,
 Gud,
 Paris,
 Hr. Carruthers,
 London,
 Koner,
 Carruthers,
 Ægteskabet,
 Hr. Carruthers,
 Christopher,
 Christopher,
 Christopher,
 Aarevis,
 Mænd,
 Metal,
 Gud,
 Fru Carruthers,
 Piger,
 Begyndels

Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop.

For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`.

In [7]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

﻿EVANGELINES MISC
ELINOR GLYN

  ORG
KØBENHAVN LOC
Eventyrerske MISC
Hatte paa PER
Eventyrerske LOC
Fru Carruthers PER
Godset PER
Fru Carruthers PER
Fru Carruthers PER
Papa PER
Mama -- MISC
Familie PER
Mama PER
Lyst ORG
Mamas Fader PER
Lord PER
Moder PER
Mama PER
Papa PER
Officer MISC
Indien LOC
Eventyrerske MISC
Fru Carruthers PER
Tante ORG
Carruthers PER
Pjank ORG
Godset PER
Diplomat PER
Paris LOC
Rusland LOC
England LOC
Herre PER
Eventyrerske ORG
Lyst PER
Tusind om Aaret MISC
Liv PER
Hr. Carruthers PER
Tante PER
Sort ORG
Brystet paa MISC
Fru Carruthers Død PER
Doktor Garrison PER
Selskabslivet LOC
Fru Carruthers PER
Sæsonen; MISC
Schweiz LOC
Efteraaret PER
London LOC
Hjertetilfælde PER
Thomas PER
Familievase PER
Forbavselsen PER
Pund ORG
Eventyrerske MISC
Hr. Carruthers PER
Gudernes Skød MISC
Mænd ORG
Mænd ORG
Bridge ORG
Carruthers PER
Ex-Ambassadører MISC
Korridoren LOC
Selskabeligheden ORG
London LOC
Tennisbold MISC
Piger ORG
Gud PER
Paris LOC
Hr. Carruthers PER
London LOC
Koner M

To extract just the named entities that have been identified as `PER` (person), we can add a simple `if` statement into the mix:

In [8]:
for named_entity in document.ents:
    if named_entity.label_ == "PER":
        print(named_entity)

Hatte paa
Fru Carruthers
Godset
Fru Carruthers
Fru Carruthers
Papa
Familie
Mama
Mamas Fader
Lord
Moder
Mama
Papa
Fru Carruthers
Carruthers
Godset
Diplomat
Herre
Lyst
Liv
Hr. Carruthers
Tante
Fru Carruthers Død
Doktor Garrison
Fru Carruthers
Efteraaret
Hjertetilfælde
Thomas
Familievase
Forbavselsen
Hr. Carruthers
Carruthers
Gud
Hr. Carruthers
Carruthers
Hr. Carruthers
Christopher
Christopher
Christopher
Gud
Fru Carruthers
Cicely Parkers
Hr. Carruthers
Hr. Carruthers
Hr. Barton
Fru Carruthers
Véronique
Carruthers
Carruthers
Jomfru
Øren
Hr. Carruthers
Sten
Hage
Hans Væsen
Hr. Barton
Hr. Barton
Hr. Carruthers
Hr. Barton
Hr. Carruthers
Tante
Carruthers
Øvelsen
Fru Carruthers
Fru Carruthers
Christopher
Liv
Hr. Barton
Tante
Hustru
Piges
Hr. Barton
Sindsbevægelser
Carruthers
Selvagtelse
Milady
Hr. Carruthers
Hr.
Carruthers
Herrens Tider
Tante
Faders
Fru Carruthers
Lampelyset
Øjne
Ulvs
Hr. Barton
Skænd
Claridges
Frøken Tomkins
Pige
Hr. Barton
Herre
Herre
Hr. Carruthers
Hr. Carruthers
Liv
Hr. Ca

## NER with Long Texts or Many Texts

In [14]:
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

In [15]:
chunked_documents = list(nlp.pipe(text_chunks))

## Get People

To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PER."

In [16]:
people = []

for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "PER":
            people.append(named_entity.text)

people_tally = Counter(people)

df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
df

Unnamed: 0,character,count
0,Robert,120
1,Lord Robert,86
2,Hr. Carruthers,78
3,Lady Ver,69
4,Lady Katherine,46
5,Fru Carruthers,45
6,Christopher,45
7,Lady Merrenden,39
8,Carruthers,33
9,Malcolm,33


## Get Places

To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "LOC."

In [17]:
places = []
for document in chunked_documents:
    for named_entity in document.ents:
        if named_entity.label_ == "LOC":
            places.append(named_entity.text)

places_tally = Counter(places)

df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count'])
df

Unnamed: 0,place,count
0,London,22
1,Paris,21
2,Vestibulen,13
3,Parken,10
4,Trappen,8
5,Middag,8
6,Vejen,8
7,Teatret,7
8,England,6
9,Huset,6


## Get NER in Context

In [18]:
from IPython.display import Markdown, display
import re

def get_ner_in_context(keyword, document, desired_ner_labels= False):
    
    if desired_ner_labels != False:
        desired_ner_labels = desired_ner_labels
    else:
        desired_ner_labels = ['PER', 'ORG', 'LOC']  
        
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        #process each sentence
        sentence_doc = nlp(sentence.text)
        for named_entity in sentence_doc.ents:
            #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
            if keyword.lower() in named_entity.text.lower()  and named_entity.label_ in desired_ner_labels:
                #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
                #sentence_text = sentence.text
            
                sentence_text = re.sub('\n', ' ', sentence.text)
                sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE)

                display(Markdown('---'))
                display(Markdown(f"**{named_entity.label_}**"))
                display(Markdown(sentence_text))

In [21]:
for document in chunked_documents:
    get_ner_in_context('Vejen', document)

---

**LOC**

  Jeg forsikrede ham om, at jeg ikke var det mindste sammensat, og at jeg kun vilde have, at alting skulde være ganske lige ud ad **Landevejen**, og at jeg gerne vilde være i Fred og ikke behøve at blive gift eller plages med at adlyde Folk.

---

**LOC**

  Hr. Barton ventede taalmodigt paa os i den hvide Dagligstue, og vi havde ikke siddet og spist Smaakager i mere end fem Minutter, da Lyden af en Vogn paa **Vejen udenfor** Vinduerne forstyrrede vor kunstige Konversation.

---

**LOC**

  Han satte sig paa Kanten af et Bord, der allerede i **Forvejen** var belæsset med Bøger; de fleste af dem væltede og faldt med et Brag ned paa Gulvet.

---

**LOC**

  Han fortalte mig ikke, hvad der var i **Vejen** med ham, men Jean sagde noget om det, da hun kom ind i mit Værelse, medens jeg tog Tøjet paa.

---

**LOC**

  Jeg drillede ham paa hele **Vejen** hjem, indtil han, da vi gik ind til Lunch, ikke vidste, om han stod paa Hovedet eller paa Benene!

---

**ORG**

  Lady Katherine og Fru Mackintosh gik med op i mit Værelse, da vi var **paa Vejen** op i Seng.

---

**LOC**

  "Han er et løjerligt Menneske," sagde Lord Robert, "og jeg er glad over, at De ikke har set ham -- jeg vil ikke have, at han kommer i **Vejen** for mig.

---

**ORG**

Lady Katherine kom og vimsede omkring og samlede dem allesammen sammen og mere eller mindre drev dem af Sted, og **paa Vejen** op ad Trappen sagde hun til mig, at jeg behøvede ikke at komme ned, hvis jeg hellere vilde lade være!

---

**LOC**

Men det er øjensynligt det, der er i **Vejen** med mig.

---

**LOC**

"Hvad i Alverden er der i **Vejen** med Robert?" sagde hun.

---

**LOC**

  Lady Ver sagde ikke et Ord paa **Vejen** hjem, og hun kyssede mig køligt, da hun gik ind i sit Værelse -- saa raabte hun ud:  "Jeg er træt, Slangepige -- tro ikke, at jeg er gnaven -- Godnat!" og jeg gik op i Seng.

---

**ORG**

  Hun var saa venlig imod mig **paa Vejen** tilbage; hun sagde, at hun var meget ked af at lade mig ene tilbage i Morgen, og at jeg nu maatte bestemme, hvad jeg vilde gøre, ellers vilde hun slet ikke rejse.

---

**LOC**

"Jeg -- hvad, aa, hvad er der i **Vejen**?"

---

**LOC**

Og han rejste sig og tog mig under Armen, han bød mig ikke sin, som i Bøgerne, og trak mig med sig ned ad **Vejen**.

---

**ORG**

  Saa tog vi af Sted, og vi satte Robert af **paa Vejen**, ved Vavasour House.

---

**LOC**

  "Jeg bryder mig ikke meget om knoppende Genier," fortsatte hun, "jeg foretrækker at vente, indtil de er blevet til noget -- lige meget hvorledes deres Oprindelse er -- saa har de **paa Vejen** opad erhvervet sig en vis ydre Opdragelse, og de støder ikke én saa meget.