# MSV / SS 2023 - Übung 2

## 2.1 Erste Schritte mit Spacy

Am Anfang müssen wir Spacy importieren:

In [1]:
import spacy

SpaCy bietet für die deutsche Sprache und die englische Sprache bereits trainierte Modelle verschieden große an, die mit unterschiedlichen Textdaten trainiert wurden:

- Englisch: en_core_web_sm, en_core_web_md, en_core_web_lg, en_core_web_trf (sehe auch  https://spacy.io/models/en )
- Deutsch: de_core_news_sm, de_core_news_md, de_core_news_lg, de_dep_news_trf (sehe auch https://spacy.io/models/de )

Also bitte vorher New ‣ Terminal auswahlen und die Sprachmodelle unterladen 
```
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
```

Es handelt sich um bereits trainierte statistische Modelle. Je nach Modell unterscheiden sich die Textanalysen in Genauigkeit und Geschwindigkeit. 

Wir laden die kleinste Modelle:

In [2]:
nlp_en = spacy.load("en_core_web_sm") 
nlp_de = spacy.load("de_core_news_sm")
nlp_en_bigger = spacy.load("en_core_web_md")

Mit ``spacy.load()`` erzeugen wir eine Instanz der Klasse ``Language``

<img src="https://spacy.io/images/architecture.svg" alt="SpaCy architecture" width="350" />

Jede Instanz der Klasse ``Language`` enthält eine sprachspezifische Verarbeitungspipeline

<img src="https://spacy.io/images/pipeline.svg" alt="SpaCy architecture" width="500" />

Wenn man ein Text mit dem nlp-Objekt verarbeitet, erstellt Spacy ein Doc objekt (eine Python-Sequenz)

In [3]:
beispiel_1 = "Das Mädchen sah den Jungen mit dem Fernglas."
doc = nlp_de(beispiel_1)

In [4]:
for token in doc:
    print(token.text) # attribute: text

Das
Mädchen
sah
den
Jungen
mit
dem
Fernglas
.


### Visualisierung des Outputs der Pipeline

In [5]:
import pandas as pd
pd.DataFrame({"Token": [token.text for token in doc],
              "Lemma": [token.lemma_ for token in doc],
              "POS": [token.pos_ for token in doc],
              "Tag": [token.tag_ for token in doc],
              "Morph": [list(token.morph) for token in doc],
              "Dep": [token.dep_ for token in doc]})

Unnamed: 0,Token,Lemma,POS,Tag,Morph,Dep
0,Das,der,DET,ART,"[Case=Nom, Definite=Def, Gender=Neut, Number=S...",nk
1,Mädchen,Mädchen,NOUN,NN,"[Case=Nom, Gender=Neut, Number=Sing]",sb
2,sah,sehen,VERB,VVFIN,"[Mood=Ind, Number=Sing, Person=3, Tense=Past, ...",ROOT
3,den,der,DET,ART,"[Case=Acc, Definite=Def, Gender=Masc, Number=S...",nk
4,Jungen,Junge,NOUN,NN,"[Case=Acc, Gender=Masc, Number=Sing]",oa
5,mit,mit,ADP,APPR,[],mo
6,dem,der,DET,ART,"[Case=Dat, Definite=Def, Gender=Masc, Number=S...",nk
7,Fernglas,Fernglas,NOUN,NN,"[Case=Dat, Gender=Masc, Number=Sing]",nk
8,.,--,PUNCT,$.,[],punct


### Visualisierung mit displaCy

Dependenzbeziehungen zwischen den Wörtern werden typischerweise als gerichtete, etikettierte Kanten dargestellt.

In [6]:
from spacy import displacy
#displacy.render(doc, style="dep")
displacy.render(doc, style="dep", options = {"compact": True})

Die Dependency Labels sind in der Dokumentation des Sprachmodells definiert: https://spacy.io/models/de
Oder alternativ kann mann auch die spacy-explain Funktion verwenden:

In [7]:
print(spacy.explain("nk"))
print(spacy.explain("sb"))
print(spacy.explain("oa"))
print(spacy.explain("mo"))

noun kernel element
subject
accusative object
modifier


## 2.2 Tokenization und Lemmatization mit Spacy

In [8]:
my_lyrics = "I can't get no satisfaction, 'Cause I try, and I try, and I try, and I try"
doc = nlp_en(my_lyrics)

### Tokenization

In [9]:
# print tokens
tokens = [token.text
         for token in doc]
print(tokens) 
len(tokens)

['I', 'ca', "n't", 'get', 'no', 'satisfaction', ',', "'Cause", 'I', 'try', ',', 'and', 'I', 'try', ',', 'and', 'I', 'try', ',', 'and', 'I', 'try']


22

"can't" wurde in zwei Token aufgeteilt: Spacy erkennt sowohl das Wurzelverb als auch die Negation

#### Ohne Satzzeichen

In [10]:
tokens = [token.text
         for token in doc
         if not token.is_punct]
print(tokens)
len(tokens)

['I', 'ca', "n't", 'get', 'no', 'satisfaction', "'Cause", 'I', 'try', 'and', 'I', 'try', 'and', 'I', 'try', 'and', 'I', 'try']


18

#### Troubleshooting

In [11]:
new_string = "I've tried a thousand times! Lemme see."
doc = nlp_en(new_string)

In [12]:
pd.DataFrame({"Token": [token.text for token in doc],
              "Lemma": [token.lemma_ for token in doc],
              "POS": [token.pos_ for token in doc],
              "Tag": [token.tag_ for token in doc],
              "Morph": [list(token.morph) for token in doc],
              "Dep": [token.dep_ for token in doc]})

Unnamed: 0,Token,Lemma,POS,Tag,Morph,Dep
0,I,I,PRON,PRP,"[Case=Nom, Number=Sing, Person=1, PronType=Prs]",nsubj
1,'ve,'ve,AUX,VBP,"[Mood=Ind, Tense=Pres, VerbForm=Fin]",aux
2,tried,try,VERB,VBN,"[Aspect=Perf, Tense=Past, VerbForm=Part]",ROOT
3,a,a,DET,DT,"[Definite=Ind, PronType=Art]",quantmod
4,thousand,thousand,NUM,CD,[NumType=Card],nummod
5,times,time,NOUN,NNS,[Number=Plur],npadvmod
6,!,!,PUNCT,.,[PunctType=Peri],punct
7,Lemme,Lemme,PROPN,NNP,[Number=Sing],nsubj
8,see,see,VERB,VBP,"[Tense=Pres, VerbForm=Fin]",ROOT
9,.,.,PUNCT,.,[PunctType=Peri],punct


#### Erklärung der Regeln des Tokenizers

In [13]:
tok_exp = nlp_en.tokenizer.explain(new_string)
for t in tok_exp:
    print(t[1], "\t", t[0])

I 	 SPECIAL-1
've 	 SPECIAL-2
tried 	 TOKEN
a 	 TOKEN
thousand 	 TOKEN
times 	 TOKEN
! 	 SUFFIX
Lemme 	 TOKEN
see 	 TOKEN
. 	 SUFFIX



<img src="https://spacy.io/images/tokenization.svg" alt="SpaCy architecture" width="350" />


#### Eine neue Regel für den Tokenizer

(<i>Lemme</i> als <i>Lem</i> <i>me</i>)

In [14]:
from spacy.symbols import ORTH
special_case = [{ORTH: "lem"}, {ORTH: "me"}]
nlp_en.tokenizer.add_special_case("lemme", special_case)
special_case = [{ORTH: "Lem"}, {ORTH: "me"}]
nlp_en.tokenizer.add_special_case("Lemme", special_case)

In [15]:
tok_exp = nlp_en.tokenizer.explain(new_string)
for t in tok_exp:
    print(t[1], "\t", t[0])

I 	 SPECIAL-1
've 	 SPECIAL-2
tried 	 TOKEN
a 	 TOKEN
thousand 	 TOKEN
times 	 TOKEN
! 	 SUFFIX
Lem 	 SPECIAL-1
me 	 SPECIAL-2
see 	 TOKEN
. 	 SUFFIX


### Lemmatization

In [18]:
tokens = [token.text
         for token in doc
         if not token.is_punct]

lemmata = [token.lemma_ 
           for token in doc
           if not token.is_punct]

print(tokens[0:17])
print(lemmata[0:17])

['I', "'ve", 'tried', 'a', 'thousand', 'times', 'Lemme', 'see']
['I', "'ve", 'try', 'a', 'thousand', 'time', 'Lemme', 'see']


### Lemmatizer: nur Lookup?

"Bitte" kann ein Adverb (<i>Bitte rufen Sie mich an</i> oder ein Substantiv (<i>Ich hätte eine Bitte an Sie</i>) sein. Wie geht das Lemmatizer damit um?

In [19]:
bitte_noun = "Bitte rufen Sie mich an"
bitte_noun_doc = nlp_de(bitte_noun)

pd.DataFrame({"Token": [token.text for token in bitte_noun_doc],
              "Lemma": [token.lemma_ for token in bitte_noun_doc],
              "POS": [token.pos_ for token in bitte_noun_doc],
              "Tag": [token.tag_ for token in bitte_noun_doc],
              "Morph": [list(token.morph) for token in bitte_noun_doc],
              "Dep": [token.dep_ for token in bitte_noun_doc]})

Unnamed: 0,Token,Lemma,POS,Tag,Morph,Dep
0,Bitte,bitte,ADV,ADV,[],mo
1,rufen,rufen,VERB,VVFIN,"[Mood=Ind, Number=Plur, Person=3, Tense=Pres, ...",ROOT
2,Sie,sie,PRON,PPER,"[Case=Nom, Number=Plur, Person=3, PronType=Prs]",sb
3,mich,mich,PRON,PPER,"[Case=Acc, Number=Sing, Person=1, PronType=Prs]",oa
4,an,an,ADP,PTKVZ,[],svp


In [20]:
bitte_part = "Ich hätte eine Bitte an Sie"
bitte_part_doc = nlp_de(bitte_part)

pd.DataFrame({"Token": [token.text for token in bitte_part_doc],
              "Lemma": [token.lemma_ for token in bitte_part_doc],
              "POS": [token.pos_ for token in bitte_part_doc],
              "Tag": [token.tag_ for token in bitte_part_doc],
              "Morph": [list(token.morph) for token in bitte_part_doc],
              "Dep": [token.dep_ for token in bitte_part_doc]})

Unnamed: 0,Token,Lemma,POS,Tag,Morph,Dep
0,Ich,ich,PRON,PPER,"[Case=Nom, Number=Sing, Person=1, PronType=Prs]",sb
1,hätte,haben,AUX,VAFIN,"[Mood=Sub, Number=Sing, Person=1, Tense=Past, ...",ROOT
2,eine,ein,DET,ART,"[Case=Acc, Definite=Ind, Gender=Fem, Number=Si...",nk
3,Bitte,Bitte,NOUN,NN,"[Case=Acc, Gender=Fem, Number=Sing]",oa
4,an,an,ADP,APPR,[],mnr
5,Sie,sie,PRON,PPER,"[Case=Acc, Number=Sing, Person=3, PronType=Prs]",nk


#### Eine neue Regel für den Lemmatizer

In [21]:
contractions = "They're there right now. I've been there, it's great!"
contractions_doc = nlp_en(contractions)

pd.DataFrame({"Token": [token.text for token in contractions_doc],
              "Lemma": [token.lemma_ for token in contractions_doc],
              "POS": [token.pos_ for token in contractions_doc],
              "Tag": [token.tag_ for token in contractions_doc],
              "Morph": [list(token.morph) for token in contractions_doc],
              "Dep": [token.dep_ for token in contractions_doc]})

Unnamed: 0,Token,Lemma,POS,Tag,Morph,Dep
0,They,they,PRON,PRP,"[Case=Nom, Number=Plur, Person=3, PronType=Prs]",nsubj
1,'re,be,AUX,VBP,"[Mood=Ind, Tense=Pres, VerbForm=Fin]",ROOT
2,there,there,ADV,RB,[PronType=Dem],advmod
3,right,right,ADV,RB,[],advmod
4,now,now,ADV,RB,[],advmod
5,.,.,PUNCT,.,[PunctType=Peri],punct
6,I,I,PRON,PRP,"[Case=Nom, Number=Sing, Person=1, PronType=Prs]",nsubj
7,'ve,'ve,AUX,VBP,"[Mood=Ind, Tense=Pres, VerbForm=Fin]",aux
8,been,be,AUX,VBN,"[Tense=Past, VerbForm=Part]",ccomp
9,there,there,ADV,RB,[PronType=Dem],advmod


In [22]:
nlp_en.get_pipe("attribute_ruler").add([[{"TEXT": "'ve"}]], {"LEMMA": "have"})

In [23]:
contractions_doc = nlp_en(contractions)
pd.DataFrame({"Token": [token.text for token in contractions_doc],
              "Lemma": [token.lemma_ for token in contractions_doc],
              "POS": [token.pos_ for token in contractions_doc],
              "Tag": [token.tag_ for token in contractions_doc],
              "Morph": [list(token.morph) for token in contractions_doc],
              "Dep": [token.dep_ for token in contractions_doc]})

Unnamed: 0,Token,Lemma,POS,Tag,Morph,Dep
0,They,they,PRON,PRP,"[Case=Nom, Number=Plur, Person=3, PronType=Prs]",nsubj
1,'re,be,AUX,VBP,"[Mood=Ind, Tense=Pres, VerbForm=Fin]",ROOT
2,there,there,ADV,RB,[PronType=Dem],advmod
3,right,right,ADV,RB,[],advmod
4,now,now,ADV,RB,[],advmod
5,.,.,PUNCT,.,[PunctType=Peri],punct
6,I,I,PRON,PRP,"[Case=Nom, Number=Sing, Person=1, PronType=Prs]",nsubj
7,'ve,have,AUX,VBP,"[Mood=Ind, Tense=Pres, VerbForm=Fin]",aux
8,been,be,AUX,VBN,"[Tense=Past, VerbForm=Part]",ccomp
9,there,there,ADV,RB,[PronType=Dem],advmod


## 2.3 Korpora

Um datenbasierte Darstellungen der Bedeutung eines Wortes zu erstellen, benötigen wir einen <b>Korpus</b>.</br>
</br>
Ein Textkorpus ist eine Sammlung von Texten einer bestimmten Sprache, die „repräsentativ“ im statistischen Sinne für der Sprache betrachten wird.
</br>
Deswegen importieren wir das Modul ``nltk`` und das Brown corpus. 

In [24]:
import nltk
nltk.download("brown")
from nltk.corpus import brown

[nltk_data] Downloading package brown to
[nltk_data]     /Users/alessandra/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [25]:
from nltk.util import ngrams
from collections import Counter, defaultdict

#### Größe des Brown Corpus

In [26]:
brown_size = len(brown.words())
brown_size

1161192

#### Wortschatzgröße des Brown Corpus

In [27]:
brown_voc_size = len(set(brown.words()))
brown_voc_size

56057

## 2.4 Maximum-Likelihood-Schätzung aus relativer Häufgkeit in Korpus

P(the|is in charge of) = C(is in charge of the) / C(is in charge of)

In [28]:
fourgrams = Counter(list(ngrams(brown.words(),4)))
fivegrams = Counter(list(ngrams(brown.words(),5)))

print("Frequency of 'is in charge of': " + str(fourgrams[('is', 'in', 'charge', 'of')]))
print("Frequency of 'is in charge of the': " + str(fivegrams[('is', 'in', 'charge', 'of', 'the')]))

Frequency of 'is in charge of': 2
Frequency of 'is in charge of the': 0


#### Markov-Annahme

P(the|in charge of) = C(in charge of the) / C(in charge of)

In [29]:
trigrams = Counter(list(ngrams(brown.words(),3)))

print("Frequency of 'in charge of the': " + str(fourgrams[('in', 'charge', 'of', 'the')]))
print("Frequency of 'in charge of': " + str(trigrams[('in', 'charge', 'of')]))
print("P of 'the'|'in charge of': " + str(fourgrams[('in', 'charge', 'of', 'the')]/trigrams[('in', 'charge', 'of')]))

Frequency of 'in charge of the': 8
Frequency of 'in charge of': 16
P of 'the'|'in charge of': 0.5


In [31]:
bigrams = Counter(list(ngrams(brown.words(),2)))

print("Frequency of 'charge of': " + str(bigrams[('charge', 'of')]))
print("Frequency of 'charge of the': " + str(trigrams[('charge', 'of', 'the')]))
print("P of 'the'|'charge of': " + str(trigrams[('charge', 'of', 'the')]/bigrams[('charge', 'of')]))

Frequency of 'charge of': 29
Frequency of 'charge of the': 12
P of 'the'|'charge of': 0.41379310344827586


## 2.5 N-Grams

In [38]:
import re

#### Zerograms

In [39]:
P_Chinese = 1/brown_voc_size
print("P('Chinese')  = " + str(P_Chinese))

P('Chinese')  = 1.7838985318515083e-05


c.a. 0.000018

#### Unigrams


In [34]:
from nltk import FreqDist
brown_FD = FreqDist(brown.words())
brown_FD

FreqDist({'the': 62713, ',': 58334, '.': 49346, 'of': 36080, 'and': 27915, 'to': 25732, 'a': 21881, 'in': 19536, 'that': 10237, 'is': 10011, ...})

In [35]:
F_Chinese = brown_FD['Chinese']
F_Chinese

56

In [36]:
P_Chinese = F_Chinese/brown_size
print("P('Chinese')  = " + str(P_Chinese))

P('Chinese')  = 4.822630538274463e-05


c.a. 0.000048

#### Bigrams

In [43]:
from nltk.util import bigrams
#list(bigrams(brown.words()))
my_lyrics = "<s> I want to eat Chinese food </s>"
my_lyrics_tokenized = re.split(" ",my_lyrics)
list(bigrams(my_lyrics_tokenized))

[('<s>', 'I'),
 ('I', 'want'),
 ('want', 'to'),
 ('to', 'eat'),
 ('eat', 'Chinese'),
 ('Chinese', 'food'),
 ('food', '</s>')]

## 2.5 Einfache Sprachmodellierung mit N-Grammen: Berkeley Restaurant Project Data (Jurafsky & Martin, Chapter 3)

In [44]:
import nltk
import pandas as pd
import numpy as np

brp_unigrams = {'I': 2533,
                'want': 927,
                'to': 2417,
                'eat' : 746,
                'chinese' : 158,
                'food' : 1093,
                'lunch': 341,
                'spend': 278}

print(brp_unigrams)

{'I': 2533, 'want': 927, 'to': 2417, 'eat': 746, 'chinese': 158, 'food': 1093, 'lunch': 341, 'spend': 278}


In [45]:
brp_big_fq = [(5, 827, 0, 9, 0, 0, 0, 2),
              (2, 0, 608, 1, 6, 6, 5, 1),
              (2, 0, 4, 686, 2, 0, 6, 211),
              (0, 0, 2, 0, 16, 2, 42, 0),
              (1, 0, 0, 0, 0, 82, 1, 0),
              (15, 0, 15, 0, 1, 4, 0, 0),
              (2, 0, 0, 0, 0, 1, 0, 0),
              (1, 0, 1, 0, 0, 0, 0, 0)]  

brp_bigrams = pd.DataFrame(brp_big_fq, columns = ['I' , 'want', 'to' , 'eat', 'chinese', 'food', 'lunch', 'spend'], index=['I' , 'want', 'to' , 'eat', 'chinese', 'food', 'lunch', 'spend'])

print(brp_bigrams)

          I  want   to  eat  chinese  food  lunch  spend
I         5   827    0    9        0     0      0      2
want      2     0  608    1        6     6      5      1
to        2     0    4  686        2     0      6    211
eat       0     0    2    0       16     2     42      0
chinese   1     0    0    0        0    82      1      0
food     15     0   15    0        1     4      0      0
lunch     2     0    0    0        0     1      0      0
spend     1     0    1    0        0     0      0      0


##### Zeilen = (I, I), (I, want), (I, to), (I, eat)... 
##### P(Chinese food) >> P(food Chinese)

Normalisierung der Häufigkeiten

In [46]:
brp_bigrams_norm = brp_bigrams.div(brp_unigrams, axis='index')
brp_bigrams_norm

Unnamed: 0,I,want,to,eat,chinese,food,lunch,spend
I,0.001974,0.32649,0.0,0.003553,0.0,0.0,0.0,0.00079
want,0.002157,0.0,0.655879,0.001079,0.006472,0.006472,0.005394,0.001079
to,0.000827,0.0,0.001655,0.283823,0.000827,0.0,0.002482,0.087298
eat,0.0,0.0,0.002681,0.0,0.021448,0.002681,0.0563,0.0
chinese,0.006329,0.0,0.0,0.0,0.0,0.518987,0.006329,0.0
food,0.013724,0.0,0.013724,0.0,0.000915,0.00366,0.0,0.0
lunch,0.005865,0.0,0.0,0.0,0.0,0.002933,0.0,0.0
spend,0.003597,0.0,0.003597,0.0,0.0,0.0,0.0,0.0


#### Einige vorgegebene Wahrscheinlichkeiten

In [47]:
P_s_i = 0.25 # P(i|<s>) = 0.25
P_want_english = 0.0011 #P(english|want) = 0.0011
P_english_food = 0.5 # P(food|english) = 0.5 
P_food_s = 0.68 # P(</s>|food) = 0.68

#### P(\<s\> i want english food \</s\>) vs. P(\<s\> i want chinese food \</s\>)
P(\<s\> i want english food \</s\>) = P(i|\<s\>)\*P(want|i)\*P(english|want)\*P(food|english)\*P(\</s\>|food)

In [48]:
# P(<s> i want english food </s>) = P(i|<s>)*P(want|i)*P(english|want)*P(food|english)*P(</s>|food)
P_eng = P_s_i * brp_bigrams_norm.loc['I','want'] * P_want_english * P_english_food * P_food_s
print("P(<s> i want english food </s>) = " + '{:f}'.format(P_eng)) # no scientific notation

P(<s> i want english food </s>) = 0.000031


P(\<s\> i want chinese food \</s\>) = P(i|\<s\>)\*P(want|i)\*P(chinese|want)\*P(food|chinese)\*P(\</s\>|food)

In [50]:
# P(<s> i want chinese food </s>) = P(i|<s>)*P(want|i)*P(chinese|want)*P(food|english)*P(</s>|food)
P_chi = P_s_i * brp_bigrams_norm.loc['I','want'] * brp_bigrams_norm.loc['want','chinese']  * brp_bigrams_norm.loc['chinese','food'] * P_food_s
print("P(<s> i want chinese food </s>) = " + '{:f}'.format(P_chi)) # no scientific notation

P(<s> i want chinese food </s>) = 0.000186


### Problem: Underflow ###
Kleine Zahlen, Rundungsprobleme</br>
−→ Alles im log-Raum berechnet</br>
</br>
log P(\<s\> i want english food \</s\>) = log P(i|\<s\>) + log P(want|i) + log P(english|want) + log P(food|english) + log P (\</s\>|food)</br>
log P(\<s\> i want chinese food \</s\>) = log P(i|\<s\>) + log P(want|i) + log P(chinese|want) + log P(food|chinese) +  log P (\</s\>|food)</br>

In [52]:
logP_eng = np.log(P_s_i) + np.log(brp_bigrams_norm.loc['I','want']) + np.log(P_want_english) + np.log(P_english_food) + np.log(P_food_s)
print("log P(<s> i want english food </s>) = " + str(logP_eng)) 
P_eng = np.exp(logP_eng)
print("P(<s> i want english food </s>) = exp(log P(<s> i want english food </s>) = " + '{:f}'.format(P_eng)) # no scientific notation

log P(<s> i want english food </s>) = -10.396904076647616
P(<s> i want english food </s>) = exp(log P(<s> i want english food </s>) = 0.000031


In [53]:
logP_chi = np.log(P_s_i) + np.log(brp_bigrams_norm.loc['I','want']) + np.log(brp_bigrams_norm.loc['want','chinese']) + np.log(brp_bigrams_norm.loc['chinese','food']) + np.log(P_food_s)
print("log P(<s> i want chinese food </s>) = " + str(logP_chi)) 
P_chi = np.exp(logP_chi)
print("P(<s> i want chinese food </s>) = exp(log P(<s> i want chinese food </s>) = " + '{:f}'.format(P_chi)) # no scientific notation

log P(<s> i want chinese food </s>) = -8.587381679010372
P(<s> i want chinese food </s>) = exp(log P(<s> i want chinese food </s>) = 0.000186


### Problem: ungesehene Daten

P(\<s\> i want dutch food \</s\>) = P(i|\<s\>)\*P(want|i)\*P(dutch|want)\*P(food|dutch)\*P(\</s\>|food)<br/>
P(food|dutch) = 0<br/>
P(\<s\> i want dutch food \</s\>) = 0<br/>

#### Laplace Smoothing


In [54]:
brp_bigrams

Unnamed: 0,I,want,to,eat,chinese,food,lunch,spend
I,5,827,0,9,0,0,0,2
want,2,0,608,1,6,6,5,1
to,2,0,4,686,2,0,6,211
eat,0,0,2,0,16,2,42,0
chinese,1,0,0,0,0,82,1,0
food,15,0,15,0,1,4,0,0
lunch,2,0,0,0,0,1,0,0
spend,1,0,1,0,0,0,0,0


In [55]:
brp_bigrams_laplace = brp_bigrams.copy()
brp_bigrams_laplace += 1
brp_bigrams_laplace

Unnamed: 0,I,want,to,eat,chinese,food,lunch,spend
I,6,828,1,10,1,1,1,3
want,3,1,609,2,7,7,6,2
to,3,1,5,687,3,1,7,212
eat,1,1,3,1,17,3,43,1
chinese,2,1,1,1,1,83,2,1
food,16,1,16,1,2,5,1,1
lunch,3,1,1,1,1,2,1,1
spend,2,1,2,1,1,1,1,1


In [56]:
brp_unigrams

{'I': 2533,
 'want': 927,
 'to': 2417,
 'eat': 746,
 'chinese': 158,
 'food': 1093,
 'lunch': 341,
 'spend': 278}

In [57]:
W = 1446
brp_unigrams_laplace = brp_unigrams.copy()
for k, v in brp_unigrams_laplace.items():
    brp_unigrams_laplace[k] += W
brp_unigrams_laplace  

{'I': 3979,
 'want': 2373,
 'to': 3863,
 'eat': 2192,
 'chinese': 1604,
 'food': 2539,
 'lunch': 1787,
 'spend': 1724}

In [58]:
brp_bigrams_norm_laplace = brp_bigrams_laplace.div(brp_unigrams_laplace, axis='index')
brp_bigrams_norm_laplace

Unnamed: 0,I,want,to,eat,chinese,food,lunch,spend
I,0.001508,0.208092,0.000251,0.002513,0.000251,0.000251,0.000251,0.000754
want,0.001264,0.000421,0.256637,0.000843,0.00295,0.00295,0.002528,0.000843
to,0.000777,0.000259,0.001294,0.177841,0.000777,0.000259,0.001812,0.05488
eat,0.000456,0.000456,0.001369,0.000456,0.007755,0.001369,0.019617,0.000456
chinese,0.001247,0.000623,0.000623,0.000623,0.000623,0.051746,0.001247,0.000623
food,0.006302,0.000394,0.006302,0.000394,0.000788,0.001969,0.000394,0.000394
lunch,0.001679,0.00056,0.00056,0.00056,0.00056,0.001119,0.00056,0.00056
spend,0.00116,0.00058,0.00116,0.00058,0.00058,0.00058,0.00058,0.00058


In [59]:
brp_bigrams_newcounts_laplace = brp_bigrams_norm_laplace.multiply(brp_unigrams, axis='index')

In [60]:
float_col = brp_bigrams_newcounts_laplace.select_dtypes(include=['float64']) # float columns only
for col in float_col.columns.values:
    brp_bigrams_newcounts_laplace[col] = np.round(brp_bigrams_newcounts_laplace[col], decimals=2)
#    brp_bigrams_newcounts_laplace[col] = brp_bigrams_newcounts_laplace[col].astype('int64')
brp_bigrams_newcounts_laplace

Unnamed: 0,I,want,to,eat,chinese,food,lunch,spend
I,3.82,527.1,0.64,6.37,0.64,0.64,0.64,1.91
want,1.17,0.39,237.9,0.78,2.73,2.73,2.34,0.78
to,1.88,0.63,3.13,429.84,1.88,0.63,4.38,132.64
eat,0.34,0.34,1.02,0.34,5.79,1.02,14.63,0.34
chinese,0.2,0.1,0.1,0.1,0.1,8.18,0.2,0.1
food,6.89,0.43,6.89,0.43,0.86,2.15,0.43,0.43
lunch,0.57,0.19,0.19,0.19,0.19,0.38,0.19,0.19
spend,0.32,0.16,0.32,0.16,0.16,0.16,0.16,0.16


In [61]:
brp_bigrams

Unnamed: 0,I,want,to,eat,chinese,food,lunch,spend
I,5,827,0,9,0,0,0,2
want,2,0,608,1,6,6,5,1
to,2,0,4,686,2,0,6,211
eat,0,0,2,0,16,2,42,0
chinese,1,0,0,0,0,82,1,0
food,15,0,15,0,1,4,0,0
lunch,2,0,0,0,0,1,0,0
spend,1,0,1,0,0,0,0,0


## Hausaufgaben

### Übung 1

- "on the other hand"
- "on the other end"

Welche Wortfolge ist die wahrscheinlichste? Die Wahrscheinlichkeit mit N-grams (aus relativer Häufgkeit im Brown Korpus - Maximum-Likelihood-Schätzung) schätzen. Groß- und Kleinschreibung beachten.

### Übung 2

Suchen Sie nach alle Bigramme und Trigramme, die mit "eat" anfangen (z.B. "eat chicken", "eat French fries"), und sortieren Sie sie nach Häufigkeit. 

### Übung 3

Wie wahrscheinlich ist "i want chinese food" mit Add-1-Smoothing?<br/>
(Sie brauchen dafür auch P(i|\<s\>) = 0.19 and P(\</s\>|food) =0.40) <br/>
<br/>
Wie ist die Add-1-Smoothing Wahrscheinlicheit im Vergleich zur Wahrscheinlicheit ohne Smoothing?