# MSV / SS 2023 - Übung 8

Wir verwenden texte aus Harry Potter Bücher. 

Quellen:

(Daten) https://github.com/amisha-jodhani/text-generator-harry-potter

(Code) https://www.nltk.org/api/nltk.lm.html und https://www.kaggle.com/code/alvations/n-gram-language-model-with-nltk

In [1]:
import nltk
import os
import re 

all_files = os.listdir("harrypotter/")   # imagine you're one directory above test dir

hp_corpus = ""

for filename in all_files:
    if ".txt" in filename:
        with open("harrypotter/"+filename, "r") as f:
            text = f.read()
            hp_corpus += text

In [14]:
hp_corpus[0:300]

'Chapter 1: The Other Minister\nIt was nearing midnight and the Prime Minister was sitting alone in his office, reading a long memo that was slipping through his brain without leaving the slightest trace of meaning behind. He was waiting for a call from the President of a far distant country, and betw'

## 8.1 Sprachgenerierung mit N-Grams

#### 1. Tokenization und padding

In [3]:
from nltk import word_tokenize, sent_tokenize 

In [4]:
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                  for sent in sent_tokenize(hp_corpus)]

In [5]:
print(tokenized_text[5:15])

[['how', 'on', 'earth', 'was', 'his', 'government', 'supposed', 'to', 'have', 'stopped', 'that', 'bridge', 'collapsing', '?'], ['it', 'was', 'outrageous', 'for', 'anybody', 'to', 'suggest', 'that', 'they', 'were', 'not', 'spending', 'enough', 'on', 'bridges', '.'], ['the', 'bridge', 'was', 'fewer', 'than', 'ten', 'years', 'old', ',', 'and', 'the', 'best', 'experts', 'were', 'at', 'a', 'loss', 'to', 'explain', 'why', 'it', 'had', 'snapped', 'cleanly', 'in', 'two', ',', 'sending', 'a', 'dozen', 'cars', 'into', 'the', 'watery', 'depths', 'of', 'the', 'river', 'below', '.'], ['and', 'how', 'dare', 'anyone', 'suggest', 'that', 'it', 'was', 'lack', 'of', 'policemen', 'that', 'had', 'resulted', 'in', 'those', 'two', 'very', 'nasty', 'and', 'well-publicized', 'murders', '?'], ['or', 'that', 'the', 'government', 'should', 'have', 'somehow', 'foreseen', 'the', 'freak', 'hurricane', 'in', 'the', 'west', 'country', 'that', 'had', 'caused', 'so', 'much', 'damage', 'to', 'both', 'people', 'and', 'pr

In [6]:
len(tokenized_text)

80740

In [7]:
split = len(tokenized_text)-int(len(tokenized_text)/1000)
split

80660

In [35]:
from nltk.lm.preprocessing import padded_everygram_pipeline

n = 3

training_ngrams, padded_sentences_training = padded_everygram_pipeline(n, tokenized_text[:split])

#### Was haben wir gemacht? Ein Beispiel mit 1 Satz

In [9]:
tokenized_text[5]

['how',
 'on',
 'earth',
 'was',
 'his',
 'government',
 'supposed',
 'to',
 'have',
 'stopped',
 'that',
 'bridge',
 'collapsing',
 '?']

In [34]:
ngrams_1s, padded_sentences_1s = padded_everygram_pipeline(2, tokenized_text[5:6])

for ngramlize_sent in ngrams_1s:
    print(list(ngramlize_sent))
    print()
    
list(padded_sentences_1s)

[('<s>',), ('<s>', 'how'), ('how',), ('how', 'on'), ('on',), ('on', 'earth'), ('earth',), ('earth', 'was'), ('was',), ('was', 'his'), ('his',), ('his', 'government'), ('government',), ('government', 'supposed'), ('supposed',), ('supposed', 'to'), ('to',), ('to', 'have'), ('have',), ('have', 'stopped'), ('stopped',), ('stopped', 'that'), ('that',), ('that', 'bridge'), ('bridge',), ('bridge', 'collapsing'), ('collapsing',), ('collapsing', '?'), ('?',), ('?', '</s>'), ('</s>',)]



['<s>',
 'how',
 'on',
 'earth',
 'was',
 'his',
 'government',
 'supposed',
 'to',
 'have',
 'stopped',
 'that',
 'bridge',
 'collapsing',
 '?',
 '</s>']

#### 2. Training eines 3-Gramm-Modells

In [36]:
from nltk.lm import MLE
model = MLE(n)  # Inisialisierung des MLE Modells, n=3, am Anfang ist das Lexicon (vocab) leer

In [37]:
len(model.vocab)

0

In [38]:
model.fit(training_ngrams, padded_sentences_training)
print(model.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 25823 items>


In [39]:
len(model.vocab)

25823

In [40]:
print(model.counts)

<NgramCounter with 3 ngram orders and 4896339 ngrams>


#### Ngrams mit "Potter"

In [41]:
model.counts['harry']

18014

In [42]:
model.counts[['harry']]['potter']

372

In [43]:
model.score('potter', 'harry'.split())  # P('potter'|'harry')

0.02065060508493394

In [44]:
model.counts[['boy']]['who'] 

50

In [45]:
model.counts[['boy', 'who']]['lived'] 

12

In [46]:
model.score('lived', 'boy who'.split())  # P('lived'|'boy who')

0.24

#### 3. Generierung

Diese Funktion macht den generierten Text besser lesbar

In [47]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [48]:
for i in range(0, 10):
    print(generate_sent(model, 10000, random_seed=i))

the security i've got no proof at all.
3 re going to happen before it sped across the stone floor for a moment later, to the bathroom, echoing voice.
were walking.
"i might never come and stay ." voldemort moved slowly toward the concealed printing press blocking the stairs, slipped on his back to politeness, "sooner you ask me, we ’ ll be all that hap-pened between james and lily.
a double-decker bus rumbled by and large, old-fashioned, red-brick department store called purge & dowse ltd.
in the same way as your brain and you resorted to crude and badly judged meas-ures such as he was standing behind quirrell.
shaggy head behind him, lost his powers, had come to tell me it was always ill at the staff table set at a terrible scream.
his black eyes gleaming.

cup.


In [49]:
for i in range(42, 52):
    print(generate_sent(model, 10000, random_seed=i))

itself, if i remember it well!"
, pettigrew, and she walked back to azkaban...but too late: there was only their heads, incapable of saying it for months."
at once from the distant bangs of battle, enter the triwizard tournament.
"i'm "" aren ’ t, and dobby thinks, and there was a little patience with it, sibyll?"
this is not touched by the neck to ankles to the sitting room, where it would suffice.
a large display near the fence, looking alarmed, and whispered, bristling with anger.
harry, shaking her magnificent head.
, feeling around in my bag!"
face.
"well, my scar hurt when nobody has seen it alive.


## 8.2 Sprachgenerierung mit RNN Netze


https://santiviquez-harry-potter-rnn-app-bjb9w6.streamlit.app/

## 8.3 Sprachgenerierung mit LSTM Netze (heute: nur Trainingsdatenvorbereitung)

https://colab.research.google.com/drive/1lKOwObRmjANWHhNHx4Urk4H7UZFYG74U?usp=sharing

## 8.4 Textklassifizierung mit LSTM Netze (SpaCy und Keras)

https://colab.research.google.com/drive/1Am5IuGfGUSFIk17ZBaER4d4v14E3o7B0?usp=sharing

basiert auf: 
- Duygu Altinok, Mastering Spacy, Chapt 8 - Code und Daten: https://github.com/PacktPublishing/Mastering-spaCy/tree/main/Chapter08 