## Datenbeispiele für die Aufgaben

### Die Python csv-Library 
 
Ein CSV-Reader wandelt einen zeilenweisen Input (File oder Liste von Strings) Zeile für Zeile
in Python-Objekte um.
Einige Formatierungsparameter sind bei der Erzeugung eines Readers von besonderem Interesse:
- **delimiter**: Default ist ','
- **skipinitialspace**: ignoriert die Leerzeichen die auf den Delimiter folgen. Default ist 'False'.
- **fieldnames**: Liste von Spaltennamen (falls nicht durch die erste Zeile gegeben)


In [1]:
from csv import DictReader

doc_lines= [
"id;title;text",
"1;Oldest Known Song; The oldest known song is the Hurrian Hymn No. 6, which dates back to around 1400 BCE.",
"2;Bohemian Rhapsody; The iconic song Bohemian Rhapsody, released in 1975, is known for its unique structure and lack of a traditional chorus.",
"3;Mozart's Early Start; Wolfgang Amadeus Mozart composed his first piece of music at the age of five.",
"4;Eye of the Tiger; The iconic rock song Eye of the Tiger is known for its motivational lyrics and driving rhythm.",
"5;Music and the Brain; Listening to music can stimulate the brain and improve memory, mood, and cognitive function.",
"6;Universal Language; Music is often referred to as a universal language because it can convey emotions and stories without words.",
"7;Largest Orchestra; The largest orchestra ever assembled consisted of 8,097 musicians and performed in Frankfurt, Germany, in 2019.",
"8;Music Therapy; Music therapy is used to help patients with various conditions, including depression, anxiety, and chronic pain.",
"9;Birds and Music; Some birds, like the lyrebird, can mimic musical instruments and human-made sounds.",
"10;Quintet; A team of five people can form a band and create music as a quintet."
]
documents=list(DictReader(doc_lines, delimiter=';', skipinitialspace=True))
print(f"read {len(documents)} documents")

question_lines =[
"question;doc;method"
"tell me the oldest song title;1;keyword-search",
"what was the first vocal ever sung;1;synonyms",
"can animals make music;9;meronyms",
"what was the first song;1;word-vector-search",
"can music bring me back to an active life;8;passage-retrieval",
"can a five years old make music;3;passage-retrieval",
"is there music about animals;4;passage-retrieval",    
]
questions=list(DictReader(question_lines, delimiter=';', skipinitialspace=True))
print(f"read {len(questions)} questions")


read 10 documents
read 6 questions


## Aufgabe 1: Keyword Search

Für die Schlüsselwortsuche ist Stopword Removal, Lowercasing und Lemmatization essentiell. 
Wir werden die Schlüsselwortsuche in der Folge immer wieder für den Vergleich mit anderen Methoden
verwenden.

#### API: Tokenisierung (Zergliederung eines Textes in Wörter)
- Synonyme eines Wortes ermitteln
- POS-Tag eines Wortes ermitteln
- Lemma eins Wortes bilden

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words=set(stopwords.words('english'))
print(stop_words)
help(word_tokenize)

{'down', 'o', 'so', 'how', 'under', 'ma', 'mightn', 'had', 'theirs', 'those', 'than', 'can', 'against', 'or', "haven't", "we've", 'i', 'themselves', 'we', 'doing', 'what', "don't", 'some', 'am', "hadn't", 'for', 'out', "shan't", 'again', 'haven', 'that', 'by', 'has', 'hers', "we'll", "they're", 'once', 'but', 'couldn', 'here', "they'd", 'wasn', "you'd", 'aren', 'do', 'm', 'yours', 'my', 'myself', 'same', 'until', 'whom', 't', 'needn', 'y', 'these', 'you', "it'd", 'if', 'his', 'be', "i'll", "needn't", 'herself', 'and', 'just', 'an', 'few', 'have', "isn't", 'he', "they'll", "you've", 'wouldn', 'about', 'himself', "weren't", 'did', 'who', "aren't", 'hasn', "he'd", "he'll", 'it', 'this', 're', 'through', "wasn't", 'itself', "it'll", 'the', "doesn't", 've', "it's", 'd', "you'll", "i'm", "mightn't", 'not', 'now', 'such', "we're", "they've", 'is', "mustn't", 'yourself', 'more', "she'll", 'other', 'then', 'won', "won't", 'at', 'your', 'our', 'up', 'were', 'will', 'was', "wouldn't", 'during', "

In [3]:
import nltk
help(nltk.pos_tag)
nltk.help.upenn_tagset()

Help on function pos_tag in module nltk.tag:

pos_tag(tokens, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.

        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +NORMALIZE_WHITESPACE
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
        ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal') # doctest: +NORMALIZE_WHITESPACE
        [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
        ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

    NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.

    :param tokens: Sequence of tokens to be tagged
    :type t

In [4]:
from nltk.stem import WordNetLemmatizer
help(WordNetLemmatizer.lemmatize)

Help on function lemmatize in module nltk.stem.wordnet:

lemmatize(self, word: str, pos: str = 'n') -> str
    Lemmatize `word` by picking the shortest of the possible lemmas,
    using the wordnet corpus reader's built-in _morphy function.
    Returns the input word unchanged if it cannot be found in WordNet.

    >>> from nltk.stem import WordNetLemmatizer as wnl
    >>> print(wnl().lemmatize('dogs'))
    dog
    >>> print(wnl().lemmatize('churches'))
    church
    >>> print(wnl().lemmatize('aardwolves'))
    aardwolf
    >>> print(wnl().lemmatize('abaci'))
    abacus
    >>> print(wnl().lemmatize('hardrock'))
    hardrock

    :param word: The input word to lemmatize.
    :type word: str
    :param pos: The Part Of Speech tag. Valid options are `"n"` for nouns,
        `"v"` for verbs, `"a"` for adjectives, `"r"` for adverbs and `"s"`
        for satellite adjectives.
    :type pos: str
    :return: The shortest lemma of `word`, for the given `pos`.



## Aufgabe 2: Query Expansion mit NLTK

## Aufgabe 3: Query Expansion mit word2vec, BERT

## Aufgabe 4: Passage Retrieval mit Sentence BERT 

## Aufgabe 5: Re-Ranking

## Aufgabe 6: Semantic Text Similarity

## Aufgabe 7: Pseudo Relevance Feedback