# Natural Language Processing with nltk & spaCy

<div>
<img src="../../images/lab02/spacy_landingpage.png" width="700"/>
</div>

_(Adapted from [spacy](https://spacy.io/usage/spacy-101))_


# 1. Overview


Before we begin, we need to download some extra tools.

First, Update your virtual environment from last time and include the new dependencies by writing the following in your terminal:

```bash
uv sync
```

<br>

Next, download the spacy model that we will work with:

```bash
uv run --with pip spacy download en_core_web_lg
```


The **spaCy** library provides a variety of linguistic annotations to give the user insights into the grammatical structure of a text snippet. This includes the word types, like the parts of speech, and how the words are related to each other.


In [None]:
import spacy
import pandas as pd
from IPython.display import display
from typing import Set, Literal, Tuple


nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

<div>
<img src="../../images/lab02/spacy_pos.png" width="1000"/>
</div>

- **`Text`**: The original token text
- **`Lemma`**: The base form of the token
- **`POS`**: The simple **Universal Part of Speech** tags
- **`Tag`**: The detailed part-of-speech tag
- **`Dep`**: Syntactic dependency, meaning the relationship between tokens.
- **`Shape`**: The word shape, i.e. capitalization, punctuation, digits -**`is alpha`**: Does the token consist of alphabetic characters?
- **`is stop`**: Is the token part of a stop list?


In [None]:
doc = nlp("The Dresden  University of Technology (the Collaborative University ðŸ˜ƒ) is known for their excellence.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

## In a Nutshell: how does the Tokenization process work?

As we know from the previous exercise, **tokenization** is the task of splitting a text into meaningful segments, which are also known as **tokens**. One important detail here is: spaCy's tokenization is **non-destructive**, meaning that we can always reconstruct the original input from the tokenized output. Information such as **whitespaces** is preserved in the tokens and no additional information is neither added nor removed during tokenization.

<div>
<img src="../../images/lab02/spacy_tokenization.png" width="1000"/>
</div>

1. In the very first step, the raw text is split on whitespace characters, similar to how **`text.split(' ')`** acts.
2. Next, the tokenizer processes the text **from left to right**, performing two checks on each substring:
   <br></br>
   - **Does the current substring match a tokenizer exception rule?**: An example for this is: "**`don't`**" has no whitespaces, however, it should still be split into the two tokens "do" and "n't", while something like "U.K." should always remain as one token.
   - **Can a prefix, suffix or infix be split off?**: For example punctuation like commas, periods, hyphens or quotes.

If we have a match, the rule is applied and the tokenizer continues the loop, moving to the next substring.

<div>
<img src="../../images/lab02/spacy_tokenization_visualised.png" width="800"/>
</div>

Important to note here is that the tokenizer exceptions strongly depend on the specifics of the individual language, so one must always load the correct subclass model to maximize performance.


We can also easily make our own special cases


In [None]:
from spacy.symbols import ORTH

doc = nlp("gimme that")
print([token.text for token in doc])

In [None]:
# Create the special case rule
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

print([token.text for token in nlp("gimme that")])

## 2. Now, we will apply the knowledge from the previous lectures


First we need to download a dataset, this one "only" has 3.000.000 entries (the original had ~30M rows resulting in a total size of 20 GiB; If any of you has enough RAM - and the patience - to work with the full dataset, see [here](https://ir-datasets.com/car.html#car/v1.5))

You can decide how many rows you wish to have with choosing a value **`nrows`**, where
$0 \lt n_{rows} \leq 3.000.000$


In [None]:
data = pd.read_csv("hf://datasets/jembie/carv1.5_reduced/carv1.5_reduced.csv", nrows=1_000)
display(data)

## What are we working with?

<div>
<img src="../../images/lab02/trec_homepage.png" width="1000"/>
</div>

TREC (Text REtrieval Conference) is a long-running research initiative that evaluates and advances information retrieval systems by organizing shared tasks and datasets. The [TREC 2017](https://trec-car.cs.unh.edu/) focuses on:

"This track encourages research for answering more complex information needs with longer answers. Much like Wikipedia pages synthesize knowledge that is globally distributed, we envision systems that collect relevant information from an entire corpus, creating synthetically structured documents by collating retrieved results."

#### Task:

Do some preprocessing on our data, and collect for each entry in the **`text`** column all stemmed and lowercased tokens that are part of the english vocabulary. In the end we want a pandas **`DataFrame`** that has the following column structure when calling **`data.columns`**:

```py
Index(['doc_id', 'text', 'tokens', 'tokens_count'], dtype='object')
```

For this, use the related documentation to (hopefully) help you out: [spaCy processing pipelines](https://spacy.io/usage/processing-pipelines/)


In [None]:
import spacy
from collections import Counter

# Load large English model
nlp = spacy.load("en_core_web_lg")
vocab = nlp.vocab

english_vocabulary = set(word for word in vocab.strings)


def preprocess_pipeline(texts, english_vocab) -> Tuple[..., ...]:
    for doc in nlp.pipe(texts, batch_size=50, n_process=4):
        # Only permit tokens that are in the english vocabulary, not a stop work, and numeric
        ...

    return ..., ...


# Apply preprocessing

data["tokens"] = ...
data["tokens_count"] = ...

In [None]:
# You can inspect the columns here

#### Some basic post-processing

Since we have a very simple pipeline it is still possible that our tokenizer made issues during the process. Get rid of the entries where the tokenization failed, i.e. we have an empty list in the tokens column


### 3. Build the Inverted Index

Build the inverted index, where the key (or the index of the DataFrame, if you prefer to use pandas) represents each term we have collected, mapping it to a list of corresponding document IDs. Although our current model is not yet capable of ranking by relevance (since we lack the ability to measure numerically of how "important" a piece of text is for a given query) we still want to collect useful metadata such as **`Document Frequency`** and **`Total Term Frequency`** for our future improvements. For now, we do not need to store the frequency of each term within individual documents; in other words, our Postingslists will simply contain the list of document IDs.

**You must implement the Inverted Index with a Document Frequency entry for each term. The Total Term Frequency entry is optional for this lab.**
<br></br>
As a quick reminder, here is the inverted index structure:

<div>
<img src="../../images/lab02/Inverted_Index.svg">
</div>

(_Adapted from_ [_De Paul University CSC575_](http://facweb.cs.depaul.edu/mobasher/classes/CSC575/Assignments/assign1-2023.html))


In [None]:
# Perhaps you might create the very basic mapping of (terms -> doc_ids)
# via a defaultdict and then add the meta information when transforming
# it into a DataFrame
from collections import defaultdict

In [None]:
# Make sure that the 'terms' columns is the new index of the DataFrame

### 4. Implement Boolean Retrieval for our Inverted Index, so we can do: AND, OR & NOT Queries

For now we want only pretty simple querying, so for example:

```py
search("Apples are great", operation="AND")
# --> "apples AND are AND great" is our query
```

should query our tokens as follows: `
