<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/01b_text_reps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular data & text representation

In this notebook, we'll learn how to work with tabular data using the `pandas` library. We will also explore some basic text preprocessing and representation techniques.

## Intro to `pandas`

Before we delve into the use of the `pandas` library, we first need to understand how to load libraries. Libraries are collections of functions and other objects, provided by the python community for us to use (you might be familiar with the R equivalent of packages). To import them into our session, we simply use the `import` function.

In [None]:
import pandas

Then we can use it's functions and objects. For pandas, arguably the most relevant class is the dataframe. This object stores data in a two-dimensional matrix with rows and columns. We can supply a dictionary to the function to define a basic table.

In [None]:
df = pandas.DataFrame(
{"a" : [4, 5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]})

In [None]:
df

To select a column, we can either use `.` or use brackets.

In [None]:
df.c

In [None]:
df['c']

Similarly, we can select rows by subsetting with brackets:

In [None]:
df[df.a > 4]

To select specific rows, you can also use `df.index`:

In [None]:
df[df.index == 1]

To select rows and columns at the same time, you can subset sequentially:

In [None]:
df.c[df.a > 4]

*Exercise: select the first observation of the b row.*

To assign values, you can use the bracket logic:

In [None]:
df['a'] = 1
df

In [None]:
df[df.index == 0] = 2
df

You can create new columns the same way:

In [None]:
df['d'] = [5,25,125]

In [None]:
df

To assign a specific value, you shoud use `.loc`:

In [None]:
df.loc[df.index == 2, 'c'] = 500

In [None]:
df

More useful is the calculation with entire dataframes.

In [None]:
df/2

*Exercise: create a new column 'e', which is the difference between the d and the c columns.*

Lastly, you can store a dataframe to a local file, e.g. a csv-file. You can then also load it again.

In [None]:
df.to_csv("test_file.csv", index=False)

In [None]:
df_load = pandas.read_csv("test_file.csv")
df_load

More: [pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## BoW

Now that we understand how to work with dataframes, we can create a document-term matrix (DTM).

In [None]:
documents = [
    "This is the first document.",
    "This is the second document.",
    "This is the third document.",
    "A completely completely unrelated text."
]

First, we need to create a vocabulary of unique terms (the columns of our DTM):

In [None]:
def vocab_generator(documents):
    vocab = []
    for doc in documents:
        tokens = doc.split()
        for token in tokens:
            if token not in vocab:
                vocab.append(token)
    return vocab

*Exercise: look at each line in the function above: what is happening here?*

In [None]:
vocab_generator(documents)

This looks ok, but it probably makes sense to remove the punctuation and lowercase all tokens.

*Exercise: create a function to remove punctuation and lowercase all tokens. Apply it to the documents, creating a new object with the processed documents called `processed_docs`.*

In [None]:
## your code here

We might also remove some stopwords when creating the vocabulary.

In [None]:
def vocab_generator(documents, stop_words = ["this", "is", "the", "a"]):
    vocab = []
    for doc in documents:
        tokens = doc.split()
        for token in tokens:
            if token not in vocab and token not in stop_words:
                vocab.append(token)
    return vocab

In [None]:
vocab = vocab_generator(processed_docs)

Let's create a document term matrix! Let's start by creating one column per token in the vocabulary.

In [None]:
import pandas as pd
dfm = pd.DataFrame([[0] * len(vocab)] * 4, columns=vocab)
dfm

Note that we `import pandas as pd` here. Libraries can be renamed at import, and `pd` is the customary way to refer to this frequently used package.

*Exercise: count the number of occurences of each term in each document and assign them to the DTM.*

In [None]:
## your code here

## Processing text with `nltk`

A particularly useful library for basic operations with texts is the NLTK library (short for natural language toolkit). We can use it to stem or lemmatize some texts.

In [None]:
from nltk.stem.snowball import SnowballStemmer

# Initialize stemmer for English
stemmer = SnowballStemmer('english')

# Test words with different morphological patterns
test_words = [
    'quick', 'quickly',
    'university', 'universities',
    'running', 'runs', 'ran'
]

for word in test_words:
    stem = stemmer.stem(word)
    print(stem)

The package also contains a lemmatizer. To use it, we first need to download the necessary resources.

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer as wnl

lemmatizer = wnl()

for word in test_words:
    lemma = lemmatizer.lemmatize(word)
    print(lemma)

Which way the lemmatizer defines the lemma might depend on whether it recognizes a given word as a verb or a noun. This behavior will be more useful when used alongside a parts-of-speech tagger identifying whether a given term is a verb, noun, adjective, etc.

In [None]:
from nltk.corpus import wordnet
lemma = lemmatizer.lemmatize("ran", wordnet.VERB)
lemma

*Exercise: Create your own word list and explore the differences between the two approaches further.*