# INFO 4271 - Exercise 2 - Text Representation

Issued: April 23, 2024

Due: April 29, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Bag-of-Words Models
In class we discussed BOW vectorization models under which documents are represented via term frequency counts.

a) Construct term frequency BOW representations for the following sentences:

- "The government is open."
- "The government is closed."
- "Long live Mickey Mouse, emperor of all!"
- "Darn! This will break."

In [2]:
import string

corpus = [['The government is open.'], ['The government is closed.'], ['Long live Mickey Mouse, emperor of all!'], ['Darn! This will break.']]

#Turn a corpus of arbitrary texts into term-frequency weighted BOW vectors.
def TF(corpus):
    vecs = []

    # Put all occuring words into an array
    all_words = []
    for text in corpus:
        words = text[0].split()
        for word in words:
            if(word not in all_words):
                all_words.append(word.lower().translate(str.maketrans('', '', string.punctuation)))

    for text in corpus:
        fwbog_dict = {}
        for word in all_words:
            fwbog_dict[word] = 0

        words = text[0].split()
        fwbog = get_fwbog(words)
        for word in fwbog:
            fwbog_dict[word] += 1
        print(fwbog_dict)
        vecs.append(fwbog_dict)
    return vecs

def get_fwbog(words):
    bow = {}
    for word in words:
        word = word.lower().translate(str.maketrans('', '', string.punctuation))
        if word in bow:
            bow[word] += 1
        else:
            bow[word] = 1
    return bow

print(TF(corpus))

{'the': 1, 'government': 1, 'is': 1, 'open': 1, 'closed': 0, 'long': 0, 'live': 0, 'mickey': 0, 'mouse': 0, 'emperor': 0, 'of': 0, 'all': 0, 'darn': 0, 'this': 0, 'will': 0, 'break': 0}
{'the': 1, 'government': 1, 'is': 1, 'open': 0, 'closed': 1, 'long': 0, 'live': 0, 'mickey': 0, 'mouse': 0, 'emperor': 0, 'of': 0, 'all': 0, 'darn': 0, 'this': 0, 'will': 0, 'break': 0}
{'the': 0, 'government': 0, 'is': 0, 'open': 0, 'closed': 0, 'long': 1, 'live': 1, 'mickey': 1, 'mouse': 1, 'emperor': 1, 'of': 1, 'all': 1, 'darn': 0, 'this': 0, 'will': 0, 'break': 0}
{'the': 0, 'government': 0, 'is': 0, 'open': 0, 'closed': 0, 'long': 0, 'live': 0, 'mickey': 0, 'mouse': 0, 'emperor': 0, 'of': 0, 'all': 0, 'darn': 1, 'this': 1, 'will': 1, 'break': 1}
[{'the': 1, 'government': 1, 'is': 1, 'open': 1, 'closed': 0, 'long': 0, 'live': 0, 'mickey': 0, 'mouse': 0, 'emperor': 0, 'of': 0, 'all': 0, 'darn': 0, 'this': 0, 'will': 0, 'break': 0}, {'the': 1, 'government': 1, 'is': 1, 'open': 0, 'closed': 1, 'long':

b) Extend the term frequency model by an inverse document frequency (IDF) component. Estimate IDFs based on the Reuters 21578 collection.

In [3]:
import nltk
from nltk.corpus import reuters
import math

#Download the documents
nltk.download("reuters")
documents = reuters.fileids()

docs = list(filter(lambda doc: doc.startswith("train"),documents));
print(str(len(docs)) + " total train documents");

#To access the content of a news article, we can use the reuters.words() function
# print("The first document contains "+str(len(reuters.words(docs[0])))+" words.\nHere they are:")
# for word in reuters.words(docs[0]):
    # print(word)

#Estimate inverse document frequencies based on a corpus of documents.
# The format of the corpus needs to be like the one in exercise a)
def IDF(corpus):
    idfs = {}
    all_vec = {}
    vecs = []
    for text in corpus:
        words = text[0].split()
        vecs.append(get_fwbog(words))
        
    for vec in vecs:
        for word in vec:
            if word in all_vec:
                all_vec[word] += 1
            else:
                all_vec[word] = 1
                
    for text in corpus:
        words = text[0].split()
        for word in words:
            word = word.lower().translate(str.maketrans('', '', string.punctuation))
            if word not in idfs:
                idfs[word] = math.log10(len(corpus) / all_vec[word])
    print(idfs)
    return idfs

# Turn a corpus of arbitrary texts into TF-IDF weighted BOW vectors.
# The format of the corpus needs to be like the one in exercise a)
def TFIDF(corpus):
    vecs = []
    # First get a frequency-weighted bag of words for every text in fwbogs
    fwbogs = []
    for text in corpus:
        words = text[0].split()
        fwbogs.append(get_fwbog(words))

    # Secondly calculate the TF-IDF
    for i, text in enumerate(corpus):
        tf_idf_vec = {}
        words = text[0].split()

        # Count frequency in all documents
        for word in words:
            word = word.lower().translate(str.maketrans('', '', string.punctuation))
            document_count = 0
            for fwbog in fwbogs:
                if word in fwbog:
                    document_count += 1
            # Add word to tf_idf_vec
            word_fwbogs = fwbogs[i]
            tf_idf_vec[word] = word_fwbogs[word] * math.log(len(corpus) / document_count)
        print(tf_idf_vec)
        vecs.append(tf_idf_vec)
        # print(f"done {(i+1)} of {len(corpus)}")
    return vecs

def format_reuter_docs(docs):
    formatted_reuter_docs = []
    for doc in docs:
        concated_text = ""
        words = reuters.words(doc)
        for word in words:
            concated_text = concated_text + " " + word
        formatted_reuter_docs.append([concated_text])
    return formatted_reuter_docs

# I understood the task like we should compute the IDF/TF-IDF scores for the whole reuters collection. 
# I only used the first 50 documents of the reuters train collection because otherwise the output would crash my computer
print("IDF:")
print(IDF(format_reuter_docs(docs)))
print("TF-IDF with corpus from 1a):")
print(TFIDF(corpus))
print("TF-IDF with first 50 documents of reuters corpus:")
print(TFIDF(format_reuter_docs(docs)[:50]))

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\nic0m\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


7769 total train documents
IDF:
TF-IDF with corpus from 1a):
{'the': 0.6931471805599453, 'government': 0.6931471805599453, 'is': 0.6931471805599453, 'open': 1.3862943611198906}
{'the': 0.6931471805599453, 'government': 0.6931471805599453, 'is': 0.6931471805599453, 'closed': 1.3862943611198906}
{'long': 1.3862943611198906, 'live': 1.3862943611198906, 'mickey': 1.3862943611198906, 'mouse': 1.3862943611198906, 'emperor': 1.3862943611198906, 'of': 1.3862943611198906, 'all': 1.3862943611198906}
{'darn': 1.3862943611198906, 'this': 1.3862943611198906, 'will': 1.3862943611198906, 'break': 1.3862943611198906}
[{'the': 0.6931471805599453, 'government': 0.6931471805599453, 'is': 0.6931471805599453, 'open': 1.3862943611198906}, {'the': 0.6931471805599453, 'government': 0.6931471805599453, 'is': 0.6931471805599453, 'closed': 1.3862943611198906}, {'long': 1.3862943611198906, 'live': 1.3862943611198906, 'mickey': 1.3862943611198906, 'mouse': 1.3862943611198906, 'emperor': 1.3862943611198906, 'of': 1

c) Bag-of-words models are order invariant. They do not retain the ordering in which terms occur in the document. Is there any way to include term order information in these models? Justify your answer below.

We could use n-grams to consider words that are sequentially coherent or have dependencies to each other. This however would lead to increased size which means all operations would be more computationally exepensive.
We could also use word2vec which is a word embedding method that converts large text into dense N-dimensional embeddings. This method also respects the order of the words and therefor can be used to get information about the order of terms.

# 2. Topic Models
Topic models represent textual documents in terms of their distribution of latent topics. Imagine you have trained a 10-topic LDA model. Each topic is a frequency distribution over thousands of terms. Is there a good way of illustrating the meaning of the learned topics to a human? Discuss the advantages and disadvantages of some of the possible options below.

- Frequency based list
We can list the top n terms of a topic to a human to give a sparse represenation of the overall topic.
    - Advantages: 
        - Gives good overview over the most frequent mentioned content in the topic
        - Shows the most dominant themes clearly
    - Disadvantages:
        - With this illustration the human won't be able to grasp the full content of the topic because we only serve parts of the information.
        - We expect that words with low frequencies mean that they don't hold much importance for our topic, which may not be true
- Visualization in wordclouds
We can visualize the terms in the topics by using representative graphs. These graphs could have more frequent occuring terms in higher font size like in wordclouds.
    - Advantages:
        - Humans instantly get a good representation of the most frequent mentioned terms but can also take a closer look to get a grasp of lower frequently appearing content/terms in the topic
        - Aestetically interesting for humans
    - Disadvantages:
        - If many terms are equally represented, we get a convoluted wordcloud that isn't helful to comprehend the information
- N-Dimensional Visualization
We can visualize the topics by representing them in an N-dimensional coordinate space
    - Advantages:
        - We can visually group connected terms, which makes it easy to understand cohesions between terms
    - Disadvantages:
        - Displaying terms in more than 2-dimensionality can be confusing and not helpful to get a grasp of the terms in the topic