# INFO 4271 - Exercise 2 - Text Representation

Issued: April 23, 2024

Due: April 29, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Bag-of-Words Models
In class we discussed BOW vectorization models under which documents are represented via term frequency counts.

a) Construct term frequency BOW representations for the following sentences:

- "The government is open."
- "The government is closed."
- "Long live Mickey Mouse, emperor of all!"
- "Darn! This will break."

In [2]:
import string

corpus = [['The government is open.'], ['The government is closed.'], ['Long live Mickey Mouse, emperor of all!'], ['Darn! This will break.']]

#Turn a corpus of arbitrary texts into term-frequency weighted BOW vectors.
def TF(corpus):
    vecs = []
    #TODO: Implement me!
    for text in corpus:
        text[0] = text[0].translate(str.maketrans('', '', string.punctuation))
        text[0] = text[0].lower()
        words = text[0].split()
        bow = {}
        for word in words:
            if word in bow:
                bow[word] += 1
            else:
                bow[word] = 1
        vecs.append(bow)
    return vecs

print(TF(corpus))

[{'the': 1, 'government': 1, 'is': 1, 'open': 1}, {'the': 1, 'government': 1, 'is': 1, 'closed': 1}, {'long': 1, 'live': 1, 'mickey': 1, 'mouse': 1, 'emperor': 1, 'of': 1, 'all': 1}, {'darn': 1, 'this': 1, 'will': 1, 'break': 1}]


b) Extend the term frequency model by an inverse document frequency (IDF) component. Estimate IDFs based on the Reuters 21578 collection.

In [10]:
import nltk
from nltk.corpus import reuters
import math

#Download the documents
nltk.download("reuters")
documents = reuters.fileids()

docs = list(filter(lambda doc: doc.startswith("train"),documents));
print(str(len(docs)) + " total train documents");

#To access the content of a news article, we can use the reuters.words() function
print("The first document contains "+str(len(reuters.words(docs[0])))+" words.\nHere they are:")
# for word in reuters.words(docs[0]):
    # print(word)

#Estimate inverse document frequencies based on a corpus of documents.
def IDF(corpus):
    idfs = {}
    all_vec = {}
    vecs = []
    for text in corpus:
        words = reuters.words(text)
        bow = {}
        for word in words:
            word = word.lower().translate(str.maketrans('', '', string.punctuation))
            if word in bow:
                bow[word] += 1
            else:
                bow[word] = 1
        vecs.append(bow)
        
    for vec in vecs:
        for word in vec:
            if word in all_vec:
                all_vec[word] += 1
            else:
                all_vec[word] = 1
                
    for text in corpus:
        words = reuters.words(text)
        for word in words:
            word = word.lower().translate(str.maketrans('', '', string.punctuation))
            if word not in idfs:
                idfs[word] = math.log10(len(corpus) / all_vec[word])
    print(idfs)
    return idfs

#Turn a corpus of arbitrary texts into TF-IDF weighted BOW vectors.
def TFIDF(corpus):
    vecs = []
    # First get a frequency-weighted bag of words for every text in fwbogs
    fwbogs = []
    for text in corpus:
        words = reuters.words(text)
        fwbogs.append(fwbog(words))

    # Secondly calculate the TF-IDF
    for i, text in enumerate(corpus):
        tf_idf_vec = {}
        words = reuters.words(text)

        # Count frequency in all documents
        for word in words:
            word = word.lower().translate(str.maketrans('', '', string.punctuation))
            document_count = 0
            for fwbog in fwbogs:
                if word in fwbog:
                    document_count += 1
            # Add word to tf_idf_vec
            word_fwbogs = fwbogs[i]
            print(word_fwbogs)
            tf_idf_vec[word] = word_fwbogs[word] * math.log10(len(corpus) / document_count)
        vecs.append(tf_idf_vec)
    print(vecs)

def fwbog(words):
    bow = {}
    for word in words:
        word = word.lower().translate(str.maketrans('', '', string.punctuation))
        if word in bow:
            bow[word] += 1
        else:
            bow[word] = 1
    return bow

TFIDF(docs)

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\nic0m\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


7769 total train documents
The first document contains 633 words.
Here they are:
['BAHIA', 'COCOA', 'REVIEW', 'Showers', 'continued', ...]


TypeError: '<' not supported between instances of 'str' and 'int'

c) Bag-of-words models are order invariant. They do not retain the ordering in which terms occur in the document. Is there any way to include term order information in these models? Justify your answer below.

# 2. Topic Models
Topic models represent textual documents in terms of their distribution of latent topics. Imagine you have trained a 10-topic LDA model. Each topic is a frequency distribution over thousands of terms. Is there a good way of illustrating the meaning of the learned topics to a human? Discuss the advantages and disadvantages of some of the possible options below.