# <center>HW #1: Analyze Documents by Numpy</center>

### Thanapoom Phatthanaphan <br> CWID: 20011296

**Instructions**: 
- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description
- Follow the Submission Instruction to submit your assignment.
- Code of academic integrity:
    - **Each assignment needs to be completed independently. This is NOT group assignment**. 
    - Never ever copy others' work (even with minor modification, e.g. changing variable names)
    - If you generate code using large lanaguage models (although it is not encouraged), make sure to adapt the generated code to meet all requirements and it is executable.
    - Anti-Plagiarism software will be used to check similarities between all submissions.
    - Check Syllabus for more details.

**Problem Description**

In this assignment, you'll write functions to analyze an article to find out the word distributions and key concepts. 

The packages you'll need for this assignment include `numpy` and `string`. Some useful functions:
- string, list, dictionary: `split`,`join`, `count`, `index`,`strip`
- numpy: `sum`, `where`,`log`, `argsort`,`argmin`, `argmax` 

## Q1. Define a function to analyze word counts in a document


Define a function named `tokenize(doc)` which process an input document (denoted as `doc`) as follows: 

* First convert the document to lower case.
* Split the document into a list of tokens by **space** (including tabs and new lines). For example, `Hello, it's a helloooo world!` -> `["Hello,", "it's", "a", "helloooo", "world!"]` 
* Remove leading or trailing punctuations of each token. For example, `world!` ->`world`, but `it's` is not changed as the punctiation is in the middle. 
    - Hint, you can import module *string*, use `string.punctuation` to get a list of punctuations (say `puncts`), and then use function `strip(puncts)` to remove leading or trailing punctuations in each token
* Find the count of each unique `non-empty` token and save the count as a dictionary, named `vocab`, i.e., `{"Hello,": 1, a: 1, ...}` 
* Return the dictionary
    

In [1]:
import numpy as np
import string
# add your input statement

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
def tokenize(doc):
    
    vocab = {}
    
    # Convert the text to lower case
    lowercased_doc = doc.lower()
    
    # Get a string containing all punctuation characters
    puncts = string.punctuation
    
    # Split the document into tokens (words)
    tokens = lowercased_doc.split()
    
    # Remove leading and trailing punctuations from each token
    cleaned_tokens = [token.strip(puncts) for token in tokens]
    
    # Count each token in the document
    for token in cleaned_tokens:
        if token:
            vocab[token] = vocab.get(token, 0) + 1
        
    return vocab

In [3]:
# Test function

doc = "Hello , it's a helloooo world!"
vocab = tokenize(doc)
vocab

{'hello': 1, "it's": 1, 'a': 1, 'helloooo': 1, 'world': 1}

## Q2: Split unusual words into common pieces 


Notice that some words contains extra characters or punctuations. Next we'll find the common subwords in each word (e.g., split "helloooo" to "hello" and "ooo").

**Q2.1.** Define a function `get_pair_count(vocab)` to count the freqency of two subwords in a word as follows:


- The input is a dictionary (denoted as `vocab`) which maps each word into its count. The word contains subwords delimited by space. For example, at the beginning, we treat each character as a subword. Thus, the `vocab` from Q1 is `{"h e l l o":1, "a":1, ...}`
- Count any pair of consecutive subwords in each word and create a new dictionary to note down the total count of each pair across all the words, e.g. `{"e l": 2}`.
- Return the dictionary for the subword pairs.

In [4]:
def get_pair_count(vocab):
    
    pairs = {}
    
    # Iterate through each word in the vocab
    for word, count in vocab.items():
        if len(word) > 1:
            chars = word.split()

            # Get a pair of the chars
            for i in range(len(chars) - 1):
                pair = (chars[i], chars[i + 1])
                pairs[pair] = pairs.get(pair, 0) + 1
            
    return pairs

In [5]:
# Test

# At the start, treat each character as a subword. 
# Add spaces as delimiters of subwords in each word 

init_vocab = {' '.join(list(word)) : count for word, count in vocab.items()}
init_vocab

pairs = get_pair_count(init_vocab)
pairs

{'h e l l o': 1, "i t ' s": 1, 'a': 1, 'h e l l o o o o': 1, 'w o r l d': 1}

{('h', 'e'): 2,
 ('e', 'l'): 2,
 ('l', 'l'): 2,
 ('l', 'o'): 2,
 ('i', 't'): 1,
 ('t', "'"): 1,
 ("'", 's'): 1,
 ('o', 'o'): 3,
 ('w', 'o'): 1,
 ('o', 'r'): 1,
 ('r', 'l'): 1,
 ('l', 'd'): 1}

**Q2.2**. Define a function `merge_subwords(pair, vocab)` as follows:


- The inputs include a subword pair (denoted as `pair`), and the original vocabulary dictionary (denoted as `vocab`).
- For each word in `vocab`, if it contains `pair`, remove the space delimiter between the pair. Now this pair becomes a new subword. 
    - Hint: if you know regular expression, feel free to use it here. Otherwise, you can simply use function `replace`. Don't worry about some minor cross-boundary issues, e.g., `('hell' 'o')` may be matched with `hell oo`.
- Return the new vocabuary dictionary

In [6]:
def merge_subwords(pair, vocab):
    
    # initialize output vocab
    vcab_out = {}
    
    # Replace a space between each pair
    for word, count in vocab.items():
        new_word = word.replace(f'{pair[0]} {pair[1]}', f'{pair[0]}{pair[1]}')
        vcab_out[new_word] = count
        
    return vcab_out

In [7]:
# Test

pair = ('h', 'e')

# replace all 'h e' substrings by 'he'
new_vocab = merge_subwords(pair, init_vocab)

new_vocab

{'he l l o': 1, "i t ' s": 1, 'a': 1, 'he l l o o o o': 1, 'w o r l d': 1}

**Q2.3**. Define a function `subword_tokenize(doc, num_merges = 5)` to put all functions together.


- The inputs include a document (denoted as `doc`) and the number of times to merge subwords.
- Call `tokenize(doc)` to get the initial vocabulary dictionary, denoted as `vocab`
- For each word in `vocab`, add a space delimiter between characters to indict that each character is treated as a subword initially. Save these charaters into a list named `subwords`
- Repeat the follow steps for `num_merges` times:
    - Call `get_pair_count(vocab)` to get the frequency of subword pair across the words
    - Find the subword pair with the highest count, denoted as `pair`. If there is a tie, take any pair.
    - Call `merge_subwords(pair, vocab)` to merge the selected subwords and update the vocabulary `vocab`. Add the new subword into the list `subwords`.
- Finally, split each word in `vocab` by space to generate a new dictionary for the count of each subword.
- Return the subword dictionary and also `subwords` list.

In [8]:
def subword_tokenize(doc, num_merges = 5):
    
    vocab_out = {}
    subwords = []
    
    # Get the initial vocabulary dictionary
    vocab = tokenize(doc)
    
    # Add a space delimiter between characters
    for word, count in vocab.items():
        vocab_out[' '.join(list(word))] = count
    
    for merge_count in range(num_merges):
        
        # Treat each character as a subword initially
        for word in vocab_out.keys():
            for char in word.split():
                if char not in subwords:
                    subwords.append(char)

        # Get the frequency of subword pair across the words
        pairs = get_pair_count(vocab_out)

        # Descending sort to get the highest count of the subword pair
        sorted_pairs = sorted(pairs.items(), key=lambda item: item[1], reverse=True)

        # Merge the selected subwords and update the vocabulary
        vocab_out = merge_subwords(sorted_pairs[0][0], vocab_out)

        # Print the output
        print(f'Merge: #{merge_count + 1}')
        print(f'Pair: {sorted_pairs[0][0]}')
        print(f'Vocab: {vocab_out}')
        print(f'Subwords: {subwords}')
    
    return vocab_out, subwords

In [9]:
# test
# for debugging, you can print out the result of each merge as shown below.

doc = "Hello world, it's a helloooo world!"
vocab_out, subwords = subword_tokenize(doc, num_merges = 9)

print("vocab:")
vocab_out

print("subwords:")
subwords


Merge: #1
Pair: ('o', 'o')
Vocab: {'h e l l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'h e l l oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a']
Merge: #2
Pair: ('h', 'e')
Vocab: {'he l l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'he l l oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo']
Merge: #3
Pair: ('he', 'l')
Vocab: {'hel l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'hel l oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo', 'he']
Merge: #4
Pair: ('hel', 'l')
Vocab: {'hell o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'hell oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo', 'he', 'hel']
Merge: #5
Pair: ('hell', 'o')
Vocab: {'hello': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'helloo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo', 'he', 'hel', 'hell']
Merge: #6
Pair: ('w', 'o')
Vocab: {'hello': 1, 'wo

{'hello': 1, 'world': 2, "i t ' s": 1, 'a': 1, 'helloo oo': 1}

subwords:


['h',
 'e',
 'l',
 'o',
 'w',
 'r',
 'd',
 'i',
 't',
 "'",
 's',
 'a',
 'oo',
 'he',
 'hel',
 'hell',
 'hello',
 'helloo',
 'wo',
 'wor',
 'worl']

## Q3. Generate a document term matrix (DTM) as a numpy array


Define a function `get_dtm(docs)` as follows:
- The input is a list of documents, denoted as `docs`
- For each document, call `tokenize(doc)` defined in **Q1** (let's only use the simple version for now) to get the vocabulary dictionary 
- Pool the keys from all the dictionaries to get a list of unique words, denoted as `unique_words` 
- Creates a numpy array (denoted as `dtm`) with a shape of (# of documents x # of unique words), and set the initial values to 0. 
- Fill cell `dtm[i,j]` with the count of the `j`th word in the `i`th document 
- Return `dtm` and `unique_words`

In [10]:
def get_dtm(docs):
    
    # get all words
    unique_words = []
    words_loc = {}
    words_each_doc = []
    for doc in docs:
        temp_list_words = []
        vocab = tokenize(doc)
        for word, count in vocab.items():
            temp_list_words.append(word)
            if word not in unique_words:
                unique_words.append(word)
                words_loc[word] = len(unique_words) - 1
        words_each_doc.append(temp_list_words)
    
    # Create a numpy array
    dtm = np.zeros((len(docs), len(unique_words)))
    
    # Fill the cell with the count of the word in the document
    for i in range(len(docs)):
        for j, word in enumerate(unique_words):
            if word in words_each_doc[i]:
                dtm[i, j] += 1
            
    return dtm, unique_words

In [11]:
docs = ["Hello , it's a helloooo world!",
       "Again, it is hello world!"]

dtm, words = get_dtm(docs)
dtm
words

array([[1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1., 1., 1., 1.]])

['hello', "it's", 'a', 'helloooo', 'world', 'again', 'it', 'is']

In [12]:
# A test document collection). This document can be found at https://hbr.org/2022/04/the-power-of-natural-language-processing

# treat each paragraph as a document

docs = open("chatgpt.txt", 'r').readlines()

dtm, words = get_dtm(docs)

In [13]:
dtm.shape

# check words in a paragraph
p = 0 # paragraph id
docs[p]
[w for i,w in enumerate(words) if dtm[p][i]>0]

sorted(words)

(26, 314)

"Ethan Mollick has a message for the humans and the machines: can't we all just get along?\n"

['ethan',
 'mollick',
 'has',
 'a',
 'message',
 'for',
 'the',
 'humans',
 'and',
 'machines',
 "can't",
 'we',
 'all',
 'just',
 'get',
 'along']

['22-year-old',
 'a',
 'a.i',
 'about',
 'abroad',
 'academic',
 'access',
 'acknowledge',
 'adapt',
 'admits',
 'adopted',
 'after',
 'again',
 'against',
 'agrees',
 'all',
 'allowing',
 'almost',
 'along',
 'already',
 'alternates',
 'an',
 'and',
 'anxiety',
 'any',
 'app',
 'are',
 'artificial',
 'as',
 'asked',
 'asking',
 'assessments',
 'associate',
 'at',
 'away',
 'b',
 'b-minus',
 'banned',
 'be',
 'been',
 'before',
 'behalf',
 'believes',
 'between',
 'bot',
 "bot's",
 'but',
 'by',
 'calculators',
 'can',
 "can't",
 'capability',
 'challenge',
 'change',
 'changed',
 'changes',
 'chatbot',
 'chatgpt',
 'cheating',
 'check',
 'cites',
 'class',
 'classes',
 'classroom',
 'code',
 'come',
 'company',
 'compose',
 'computer',
 'concerns',
 'convinced',
 'core',
 'could',
 "couldn't",
 'course',
 'crashed',
 'created',
 'deserve',
 'despite',
 'detect',
 'did',
 "didn't",
 'differently',
 'districts',
 'do',
 "don't",
 'earlier',
 'early',
 'educators',
 'edward',
 'emerging'

## Q4 Analyze DTM Array (4 points)


**Don't use any loop in this task**. You should use array operations to take the advantage of high performance computing.

Define a function named `analyze_dtm(dtm, words, docs)` as follows:
- It takes an array `dtm`, an array of `words`, and an array of documents (denoted `docs`) as inputs, where `dtm` is the array you created from `docs` in Q3 with a shape of $(m \times n)$, and `words` corresponds to the columns of `dtm`.
- Calculate the document frequency for each word $j$, e.g., how many documents contain word $j$. Save the result to array $df$. $df$ has shape of $(n,)$ or $(1, n)$. 
- Normalize the word count per paragraph: divides word count, i.e., $dtm_{i,j}$, by the total number of words in document $i$. Save the result as an array named $tf$. $tf$ has shape of $(m,n)$. 
* For each $dtm_{i,j}$, calculate $tfidf_{i,j} = \frac{tf_{i, j}}{1+log(df_j)}$, i.e., divide each normalized word count by the log of the document frequency of the word (add 1 to the denominator to avoid dividing by 0).  $tfidf$ has shape of $(m,n)$ 
* Print out the following:
    
    - the total number of words in the documents represented by `dtm` 
    - the number of documents and the number of unique words
    - the most frequent top 10 words in this document    
    - top-5 words that show in most of the documents, i.e. words with the top 5 largest $df$ values (print words first, then their values. ) 
    - the longest document in terms of the number of words. Print out this document.
    - top-5 words with the largest $tfidf$ values in the longest document (show words and values) 
    - documents that contain `intelligence` word.

Note, for all the steps, **do not use any loop**. Just use array functions and broadcasting for high performance computation.

Your answer may be different from the example output, since words may have the same values in the dtm but are kept in different positions

In [14]:
def analyze_dtm(dtm, words, docs):
    
    # Calculate the document frequency for each word
    df = np.sum(dtm>0, axis=0)
    
    # Normalize the word count per paragraph: divides word count by the total number of words in document
    tf = dtm / np.sum(dtm, axis=1, keepdims=True)
    
    # Divide each normalized word count by the log of the document frequency of the word
    tfidf = tf / (1 + np.log(df))
    
    # The total number of words in the documents represented by dtm
    total_words = np.sum(dtm)
    print("The total number of words in the documents represented by dtm:", total_words)
    
    # The number of documents and the number of unique words
    num_docs, num_unique_words = dtm.shape
    print("\nThe number of documents:", num_docs)
    print("\nThe number of unique words:", num_unique_words)
    
    # The most frequent top 10 words in this document
    ind_words_dtm = np.argsort(np.sum(dtm, axis=0))[::-1]
    top10_words = [words[i] for i in ind_words_dtm[:10]]
    print("\nThe most frequent top 10 words in this document:", top10_words)
    
    # The top-5 words that show in most of the documents
    ind_words_df = np.argsort(df)[::-1]
    print("\nThe top-5 words that show in most of the documents:")
    for i in ind_words_df[:5]:
        print(f"{words[i]}: {df[i]}")
    
    # The longest document in terms of the number of words
    num_words_each_doc = np.argsort(np.sum(dtm, axis=1))[::-1]
    print("\nThe longest document in terms of the number of words:\n", docs[num_words_each_doc[0]])
    
    # The top-5 words with the largest tfidf values in the longest document (show words and values)
    ind_words_tfidf = np.argsort(tfidf[num_words_each_doc[0]])[::-1]
    print("\nThe top-5 words with the largest tfidf values in the longest document:")
    for i in ind_words_tfidf[:5]:
        print(f"{words[i]}: {tfidf[num_words_each_doc[0]][i]}")
    
    # Documents that contain "intelligence" word
    docs_contain_intelligence = np.where(dtm[:, np.where(words == 'intelligence')[0]] > 0)
    print("\nDocuments that contain 'intelligence' word:")
    print("Document No.:", docs_contain_intelligence[0][0])
    print(docs[docs_contain_intelligence[0][0]])
    print("Document No.:", docs_contain_intelligence[0][1])
    print(docs[docs_contain_intelligence[0][1]])
    
    return None

In [15]:
words = np.array(words)
docs = np.array(docs)

analyze_dtm(dtm, words, docs)

The total number of words in the documents represented by dtm: 624.0

The number of documents: 26

The number of unique words: 314

The most frequent top 10 words in this document: ['the', 'and', 'to', 'a', 'in', 'it', 'that', 'he', 'chatgpt', 'for']

The top-5 words that show in most of the documents:
the: 20
and: 15
to: 14
a: 12
in: 11

The longest document in terms of the number of words:
 """I think everybody is cheating ... I mean, it's happening. So what I'm asking students to do is just be honest with me,"" he said. ""Tell me what they use ChatGPT for, tell me what they used as prompts to get it to do what they want, and that's all I'm asking from them. We're in a world where this is happening, but now it's just going to be at an even grander scale."""


The top-5 words with the largest tfidf values in the longest document:
what: 0.019230769230769232
me: 0.019230769230769232
everybody: 0.019230769230769232
mean: 0.019230769230769232
it's: 0.019230769230769232

Documents that con

## Q5 (Bonus). Generating DTM by subword tokenization (2 points)

Assume you only need to keep the top N most frequent words (e.g., N = 200) in the collection of documents. Redo Q3-Q4 as follows:

- Use the subword tokenization you developed in Q2 to tokenize documents
- Generate a dtm with only the top-N most frequent words in the entire collection.
- Then analyze the dtm as in Q4.


Describe and implement your ideas. Again, no loop should be used in your solution to Q4. **Don't just submit code. You need to explain your idea as markdowns. No score will be given if only code is submitted**

In [None]:
# For Q5:
# To generate a dtm with only the top-N most frequent words in the entire collection.
# We can still use the same function in Q4 with some changes because, in that function,
# We createda list of words with descending orders by the frequency of each word.
# Therefore, to get only the top-N, we can easily additionally specify N number in a code
# For example, I will create a copy of Q4 function with the changing points for Q5
# Please look the test code below
    
def analyze_dtm_forQ5(dtm, words, docs, N):
    
    # Calculate the document frequency for each word
    df = np.sum(dtm>0, axis=0)
    
    # Normalize the word count per paragraph: divides word count by the total number of words in document
    tf = dtm / np.sum(dtm, axis=1, keepdims=True)
    
    # Divide each normalized word count by the log of the document frequency of the word
    tfidf = tf / (1 + np.log(df))
    
    # The total number of words in the documents represented by dtm
    total_words = np.sum(dtm)
    print("The total number of words in the documents represented by dtm:", total_words)
    
    # The number of documents and the number of unique words
    num_docs, num_unique_words = dtm.shape
    print("\nThe number of documents:", num_docs)
    print("\nThe number of unique words:", num_unique_words)
    
    # The most frequent top N words in this document
    # ** Change from top 10 to top N words by specifying N number in a code **
    ind_words_dtm = np.argsort(np.sum(dtm, axis=0))[::-1]
    topN_words = [words[i] for i in ind_words_dtm[:N]]
    print("\nThe most frequent top N words in this document:", topN_words)
    
    # The top-N words that show in most of the documents
    # ** Change from top 10 to top N words by specifying N number in a code **
    ind_words_df = np.argsort(df)[::-1]
    print("\nThe top-N words that show in most of the documents:")
    for i in ind_words_df[:N]:
        print(f"{words[i]}: {df[i]}")
    
    # The longest document in terms of the number of words
    num_words_each_doc = np.argsort(np.sum(dtm, axis=1))[::-1]
    print("\nThe longest document in terms of the number of words:\n", docs[num_words_each_doc[0]])
    
    # The top-N words with the largest tfidf values in the longest document (show words and values)
    # ** Change from top 10 to top N words by specifying N number in a code **
    ind_words_tfidf = np.argsort(tfidf[num_words_each_doc[0]])[::-1]
    print("\nThe top-N words with the largest tfidf values in the longest document:")
    for i in ind_words_tfidf[:N]:
        print(f"{words[i]}: {tfidf[num_words_each_doc[0]][i]}")
    
    # Documents that contain "intelligence" word
    docs_contain_intelligence = np.where(dtm[:, np.where(words == 'intelligence')[0]] > 0)
    print("\nDocuments that contain 'intelligence' word:")
    print("Document No.:", docs_contain_intelligence[0][0])
    print(docs[docs_contain_intelligence[0][0]])
    print("Document No.:", docs_contain_intelligence[0][1])
    print(docs[docs_contain_intelligence[0][1]])
    
    return None

# Put everything together and test using main block**

In [16]:
# best practice to test your class
# if your script is exported as a module,
# the following part is ignored
# this is equivalent to main() in Java

if __name__ == "__main__":  
    
    
    print("\n=======Q1: =========\n")
    doc = "Hello , it's a helloooo world!"
    vocab = tokenize(doc)
    print(vocab)
    
    print("\n=======Q2: =========\n")
    doc = "Hello world, it's a helloooo world!"

    vocab_out, subwords = subword_tokenize(doc, num_merges = 9)

    print("vocab:")
    print(vocab_out)

    print("subwords:")
    print(subwords)
    
    print("\n=======Q3: =========\n")
    
    docs = ["Hello , it's a helloooo world!",
       "Again, it is hello world!"]

    dtm, words = get_dtm(docs)
    print(dtm)
    print(words)
    
    print("\n=======Q4: =========\n")
    
    docs = open("chatgpt.txt", 'r').readlines()

    dtm, words = get_dtm(docs)

    words = np.array(words)
    docs = np.array(docs)

    analyze_dtm(dtm, words, docs)
    
    print("\n=======Q5: BONUS =========\n")
    
    # To generate a dtm with only the top-N most frequent words in the entire collection.
    # We can still use the same function in Q4 with some changes because, in that function,
    # We createda list of words with descending orders by the frequency of each word.
    # Therefore, to get only the top-N, we can easily additionally specify N number in a code
    # For example, I will create a copy of Q4 function with the changing points for Q5
    # Please look the test code below
    
def analyze_dtm_forQ5(dtm, words, docs, N):
    
    # Calculate the document frequency for each word
    df = np.sum(dtm>0, axis=0)
    
    # Normalize the word count per paragraph: divides word count by the total number of words in document
    tf = dtm / np.sum(dtm, axis=1, keepdims=True)
    
    # Divide each normalized word count by the log of the document frequency of the word
    tfidf = tf / (1 + np.log(df))
    
    # The total number of words in the documents represented by dtm
    total_words = np.sum(dtm)
    print("The total number of words in the documents represented by dtm:", total_words)
    
    # The number of documents and the number of unique words
    num_docs, num_unique_words = dtm.shape
    print("\nThe number of documents:", num_docs)
    print("\nThe number of unique words:", num_unique_words)
    
    # The most frequent top N words in this document
    # ** Change from top 10 to top N words by specifying N number in a code **
    ind_words_dtm = np.argsort(np.sum(dtm, axis=0))[::-1]
    topN_words = [words[i] for i in ind_words_dtm[:N]]
    print("\nThe most frequent top N words in this document:", topN_words)
    
    # The top-N words that show in most of the documents
    # ** Change from top 10 to top N words by specifying N number in a code **
    ind_words_df = np.argsort(df)[::-1]
    print("\nThe top-N words that show in most of the documents:")
    for i in ind_words_df[:N]:
        print(f"{words[i]}: {df[i]}")
    
    # The longest document in terms of the number of words
    num_words_each_doc = np.argsort(np.sum(dtm, axis=1))[::-1]
    print("\nThe longest document in terms of the number of words:\n", docs[num_words_each_doc[0]])
    
    # The top-N words with the largest tfidf values in the longest document (show words and values)
    # ** Change from top 10 to top N words by specifying N number in a code **
    ind_words_tfidf = np.argsort(tfidf[num_words_each_doc[0]])[::-1]
    print("\nThe top-N words with the largest tfidf values in the longest document:")
    for i in ind_words_tfidf[:N]:
        print(f"{words[i]}: {tfidf[num_words_each_doc[0]][i]}")
    
    # Documents that contain "intelligence" word
    docs_contain_intelligence = np.where(dtm[:, np.where(words == 'intelligence')[0]] > 0)
    print("\nDocuments that contain 'intelligence' word:")
    print("Document No.:", docs_contain_intelligence[0][0])
    print(docs[docs_contain_intelligence[0][0]])
    print("Document No.:", docs_contain_intelligence[0][1])
    print(docs[docs_contain_intelligence[0][1]])
    
    return None

# We additionally specify N number in a code to get top-N words
analyze_dtm_forQ5(dtm, words, docs, N=200)



{'hello': 1, "it's": 1, 'a': 1, 'helloooo': 1, 'world': 1}


Merge: #1
Pair: ('o', 'o')
Vocab: {'h e l l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'h e l l oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a']
Merge: #2
Pair: ('h', 'e')
Vocab: {'he l l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'he l l oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo']
Merge: #3
Pair: ('he', 'l')
Vocab: {'hel l o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'hel l oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo', 'he']
Merge: #4
Pair: ('hel', 'l')
Vocab: {'hell o': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'hell oo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo', 'he', 'hel']
Merge: #5
Pair: ('hell', 'o')
Vocab: {'hello': 1, 'w o r l d': 2, "i t ' s": 1, 'a': 1, 'helloo oo': 1}
Subwords: ['h', 'e', 'l', 'o', 'w', 'r', 'd', 'i', 't', "'", 's', 'a', 'oo', 'he', 'h

NameError: name 'analyze_dtm_forQ5' is not defined