## Instruction

> 1. Rename Assignment-01-###.ipynb where ### is your student ID.
> 2. The deadline of Assignment-01 is 23:59pm, 03-31-2024
>
> 3. In this assignment, you will
>    1) explore Wikipedia text data
>    2) build language models
>    3) build NB and LR classifiers

## Task0 - Download datasets
> Download the preprocessed data, enwiki-train.json and enwiki-test.json from the Assignment-01 folder. In the data file, each line contains a Wikipedia page with attributes, title, label, and text. There are 1000 records in the train file and 100 records in test file with ten categories.

## Task1 - Data exploring and preprocessing

> 1) Print out how many documents are in each class  (for both train and test dataset)

In [None]:
import json
from typing import Callable

################################################################ 
###         define the function we need for later use        ###
################################################################

def load_json(file_path: str) -> list:
    """
    Fetch the data from `.json` file and concat them into a list.

    Input:
    - file_path: The relative file path of the `.json` file

    Returns:
    - join_data_list: A list containing the data, with the format of [{'title':<>, 'label':<>, 'text':<>}, {}, ...]
    """
    join_data_list = []
    with open(file_path, "r") as json_file:
        for line in json_file:
            line = line.strip()
            # guaranteen the line is not empty
            if line: 
                join_data_list.append(json.loads(line))
    return join_data_list

def iterate_line_in_list(data_list: list, f: Callable) -> dict:
    """
    Iterate the `data_list` while recording the class.

    Input:
    - data_list: A list containing (train/test) data, with the format of [{'title':<>, 'label':<>, 'text':<>}, {}, ...]
    - type: The type of the data, default is "train". Can take the value of "train" or "test"
    - f: A function to compute the number of documents, sentences e.t.c. in each `line`

    Output:
    - class_dict: A list containing dictionaries with (key, value) as (<class>, <number_of_documents>)
    """
    class_dict = {}
    for line in data_list:
        line_class = line['label']
        class_dict[line_class] = class_dict.get(line_class, 0) + f(line['text'])  # if the class doesn't exist, set the value as 0
    return class_dict

################################################################ 
###                        end define                        ###
################################################################

def count_docs(text):
    return 1

def print_docs_in_class(class_dict: dict, type: str = "train") -> None:
    print("The number of documents in each class for " + type + " dataset is: \n")
    for _class, _times in class_dict.items():
        print("There are {:>3} documents in class {:>10}".format(_times, _class))
    print('-'*50)


# Fetch data from the json file
    
train_file_path, test_file_path = "enwiki-train.json", "enwiki-test.json"
train_data_list, test_data_list = map(load_json, [train_file_path, test_file_path])

# print out the number of documents of each class in train and test dataset

train_docs_num = iterate_line_in_list(train_data_list, count_docs)
test_docs_num = iterate_line_in_list(test_data_list, count_docs)

print_docs_in_class(train_docs_num)
print_docs_in_class(test_docs_num, "test")

> 2) Print out the average number of sentences in each class.
>    You may need to use sentence tokenization of NLTK.
>    (for both train and test dataset)


In [None]:
from nltk.tokenize import sent_tokenize

def count_sents(text):
    return len(sent_tokenize(text))

def print_ave_sents_in_class(class_dict: dict, type: str = "train"):
    # get the dict of number of documents in each class based on the input type
    if type == "train":
        docs_num_class = train_docs_num
    elif type == "test":
        docs_num_class = test_docs_num
    else:
        raise TypeError
    
    # print the result
    print("The average number of sentences in each class for " + type + " dataset is: \n")
    for _class, _times in class_dict.items():
        print("There are average {:>7.2f} sentences in class {:>10}".format(_times / docs_num_class[_class], _class))
    print('-'*50)


train_ave_sents = iterate_line_in_list(train_data_list, count_sents)
test_ave_sents = iterate_line_in_list(test_data_list, count_sents)

print_ave_sents_in_class(train_ave_sents)
print_ave_sents_in_class(test_ave_sents, "test")


> 3) Print out the average number of tokens in each class
>    (for both train and test dataset)

In [None]:
from nltk.tokenize import word_tokenize

def count_tokens(text):
    return len(word_tokenize(text))

def print_ave_tokens_in_class(class_dict: dict, type: str = "train"):
    # get the dict of number of documents in each class based on the input type
    if type == "train":
        docs_num_class = train_docs_num
    elif type == "test":
        docs_num_class = test_docs_num
    else:
        raise TypeError
    
    # print the result
    print("The average number of tokens in each class for " + type + " dataset is: \n")
    for _class, _times in class_dict.items():
        print("There are average {:>8.2f} tokens in class {:>10}".format(_times / docs_num_class[_class], _class))
    print('-'*50)

train_ave_tokens = iterate_line_in_list(train_data_list, count_tokens)
test_ave_tokens = iterate_line_in_list(test_data_list, count_tokens)

print_ave_tokens_in_class(train_ave_tokens)
print_ave_tokens_in_class(test_ave_tokens, "test")

> 4) For each sentence in the document, remove punctuations and other special characters so that each sentence only contains English words and numbers. To make your life easier, you can make all words as lower cases. For each class, print out the first article's name and the processed first 40 words. (for both train and test dataset)

In [None]:
import re
from copy import deepcopy

def clean_doc(document: str) -> list:
    document = document.lower()
    cleaned_document = []
    sentences = sent_tokenize(document)
    for sentence in sentences:
        # remove punctuations and special characters
        sentence = re.sub(r'[^a-zA-Z0-9\s]', '', sentence)
        # remove extra whitespaces
        sentence = re.sub(r'\s+', ' ', sentence).strip()
        cleaned_document.append(sentence)
    return cleaned_document

def process_data_list(data_list: list, type: str = "train") -> list:
    explored = []
    print("The result of the " + type + " data list:")
    # process the data_list
    for line in data_list:
        class_label = line["label"]
        former_line_text = line["text"]                         # former text
        line["sentences"] = clean_doc(line["text"])             # cleaned sentences list
        line["text"] = ". ".join(line["sentences"]) + "."       # join the sentence list with ". " to generate the processed text
        if class_label not in explored:
            explored.append(class_label)
            # print the result
            print()
            print("The first article's name of class {:>10} is {:>20}".format(class_label, line["title"]))
            print("The cleaned text is: [{}] ==> [{}]".format(former_line_text[:40], line["text"][:40]))
    print("-"*120)
    return data_list

# make a deepcopy of the origin data list to avoid over-write
cleaned_train_data_list = deepcopy(train_data_list)
cleaned_test_data_list = deepcopy(test_data_list)

# process the copyed data list in place by `process_data_list`
cleaned_train_data_list = process_data_list(cleaned_train_data_list)
cleaned_test_data_list = process_data_list(cleaned_test_data_list, "test")

## Task2 - Build language models

> 1) Based on the training dataset, build unigram, bigram, and trigram language models using Add-one smoothing technique. It is encouraged to implement models by yourself. If you use public code, please cite it.


In [None]:
import nltk
from nltk.tokenize import sent_tokenize
from itertools import product
import math

class NGramModels(object):
    def __init__(self, n, laplace=1) -> None:
        self.n = n
        self.laplace = laplace
        self._model = None
        self._tokens = None
        self._vocab = None
        self._masks = list(reversed(list(product((0,1), repeat=n))))
    
    def _preprocess(self, sentences: list) -> list:
        """
        Preprocess the raw text by adding (n-1)*"<s>" (or one single <s>) on the front of the sentence 
        and replacing the tokens which occur only once with "<UNK>".

        Input:
        - sentences: A list with each element as a `sent_tokenized` sentence.

        Return:
        - tokens: A list containing the processed tokens
        """
        sos = "<s> " * (self.n - 1) if self.n > 1 else "<s> "
        tokenized_sentences = ['{}{} {}'.format(sos, sent, "</s>").split() for sent in sentences]
        tokenized_sentences = [word for sublist in tokenized_sentences for word in sublist]  # flatten
        # Replace tokens which appear only once in the corpus with <UNK>
        vocab = nltk.FreqDist(tokenized_sentences)
        tokens = [token if vocab[token] > 1 else "<UNK>" for token in tokenized_sentences]
        return tokens
    
    def _smooth(self) -> dict:
        """
        Smooth the frequency distribution based on Laplace smoothing.

        Return:
        - A dictionary {<ngram>: <count>, ...} containing the information of the frequency distribution
        """
        vocab_size = len(self._vocab)
        if self.n == 1:         # if n equals 1, we don't need to smooth it
            return {(unigram,): count / vocab_size for unigram, count in self._vocab.items()}
        else:
            n_grams = nltk.ngrams(self._tokens, self.n)
            n_vocab = nltk.FreqDist(n_grams)
            n_minus_one_grams = nltk.ngrams(self._tokens, self.n-1)
            n_minus_one_vocab = nltk.FreqDist(n_minus_one_grams)
            return {ngram: (n_freq + self.laplace) / (n_minus_one_vocab[ngram[:-1]] + self.laplace * vocab_size) for ngram, n_freq in n_vocab.items()}

    def train(self, sentences: list) -> None:
        """
        Train the model based on the given raw text.

        Input:
        - sentences: A list with each element as a `sent_tokenized` sentence.
        """
        tokens = self._preprocess(sentences)
        self._tokens = tokens
        self._vocab = nltk.FreqDist(self._tokens)
        self._model = self._smooth()

    def _find_match(self, ngram: tuple) -> str:
        """
        Find the best match of the given ngram token in the trained model by masking the ngram in iteration

        Input: 
        - ngram: A tuple representing a test ngram token

        Return:
        - tokens: The best match of the ngram in the trained model
        """
        mask = lambda ngram, bitmask: tuple((token if flag == 1 else "<UNK>" for token, flag in zip(ngram, bitmask)))
        possible_tokens = [mask(ngram, bitmask) for bitmask in self._masks]
        for tokens in possible_tokens:
            if tokens in self._model:
                return tokens

    def perplexity(self, test_sentences: list) -> float:
        """
        Compute the perplexity of the given `test_sentences` based on the train tokens.

        Input:
        - test_sentences: A list containing the test sentences
        
        Return:
        - perplexity: The perplexity of the test material, computed by the geomteric mean of the
                      log probabilities. 
        """
        test_tokens = self._preprocess(test_sentences)
        test_ngrams = nltk.ngrams(test_tokens, self.n)
        known_ngrams  = (self._find_match((ngram,)) if isinstance(ngram, str) else self._find_match(ngram) for ngram in test_ngrams)
        probabilities = [self._model[ngram] for ngram in known_ngrams]

        return math.exp((-1 / len(test_tokens)) * sum(map(math.log, probabilities)))

    def _best_candidate(self, prev: tuple, i: int, blacklist: list=[]) -> tuple:
        """
        Find the best candidate from the trained model based on the previous text and blacklist.

        Input:
        - prev: A tuple containing the information of the previous text 
        - i: current index
        - blacklist: A list of values that can't be taken

        Return:
        - candidate: A tuple with the format (<candidate_token>, <prob>)
        """
        blacklist += ["<UNK>"]
        candidates = [(ngram[-1], prob) for ngram, prob in self._model.items() if ngram[:-1] == prev] # find the candidates based on the trained moel
        candidates = [candidate for candidate in candidates if candidate[0] not in blacklist]         # filter out the candidate in blacklist
        candidates = sorted(candidates, key=lambda candidate: candidate[1], reverse=True)             # sort the candidates based on the prob
        if len(candidates) == 0:
            return ("</s>", 1)
        return candidates[0 if prev != () and prev[-1] != "<s>" else i]

    def generate(self, num: int, min_len: int=12, max_len: int=24):
        """
        Generate sentences based on the trained model for given number of sentences, minimum length and maximun length

        Input:
        - num: The number of sentences we need to generate
        - min_len: The minmum length of the generated sentence
        - max_len: The maximum length of the generated sentence

        Return (Yield):
        - The generated sentence one by one
        """
        for i in range(num):
            sent, prob = ["<s>"] * max(1, self.n - 1), 1
            while sent[-1] != "</s>":
                prev = () if self.n == 1 else tuple(sent[-(self.n-1):])
                blacklist = sent + (["</s>"] if len(sent) < min_len else [])
                next_token, next_prob = self._best_candidate(prev, i, blacklist)
                sent.append(next_token)
                prob *= next_prob
                
                if len(sent) >= max_len:
                    sent.append("</s>")

            yield ' '.join(sent), -1/math.log(prob)

# generate the train sentences from `cleaned_train_data_list`
train_sentences = []
for each in cleaned_train_data_list:
    train_sentences.extend(each["sentences"]) 

# unigram language model
unigramModel = NGramModels(1)
unigramModel.train(train_sentences)
print("The unigram language model has been successfully built!")
# bigram language model
bigramModel = NGramModels(2)
bigramModel.train(train_sentences)
print("The bigram language model has been successfully built!")
# trigram language model
trigramModel = NGramModels(3)
trigramModel.train(train_sentences)
print("The trigram language model has been successfully built!")

            

> 2) Report the perplexity of these 3 trained models on the testing dataset and explain your findings. 

In [None]:
# generate the test sentences from `cleaned_test_data_list`
test_sentences = []
for each in cleaned_test_data_list:
    test_sentences.extend(each["sentences"]) 

# compute the perplexity of the test dataset
u_perp = unigramModel.perplexity(test_sentences)
b_perp = bigramModel.perplexity(test_sentences)
t_perp = trigramModel.perplexity(test_sentences)

print("The perplexity of the testing dataset in unigram language model is {:>7.2f}".format(round(u_perp, 2)))
print("The perplexity of the testing dataset in bigram language model is {:>7.2f}".format(round(b_perp, 2)))
print("The perplexity of the testing dataset in trigram language model is {:>7.2f}".format(round(t_perp, 2)))


> 3) Use each built model to generate five sentences and explain these generated patterns.


In [None]:
num_sentence = 5
for i, model in enumerate([unigramModel, bigramModel, trigramModel]):
    print("-" * 50)
    print("For the {}-gram language model:".format(i+1))
    for sentence, prob in model.generate(num_sentence):
        print("{} ({:.5f})".format(sentence, prob))



## Task3 - Build NB/LR classifiers

> 1) Build a Naive Bayes classifier (with Laplace smoothing) and test your model on test dataset

In [None]:
# Your code goes to here




> 2) Build a LR classifier. This question seems to be challenging. We did not directly provide features for samples. But just use your own method to build useful features. You may need to split the training dataset into train and validation so that some involved parameters can be tuned. 

In [None]:
# Your code goes to here




> 3) Report Micro-F1 score and Macro-F1 score for these classifiers on testing dataset explain our results.

In [None]:
# Your code goes to here


