## Instruction

> 1. Rename Assignment-01-###.ipynb where ### is your student ID.
> 2. The deadline of Assignment-01 is 23:59pm, 03-31-2024
>
> 3. In this assignment, you will
>    1) explore Wikipedia text data
>    2) build language models
>    3) build NB and LR classifiers

## Task0 - Download datasets
> Download the preprocessed data, enwiki-train.json and enwiki-test.json from the Assignment-01 folder. In the data file, each line contains a Wikipedia page with attributes, title, label, and text. There are 1000 records in the train file and 100 records in test file with ten categories.

## Task1 - Data exploring and preprocessing

> 1) Print out how many documents are in each class  (for both train and test dataset)

In [45]:
import json
from typing import Callable

################################################################ 
###         define the function we need for later use        ###
################################################################

def load_json(file_path: str) -> list:
    """
    Fetch the data from `.json` file and concat them into a list.

    Input:
    - file_path: The relative file path of the `.json` file

    Returns:
    - join_data_list: A list containing the data, with the format of [{'title':<>, 'label':<>, 'text':<>}, {}, ...]
    """
    join_data_list = []
    with open(file_path, "r") as json_file:
        for line in json_file:
            line = line.strip()
            # guaranteen the line is not empty
            if line: 
                join_data_list.append(json.loads(line))
    return join_data_list

def iterate_line_in_list(data_list: list, f: Callable) -> dict:
    """
    Iterate the `data_list` while recording the class.

    Input:
    - data_list: A list containing (train/test) data, with the format of [{'title':<>, 'label':<>, 'text':<>}, {}, ...]
    - type: The type of the data, default is "train". Can take the value of "train" or "test"
    - f: A function to compute the number of documents, sentences e.t.c. in each `line`

    Output:
    - class_dict: A list containing dictionaries with (key, value) as (<class>, <number_of_documents>)
    """
    class_dict = {}
    for line in data_list:
        line_class = line['label']
        class_dict[line_class] = class_dict.get(line_class, 0) + f(line['text'])  # if the class doesn't exist, set the value as 0
    return class_dict

################################################################ 
###                        end define                        ###
################################################################

def count_docs(text):
    return 1

def print_docs_in_class(class_dict: dict, type: str = "train") -> None:
    print("The number of documents in each class for " + type + " dataset is: \n")
    for _class, _times in class_dict.items():
        print("There are {:>3} documents in class {:>10}".format(_times, _class))
    print('-'*50)


# Fetch data from the json file
    
train_file_path, test_file_path = "enwiki-train.json", "enwiki-test.json"
train_data_list, test_data_list = map(load_json, [train_file_path, test_file_path])

# print out the number of documents of each class in train and test dataset

train_docs_num = iterate_line_in_list(train_data_list, count_docs)
test_docs_num = iterate_line_in_list(test_data_list, count_docs)

print_docs_in_class(train_docs_num), print_docs_in_class(test_docs_num, "test")

The number of documents in each class for train dataset is: 

There are 100 documents in class       Film
There are 100 documents in class       Book
There are 100 documents in class Politician
There are 100 documents in class     Writer
There are 100 documents in class       Food
There are  70 documents in class      Actor
There are  80 documents in class     Animal
There are 130 documents in class   Software
There are 100 documents in class     Artist
There are 120 documents in class    Disease
--------------------------------------------------
The number of documents in each class for test dataset is: 

There are  10 documents in class       Film
There are  10 documents in class       Book
There are  10 documents in class Politician
There are  10 documents in class     Writer
There are  10 documents in class       Food
There are  10 documents in class      Actor
There are  10 documents in class     Animal
There are  10 documents in class   Software
There are  10 documents in class  

(None, None)

> 2) Print out the average number of sentences in each class.
>    You may need to use sentence tokenization of NLTK.
>    (for both train and test dataset)


In [35]:
from nltk.tokenize import sent_tokenize

def count_sents(text):
    return len(sent_tokenize(text))

def print_ave_sents_in_class(class_dict: dict, type: str = "train"):
    # get the dict of number of documents in each class based on the input type
    if type == "train":
        docs_num_class = train_docs_num
    elif type == "test":
        docs_num_class = test_docs_num
    else:
        raise TypeError
    
    # print the result
    print("The average number of sentences in each class for " + type + " dataset is: \n")
    for _class, _times in class_dict.items():
        print("There are average {:>7.2f} sentences in class {:>10}".format(_times / docs_num_class[_class], _class))
    print('-'*50)


train_ave_sents = iterate_line_in_list(train_data_list, count_sents)
test_ave_sents = iterate_line_in_list(test_data_list, count_sents)

print_ave_sents_in_class(train_ave_sents), print_ave_sents_in_class(test_ave_sents, "test")


The average number of sentences in each class for train dataset is: 

There are average  438.56 sentences in class       Film
There are average  400.36 sentences in class       Book
There are average  706.20 sentences in class Politician
There are average  420.32 sentences in class     Writer
There are average  175.24 sentences in class       Food
There are average   76.70 sentences in class      Actor
There are average   70.38 sentences in class     Animal
There are average  260.95 sentences in class   Software
There are average  306.47 sentences in class     Artist
There are average  404.90 sentences in class    Disease
--------------------------------------------------
The average number of sentences in each class for test dataset is: 

There are average  364.70 sentences in class       Film
There are average  295.90 sentences in class       Book
There are average  597.60 sentences in class Politician
There are average  294.90 sentences in class     Writer
There are average  107.60 

(None, None)

> 3) Print out the average number of tokens in each class
>    (for both train and test dataset)

In [36]:
from nltk.tokenize import word_tokenize

def count_tokens(text):
    return len(word_tokenize(text))

def print_ave_tokens_in_class(class_dict: dict, type: str = "train"):
    # get the dict of number of documents in each class based on the input type
    if type == "train":
        docs_num_class = train_docs_num
    elif type == "test":
        docs_num_class = test_docs_num
    else:
        raise TypeError
    
    # print the result
    print("The average number of tokens in each class for " + type + " dataset is: \n")
    for _class, _times in class_dict.items():
        print("There are average {:>8.2f} tokens in class {:>10}".format(_times / docs_num_class[_class], _class))
    print('-'*50)

train_ave_tokens = iterate_line_in_list(train_data_list, count_tokens)
test_ave_tokens = iterate_line_in_list(test_data_list, count_tokens)

print_ave_tokens_in_class(train_ave_tokens), print_ave_tokens_in_class(test_ave_tokens, "test")

The average number of tokens in each class for train dataset is: 

There are average 11895.28 tokens in class       Film
There are average 10540.51 tokens in class       Book
There are average 18644.30 tokens in class Politician
There are average 11849.91 tokens in class     Writer
There are average  3904.15 tokens in class       Food
There are average  1868.84 tokens in class      Actor
There are average  1521.92 tokens in class     Animal
There are average  6302.30 tokens in class   Software
There are average  8212.91 tokens in class     Artist
There are average  9322.96 tokens in class    Disease
--------------------------------------------------
The average number of tokens in each class for test dataset is: 

There are average  9292.90 tokens in class       Film
There are average  7711.10 tokens in class       Book
There are average 15204.30 tokens in class Politician
There are average  8499.40 tokens in class     Writer
There are average  2445.50 tokens in class       Food
There 

(None, None)

> 4) For each sentence in the document, remove punctuations and other special characters so that each sentence only contains English words and numbers. To make your life easier, you can make all words as lower cases. For each class, print out the first article's name and the processed first 40 words. (for both train and test dataset)

In [51]:
import re
from copy import deepcopy

def clean_doc(document: str) -> str:
    document = document.lower()
    # remove punctuations and special characters
    cleaned_document = re.sub(r'[^a-zA-Z0-9\s]', '', document)
    # remove extra whitespaces
    cleaned_document = re.sub(r'\s+', ' ', cleaned_document).strip()
    return cleaned_document

def process_data_list(data_list: list, type: str = "train") -> list:
    explored = []
    print("The result of the " + type + " data list:")
    # process the data_list
    for line in data_list:
        class_label = line["label"]
        former_line_text = line["text"]            # former text
        line["text"] = clean_doc(line["text"])     # cleaned text
        if class_label not in explored:
            explored.append(class_label)
            # print the result
            print()
            print("The first article's name of class {:>10} is {:>20}".format(class_label, line["title"]))
            print("The cleaned text is: [{}] ==> [{}]".format(former_line_text[:40], line["text"][:40]))
    print("-"*120)
    return data_list

# make a deepcopy of the origin data list to avoid over-write
cleaned_train_data_list = deepcopy(train_data_list)
cleaned_test_data_list = deepcopy(test_data_list)

# process the copyed data list in place by `process_data_list`
cleaned_train_data_list = process_data_list(cleaned_train_data_list)
cleaned_test_data_list = process_data_list(cleaned_test_data_list, "test")

The result of the train data list:

The first article's name of class       Film is         Citizen_Kane
The cleaned text is: [Citizen Kane is a 1941 American drama fi] ==> [citizen kane is a 1941 american drama fi]

The first article's name of class       Book is The_Spirit_of_the_Age
The cleaned text is: [The Spirit of the Age (full title "The S] ==> [the spirit of the age full title the spi]

The first article's name of class Politician is    Charles_de_Gaulle
The cleaned text is: [Charles André Joseph Marie de Gaulle (; ] ==> [charles andr joseph marie de gaulle 22 n]

The first article's name of class     Writer is        Mircea_Eliade
The cleaned text is: [Mircea Eliade (; – April 22, 1986) was a] ==> [mircea eliade april 22 1986 was a romani]

The first article's name of class       Food is       Korean_cuisine
The cleaned text is: [ 
Korean cuisine has evolved through cen] ==> [korean cuisine has evolved through centu]

The first article's name of class      Actor is       Roma

## Task2 - Build language models

> 1) Based on the training dataset, build unigram, bigram, and trigram language models using Add-one smoothing technique. It is encouraged to implement models by yourself. If you use public code, please cite it.


In [None]:
# Your code goes to here




> 2) Report the perplexity of these 3 trained models on the testing dataset and explain your findings. 

In [None]:
# Your code goes to here




> 3) Use each built model to generate five sentences and explain these generated patterns.


In [None]:
# Your code goes to here




## Task3 - Build NB/LR classifiers

> 1) Build a Naive Bayes classifier (with Laplace smoothing) and test your model on test dataset

In [None]:
# Your code goes to here




> 2) Build a LR classifier. This question seems to be challenging. We did not directly provide features for samples. But just use your own method to build useful features. You may need to split the training dataset into train and validation so that some involved parameters can be tuned. 

In [None]:
# Your code goes to here




> 3) Report Micro-F1 score and Macro-F1 score for these classifiers on testing dataset explain our results.

In [None]:
# Your code goes to here


