# Homework Lab 2: Text Preprocessing with Vietnamese
**Overview:** In this exercise, we will build a text preprocessing program for Vietnamese.

Import the necessary libraries. Note that we are using the underthesea library for Vietnamese tokenization. To install it, follow the instructions below. ([link](https://github.com/undertheseanlp/underthesea))

In [312]:
# !pip install underthesea

In [313]:
import os,glob
import codecs
import sys
import re
from underthesea import word_tokenize

## Question 1: Create a Corpus and Survey the Data

The data in this section is partially extracted from the [VNTC](https://github.com/duyvuleo/VNTC) dataset. VNTC is a Vietnamese news dataset covering various topics. In this section, we will only process the science topic from VNTC. We will create a corpus from both the train and test directories. Complete the following program:

- Write `sentences_list` to a file named `dataset_name.txt`, with each element as a document on a separate line.
- Check how many documents are in the corpus.


In [314]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [315]:
dataset_name = "VNTC_khoahoc"
path = ['/content/drive/MyDrive/VNTC_khoahoc/Train_Full/', '/content/drive/MyDrive/VNTC_khoahoc/Test_Full/']

if os.listdir(path[0]) == os.listdir(path[1]):
    folder_list = [os.listdir(path[0]), os.listdir(path[1])]
    print("train labels = test labels")
else:
    print("train labels differ from test labels")

doc_num = 0
sentences_list = []
meta_data_list = []
for i in range(2):
    for folder_name in folder_list[i]:
        folder_path = path[i] + folder_name
        if folder_name[0] != ".":
            for file_name in glob.glob(os.path.join(folder_path, '*.txt')):
                # Read the file content into f
                f = codecs.open(file_name, 'br')
                # Convert the data to UTF-16 format for Vietnamese text
                file_content = (f.read().decode("utf-16")).replace("\r\n", " ")
                sentences_list.append(file_content.strip())
                f.close
                # Count the number of documents
                doc_num += 1

#### YOUR CODE HERE ####

# Write sentences_list to a file (dataset_name.txt)
with open(f'{dataset_name}.txt', 'w', encoding = 'utf-8') as out_file:
  for sentence in sentences_list:
    out_file.write(f"{sentence}\n")

# Number of documents in the corpus
print(f"Number of documents in the corpus: {doc_num}")
#### END YOUR CODE #####

train labels = test labels
Number of documents in the corpus: 3916


## Question 2: Write Preprocessing Functions







### Question 2.1: Write a Function to Clean Text
Hint:
- The text should only retain the following characters: aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?\'\
- Then trim the whitespace in the input text.

In [316]:
def clean_str(string):
    #### YOUR CODE HERE ####
    # defined the allowed characters
    allowed_chars = r"aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?'\""

    # allowed_chars_escaped = re.escape(allowed_chars)
    # retain only the allowed characters
    cleaned_text = re.sub(f"[^{allowed_chars}]", ' ', string)

    # trim the whitespace in the input text
    return cleaned_text.strip()
    #### END YOUR CODE #####

### Question 2.2: Write a Function to Convert Text to Lowercase

In [317]:
# make all text lowercase
def text_lowercase(string):

    #### YOUR CODE HERE ####

    # convert the entire string to lowercase
    text = string.lower()

    return text
    #### END YOUR CODE #####

### Question 2.3: Tokenize Words
Hint: Use the `word_tokenize()` function imported above with two parameters: `strings` and `format="text"`.


In [318]:
def tokenize(strings):
    #### YOUR CODE HERE ####

    # tokenizing the text with format = "text"
    tokenized_text = word_tokenize(strings, format="text")

    return tokenized_text
    #### END YOUR CODE #####

### Question 2.4: Remove Stop Words
To remove stop words, we use a list of Vietnamese stop words stored in the file `./vietnamese-stopwords.txt`. Complete the following program:
- Check each word in the text (`strings`). If a word is not in the stop words list, add it to `doc_words`.


In [319]:
stopwords_file = '/vietnamese-stopwords.txt'
def remove_stopwords(strings):
    #### YOUR CODE HERE ####
    # load stop words
    with open(stopwords_file, 'r', encoding='utf-8') as f:
      stop_words = set(f.read().splitlines())
    doc_words = strings.split()

    filter_words = [word for word in doc_words if word not in stop_words]
    # doc_words = []

    # # iterate through each tokenized word
    # for word in strings:
    #   if word not in stop_words:
    #     doc_words.append(word)

    return ' '.join(filter_words)
    #### END YOUR CODE #####

## Question 2.5: Build a Preprocessing Function
Hint: Call the functions `clean_str`, `text_lowercase`, `tokenize`, and `remove_stopwords` in order, then return the result from the function.


In [320]:
def text_preprocessing(strings):
    #### YOUR CODE HERE ####
    text = clean_str(strings)
    text = text_lowercase(text)
    text = tokenize(text)
    text = remove_stopwords(text)

    return text
    #### END YOUR CODE #####

## Question 3: Perform Preprocessing
Now, we will read the corpus from the file created in Question 1. After that, we will call the preprocessing function for each document in the corpus.

Hint: Call the `text_preprocessing()` function with `doc_content` as the input parameter and save the result in the variable `temp1`.


In [321]:
#### YOUR CODE HERE ####
clean_docs = []
file_path = f'{dataset_name}.txt'

with open(file_path, 'r', encoding='utf-8') as file:
  for line in file:
    # remove unnecessary spaces
    doc_content = line.strip()
    # call preprocessing function for each document in the corpus
    temp1 = text_preprocessing(doc_content)
    # append to the clean_docs list
    clean_docs.append(temp1)

#### END YOUR CODE #####
print("\nlength of clean_docs = ", len(clean_docs))
print('clean_docs[0]:\n' + clean_docs[0])


length of clean_docs =  3916
clean_docs[0]:
trung_quốc phóng_vệ_tinh nghiên_cứu khoa_học 27 4 , trung_quốc phóng thành_công vệ_tinh nghiên_cứu quỹ_đạo , phóng_vệ_tinh đầu_tiên trung_quốc vệ_tinh nghiên_cứu khoa_học trung_quốc đồng_thời , thành_công món quà nhân_dịp kỷ_niệm 50 chương_trình phi_hành_vũ_trụ trung_quốc 6 48 phút 27 4 , trung_quốc phóng vệ_tinh nghiên_cứu long march 4 b tên_lửa trung_tâm phóng_vệ tinh_taiyuan quỹ đạo_vệ_tinh nghiên_cứu 2,7 sử_dụng thí_nghiệm khoa_học , khảo_sát tài_nguyên đất , đánh_giá cây_trồng ngăn_ngừa , thảm_họa trái_đất quan_chức viện nghiên_cứu công_nghệ phi_hành_vũ_trụ thượng_hải trung_quốc thành_công khởi_đầu phóng_vệ_tinh đầu_tiên đồng_thời , món quà tặng lễ kỷ_niệm 50 chương_trình phi_hành_vũ_trụ trung_quốc tiết_lộ , , trung_quốc phóng vệ_tinh liên_lạc vệ_tinh_thí_nghiệm khoa_học trung_quốc liên_tục phóng vệ_tinh_thám_hiểm_vũ_trụ 2003 trung_quốc thành_công đi vũ_trụ


## Question 4: Save Preprocessed Data
Hint: Save the preprocessed data to a file named `dataset_name + '.clean.txt'`, where each document is written on a separate line.


In [322]:
#### YOUR CODE HERE ####
# define the result name
result = f"{dataset_name}.clean.txt"

# open the file and write on a separate line
with open(result, 'w', encoding='utf-8') as f:
  for doc in clean_docs:
    f.write(doc + '\n')

# file result
print(f"Preprocessed data: {result}")
#### YOUR CODE HERE ####

Preprocessed data: VNTC_khoahoc.clean.txt
