## **0. Tải bộ dữ liệu**
**Lưu ý:** Nếu bạn không thể sử dụng lệnh gdown để tải bộ dữ liệu vì bị giới hạn số lượt tải, hãy tải bộ dữ liệu thử công và upload lên google drive của mình. Sau đó, sử dụng lệnh dưới đây để copy file dữ liệu vào colab:
```python
from google.colab import drive

drive.mount('/content/drive')
!cp /path/to/dataset/on/your/drive .
```

In [None]:
# https://drive.google.com/file/d/1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R/view?usp=sharing
!gdown --id 1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R

## **1. Import các thư viện cần thiết**

In [None]:
import string
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

## **2. Đọc bộ dữ liệu**

In [None]:
# DATASET_PATH = '/content/2cls_spam_text_cls.csv'
DATASET_PATH = '2cls_spam_text_cls.csv'
df = pd.read_csv(DATASET_PATH)
df

In [None]:
messages = df['Message'].values.tolist()
labels = df['Category'].values.tolist()

In [None]:
messages

## **3. Chuẩn bị bộ dữ liệu**

### **3.1. Xử lý dữ liệu nhãn**

In [None]:
le = LabelEncoder()
y = le.fit_transform(labels)
print(f'Classes: {le.classes_}')
print(f'Encoded labels: {y}')

### **3.2. Xử lý dữ liệu đặc trưng**

In [None]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# It's good practice to ensure the necessary NLTK data is downloaded
# nltk.download('punkt')
# nltk.download('stopwords')

def punctuation_removal(text):
    """Removes punctuation from text."""
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

def remove_stopwords(tokens):
    """Removes English stopwords from a list of tokens."""
    stop_words = set(stopwords.words('english'))
    # The list comprehension is the most efficient way to do this.
    return [token for token in tokens if token not in stop_words]

def stemming(tokens):
    """Applies Porter Stemmer to a list of tokens."""
    stemmer = PorterStemmer()
    return [stemmer.stem(token) for token in tokens]

def preprocess_text(text):
    """
    Applies a full preprocessing pipeline to a list of text documents.
    """

    # 1. Lowercase the text
    text = text.lower()

    # 2. Remove punctuation
    text = punctuation_removal(text)

    # 3. Tokenize the cleaned text
    tokens = word_tokenize(text)

    # 4. Remove stopwords
    tokens = remove_stopwords(tokens)

    # 5. Apply stemming
    tokens = stemming(tokens)

    tokens

    return tokens

# --- Example Usage ---
# sample_texts = [
#     "This is the first document; it is amazing!",
#     "Here is the second one, which is also interesting."
# ]

# processed_data = preprocess_text(sample_texts)
# print(processed_data)
# Output: [['first', 'document', 'amaz'], ['second', 'one', 'also', 'interest']]

In [None]:
processed_messages = [preprocess_text(message) for message in messages]
processed_messages

In [None]:
def create_dictionary(messages):
    """Creates a vocabulary of unique words from a list of tokenized messages."""
    all_words = []

    for tokens in messages:
        # Extend the list with all tokens from the current message
        all_words.extend(tokens)
    print("all_words=",all_words)
    # Create a dictionary of unique words by converting to a set, then back to a list
    dictionary = sorted(list(set(all_words)))

    return dictionary

def create_features(tokens, dictionary):
    features = np.zeros(len(dictionary))

    for token in tokens:
        if token in dictionary:
            features[dictionary.index(token)] += 1

    return features

In [None]:
dictionary = create_dictionary(processed_messages)
dictionary

In [None]:
X = np.array([create_features(tokens, dictionary) for tokens in processed_messages])
X

### **3.3. Chia dữ liệu train/val/test**

In [None]:
VAL_SIZE = 0.2
TEST_SIZE = 0.125
SEED = 0

X_train, X_val, y_train, y_val = train_test_split(X, y,
                                                  test_size=VAL_SIZE,
                                                  shuffle=True,
                                                  random_state=SEED)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train,
                                                    test_size=TEST_SIZE,
                                                    shuffle=True,
                                                    random_state=SEED)

## **4. Huấn luyện mô hình**

In [None]:
%%time
model = GaussianNB()
print('Start training...')
model = model.fit(X_train, y_train)
print('Training completed!')

## **5. Đánh giá mô hình**

In [None]:
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)
val_accuracy = accuracy_score(y_val, y_val_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f'Val accuracy: {val_accuracy}')
print(f'Test accuracy: {test_accuracy}')

## **6. Thực hiện dự đoán**

In [None]:
def predict(text, model, dictionary):
    processed_text = preprocess_text(text)
    features = create_features(text, dictionary)
    features = np.array(features).reshape(1, -1)
    prediction = model.predict(features)
    prediction_cls = le.inverse_transform(prediction)[0]

    return prediction_cls

In [None]:
test_input = 'I am actually thinking a way of doing something useful'
prediction_cls = predict(test_input, model, dictionary)
print(f'Prediction: {prediction_cls}')