## Table of contents
#### 1 noise removal: 
punctuation(d·∫•u c√¢u), stop words (stopwordsVN), URLs and Special Characters, date time, m√£ s·ªë t√™n gi·∫£ng vi√™n/ ph√≤ng d·∫°ng wzjwz
#### 2 normalization: 
Elongation, accent-marks((e.g., convert all variations of "√°" to a single form)), emoji, normalize Unicode text to NFC form, lowercase, num2word, handle negation(d√°n nh√£n cho t·ª´ ph·ªß ƒë·ªãnh)
#### 3 word segmentation
#### 4 remove non-word chars
remove all character are not word or underscore
#### 5 drop unique word
it should be processed when all above step done
#### 6 handle imbalanced data

In [106]:
pip install underthesea pyvi num2words py_vncorenlp pyvi

Note: you may need to restart the kernel to use updated packages.


In [107]:
import pandas as pd
import numpy as np

import unicodedata
from underthesea import word_tokenize
from pyvi import ViTokenizer

import re
import string

## 1. Noise removal

In [108]:
# X·ª≠ l√≠ d·∫•u c√¢u
def remove_punctuation(text):
    translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
    text = text.translate(translator)
    return text

In [225]:
# Stopwords
import requests

raw_url = 'https://raw.githubusercontent.com/lavibula/SentimentAnalysis-with-Vietnamese-reviews/topic/data-preparation/vietnamese-stopwords-dash_filtered.txt'
response = requests.get(raw_url)

if response.status_code == 200:
    content = response.text
else:
    print('Failed to retrieve the file from GitHub:', response.status_code)

# Stopwords
sw = content.split('\n')
def remove_stopword(text):
    text = " ".join(x for x in text.split() if x not in sw)
    return text

In [227]:
'nh·ªè' in sw

False

In [110]:
# # URLs and special character

import re

def clean_text(text):
    # Remove URLs, special characters and date time, m√£ s·ªë t√™n gi·∫£ng vi√™n/ ph√≤ng d·∫°ng wzjwz, repeating chars

    # URLs
    text = re.sub(r"https?://\S+", "", text)
    
    # special characters
    text = re.sub(r"[!@#$%/^&*(]", "", text)
    
    # dates in the format dd/mm/yyyy or dd-mm-yyyy or dd.mm.yyyy or dd_mm_yy
    text = re.sub(r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', '', text)
    text = re.sub(r'\b\d{1,2}[_]\d{1,2}[_]\d{2}\b', '', text)
    
    # m√£ s·ªë t√™n gi·∫£ng vi√™n/ ph√≤ng d·∫°ng wzjwz
    text = re.sub(r'\b(wzjwz\w*)\b', '', text) 
    
    return text


## 2. Normalization

In [111]:
# Remove c√°c k√Ω t·ª± k√©o d√†i: vd: ƒë·∫πppppppp
def remove_elongated_chars(text):
    replacements = {
       'a' : '√†√°·∫£√£·∫°ƒÉ·∫±·∫Ø·∫≥·∫µ·∫∑√¢·∫ß·∫•·∫©·∫´·∫≠' ,
       'e' : '√®√©·∫ª·∫Ω·∫π√™·ªÅ·∫ø·ªÉ·ªÖ·ªá' ,
       'i' : '√¨√≠·ªâƒ©·ªã' ,
       'o' : '√≤√≥·ªè√µ·ªç√¥·ªì·ªë·ªï·ªó·ªô∆°·ªù·ªõ·ªü·ª°·ª£' ,
       'u' : '√π√∫·ªß≈©·ª•∆∞·ª´·ª©·ª≠·ªØ·ª±' ,
       'y' : '·ª≥√Ω·ª∑·ªπ·ªµ' ,
       'd' : 'ƒë' ,
       'A' : '√Ä√Å·∫¢√É·∫†ƒÇ·∫∞·∫Æ·∫≤·∫¥·∫∂√Ç·∫¶·∫§·∫®·∫™·∫¨' ,
       'E' : '√à√â·∫∫·∫º·∫∏√ä·ªÄ·∫æ·ªÇ·ªÑ·ªÜ' ,
       'I' : '√å√ç·ªàƒ®·ªä' ,
       'O' : '√í√ì·ªé√ï·ªå√î·ªí·ªê·ªî·ªñ·ªò∆†·ªú·ªö·ªû·ª†·ª¢' ,
       'U' : '√ô√ö·ª¶≈®·ª§∆Ø·ª™·ª®·ª¨·ªÆ·ª∞' ,
       'Y' : '·ª≤√ù·ª∂·ª∏·ª¥' ,
       'D' : 'ƒê' 
    }
    
    for char, replacements_str in replacements.items():
        pattern = rf"({char})\1+"
        text = re.sub(pattern, ' ' , text)
    pattern = rf"(\w)\1+"
    text = re.sub(pattern, r'\1', text)
    return text


In [112]:
#S·ª≠ d·ª•ng th∆∞ vi·ªán unicodedata ƒë·ªÉ chuy·ªÉn ƒë·ªïi c√°c k√Ω t·ª± Unicode 
#t∆∞∆°ng ƒë∆∞∆°ng th√†nh d·∫°ng chu·∫©n. V√≠ d·ª•: "Ho√†" th√†nh "H√≤a".
def normalize_unicode(text):
    return unicodedata.normalize("NFC", text)

In [113]:
from num2words import num2words

def handle_number(text):
    # X·ª≠ l√Ω chuy·ªÉn s·ªë th√†nh vƒÉn b·∫£n
    words = text.split()
    cleaned_words = []
    for word in words:
        try:
            # N·∫øu t·ª´ k·∫øt th√∫c b·∫±ng d·∫•u ch·∫•m, lo·∫°i b·ªè d·∫•u ch·∫•m v√† x·ª≠ l√Ω s·ªë
            if word.endswith( '.' ):
                num = int(word.replace( ',' , '' )[:-1])
                word = num2words(num, lang= 'vi' ) + '.' 
            else:
                # N·∫øu t·ª´ ch·ª©a d·∫•u ph·∫©y, lo·∫°i b·ªè d·∫•u ph·∫©y v√† x·ª≠ l√Ω s·ªë
                if ',' in word:
                    word = num2words(float(word), lang= 'vi' )
                elif '.' in word:
                    parts = word.split( '.' ) 
                    num = '' .join(parts[0:])
                    word = num2words(int(num), lang= 'vi' )
                else:
                    num = int(word)
                    word = num2words(num, lang= 'vi' )
        except ValueError:
            # N·∫øu kh√¥ng th·ªÉ chuy·ªÉn ƒë·ªïi, gi·ªØ nguy√™n t·ª´
            pass
        cleaned_words.append(word)
    
    # K·∫øt h·ª£p c√°c t·ª´ th√†nh c√¢u
    cleaned_text = ' ' .join(cleaned_words)
    
    return cleaned_text

In [114]:
#Lowercase
def lowercase(text):
    return text.lower()

In [115]:
from pyvi import ViTokenizer

def handle_negation(text):
    
    not_words = {"kh√¥ng", 'kh√¥ng h·ªÅ', "ch·∫≥ng", "ch∆∞a", "kh√¥ng ph·∫£i", "ch·∫£", "m·∫•t",
                 "thi·∫øu", "ƒë·∫øch", "ƒë√©o", "k√©m", "n·ªè", "not",
                 "b·ªõt", "kh√¥ng bao gi·ªù", "ch∆∞a bao gi·ªù"}
    # S·∫Øp x·∫øp c√°c t·ª´ t·ª´ d√†i nh·∫•t ƒë·∫øn ng·∫Øn nh·∫•t
    not_words = sorted(not_words, key=len, reverse=True)

    # Thay th·∫ø c√°c t·ª´ trong text b·∫±ng 'NOT'
    pattern = r'\b(?:' + '|'.join(re.escape(word) for word in not_words) + r')\b'
    text = re.sub(pattern, 'NOT', text, flags=re.IGNORECASE)

    return text

In [116]:
# V√≠ d·ª• s·ª≠ d·ª•ng
text = "T√¥i kh√¥ng h·ªÅ mu·ªën ƒëi ch∆°i, ch∆∞a bao gi·ªù!"
processed_text = handle_negation(text)
print(processed_text)

T√¥i NOT mu·ªën ƒëi ch∆°i, NOT!


## 3. Word segmentation

In [119]:
from pyvi import ViTokenizer
def Word_segmentation(text):
    text = ViTokenizer.tokenize(text)
    return text

In [120]:
# V√≠ d·ª• s·ª≠ d·ª•ng
text = "gi√°o vi√™n c·∫ßn l√™n l·ªõp th∆∞·ªùng xuy√™n h∆°n v√† d·∫°y nh·ªØng ki·∫øn th·ª©c thi·∫øt th·ª±c v·ªõi m√¥n h·ªçc h∆°n .!"
processed_text = Word_segmentation(text)
print(processed_text)

gi√°o_vi√™n c·∫ßn l√™n_l·ªõp th∆∞·ªùng_xuy√™n h∆°n v√† d·∫°y nh·ªØng ki·∫øn_th·ª©c thi·∫øt_th·ª±c v·ªõi m√¥n_h·ªçc h∆°n . !


In [164]:
# V√≠ d·ª• s·ª≠ d·ª•ng
text = "n·ªôi_dung m√¥n_h·ªçc NOT tr·ªçng_t√¢m"
processed_text = drop_unique_word(Word_segmentation(text))
print(processed_text)

1450
n·ªôi_dung m√¥n_h·ªçc  tr·ªçng_t√¢m


In [159]:
# V√≠ d·ª• s·ª≠ d·ª•ng
text = "ch∆∞a √°p d·ª•ng c√¥ng ngh·ªá th√¥ng tin v√† c√°c thi·∫øt b·ªã h·ªó tr·ª£ cho vi·ªác gi·∫£ng d·∫°y ."
processed_text = drop_unique_word(Word_segmentation(text))
print(processed_text)

1450
ch∆∞a √°p_d·ª•ng c√¥ng_ngh·ªá th√¥ng_tin v√† c√°c thi·∫øt_b·ªã h·ªó_tr·ª£ cho vi·ªác gi·∫£ng_d·∫°y .


## 4. Remove non-word chars

In [121]:
import re

def remove_non_word_chars(text):
    # remove non-word characters
    cleaned_text = re.sub(r'[^\w\s]', '', text)
    return cleaned_text

text = "This is a sample_text! With punctuation& and symbols. üå∑"
cleaned_text = remove_non_word_chars(text)
print(cleaned_text)

This is a sample_text With punctuation and symbols 


## 5. Drop unique word

In [179]:
import pandas as pd

def return_unique_word(df1, column_name):
    
    df = df1.copy()
    df[column_name] = df[column_name].apply(remove_punctuation)
    df[column_name] = df[column_name].apply(clean_text)
    df[column_name] = df[column_name].apply(Word_segmentation)

    all_words_series = df[column_name].str.split(expand=True).stack()

    word_counts = all_words_series.value_counts()
    df_word_counts = pd.DataFrame({'count': word_counts}).reset_index()
    df_word_counts = df_word_counts[df_word_counts['count'] == 1]
    
    return df_word_counts['index'].values


In [180]:
from pyvi import ViTokenizer

def drop_unique_word(unique_word, text):
#     print(len(unique_word))
    # Thay th·∫ø c√°c t·ª´ trong text b·∫±ng ''
    pattern = r'\b(?:' + '|'.join(re.escape(word) for word in unique_word) + r')\b'
    text = re.sub(pattern, '', text, flags=re.IGNORECASE)

    return text

## Preprocess data

In [168]:
def process_text(text):
    
    cleaned_text = clean_text(text)
    cleaned_text = handle_number(cleaned_text)
    cleaned_text = remove_punctuation(cleaned_text)
    cleaned_text = lowercase(cleaned_text)
    
    cleaned_text = remove_elongated_chars(cleaned_text)
    cleaned_text = normalize_unicode(cleaned_text)
    cleaned_text = Word_segmentation(cleaned_text)
    cleaned_text = handle_negation(cleaned_text)

    cleaned_text = remove_non_word_chars(cleaned_text)

    #remove n·ªët nh·ªØng k√Ω t·ª± th·ª´a th√£i
    cleaned_text = cleaned_text.replace(u'  ', u' ')
    cleaned_text = cleaned_text.replace(u'"', u' ')
    cleaned_text = cleaned_text.replace(u'Ô∏è', u'')
    
    cleaned_text = remove_stopword(cleaned_text)
    
    return cleaned_text

In [26]:
from datasets import load_dataset

dataset = load_dataset("uitnlp/vietnamese_students_feedback")

Downloading data: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 475k/475k [00:00<00:00, 1.46MB/s]
Downloading data: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 63.3k/63.3k [00:00<00:00, 268kB/s]
Downloading data: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 134k/134k [00:00<00:00, 427kB/s]


Generating train split:   0%|          | 0/11426 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1583 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3166 [00:00<?, ? examples/s]

In [173]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'sentiment', 'topic'],
        num_rows: 11426
    })
    validation: Dataset({
        features: ['sentence', 'sentiment', 'topic'],
        num_rows: 1583
    })
    test: Dataset({
        features: ['sentence', 'sentiment', 'topic'],
        num_rows: 3166
    })
})

In [4]:
dataset['train']

Dataset({
    features: ['sentence', 'sentiment', 'topic'],
    num_rows: 11426
})

In [252]:
df = dataset['validation'].to_pandas()
df
# sentiment: negative, neutral, positive
# topic: lecturer, training program, facility, others

Unnamed: 0,sentence,sentiment,topic
0,gi√°o tr√¨nh ch∆∞a c·ª• th·ªÉ .,0,1
1,gi·∫£ng bu·ªìn ng·ªß .,0,0
2,"gi√°o vi√™n vui t√≠nh , t·∫≠n t√¢m .",2,0
3,"gi·∫£ng vi√™n n√™n giao b√†i t·∫≠p nhi·ªÅu h∆°n , chia n...",0,0
4,"gi·∫£ng vi√™n c·∫ßn gi·∫£ng b√†i chi ti·∫øt h∆°n , ƒëi s√¢u...",0,0
...,...,...,...
1578,h∆∞·ªõng d·∫´n lab m∆° h·ªì .,0,0
1579,th·∫ßy cho ch√∫ng em nh·ªØng b√†i t·∫≠p mang t√≠nh th·ª±c...,2,0
1580,th·∫ßy kh√¥ng d·∫°y nhi·ªÅu ch·ªß y·∫øu cho sinh vi√™n t·ª± ...,0,0
1581,em mu·ªën ƒë·ªïi t√™n m√¥n h·ªçc v√¨ t√™n m√¥n l√† l·∫≠p tr√¨n...,0,1


In [155]:
df['sentence'][:10]

0                            slide gi√°o tr√¨nh ƒë·∫ßy ƒë·ªß .
1       nhi·ªát t√¨nh gi·∫£ng d·∫°y , g·∫ßn g≈©i v·ªõi sinh vi√™n .
2                 ƒëi h·ªçc ƒë·∫ßy ƒë·ªß full ƒëi·ªÉm chuy√™n c·∫ßn .
3    ch∆∞a √°p d·ª•ng c√¥ng ngh·ªá th√¥ng tin v√† c√°c thi·∫øt ...
4    th·∫ßy gi·∫£ng b√†i hay , c√≥ nhi·ªÅu b√†i t·∫≠p v√≠ d·ª• ng...
5    gi·∫£ng vi√™n ƒë·∫£m b·∫£o th·ªùi gian l√™n l·ªõp , t√≠ch c·ª±...
6    em s·∫Ω n·ª£ m√¥n n√†y , nh∆∞ng em s·∫Ω h·ªçc l·∫°i ·ªü c√°c h...
7    th·ªùi l∆∞·ª£ng h·ªçc qu√° d√†i , kh√¥ng ƒë·∫£m b·∫£o ti·∫øp th...
8    n·ªôi dung m√¥n h·ªçc c√≥ ph·∫ßn thi·∫øu tr·ªçng t√¢m , h·∫ßu...
9    c·∫ßn n√≥i r√µ h∆°n b·∫±ng c√°ch tr√¨nh b√†y l√™n b·∫£ng th...
Name: sentence, dtype: object

In [170]:
df['sentence'][:10].apply(process_text)

0                              slide gi√°o_tr√¨nh ƒë·∫ßy_ƒë·ªß
1               nhi·ªát_t√¨nh gi·∫£ng_d·∫°y g·∫ßn_g≈©i sinh_vi√™n
2                         ƒëi h·ªçc ƒë·∫ßy_ƒë·ªß ful chuy√™n_c·∫ßn
3    NOT √°p_d·ª•ng c√¥ng_ngh·ªá th√¥ng_tin thi·∫øt_b·ªã gi·∫£ng...
4                         th·∫ßy gi·∫£ng b√†i_t·∫≠p v√≠_d·ª• l·ªõp
5    gi·∫£ng_vi√™n l√™n_l·ªõp t√≠ch_c·ª±c tr·∫£_l·ªùi c√¢u sinh_v...
6                            n·ª£ m√¥n h·ªçc h·ªçc_k·ª≥ k·∫ø_ti·∫øp
7                 th·ªùi_l∆∞·ª£ng h·ªçc NOT ti·∫øp_thu hi·ªáu_qu·∫£
8    n·ªôi_dung m√¥n_h·ªçc NOT tr·ªçng_t√¢m h·∫ßu_nh∆∞ kh√°i_qu...
9                         tr√¨nh_b√†y b·∫£ng thay_v√¨ slide
Name: sentence, dtype: object

In [240]:
from tqdm import tqdm
def process_df(df):   
    column_name = 'sentence_preprocessed'
    df[column_name] = df['sentence'].apply(process_text)


In [253]:
process_df(df)
df

Unnamed: 0,sentence,sentiment,topic,sentence_preprocessed
0,gi√°o tr√¨nh ch∆∞a c·ª• th·ªÉ .,0,1,gi√°o_tr√¨nh NOT
1,gi·∫£ng bu·ªìn ng·ªß .,0,0,gi·∫£ng bu·ªìn_ng·ªß
2,"gi√°o vi√™n vui t√≠nh , t·∫≠n t√¢m .",2,0,gi√°o_vi√™n vui_t√≠nh t·∫≠n_t√¢m
3,"gi·∫£ng vi√™n n√™n giao b√†i t·∫≠p nhi·ªÅu h∆°n , chia n...",0,0,gi·∫£ng_vi√™n giao b√†i_t·∫≠p nhi·ªÅu chia b√†i_t·∫≠p gi·∫£...
4,"gi·∫£ng vi√™n c·∫ßn gi·∫£ng b√†i chi ti·∫øt h∆°n , ƒëi s√¢u...",0,0,gi·∫£ng_vi√™n c·∫ßn gi·∫£ng chi_ti·∫øt ƒëi_s√¢u code ch·∫°y...
...,...,...,...,...
1578,h∆∞·ªõng d·∫´n lab m∆° h·ªì .,0,0,h∆∞·ªõng_d·∫´n lab m∆°_h·ªì
1579,th·∫ßy cho ch√∫ng em nh·ªØng b√†i t·∫≠p mang t√≠nh th·ª±c...,2,0,th·∫ßy b√†i_t·∫≠p th·ª±c_h√†nh th·ª±c_ti·ªÖn h√†i_l√≤ng
1580,th·∫ßy kh√¥ng d·∫°y nhi·ªÅu ch·ªß y·∫øu cho sinh vi√™n t·ª± ...,0,0,th·∫ßy NOT d·∫°y nhi·ªÅu ch·ªß_y·∫øu sinh_vi√™n
1581,em mu·ªën ƒë·ªïi t√™n m√¥n h·ªçc v√¨ t√™n m√¥n l√† l·∫≠p tr√¨n...,0,1,ƒë·ªïi m√¥n_h·ªçc m√¥n l·∫≠p_tr√¨nh c fraction cplusplus...


In [254]:
# unique word
column_name = 'sentence_preprocessed'
unique_word = return_unique_word(df, column_name)
text_dropped = []
for text in tqdm(df[column_name], desc='Processing'):
    text_dropped.append(drop_unique_word(unique_word, text))

Processing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1583/1583 [00:01<00:00, 948.53it/s]


In [255]:
df['sentence_dropped'] = text_dropped

In [256]:
df

Unnamed: 0,sentence,sentiment,topic,sentence_preprocessed,sentence_dropped
0,gi√°o tr√¨nh ch∆∞a c·ª• th·ªÉ .,0,1,gi√°o_tr√¨nh NOT,gi√°o_tr√¨nh NOT
1,gi·∫£ng bu·ªìn ng·ªß .,0,0,gi·∫£ng bu·ªìn_ng·ªß,gi·∫£ng bu·ªìn_ng·ªß
2,"gi√°o vi√™n vui t√≠nh , t·∫≠n t√¢m .",2,0,gi√°o_vi√™n vui_t√≠nh t·∫≠n_t√¢m,gi√°o_vi√™n vui_t√≠nh t·∫≠n_t√¢m
3,"gi·∫£ng vi√™n n√™n giao b√†i t·∫≠p nhi·ªÅu h∆°n , chia n...",0,0,gi·∫£ng_vi√™n giao b√†i_t·∫≠p nhi·ªÅu chia b√†i_t·∫≠p gi·∫£...,gi·∫£ng_vi√™n giao b√†i_t·∫≠p nhi·ªÅu chia b√†i_t·∫≠p gi·∫£...
4,"gi·∫£ng vi√™n c·∫ßn gi·∫£ng b√†i chi ti·∫øt h∆°n , ƒëi s√¢u...",0,0,gi·∫£ng_vi√™n c·∫ßn gi·∫£ng chi_ti·∫øt ƒëi_s√¢u code ch·∫°y...,gi·∫£ng_vi√™n c·∫ßn gi·∫£ng chi_ti·∫øt ƒëi_s√¢u code ch·∫°y...
...,...,...,...,...,...
1578,h∆∞·ªõng d·∫´n lab m∆° h·ªì .,0,0,h∆∞·ªõng_d·∫´n lab m∆°_h·ªì,h∆∞·ªõng_d·∫´n lab m∆°_h·ªì
1579,th·∫ßy cho ch√∫ng em nh·ªØng b√†i t·∫≠p mang t√≠nh th·ª±c...,2,0,th·∫ßy b√†i_t·∫≠p th·ª±c_h√†nh th·ª±c_ti·ªÖn h√†i_l√≤ng,th·∫ßy b√†i_t·∫≠p th·ª±c_h√†nh th·ª±c_ti·ªÖn h√†i_l√≤ng
1580,th·∫ßy kh√¥ng d·∫°y nhi·ªÅu ch·ªß y·∫øu cho sinh vi√™n t·ª± ...,0,0,th·∫ßy NOT d·∫°y nhi·ªÅu ch·ªß_y·∫øu sinh_vi√™n,th·∫ßy NOT d·∫°y nhi·ªÅu ch·ªß_y·∫øu sinh_vi√™n
1581,em mu·ªën ƒë·ªïi t√™n m√¥n h·ªçc v√¨ t√™n m√¥n l√† l·∫≠p tr√¨n...,0,1,ƒë·ªïi m√¥n_h·ªçc m√¥n l·∫≠p_tr√¨nh c fraction cplusplus...,ƒë·ªïi m√¥n_h·ªçc m√¥n l·∫≠p_tr√¨nh c fraction h·ªçc


In [257]:
df[['sentence', 'sentiment', 'topic', 'sentence_dropped']].to_csv('Validation.csv')