# Vietnamese Sentiment Anslysis

## About team:
Information Retrival (CS336.K11.KHCL) - University of Information Technology
- Lê Nhất Minh (17520751)
- Đặng Khắc Lộc (17520694)

## Methods for project:
State 1: Basic Machine Learning Algorithms

Word Embedding: TF-IDF Vectorizer 
- Case 1: Logistic Regression 
- Case 2: Support Vector Machine
- Case 3: Multinomial Naive Bayes 
- Case 4: Bernoulli Naive Bayes

State 2: Deep Learning Algorithms

Word Embedding: word2vec, TF-IDF Vectorizer
- Case 1: CNN
- Case 2: LSTM



# Setting Google Colab

In [0]:
!apt-get update -qq 2>&1 > /dev/null
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse

In [0]:
# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
from google.colab import drive
drive.mount('/gdrive/')

## READ THE NOTES / INSTRUCTIONS ABOVE CAREFULLY BEFORE USE ##
%cd /gdrive/My Drive/Vietnamese_Sentiment_Analysis

# Import libraries and dataset

In [0]:
#install library
!pip install pyvi

In [0]:
from pyvi import ViTokenizer
from gensim.models import Word2Vec
import pandas as pd
import numpy as np
import glob
from collections import Counter
from string import punctuation

In [5]:
data = pd.read_csv("/gdrive/My Drive/Vietnamese_Sentiment_Analysis/Dataset/VLSP2016_SA-new.csv")
data.head()

Unnamed: 0,content,class
0,"Đang xài MX1. Dùng bình thường ngon, pin trâu....",positive
1,"Qủa pin ngon, sạc lại được, bền. Riêng em dùng...",positive
2,cũng đang xài 1 con logitech bluetooth tầm thấ...,positive
3,"Logitech pin trâu thôi rồi, mua 1 con B175 cu...",positive
4,Em có con chuột không dây 150k cũng đầy đủ nút...,positive


# Cleaning and Preprocessing Text Data 

In [0]:
y = data["class"].values
X_text = data["content"].values

### Tokenize text

In [0]:
# Thủ thuật tách từ
comments = []
data['content'] = data['content'].str.lower()
text_token = ViTokenizer.tokenize(str(data['content']))
comments.append(text_token)

### Create Vietnamese stop words

In [8]:
stop_word = []
with open("/gdrive/My Drive/Vietnamese_Sentiment_Analysis/Dataset/vietnamese-stopwords.txt",encoding="utf-8") as f :
  text = f.read()
  for word in text.split() :
      stop_word.append(word)
  f.close()
punc = list(punctuation)
stop_word = stop_word + punc
print(stop_word)

['a', 'lô', 'a', 'ha', 'ai', 'ai', 'ai', 'ai', 'nấy', 'ai', 'đó', 'alô', 'amen', 'anh', 'anh', 'ấy', 'ba', 'ba', 'ba', 'ba', 'bản', 'ba', 'cùng', 'ba', 'họ', 'ba', 'ngày', 'ba', 'ngôi', 'ba', 'tăng', 'bao', 'giờ', 'bao', 'lâu', 'bao', 'nhiêu', 'bao', 'nả', 'bay', 'biến', 'biết', 'biết', 'bao', 'biết', 'bao', 'nhiêu', 'biết', 'chắc', 'biết', 'chừng', 'nào', 'biết', 'mình', 'biết', 'mấy', 'biết', 'thế', 'biết', 'trước', 'biết', 'việc', 'biết', 'đâu', 'biết', 'đâu', 'chừng', 'biết', 'đâu', 'đấy', 'biết', 'được', 'buổi', 'buổi', 'làm', 'buổi', 'mới', 'buổi', 'ngày', 'buổi', 'sớm', 'bà', 'bà', 'ấy', 'bài', 'bài', 'bác', 'bài', 'bỏ', 'bài', 'cái', 'bác', 'bán', 'bán', 'cấp', 'bán', 'dạ', 'bán', 'thế', 'bây', 'bẩy', 'bây', 'chừ', 'bây', 'giờ', 'bây', 'nhiêu', 'bèn', 'béng', 'bên', 'bên', 'bị', 'bên', 'có', 'bên', 'cạnh', 'bông', 'bước', 'bước', 'khỏi', 'bước', 'tới', 'bước', 'đi', 'bạn', 'bản', 'bản', 'bộ', 'bản', 'riêng', 'bản', 'thân', 'bản', 'ý', 'bất', 'chợt', 'bất', 'cứ', 'bất', 'giác', 

### Clean data

In [0]:
# Làm sạch data
sentences = []
for comment in comments:
    sent = []
    for word in comment.split(" ") :
            if (word not in stop_word) :
                if ("_" in word) or (word.isalpha() == True):
                    sent.append(word)
    sentences.append(" ".join(sent))

### Word embedding with TfidfVectorizer

In [0]:
# Word embedding from text to vector
from sklearn.feature_extraction.text import TfidfVectorizer
#tf = TfidfVectorizer(min_df=5,max_df= 0.8,max_features=3000,sublinear_tf=True)
tf = TfidfVectorizer()
#tf.fit(sentences)
X = tf.fit_transform(X_text)

In [11]:
X.shape

(5100, 6712)

# Load Machine Learning / Deep Learning algorithms

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50,shuffle=True)

In [14]:
print("X_train lenght:",X_train.shape[0])
print("X_test lenght:",X_test.shape[0])
print("y_train lenght:",y_train.shape[0])
print("y_test lenght:",y_test.shape[0])

X_train lenght: 3570
X_test lenght: 1530
y_train lenght: 3570
y_test lenght: 1530


## Logistic Regression

In [40]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs')
classifier.fit(X_train,y_train)
pred = classifier.predict(X_test)
print("Accuracy score",accuracy_score(y_test,pred))
print("\n")
print(classification_report(y_test, pred))



Accuracy score 0.6816993464052288


              precision    recall  f1-score   support

    negative       0.67      0.67      0.67       499
     neutral       0.67      0.65      0.66       549
    positive       0.71      0.73      0.72       482

    accuracy                           0.68      1530
   macro avg       0.68      0.68      0.68      1530
weighted avg       0.68      0.68      0.68      1530



In [41]:
# Predict class example
classifier = LogisticRegression(solver='lbfgs')
classifier.fit(X, data['class'])

sentence = input("Your test sentence: ")
print("Predicted class:",classifier.predict(tf.transform([sentence]))[0])



Your test sentence: Tại sao có máy tính ở nhà rồi còn mua laptop làm gì nữa ?
Predicted class: neutral


## Multinomial Naive Bayes

In [38]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train,y_train)
pred = classifier.predict(X_test)
print("Accuracy score: ",accuracy_score(y_test,pred))
print("\n")
from sklearn.metrics import classification_report
print(classification_report(y_test,pred))

Accuracy score:  0.6699346405228758


              precision    recall  f1-score   support

    negative       0.63      0.66      0.65       499
     neutral       0.66      0.64      0.65       549
    positive       0.72      0.72      0.72       482

    accuracy                           0.67      1530
   macro avg       0.67      0.67      0.67      1530
weighted avg       0.67      0.67      0.67      1530



In [39]:
# Predict class example
classifier = MultinomialNB()
classifier.fit(X,data['class'])

sentence = input("Your test sentence: ")
print("Predicted class: ", classifier.predict(tf.transform([sentence]))[0])

Your test sentence: Tại sao có máy tính ở nhà rồi còn mua laptop làm gì nữa ?
Predicted class:  neutral


## Bernoulli Naive Bayes

In [19]:
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB()
classifier.fit(X_train,y_train)
pred = classifier.predict(X_test)
print("Accuracy score",accuracy_score(y_test,pred))
print("\n")
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

Accuracy score 0.565359477124183


              precision    recall  f1-score   support

    negative       0.62      0.56      0.59       499
     neutral       0.63      0.38      0.47       549
    positive       0.51      0.78      0.62       482

    accuracy                           0.57      1530
   macro avg       0.58      0.57      0.56      1530
weighted avg       0.59      0.57      0.55      1530



In [37]:
# Predict class example
classifier = BernoulliNB()
classifier.fit(X,data['class'])

sentence = input("Your test sentence: ")
print("Predicted class: ", classifier.predict(tf.transform([sentence]))[0])

Your test sentence: Tại sao có máy tính ở nhà rồi còn mua laptop làm gì nữa ?
Predicted class:  negative


## Support Vector Machine

In [33]:
from sklearn.svm import SVC
classifier = SVC(gamma=0.1,C=1,kernel='rbf')
classifier.fit(X_train,y_train)
pred = classifier.predict(X_test)
print("Accuracy score",accuracy_score(y_test,pred))
print("\n")
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

Accuracy score 0.6601307189542484


              precision    recall  f1-score   support

    negative       0.61      0.70      0.65       499
     neutral       0.62      0.66      0.64       549
    positive       0.79      0.62      0.70       482

    accuracy                           0.66      1530
   macro avg       0.67      0.66      0.66      1530
weighted avg       0.67      0.66      0.66      1530



In [36]:
# Predict class example
classifier = SVC(gamma=0.1,C=1,kernel='rbf')
classifier.fit(X,data['class'])

sentence = input("Your test sentence: ")
print("Predicted class: ", classifier.predict(tf.transform([sentence]))[0])

Your test sentence: Tại sao có máy tính ở nhà rồi còn mua laptop làm gì nữa ? 
Predicted class:  neutral
