### Feature extractions

This script responds to features extracted from two sentimental corpora, kt4.0 (ours) and wisesight. By combining train data from both corpora, we expect to see an improvement in the wisesight corpus' classification performance.

For both datasets, random stratify hold-out was performed with 80:20 ratio for train and test set. Feature engineering was carried out including dictionary-based (i.e., using list of good and bad Thai words), word count approches. Next, several feature extraction methods were applied and output as a joblib objects as follows:  

* BOW1, BOW2
* TF-IDF1, TF-IDF2
* Word2Vec pretrained from Thai wiki. (300 dimension)
* POS_tagging with flatten dataframe

Dependencies
* pythainlp >= 3.06dev
* python >= 3.8.8
* gensim >= 4.1.2
* scikit-learn >= 1.0.2
* joblib = 1.1.0
* dill = 0.31

The output vectors will be carried out in the next experiment.  
pree.t@cmu.ac.th  

In [1]:
import pandas as pd
import numpy as np
import pythainlp
from pythainlp.ulmfit import process_thai

from matplotlib import pyplot as plt

plt.rcParams['font.family'] = 'tahoma'

## Load original datasets

In [2]:
import os
os.path.dirname(os.getcwd())

data_path_kt = os.path.dirname(os.getcwd()) + '\\data\kt4.0\\'
data_path_ws = os.path.dirname(os.getcwd()) + '\\data\wisesight\\'
df_kt = pd.read_csv(data_path_kt + 'pantip_cleaned_1.csv')

# we use the original wisesight corpus and reconstruct a new dataframe
texts = []
targets = []

with open(str(data_path_ws) + '/' + 'neg.txt', encoding='utf-8') as f:
    for line in f:
        texts.append(line.strip())
        targets.append('neg')

with open(str(data_path_ws) + '/' + 'neu.txt', encoding='utf-8') as f:
    for line in f:
        texts.append(line.strip())
        targets.append('neu')

with open(str(data_path_ws) + '/' + 'pos.txt', encoding='utf-8') as f:
    for line in f:
        texts.append(line.strip())
        targets.append('pos')

with open(str(data_path_ws) + '/' + 'q.txt', encoding='utf-8') as f:
    for line in f:
        texts.append(line.strip())
        targets.append('q')
        
df_ws = pd.DataFrame({'texts': texts, 'targets': targets})
df_ws.to_csv('wisesight.csv', index=False)
df_kt.shape, df_ws.shape

((60081, 14), (26737, 2))

In [3]:
df_kt.head(10)

Unnamed: 0,post_id,post_date,user_id,user_name,text,tag,emotion,length,num_sent,sent_length,label,label_1,label_2,vote
0,39839097,2020-04-25 13:24:00,https://pantip.com/profile/5798163,‡∏™‡∏°‡∏≤‡∏ä‡∏¥‡∏Å‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏•‡∏Ç 5798163,[CR] ‡πÅ‡∏õ‡∏±‡∏á‡∏û‡∏±‡∏ü‡∏Ñ‡∏∏‡∏°‡∏°‡∏±‡∏ô ‡∏à‡∏±‡∏î‡πÄ‡∏ï‡πá‡∏°‡πÄ‡∏ô‡∏∑‡πâ‡∏≠‡∏ö‡∏≤‡∏á‡πÄ‡∏ö‡∏≤,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,36,3,14,2,2,2,pos
1,39839097,2020-04-25 13:24:00,https://pantip.com/profile/5798163,‡∏™‡∏°‡∏≤‡∏ä‡∏¥‡∏Å‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏•‡∏Ç 5798163,‡πÑ‡∏°‡πà‡∏≠‡∏∏‡∏î‡∏ï‡∏±‡∏ô ‡πÅ‡∏ï‡πà‡∏õ‡∏Å‡∏õ‡∏¥‡∏î‡πÅ‡∏ô‡πà‡∏ô‡∏°‡∏≤‡∏Å,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,36,3,8,2,2,2,pos
2,39839097,2020-04-25 13:24:00,https://pantip.com/profile/5798163,‡∏™‡∏°‡∏≤‡∏ä‡∏¥‡∏Å‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏•‡∏Ç 5798163,‡∏£‡∏µ‡∏ß‡∏¥‡∏ß‡πÅ‡∏õ‡πâ‡∏á Lady Audrey Ready All Day ‡∏à‡πâ‡∏≤,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,36,3,14,2,2,1,pos
3,39838736,2020-04-25 10:52:00,https://pantip.com/profile/5730006,‡∏™‡∏°‡∏≤‡∏ä‡∏¥‡∏Å‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏•‡∏Ç 5730006,‡∏Ç‡∏≠‡∏ö‡∏ï‡∏≤‡∏î‡∏≥‡∏°‡∏≤‡∏Å‡∏Ñ‡πà‡∏∞ ‡∏Ñ‡∏≠‡∏£‡πå‡πÄ‡∏•‡πá‡∏Ñ‡πÄ‡∏ï‡∏≠‡∏£‡πå‡∏Å‡πá‡πÄ‡∏≠‡∏≤‡πÑ‡∏°‡πà‡∏≠‡∏¢‡∏π‡πà,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,15,2,13,1,3,3,neg
4,39837384,2020-04-24 20:39:00,https://pantip.com/profile/4975838,‡∏™‡∏°‡∏≤‡∏ä‡∏¥‡∏Å‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏•‡∏Ç 4975838,‡πÄ‡∏≠‡∏≤aloe Vera ‡πÅ‡∏ä‡πà‡∏ï‡∏π‡πâ‡πÄ‡∏¢‡πá‡∏ô ‡∏à‡∏ô‡∏Å‡∏•‡∏≤‡∏¢‡πÄ‡∏õ‡πá‡∏ô‡∏ô‡πâ‡∏≥‡πÅ‡∏Ç‡πá‡∏á,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,11,1,11,1,1,3,neu
5,39838990,2020-04-25 12:36:00,https://pantip.com/profile/5655853,chdewxx,[SR] ‡πÑ‡∏≠‡πÄ‡∏ó‡∏° #‡πÄ‡∏ã‡∏£‡∏±‡πà‡∏°‡∏™‡∏¥‡∏ß ‡∏•‡∏î‡∏™‡∏¥‡∏ß ‡∏™‡∏¥‡∏ß‡∏≠‡∏∏‡∏î‡∏ï‡∏±‡∏ô ‡∏™‡∏¥‡∏ß‡∏ú‡∏î ‡∏ö‡∏≥...,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,29,1,29,2,2,2,pos
6,39838619,2020-04-25 10:01:00,https://pantip.com/profile/5656639,‡∏Ñ‡∏π‡∏à‡∏≠‡∏á‡∏¢‡∏≠‡∏ô‡πÅ‡∏•‡∏∞‡∏ß‡∏µ‡∏£‡∏¢‡∏≤,‡∏£‡∏ö‡∏Å‡∏ß‡∏ô‡∏™‡∏≤‡∏ß‡πÜ‡∏ä‡πà‡∏ß‡∏¢‡πÅ‡∏ô‡∏∞‡∏ô‡∏≥‡∏™‡∏Å‡∏¥‡∏ô‡πÅ‡∏Ñ‡∏£‡πå ‡∏ó‡∏µ‡πà‡∏ä‡πà‡∏ß‡∏¢‡πÉ‡∏´‡πâ‡∏ú‡∏¥‡∏ß‡∏´‡∏ô‡πâ‡∏≤‡∏Ç‡∏≤...,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,23,1,23,2,2,1,pos
7,39837266,2020-04-24 19:58:00,https://pantip.com/profile/632132,‡∏´‡∏°‡∏π‡∏Å‡∏•‡∏°‡∏≠‡∏≤‡∏£‡∏°‡∏ì‡πå‡∏î‡∏µ,‡∏ó‡∏î‡∏•‡∏≠‡∏á‡πÉ‡∏ä‡πâ ‡πÅ‡∏Ñ‡∏£‡∏≠‡∏ó‡∏ß‡∏¥‡∏ï‡∏ã‡∏µ‡∏´‡∏ô‡πâ‡∏≤‡πÉ‡∏™,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,14,2,9,1,2,2,pos
8,39837266,2020-04-24 19:58:00,https://pantip.com/profile/632132,‡∏´‡∏°‡∏π‡∏Å‡∏•‡∏°‡∏≠‡∏≤‡∏£‡∏°‡∏ì‡πå‡∏î‡∏µ,‡πÉ‡∏ô 1 ‡∏™‡∏±‡∏õ‡∏î‡∏≤‡∏´‡πå,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,14,2,5,1,1,1,neu
9,39835926,2020-04-24 12:03:00,https://pantip.com/profile/3826851,‡∏™‡∏°‡∏≤‡∏ä‡∏¥‡∏Å‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏•‡∏Ç 3826851,"‡∏ß‡∏¥‡∏ò‡∏µ‡πÄ‡∏•‡∏∑‡∏≠‡∏Å ""‡∏£‡∏≠‡∏á‡∏û‡∏∑‡πâ‡∏ô"" ‡πÅ‡∏•‡∏∞ ""‡∏Ñ‡∏≠‡∏ô‡∏ã‡∏µ‡∏•‡πÄ‡∏•‡∏≠‡∏£‡πå""",‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,14,1,14,2,2,2,pos


In [4]:
df_kt.describe()

Unnamed: 0,post_id,length,num_sent,sent_length,label,label_1,label_2
count,60081.0,60081.0,60081.0,60081.0,60081.0,60081.0,60081.0
mean,39649360.0,116.994574,8.502172,13.978329,1.577304,1.362644,1.662156
std,155991.9,118.647716,7.575442,12.083572,0.777527,0.639271,0.800034
min,39172830.0,3.0,1.0,3.0,1.0,1.0,1.0
25%,39587550.0,31.0,3.0,6.0,1.0,1.0,1.0
50%,39689290.0,72.0,6.0,10.0,1.0,1.0,1.0
75%,39769470.0,159.0,11.0,17.0,2.0,2.0,2.0
max,39839700.0,499.0,44.0,301.0,3.0,3.0,3.0


In [5]:
df_ws.head(10)

Unnamed: 0,texts,targets
0,‚òπÔ∏è,neg
1,üòî,neg
2,üòû,neg
3,üò•,neg
4,‡∏£‡∏≥,neg
5,No‡πÜ,neg
6,Rip,neg
7,T_T,neg
8,‡∏Å‡∏≤‡∏Å,neg
9,‡πÇ‡∏Å‡∏á,neg


In [6]:
df_ws.describe()

Unnamed: 0,texts,targets
count,26737,26737
unique,26713,4
top,‡∏≠‡∏∏‡∏î‡∏£‡∏°‡∏µ‡πÑ‡∏´‡∏°‡∏Ñ‡πà‡∏∞,neu
freq,2,14561


# Train-test split 80/20

In [7]:
# random stratified split train and test set 80/20
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
all_df_kt, test_df_kt = train_test_split(df_kt, test_size=0.2, random_state=42, shuffle = True)
all_df_kt.shape, test_df_kt.shape

((48064, 14), (12017, 14))

In [8]:
# class distribution
all_df_kt.vote.value_counts() / all_df_kt.shape[0]

neu    0.632136
pos    0.206620
neg    0.161243
Name: vote, dtype: float64

In [9]:
all_df_ws, test_df_ws = train_test_split(df_ws, test_size=0.2, random_state=42)
all_df_ws.shape, test_df_ws.shape

((21389, 2), (5348, 2))

In [10]:
# class distribution
all_df_ws.targets.value_counts() / all_df_ws.shape[0]

neu    0.544860
neg    0.253588
pos    0.179345
q      0.022208
Name: targets, dtype: float64

In [11]:
# clean and word tokenize
all_df_kt['processed'] = all_df_kt['text'].apply(str).apply(process_thai)
test_df_kt['processed'] = test_df_kt['text'].apply(str).apply(process_thai)

all_df_ws['processed'] = all_df_ws['texts'].apply(str).apply(process_thai)
test_df_ws['processed'] = test_df_ws['texts'].apply(str).apply(process_thai)

## Feature engineering: dictionary-based, word count, and unique word count

In [12]:
# Create new feature with dict-based approach
# load list of our custom positive and negative words
with open(os.path.dirname(os.getcwd()) + '\\data\\' + 'pos_words.txt', encoding='UTF-8') as f:
    pos_words = [line.rstrip('\n') for line in f]

with open(os.path.dirname(os.getcwd()) + '\\data\\' + 'neg_words.txt', encoding='UTF-8') as f:
    neg_words = [line.rstrip('\n') for line in f]
pos_words = list(set(pos_words))
neg_words = list(set(neg_words))

In [13]:
def cal_sentiment(sentence):
    senti = 0
    words = [word.lower() for word in sentence]
    for word in words:
        if word in pos_words:
            senti += 1
        elif word in neg_words:
            senti -= 1
    return senti

In [14]:
all_df_kt['cal_sentiment'] = all_df_kt['processed'].apply(cal_sentiment)
all_df_ws['cal_sentiment'] = all_df_ws['processed'].apply(cal_sentiment)

test_df_kt['cal_sentiment'] = test_df_kt['processed'].apply(cal_sentiment)
test_df_ws['cal_sentiment'] = test_df_ws['processed'].apply(cal_sentiment)
all_df_kt.head()

Unnamed: 0,post_id,post_date,user_id,user_name,text,tag,emotion,length,num_sent,sent_length,label,label_1,label_2,vote,processed,cal_sentiment
43487,39420424,2019-11-19 12:36:00,https://pantip.com/profile/3561069,‡∏™‡∏°‡∏≤‡∏ä‡∏¥‡∏Å‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏•‡∏Ç 3561069,‡∏ï‡∏≤‡∏°‡∏´‡∏±‡∏ß‡∏Ç‡πâ‡∏≠ ‡πÅ‡∏•‡∏∞‡∏ï‡∏≤‡∏°‡∏£‡∏π‡∏õ‡∏õ‡∏£‡∏∞‡∏Å‡∏≠‡∏ö‡πÄ‡∏•‡∏¢‡∏Ñ‡∏£‡∏±‡∏ö Zinc Vistra 2...,‡∏≠‡∏≤‡∏´‡∏≤‡∏£‡πÄ‡∏™‡∏£‡∏¥‡∏°,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,197,11,18,1,1,1,neu,"[‡∏ï‡∏≤‡∏°, ‡∏´‡∏±‡∏ß‡∏Ç‡πâ‡∏≠, ‡πÅ‡∏•‡∏∞, ‡∏ï‡∏≤‡∏°, ‡∏£‡∏π‡∏õ‡∏õ‡∏£‡∏∞‡∏Å‡∏≠‡∏ö, ‡πÄ‡∏•‡∏¢, ‡∏Ñ‡∏£‡∏±‡∏ö, ...",0
31804,39711716,2020-03-13 15:14:00,https://pantip.com/profile/162639,PANTIP CREW,‡πÅ‡∏•‡∏∞‡∏¢‡∏±‡∏á‡πÄ‡∏õ‡πá‡∏ô Expert Account ‡πÉ‡∏ô‡∏ô‡∏≤‡∏°‡∏•‡πá‡∏≠‡∏Å‡∏≠‡∏¥‡∏ô pholfoo...,‡∏≠‡∏≤‡∏´‡∏≤‡∏£,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 5 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,154,8,43,1,2,2,pos,"[‡πÅ‡∏•‡∏∞, ‡∏¢‡∏±‡∏á, ‡πÄ‡∏õ‡πá‡∏ô, expert, account, ‡πÉ‡∏ô‡∏ô‡∏≤‡∏°, ‡∏•‡πá‡∏≠‡∏Å‡∏≠...",0
3000,39646166,2020-02-17 16:39:00,https://pantip.com/profile/5746727,‡∏™‡∏°‡∏≤‡∏ä‡∏¥‡∏Å‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏•‡∏Ç 5746727,‡∏≠‡∏¢‡∏≤‡∏Å‡∏ú‡∏¥‡∏ß‡∏Ç‡∏≤‡∏ß‡πÉ‡∏™ ‡πÅ‡∏ï‡πà‡∏Ç‡∏µ‡πâ‡πÄ‡∏Å‡∏µ‡∏¢‡∏à‡∏ó‡∏≤‡∏Ñ‡∏£‡∏µ‡∏°,‡∏≠‡∏≤‡∏´‡∏≤‡∏£‡πÄ‡∏™‡∏£‡∏¥‡∏°,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,16,2,9,1,1,3,neu,"[‡∏≠‡∏¢‡∏≤‡∏Å, ‡∏ú‡∏¥‡∏ß‡∏Ç‡∏≤‡∏ß, ‡πÉ‡∏™, ‡πÅ‡∏ï‡πà, ‡∏Ç‡∏µ‡πâ‡πÄ‡∏Å‡∏µ‡∏¢‡∏à, ‡∏ó‡∏≤, ‡∏Ñ‡∏£‡∏µ‡∏°]",0
8760,39763266,2020-03-31 19:43:00,https://pantip.com/profile/567821,sugarsand,‡πÅ‡∏Å‡πâ‡πÑ‡∏Ç‡∏Ç‡πâ‡∏≠‡∏Ñ‡∏ß‡∏≤‡∏°‡πÄ‡∏°‡∏∑‡πà‡∏≠ 31 ‡∏°‡∏µ‡∏ô‡∏≤‡∏Ñ‡∏° ‡πÄ‡∏ß‡∏•‡∏≤ 19:55 ‡∏ô.,‡πÄ‡∏Ñ‡∏£‡∏∑‡πà‡∏≠‡∏á‡∏™‡∏≥‡∏≠‡∏≤‡∏á,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,96,3,16,1,1,1,neu,"[‡πÅ‡∏Å‡πâ‡πÑ‡∏Ç, ‡∏Ç‡πâ‡∏≠‡∏Ñ‡∏ß‡∏≤‡∏°, ‡πÄ‡∏°‡∏∑‡πà‡∏≠, 31, ‡∏°‡∏µ‡∏ô‡∏≤‡∏Ñ‡∏°, ‡πÄ‡∏ß‡∏•‡∏≤, 19, ...",0
27651,39762185,2020-03-31 20:48:00,https://pantip.com/profile/674738,lantaolhin,‡πÄ‡∏û‡∏¥‡πà‡∏á‡∏£‡∏π‡πâ‡πÄ‡∏´‡∏ï‡∏∏‡∏ú‡∏•‡∏ó‡∏µ‡πà‡πÅ‡∏ó‡πâ‡∏à‡∏£‡∏¥‡∏á‡∏Å‡πá‡∏ß‡∏±‡∏ô‡∏ô‡∏µ‡πâ‡πÄ‡∏≠‡∏á,‡∏≠‡∏≤‡∏´‡∏≤‡∏£,‡∏ñ‡∏π‡∏Å‡πÉ‡∏à 0 ‡∏Ç‡∏≥‡∏Å‡∏•‡∏¥‡πâ‡∏á 0 ‡∏´‡∏•‡∏á‡∏£‡∏±‡∏Å 0 ‡∏ã‡∏∂‡πâ‡∏á 0 ‡∏™‡∏¢‡∏≠‡∏á 0 ‡∏ó‡∏∂‡πà‡∏á 0,21,3,9,1,1,1,neu,"[‡πÄ‡∏û‡∏¥‡πà‡∏á, ‡∏£‡∏π‡πâ, ‡πÄ‡∏´‡∏ï‡∏∏‡∏ú‡∏•, ‡∏ó‡∏µ‡πà, ‡πÅ‡∏ó‡πâ‡∏à‡∏£‡∏¥‡∏á, ‡∏Å‡πá, ‡∏ß‡∏±‡∏ô‡∏ô‡∏µ‡πâ,...",0


In [15]:
all_df_ws.head()

Unnamed: 0,texts,targets,processed,cal_sentiment
25232,Nissan Silvia S14 ‡∏´‡∏ô‡πâ‡∏≤‡∏´‡∏°‡∏π‡∏ó‡∏µ‡πà‡πÄ‡∏ó‡πà‡∏´‡πå‡πÑ‡∏°‡πà‡πÄ‡∏´‡∏°‡∏∑‡∏≠‡∏ô‡πÉ‡∏Ñ‡∏£ ...,pos,"[nissan, silvia, s, 14, ‡∏´‡∏ô‡πâ‡∏≤, ‡∏´‡∏°‡∏π, ‡∏ó‡∏µ‡πà, ‡πÄ‡∏ó‡πà‡∏´‡πå,...",0
11150,‡πÑ‡∏õ‡∏î‡∏π‡∏´‡∏ô‡∏±‡∏á‡∏Å‡∏±‡∏ô‡∏°‡∏±‡πâ‡∏¢‡∏à‡πâ‡∏∞‡∏ß‡∏¥,neu,"[‡πÑ‡∏õ‡∏î‡∏π, ‡∏´‡∏ô‡∏±‡∏á, ‡∏Å‡∏±‡∏ô, ‡∏°‡∏±‡πâ‡∏¢, ‡∏à‡πâ‡∏∞, ‡∏ß‡∏¥]",0
23799,‡∏î‡∏±‡∏ô.....‡πÜ‡∏Ñ‡∏£‡∏±‡∏ö....‡πÄ‡∏≠‡∏≤‡∏î‡πâ‡∏ß‡∏¢,pos,"[‡∏î‡∏±‡∏ô, ., xxrep, ‡πÜ, ‡∏Ñ‡∏£‡∏±‡∏ö, ., xxrep, ‡πÄ‡∏≠‡∏≤, ‡∏î‡πâ‡∏ß‡∏¢]",0
19312,‡∏Ñ‡∏∑‡∏ô‡∏ô‡∏µ‡πâ Chang Carnival World of Illusion ‡∏ó‡∏µ‡πà‡∏•‡∏≤‡∏ô...,neu,"[‡∏Ñ‡∏∑‡∏ô‡∏ô‡∏µ‡πâ, chang, carnival, world, of, illusion,...",0
12291,‡∏à‡∏±‡∏î‡πÑ‡∏´‡πâ‡∏ô‡πâ‡∏≤‡∏´‡∏ô‡πà‡∏≠‡∏¢‡∏î‡∏¥‡∏ï‡∏±‡∏á‡∏≠‡∏≠‡∏Å‡∏≠‡πà‡∏∞,neu,"[‡∏à‡∏±‡∏î, ‡πÑ‡∏´‡πâ, ‡∏ô‡πâ‡∏≤, ‡∏´‡∏ô‡πà‡∏≠‡∏¢, ‡∏î‡∏¥, ‡∏ï‡∏±‡∏á, ‡∏≠‡∏≠‡∏Å, ‡∏≠‡πà‡∏∞]",0


In [16]:
# see the correlation between the above computational sentiment and human ratings 
# for train set
all_df_kt['targets_codes'] = all_df_kt['vote'].astype('category').cat.codes
all_df_ws['targets_codes'] = all_df_ws['targets'].astype('category').cat.codes

# for test set
test_df_kt['targets_codes'] = test_df_kt['vote'].astype('category').cat.codes
test_df_ws['targets_codes'] = test_df_ws['targets'].astype('category').cat.codes

print(all_df_kt['targets_codes'].corr(all_df_kt['cal_sentiment']), \
      all_df_ws['targets_codes'].corr(all_df_ws['cal_sentiment']))

0.17576646507820368 0.11384882038831483


In [17]:
# create word count feature
all_df_kt = all_df_kt.rename(columns = {'sent_length':'wc'})
all_df_ws['wc'] =  all_df_ws['processed'].map(len)

test_df_kt = all_df_kt.rename(columns = {'sent_length':'wc'})
test_df_ws['wc'] =  test_df_ws['processed'].map(len)

In [18]:
# see if the a correlation with word count
print(all_df_kt['targets_codes'].corr(all_df_kt['wc']), \
      all_df_ws['targets_codes'].corr(all_df_ws['wc']))

-0.01440540432656602 -0.0728353149652106


In [19]:
#all_df_kt.to_csv('all_df_kt.csv',  index=False)
#test_df_kt.to_csv('test_df_kt.csv',  index=False)

all_df_ws.to_csv('all_df_ws.csv', index=False)
test_df_ws.to_csv('test_df_ws.csv', index=False)

## Bag of words 

In [20]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import joblib

# BOW with unigram and bigrams
bow1 = CountVectorizer(ngram_range=(1, 1))
bow2 = CountVectorizer(ngram_range=(2, 2))

bow1_fit_ws = bow1.fit(all_df_ws['processed'].apply(str))
bow1_fit_ws_kt = bow1_fit_ws.fit(all_df_kt['processed'].apply(str))

bow2_fit_ws = bow2.fit(all_df_ws['processed'].apply(str))
bow2_fit_ws_kt = bow2_fit_ws.fit(all_df_kt['processed'].apply(str))

joblib.dump(bow1_fit_ws, 'bow1_fit_ws.pkl')
joblib.dump(bow1_fit_ws_kt, 'bow1_fit_ws_kt.pkl')
joblib.dump(bow2_fit_ws, 'bow2_fit_ws.pkl')
joblib.dump(bow2_fit_ws_kt, 'bow2_fit_ws_kt.pkl')

['bow2_fit_ws_kt.pkl']

## TF-IDF

In [21]:
tfidf1 = TfidfVectorizer(ngram_range=(1, 1), min_df=20, sublinear_tf=True)
tfidf2 = TfidfVectorizer(ngram_range=(2, 2), min_df=20, sublinear_tf=True)

tfidf1_fit_ws = tfidf1.fit(all_df_ws['processed'].apply(str))
tfidf1_fit_ws_kt = tfidf1_fit_ws.fit(all_df_kt['processed'].apply(str))

tfidf2_fit_ws = tfidf2.fit(all_df_ws['processed'].apply(str))
tfidf2_fit_ws_kt = tfidf2_fit_ws.fit(all_df_kt['processed'].apply(str))

joblib.dump(tfidf1_fit_ws, 'tfidf1_fit_ws.pkl')
joblib.dump(tfidf1_fit_ws_kt, 'tfidf1_fit_ws_kt.pkl')
joblib.dump(tfidf2_fit_ws, 'tfidf2_fit_ws.pkl')
joblib.dump(tfidf2_fit_ws_kt, 'tfidf2_fit_ws_kt.pkl')

['tfidf2_fit_ws_kt.pkl']

## Word2vec

In [22]:
from collections import Counter, defaultdict

class TfidfEmbeddingVectorizer(object):
    def __init__(self, model):
        w2v = {w: vec for w, vec in zip(model.wv.index_to_key, model.wv.vectors)}
        self.word2vec = w2v
        self.word2weight = None
        self.dim = model.vector_size
    
    
    def fit(self, X):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])

In [23]:
import gensim
from pythainlp import word_vector
from gensim.models import Word2Vec, KeyedVectors

# create word2vec models and intersect with the thai wiki pretrained one.

w2v_model_ws = Word2Vec(vector_size=300, min_count=1, window = 5, workers=4)
w2v_model_ws.build_vocab(all_df_ws['processed'])
total_examples = w2v_model_ws.corpus_count
total_examples

21389

In [24]:
w2v_thwiki = word_vector.get_model()
w2v_model_ws.build_vocab(w2v_thwiki.index_to_key, update=True)
w2v_model_ws.wv.vectors_lockf = np.ones(len(w2v_model_ws.wv))
w2v_model_ws.wv.intersect_word2vec_format('thai2vec.bin', binary=True, lockf=1.0)
total_examples = w2v_model_ws.corpus_count
total_examples

51358

In [25]:
w2v_model_ws.train(all_df_ws['processed'], total_examples=total_examples, epochs=100)

(42542754, 47151500)

In [26]:
w2v_model_ws.wv.most_similar("‡∏ö‡∏∞‡∏´‡∏°‡∏µ‡πà")

[('‡∏Ç‡πâ‡∏≤‡∏ß‡∏°‡∏±‡∏ô‡πÑ‡∏Å‡πà', 0.39422789216041565),
 ('‡∏Å‡∏∏‡πâ‡∏á', 0.37291258573532104),
 ('‡πÑ‡πà‡∏Ç‡πà', 0.36208683252334595),
 ('‡πÑ‡∏Ç‡πà‡πÄ‡∏Ñ‡πá‡∏°', 0.3498747944831848),
 ('üç£', 0.3467874228954315),
 ('‡∏•‡∏π‡∏Å‡∏ä‡∏¥‡πâ‡∏ô', 0.34412676095962524),
 ('‡πÅ‡∏´‡∏ô‡∏°', 0.34077826142311096),
 ('‡∏´‡∏°‡∏π‡πÅ‡∏î‡∏á', 0.3383786678314209),
 ('‡πÉ‡∏™‡πâ', 0.32756781578063965),
 ('‡∏¢‡∏≥', 0.32196688652038574)]

In [27]:
import dill # we use dill instead of joblim because the lambda and dependecie in class TfidfEmbeddingVectorizer
# now we have our w2v mmodel, we need to convert single vector
w2v_tfidf_emb_ws = TfidfEmbeddingVectorizer(w2v_model_ws)
w2v_tifdf_fit_ws = w2v_tfidf_emb_ws.fit(all_df_ws['processed'])
dill.dump(w2v_tifdf_fit_ws, open('w2v-tfidf_fit_ws.pkl', 'wb'))

In [28]:
# train on new words from kt4.0 corpus
w2v_model_ws.build_vocab(all_df_kt['processed'], update = True)
total_examples = w2v_model_ws.corpus_count
total_examples
w2v_model_ws.train(all_df_kt['processed'], total_examples=total_examples, epochs=100)

w2v_tfidf_emb_ws_kt = TfidfEmbeddingVectorizer(w2v_model_ws)
w2v_tifdf_fit_ws_kt = w2v_tfidf_emb_ws_kt.fit(all_df_ws['processed'])
dill.dump(w2v_tifdf_fit_ws_kt, open('w2v-tfidf_fit_ws_kt.pkl', 'wb'))

In [29]:
w2v_model_ws.wv.most_similar("‡∏ö‡∏∞‡∏´‡∏°‡∏µ‡πà")

[('‡∏¢‡πà‡∏≤‡∏á', 0.3160748779773712),
 ('‡πÑ‡∏™‡πâ‡∏Å‡∏£‡∏≠‡∏Å', 0.2988615930080414),
 ('‡πÑ‡∏Å‡πà‡∏ö‡πâ‡∏≤‡∏ô', 0.28489479422569275),
 ('‡∏ä‡∏µ‡∏™', 0.28290531039237976),
 ('‡∏Å‡∏∞‡∏ó‡∏¥', 0.2817407250404358),
 ('‡∏ã‡∏∏‡∏õ', 0.2725837826728821),
 ('üçö', 0.272504061460495),
 ('‡πÅ‡∏°‡∏Ñ‡πÇ‡∏Ñ‡∏£', 0.261968731880188),
 ('‡∏´‡∏±‡∏ß‡πÑ‡∏ä‡πÄ‡∏ó‡πâ‡∏≤', 0.2597872018814087),
 ('‡πÄ‡∏Ñ‡πâ‡∏°', 0.2591487765312195)]

## POS_Tag


In [30]:
def flatten(x):
    l = list(sum(x, ()))
    return l

In [31]:
from pythainlp.tag import pos_tag_sents

# we used a POS tag with the orchid_ud feature that represented a type of word in a sentence in one-hot vector form
# flatten the list of tuple in series was applied for feature vectors
all_df_ws['POSTags'] = pos_tag_sents(all_df_ws['texts'].apply(str).apply(process_thai).tolist(), \
                                     corpus='orchid_ud')
all_df_ws['POSTags'] = all_df_ws['POSTags'].apply(flatten)

# TODO: concate word with pos (‡∏°‡∏±‡∏ô_ADV)

In [32]:
pos1 = CountVectorizer(ngram_range=(1, 1))
pos2 = CountVectorizer(ngram_range=(2, 2))

pos1_fit_ws = pos1.fit(all_df_ws['processed'].apply(str))
pos1_fit_ws_kt = pos1_fit_ws.fit(all_df_kt['processed'].apply(str))

pos2_fit_ws = pos2.fit(all_df_ws['processed'].apply(str))
pos2_fit_ws_kt = pos2_fit_ws.fit(all_df_kt['processed'].apply(str))

joblib.dump(pos1_fit_ws, 'pos1_fit_ws.pkl')
joblib.dump(pos1_fit_ws_kt, 'pos1_fit_ws_kt.pkl')
joblib.dump(pos2_fit_ws, 'pos2_fit_ws.pkl')
joblib.dump(pos2_fit_ws_kt, 'pos2_fit_ws_kt.pkl')

['pos2_fit_ws_kt.pkl']

## Train-valid split

In [33]:
train_df_ws, valid_df_ws = train_test_split(all_df_ws, test_size=0.15, random_state=42)

In [34]:
text_train_ws = tfidf2_fit_ws.transform(train_df_ws['texts'].apply(str))
text_valid_ws = tfidf2_fit_ws.transform(valid_df_ws['texts'].apply(str))
text_test_ws = tfidf2_fit_ws.transform(test_df_ws['texts'].apply(str))
text_train_ws.shape, text_valid_ws.shape, text_test_ws.shape

((18180, 2395), (3209, 2395), (5348, 2395))

In [35]:
valid_df_ws.targets.value_counts() / valid_df_ws.shape[0]

neu    0.548146
neg    0.246183
pos    0.180118
q      0.025553
Name: targets, dtype: float64

In [36]:
# The calculated sentiment and word count features might be useful, so we concat them to the text feature
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler_fit = scaler.fit(np.asarray(all_df_ws['cal_sentiment']).reshape(-1, 1))
print(scaler_fit.mean_, scaler_fit.var_)

cal_sent_train = scaler_fit.transform(np.asarray(train_df_ws['cal_sentiment']).reshape(-1, 1).astype(float))
cal_sent_valid = scaler_fit.transform(np.asarray(valid_df_ws['cal_sentiment']).reshape(-1, 1).astype(float))
cal_sent_test = scaler_fit.transform(np.asarray(test_df_ws['cal_sentiment']).reshape(-1, 1).astype(float))
cal_sent_train.shape, cal_sent_valid.shape

[0.07882556] [0.75324233]


((18180, 1), (3209, 1))

In [37]:
scaler_fit = scaler.fit(np.asarray(all_df_ws['wc']).reshape(-1, 1))
print(scaler_fit.mean_, scaler_fit.var_)

num_train = scaler_fit.transform(np.asarray(train_df_ws['wc']).reshape(-1, 1).astype(float))
num_valid = scaler_fit.transform(np.asarray(valid_df_ws['wc']).reshape(-1, 1).astype(float))
num_test = scaler_fit.transform(np.asarray(test_df_ws['wc']).reshape(-1, 1).astype(float))
num_train.shape, num_valid.shape


[22.04474262] [1169.8183263]


((18180, 1), (3209, 1))

## Test the extracted features with Logistic Regression

In [38]:
y_train = train_df_ws['targets']
y_valid = valid_df_ws['targets']

# concat text vector and generated feature vectors
X_train = np.concatenate([text_train_ws.toarray(), cal_sent_train, num_train], axis=1)
X_valid = np.concatenate([text_valid_ws.toarray(), cal_sent_valid, num_valid], axis=1)
X_test = np.concatenate([text_test_ws.toarray(), cal_sent_test, num_test], axis=1)

#X_train = text_train_ws.toarray()
#X_valid = text_valid_ws.toarray()
#X_test = text_test_ws.toarray()

#X_train = text_train_ws
#X_valid  = text_valid_ws
#X_test = text_test_ws

X_train.shape, X_valid.shape

((18180, 2397), (3209, 2397))

In [39]:
# quick test with out cv
#fit logistic regression models
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=2., penalty="l2", solver="liblinear", dual=False, multi_class="ovr")
model.fit(X_train, y_train)
model.score(X_valid, y_valid)
#y_pred = model.predict(X_valid)

0.5687129947023996

In [40]:
#np.unique(y_pred) 
# this might cause from label 'q' is not present in the y_pred (due to the severe imbalance class)

In [41]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score, classification_report
def build_model(model):
    scores = (cross_val_score(model, X_train, y_train, cv = 5).mean())
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    acc_sc = accuracy_score(y_valid, y_pred)
    pre_sc = precision_score(y_valid, y_pred, average='weighted')
    rec_sc = recall_score(y_valid, y_pred, average='weighted')
    f1_sc = f1_score(y_valid, y_pred, average='weighted')
    print('Accuracy :',acc_sc)
    print('Confusion Matrix :\n', confusion_matrix(y_valid, y_pred))
    print('Precision :', pre_sc)
    print('Recall :', rec_sc)
    print('F1-score :', f1_sc)
    print('Classification Report :\n', classification_report(y_valid, y_pred))
    print('Average accuracy of k-fold (5-fold) :', scores ,'\n')

In [42]:
build_model(model)

Accuracy : 0.5687129947023996
Confusion Matrix :
 [[ 112  656   22    0]
 [  65 1677   17    0]
 [  20  522   36    0]
 [   0   80    2    0]]
Precision : 0.5373724095592892
Recall : 0.5687129947023996
F1-score : 0.467336708171298
Classification Report :
               precision    recall  f1-score   support

         neg       0.57      0.14      0.23       790
         neu       0.57      0.95      0.71      1759
         pos       0.47      0.06      0.11       578
           q       0.00      0.00      0.00        82

    accuracy                           0.57      3209
   macro avg       0.40      0.29      0.26      3209
weighted avg       0.54      0.57      0.47      3209

Average accuracy of k-fold (5-fold) : 0.5613861386138613 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [43]:
#np.unique(y_pred) 
# this might cause from label 'q' is not present in the y_pred (due to the severe imbalance class)