<a href="https://colab.research.google.com/github/kwanglo/mge51101-20195171/blob/master/final_project/02_multisentiment_ML_W2V.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Multi-sentiment classification using machine learning**

In this section, we will build multi-sentiment classification models using machine learning techniques. <br>
<br>
**Applied embedding :** fastText Korean ver. using wikipedia<br>
Applied vectorizer : <br>
Count Vector, TF-IDF, Ngram, Character level TF-IDF
<br>

**Applied machine learning model :** <br>
Naive Bayes, Logistic Regression, Support Vector Machine, Random Forest
<br>

**Reference** <br>




1. Link to google drive
2. Import required libraries
3. Load prepared dataset
4. Divide into train, valid, test set

In [None]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


In [None]:
!pip3 install konlpy
!pip3 install soynlp

In [None]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import xgboost, string
import pandas as pd
import numpy as np
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

Using TensorFlow backend.


In [None]:
path='/gdrive/My Drive/Colab Notebooks/Final Project/dataset/'

train_data = pd.read_csv(path+"sentiment_train.csv")
valid_data = pd.read_csv(path+"sentiment_valid.csv")
test_data = pd.read_csv(path+"sentiment_test.csv")

In [None]:
test_data.head()

Unnamed: 0,Sentence,Emotion
0,약은 최대한 안먹으려고 하는데좋은 음시있나요?0,1
1,몸무게 1키로찌는건 아니겠죠?,1
2,보통 가진통도 이렇게 오래가나요?,1
3,여자가 술취해서 먼저 전화하는거 짜증나요???,1
4,아무래도 무리겠죠?,1


In [None]:
train_x, train_y = train_data['Sentence'], train_data['Emotion']
valid_x, valid_y = valid_data['Sentence'], valid_data['Emotion']
test_x, test_y = test_data['Sentence'], test_data['Emotion']

# Preprocessing before training

1. Import vectorizer
2. Set stopwords and build clean dataset
3. Tokenize dataset
4. Import pre-trained embedding from fastText

In [None]:
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(train_data['Sentence'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)
xtest_count =  count_vect.transform(test_x)

In [None]:
#TF-IDF vectorizer
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=15000)
tfidf_vect.fit(train_data['Sentence'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)
xtest_tfidf =  tfidf_vect.transform(test_x)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=15000)
tfidf_vect_ngram.fit(train_data['Sentence'])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x)
xtest_tfidf_ngram =  tfidf_vect_ngram.transform(test_x)

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=15000)
tfidf_vect_ngram_chars.fit(train_data['Sentence'])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(train_x) 
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x)
xtest_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(test_x)



In [None]:
stop_words_set = pd.read_csv(path+'stopwords100.txt',header = 0, delimiter = '\t', quoting = 3)
stop_words= (list(stop_words_set['aa']))
stop_words2 = ['은', '는', '이', '가', '하', '아', '것', '들','의', '있', '되', '수', '보', '주', '등', '한']
stop_words.extend(stop_words)

In [None]:
from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.normalizer import *
import re
from konlpy.tag import Okt

def preprocessing(text, okt, remove_stopwords = False, stop_words = []):
    text = only_hangle(text)
    text = repeat_normalize(text, num_repeats = 2)
    
    text_token = okt.morphs(text, stem = True)
    
    if remove_stopwords:
        text_token = [token for token in text_token if not token in stop_words]
        
    return text_token

In [None]:
#Preprocessing - Train and Valid data
okt = Okt()
clean_train_data = []
clean_valid_data = []
clean_test_data = []
for text in train_data['Sentence']:
    if type(text) == str:
        clean_train_data.append(preprocessing(text, okt, remove_stopwords = True, stop_words = stop_words))
    else:
        clean_train_data.append([])
        
for text in valid_data['Sentence']:
    if type(text) == str:
        clean_valid_data.append(preprocessing(text, okt, remove_stopwords = True, stop_words = stop_words))
    else:
        clean_valid_data.append([])
        
for text in test_data['Sentence']:
    if type(text) == str:
        clean_test_data.append(preprocessing(text, okt, remove_stopwords = True, stop_words = stop_words))
    else:
        clean_test_data.append([])

In [None]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(clean_train_data)
train_sequences = tokenizer.texts_to_sequences(clean_train_data)
valid_sequences = tokenizer.texts_to_sequences(clean_valid_data)
test_sequences = tokenizer.texts_to_sequences(clean_test_data)

word_index = tokenizer.word_index

MAX_SEQUENCE_LENGTH = 70

In [None]:
embeddings_index = {}
for i, line in enumerate(open(path+'wiki.ko.vec')):
    values = line.split()
    embeddings_index[values[0]] = np.asarray(values[1:], dtype='float32')
    
    
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Model building

1. Set train and test models
2. Get machine learning models from scikit learn
3. Train!

In [None]:
# 기존 모델
def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return metrics.accuracy_score(predictions, valid_y)

In [None]:
# From other reference
def test_model(classifier, X_train, y_train, X_test):
  
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)

  return metrics.accuracy_score(y_pred, test_y)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

In [None]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
# Naive Bayes on Count Vectors
valid_accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count)
test_accuracy = test_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xtest_count)

print("NB, Count Vectors Accuracy")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Naive Bayes on Word Level TF IDF Vectors
valid_accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
test_accuracy = test_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xtest_tfidf)
print("NB, WordLevel TF-IDF")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
valid_accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
test_accuracy = test_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
print("NB, N-Gram Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Naive Bayes on Character Level TF IDF Vectors
valid_accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
test_accuracy = test_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
print("NB, CharLevel Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

NB, Count Vectors Accuracy
Validation Accuracy:  0.38013571869216534
Test Accuracy:  0.37424425634824665
NB, WordLevel TF-IDF
Validation Accuracy:  0.36964836520666255
Test Accuracy:  0.3653480739333218
NB, N-Gram Vectors
Validation Accuracy:  0.16681061073411474
Test Accuracy:  0.1731732596303334
NB, CharLevel Vectors
Validation Accuracy:  0.48834053053670573
Test Accuracy:  0.4920538953187079


After training, f1-score and confusion matrix was tested as mentioned in proposal.

In [None]:
def f_scores(classifier, X_train, y_train, X_test):
  
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)

  return y_pred

In [None]:
#F1-score 
y_pred = f_scores(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)

precision_recall_fscore_support(test_y, y_pred, average='weighted')
#Precision / Recall / F1_score

(0.481089554780168, 0.4920538953187079, 0.47989521354227416, None)

In [None]:
#Confusion Matrix
confusion_matrix(test_y, y_pred)

array([[ 230,  158,  328,  234,   99,  166,  234],
       [  70,  918,  210,   50,  261,   71,   60],
       [  92,  215,  989,  115,  115,  133,  110],
       [  59,   89,  226,  854,   77,   45,  350],
       [  36,  284,  115,   69,  943,   81,   52],
       [  60,   56,  201,   51,  105, 1287,   51],
       [ 119,   94,  257,  526,   89,   68,  476]])

In [None]:
# Linear Classifier on Count Vectors
valid_accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
test_accuracy = test_model(linear_model.LogisticRegression(), xtrain_count, train_y, xtest_count)
print("LR, Count Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Linear Classifier on Word Level TF IDF Vectors
valid_accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
test_accuracy = test_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xtest_tfidf)
print("NB, CharLevel Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Linear Classifier on Ngram Level TF IDF Vectors
valid_accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
test_accuracy = test_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)

print("LR, N-Gram Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Linear Classifier on Character Level TF IDF Vectors
valid_accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
test_accuracy = test_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
print("LR, CharLevel Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR, Count Vectors
Validation Accuracy:  0.3731030228254164
Test Accuracy:  0.3667300051822422


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


NB, CharLevel Vectors
Validation Accuracy:  0.36557680444170265
Test Accuracy:  0.35792019347037485
LR, N-Gram Vectors
Validation Accuracy:  0.16693399136335596
Test Accuracy:  0.17256866470893073


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR, CharLevel Vectors
Validation Accuracy:  0.4983343615052437
Test Accuracy:  0.4931767144584557


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
#F1-score 
y_pred = f_scores(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)

precision_recall_fscore_support(test_y, y_pred, average='weighted')
#Precision / Recall / F1_score

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


(0.4896338402343924, 0.4931767144584557, 0.4904485878130712, None)

In [None]:
#Confusion Matrix
confusion_matrix(test_y, y_pred)

array([[ 376,  103,  266,  221,   78,  160,  245],
       [ 115,  859,  218,   51,  256,   62,   79],
       [ 167,  170,  937,  108,  106,  129,  152],
       [ 128,   66,  183,  776,   63,   44,  440],
       [  70,  249,  115,   75,  923,   84,   64],
       [ 105,   44,  157,   52,   92, 1292,   69],
       [ 210,   72,  202,  452,   78,   68,  547]])

In [None]:
# SVM on Ngram Level TF IDF Vectors
valid_accuracy = train_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
test_accuracy = test_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
print("SVM, N-Gram Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

SVM, N-Gram Vectors
Validation Accuracy:  0.16594694632942628
Test Accuracy:  0.17127310416306787


In [None]:
# RF on Count Vectors
valid_accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count)
test_accuracy = test_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xtest_count)
print("RF, Count Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# RF on Word Level TF IDF Vectors
valid_accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)
test_accuracy = test_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xtest_tfidf)
print( "RF, WordLevel TF-IDF")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

RF, Count Vectors
Validation Accuracy:  0.30475015422578655
Test Accuracy:  0.31188460874071516
RF, WordLevel TF-IDF
Validation Accuracy:  0.3230104873534855
Test Accuracy:  0.32276731732596303


By reviewing all results, Naive Bayes and Logistic Regression using Char-level TF-IDF vectorizer was selected as top 2 accuracy and f1-score.