<a href="https://colab.research.google.com/github/kwanglo/mge51101-20195171/blob/master/final_project/03_Utterance_ML_W2V.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Utterance classification using machine learning**

In this section, we will build utterance classification models using machine learning techniques. <br>
<br>
**Applied embedding :** fastText Korean ver. using wikipedia<br>
Applied vectorizer : <br>
Count Vector, TF-IDF, Ngram, Character level TF-IDF
<br>

**Applied machine learning model :** <br>
Naive Bayes, Logistic Regression, Support Vector Machine, Random Forest
<br>

**Reference** <br>




1. Link to google drive
2. Import required libraries
3. Load prepared dataset
4. Divide into train, valid, test set

In [1]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


In [None]:
!pip3 install konlpy
!pip3 install soynlp

In [3]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import xgboost, string
import pandas as pd
import numpy as np
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

Using TensorFlow backend.


In [10]:
path='/gdrive/My Drive/Colab Notebooks/Final Project/dataset/'
rnd_num = 2020

train = pd.read_csv(path+"fci_train.csv")
valid_data = train.sample(frac=0.3, random_state=rnd_num)
train_data = train.drop(valid_data.index)
test_data = pd.read_csv(path+"fci_test.csv")

In [11]:
valid_data.head(3)

Unnamed: 0,label,text
13746,1,저도 곧 주부가
28756,2,동해와 남해 중
4996,1,규정 속도를 지키


In [12]:
valid_data.groupby('label').count()

Unnamed: 0_level_0,text
label,Unnamed: 1_level_1
0,1635
1,4960
2,4804
3,3494
4,487
5,302
6,858


In [13]:
train_x, train_y = train_data['text'], train_data['label']
valid_x, valid_y = valid_data['text'], valid_data['label']
test_x, test_y = test_data['text'], test_data['label']

# Preprocessing before training

1. Import vectorizer
2. Set stopwords and build clean dataset
3. Tokenize dataset
4. Import pre-trained embedding from fastText

In [19]:
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(train_data['text'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)
xtest_count =  count_vect.transform(test_x)

In [20]:
#TF-IDF vectorizer
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=15000)
tfidf_vect.fit(train_data['text'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)
xtest_tfidf =  tfidf_vect.transform(test_x)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=15000)
tfidf_vect_ngram.fit(train_data['text'])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x)
xtest_tfidf_ngram =  tfidf_vect_ngram.transform(test_x)

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=15000)
tfidf_vect_ngram_chars.fit(train_data['text'])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(train_x) 
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x)
xtest_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(test_x)



In [21]:
stop_words_set = pd.read_csv(path+'stopwords100.txt',header = 0, delimiter = '\t', quoting = 3)
stop_words= (list(stop_words_set['aa']))
stop_words2 = ['은', '는', '이', '가', '하', '아', '것', '들','의', '있', '되', '수', '보', '주', '등', '한']
stop_words.extend(stop_words)

In [22]:
from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.normalizer import *
import re
from konlpy.tag import Okt

def preprocessing(text, okt, remove_stopwords = False, stop_words = []):
    text = only_hangle(text)
    text = repeat_normalize(text, num_repeats = 2)
    
    text_token = okt.morphs(text, stem = True)
    
    if remove_stopwords:
        text_token = [token for token in text_token if not token in stop_words]
        
    return text_token

In [23]:
#Preprocessing - Train and Valid data
okt = Okt()
clean_train_data = []
clean_valid_data = []
clean_test_data = []
for text in train_data['text']:
    if type(text) == str:
        clean_train_data.append(preprocessing(text, okt, remove_stopwords = True, stop_words = stop_words))
    else:
        clean_train_data.append([])
        
for text in valid_data['text']:
    if type(text) == str:
        clean_valid_data.append(preprocessing(text, okt, remove_stopwords = True, stop_words = stop_words))
    else:
        clean_valid_data.append([])
        
for text in test_data['text']:
    if type(text) == str:
        clean_test_data.append(preprocessing(text, okt, remove_stopwords = True, stop_words = stop_words))
    else:
        clean_test_data.append([])

In [24]:
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(clean_train_data)
train_sequences = tokenizer.texts_to_sequences(clean_train_data)
valid_sequences = tokenizer.texts_to_sequences(clean_valid_data)
test_sequences = tokenizer.texts_to_sequences(clean_test_data)

word_index = tokenizer.word_index

MAX_SEQUENCE_LENGTH = 70

In [25]:
embeddings_index = {}
for i, line in enumerate(open(path+'wiki.ko.vec')):
    values = line.split()
    embeddings_index[values[0]] = np.asarray(values[1:], dtype='float32')
    
    
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Model building

1. Set train and test models
2. Get machine learning models from scikit learn
3. Train!

In [26]:
# 기존 모델
def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return metrics.accuracy_score(predictions, valid_y)

In [27]:
# From other reference
def test_model(classifier, X_train, y_train, X_test):
  
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)

  return metrics.accuracy_score(y_pred, test_y)

In [28]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

In [29]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [30]:
# Naive Bayes on Count Vectors
valid_accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count)
test_accuracy = test_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xtest_count)

print("NB, Count Vectors Accuracy")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Naive Bayes on Word Level TF IDF Vectors
valid_accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
test_accuracy = test_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xtest_tfidf)
print("NB, WordLevel TF-IDF")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
valid_accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
test_accuracy = test_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
print("NB, N-Gram Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Naive Bayes on Character Level TF IDF Vectors
valid_accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
test_accuracy = test_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
print("NB, CharLevel Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

NB, Count Vectors Accuracy
Validation Accuracy:  0.5965538089480048
Test Accuracy:  0.5706583891520993
NB, WordLevel TF-IDF
Validation Accuracy:  0.5854897218863362
Test Accuracy:  0.5618363012579644
NB, N-Gram Vectors
Validation Accuracy:  0.469770253929867
Test Accuracy:  0.4378369547459565
NB, CharLevel Vectors
Validation Accuracy:  0.6064087061668681
Test Accuracy:  0.5843816369874204


After training, f1-score and confusion matrix was tested as mentioned in proposal.

In [31]:
def f_scores(classifier, X_train, y_train, X_test):
  
  classifier.fit(X_train, y_train)
  y_pred = classifier.predict(X_test)

  return y_pred

In [32]:
#F1-score 
y_pred = f_scores(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)

precision_recall_fscore_support(test_y, y_pred, average='weighted')
#Precision / Recall / F1_score

(0.6356099106975589, 0.5843816369874204, 0.5426524287039687, None)

In [36]:
#Confusion Matrix
confusion_matrix(test_y, y_pred)

array([[  21,  485,   66,   26,    0,    1,    1],
       [   3, 1518,  210,   90,    1,    1,    7],
       [   0,  244, 1285,  253,    2,    0,    2],
       [   1,  324,  352,  618,    0,    1,    0],
       [   1,  130,   23,    3,   13,    0,    4],
       [   0,   70,   14,    6,    0,   18,    0],
       [   0,  160,   41,   21,    0,    1,  104]])

In [33]:
# Linear Classifier on Count Vectors
valid_accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
test_accuracy = test_model(linear_model.LogisticRegression(), xtrain_count, train_y, xtest_count)
print("LR, Count Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Linear Classifier on Word Level TF IDF Vectors
valid_accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
test_accuracy = test_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xtest_tfidf)
print("NB, CharLevel Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Linear Classifier on Ngram Level TF IDF Vectors
valid_accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
test_accuracy = test_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)

print("LR, N-Gram Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# Linear Classifier on Character Level TF IDF Vectors
valid_accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
test_accuracy = test_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)
print("LR, CharLevel Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR, Count Vectors
Validation Accuracy:  0.6105199516324062
Test Accuracy:  0.5704950171540598


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


NB, CharLevel Vectors
Validation Accuracy:  0.5984280532043531
Test Accuracy:  0.5713118771442575


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR, N-Gram Vectors
Validation Accuracy:  0.4741837968561064
Test Accuracy:  0.43832707074007515


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR, CharLevel Vectors
Validation Accuracy:  0.708645707376058
Test Accuracy:  0.6593693840875674


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [37]:
#F1-score 
y_pred = f_scores(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xtest_tfidf_ngram_chars)

precision_recall_fscore_support(test_y, y_pred, average='weighted')
#Precision / Recall / F1_score

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


(0.6673723776924843, 0.6593693840875674, 0.644568252463745, None)

In [38]:
#Confusion Matrix
confusion_matrix(test_y, y_pred)

array([[ 533,   54,   11,    1,    0,    0,    1],
       [  83, 1450,  176,  103,    4,    3,   11],
       [  32,  257, 1266,  225,    4,    0,    2],
       [  23,  297,  359,  611,    1,    3,    2],
       [   7,  109,   19,   12,   21,    0,    6],
       [   4,   55,    8,    7,    0,   33,    1],
       [  17,  127,   38,   22,    0,    1,  122]])

In [34]:
# SVM on Ngram Level TF IDF Vectors
valid_accuracy = train_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
test_accuracy = test_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xtest_tfidf_ngram)
print("SVM, N-Gram Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

SVM, N-Gram Vectors
Validation Accuracy:  0.4738210399032648
Test Accuracy:  0.43293579480477046


In [35]:
# RF on Count Vectors
valid_accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count)
test_accuracy = test_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xtest_count)
print("RF, Count Vectors")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

# RF on Word Level TF IDF Vectors
valid_accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)
test_accuracy = test_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xtest_tfidf)
print( "RF, WordLevel TF-IDF")
print("Validation Accuracy: ", valid_accuracy)
print("Test Accuracy: ", test_accuracy)

RF, Count Vectors
Validation Accuracy:  0.48137847642079806
Test Accuracy:  0.42297010292435877
RF, WordLevel TF-IDF
Validation Accuracy:  0.5992744860943168
Test Accuracy:  0.5252409736971083


By reviewing all results, Naive Bayes and Logistic Regression using Char-level TF-IDF vectorizer was selected as top 2 accuracy and f1-score.