2021 한국정보처리학회 춘계학술대회 발표 논문 

조희련_1, 임현열_2, 차준우_1, 이유미_1 (1_중앙대학교 인문콘텐츠연구소, 2_중앙대학교 다빈치교양대학)

"KoBERT, 나이브 베이즈, 로지스틱 회귀의 한국어 쓰기 답안지 점수 구간 예측 성능 비교"

나이브 베이즈(Naive Bayes: NB)와 로지스틱 회귀(Logistic Regression) 실험 코드입니다.

In [1]:
!pip install konlpy

Collecting konlpy
[?25l  Downloading https://files.pythonhosted.org/packages/85/0e/f385566fec837c0b83f216b2da65db9997b35dd675e107752005b7d392b1/konlpy-0.5.2-py2.py3-none-any.whl (19.4MB)
[K     |████████████████████████████████| 19.4MB 213kB/s 
Collecting JPype1>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/cd/a5/9781e2ef4ca92d09912c4794642c1653aea7607f473e156cf4d423a881a1/JPype1-1.2.1-cp37-cp37m-manylinux2010_x86_64.whl (457kB)
[K     |████████████████████████████████| 460kB 38.7MB/s 
Collecting beautifulsoup4==4.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/9e/d4/10f46e5cfac773e22707237bfcd51bbffeaf0a576b0a847ec7ab15bd7ace/beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
[K     |████████████████████████████████| 92kB 12.2MB/s 
[?25hCollecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Installing collected packages: JPype1, 

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from konlpy.tag import Komoran
from sklearn.metrics import accuracy_score

In [4]:
def get_file(file_name):
    with open(file_name) as f:
        data = pd.read_csv(f, delimiter="\t", quotechar='"')
    return data

def vectorize(train, val, test):
    parser = Komoran()

    temp_train = []
    for doc in train:
        temp_train.append(parser.morphs(doc))
    result_train = [' '.join(tokens) for tokens in temp_train]

    temp_val = []
    for doc in val:
        temp_val.append(parser.morphs(doc.replace("[[문단]] ","")))
    result_val = [' '.join(tokens) for tokens in temp_val]

    temp_test = []
    for doc in test:
        temp_test.append(parser.morphs(doc))
    result_test = [' '.join(tokens) for tokens in temp_test]

    vect = CountVectorizer()
    X_train = vect.fit_transform(result_train)
    X_val = vect.transform(result_val)
    X_test = vect.transform(result_test)

    return X_train, X_val, X_test

### 실험 데이터는 아래의 URL에서 다운로드 받을 수 있습니다.

http://aihumanities.org/ko/archive/data/?vid=1

압축해제 하신 후 colab의 sample_data 폴더에 올려 주세요.

이 때 폴더 구조와 파일의 위치(예시)는 다음과 같습니다.

`sample_data/job/train_0.txt `


In [None]:
# 코멘트 아웃하 나이브 베이즈와 로지스틱 회귀 실험 결과를 취득

#clf = MultinomialNB()
clf = LogisticRegression(random_state=0, max_iter=1000)

folders = ["job_plus_success"]

for folder in folders:
    print("======================")
    print("result_{}".format(folder))
    print("======================")
    avg_acc_train = []
    avg_acc_val = []
    avg_acc_test = []    
    for i in range(7):
        train_data_file = "sample_data/{}/train_{}.txt".format(folder, i)
        val_data_file = "sample_data/{}/val_{}.txt".format(folder, i)
        test_data_file = "sample_data/{}/test_{}.txt".format(folder, i)

        data_train = get_file(train_data_file)
        train_doc = data_train["document"].str.replace("[[문단]] ","")
        train_label = data_train["label"]

        data_val = get_file(val_data_file)
        val_doc = data_val["document"].str.replace("[[문단]] ","")
        val_label = data_val["label"]

        data_test = get_file(test_data_file)
        test_doc = data_test["document"].str.replace("[[문단]] ","")
        test_label = data_test["label"]
        X_train, X_val, X_test = vectorize(train_doc, val_doc, test_doc)


        clf.fit(X_train, train_label)
        pred_train = clf.predict(X_train)
        pred_val = clf.predict(X_val)
        pred_test = clf.predict(X_test)

        '''
        print("X_test", X_test.shape)
        print("y_test", len(test_label))
        print("X_val", X_val.shape)
        print("y_val", len(val_label))
        print("X_train", X_train.shape)
        print("y_train", len(train_label))
        '''

        acc_train = accuracy_score(pred_train, train_label)
        avg_acc_train.append(acc_train)

        acc_val = accuracy_score(pred_val, val_label)
        avg_acc_val.append(acc_val)

        acc_test = accuracy_score(pred_test, test_label)
        avg_acc_test.append(acc_test)

        print("acc_train:", round(acc_train, 5))
        print("acc_val:", round(acc_val, 5))
        print("acc_test:", round(acc_test, 5))
        print("-------------------")

    avg_train = sum(avg_acc_train) / len(avg_acc_train)
    avg_val = sum(avg_acc_val) / len(avg_acc_val)
    avg_test = sum(avg_acc_test) / len(avg_acc_test)

    print("AVG_TRAIN:", round(avg_train, 5))
    print("AVG_VAL:", round(avg_val, 5))
    print("AVG_TEST:", round(avg_test, 5))

### 로지스틱 회귀와 나이브 베이즈 모델 구축 시 사용된 특징 단어 확인 및 로지스틱 회귀에서의 각 클래스 별 특징 단어 상위 10위 표시

In [8]:
# https://medium.com/@cristhianboujon/how-to-list-the-most-common-words-from-text-corpus-using-scikit-learn-dad4d0cab41d

train_data_file = "sample_data/job_plus_econ/train_6.txt"
data_train = get_file(train_data_file)
train_doc = data_train["document"].str.replace("[[문단]] ","", regex=True)
train_label = data_train["label"]

parser = Komoran()

temp_train = []
for doc in train_doc:
    temp_train.append(parser.morphs(doc))
result_train = [' '.join(tokens) for tokens in temp_train]

vect = CountVectorizer()
X_train = vect.fit_transform(result_train)

In [9]:
sum_words = X_train.sum(axis=0)

In [10]:
sum_words

matrix([[ 3,  2,  1, ...,  1,  1, 30]], dtype=int64)

In [11]:
words_freq = [(word, sum_words[0, idx]) for word, idx in vect.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

In [None]:
words_freq

In [13]:
len(words_freq)

1775

In [14]:
clf = LogisticRegression(random_state=0, max_iter=1000)

In [15]:
clf.fit(X_train, train_label)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [16]:
clf.classes_

array([0, 1, 2, 3])

In [22]:
weight = clf.coef_
weight

array([[-0.01874652,  0.02591283, -0.00650765, ..., -0.01036387,
         0.02606569, -0.12337049],
       [ 0.04782676, -0.00297888,  0.03004586, ..., -0.00844004,
        -0.00727872, -0.05334802],
       [ 0.00200671, -0.01274985, -0.01345546, ...,  0.03086479,
        -0.0139874 ,  0.26265692],
       [-0.03108694, -0.01018411, -0.01008275, ..., -0.01206088,
        -0.00479957, -0.0859384 ]])

In [18]:
import numpy as np

In [19]:
# https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array
# 레이블이 '3'인 경우
sel_weights = np.argsort(-weight[3])[:10]

In [20]:
vocab_idx = {y:x for x,y in vect.vocabulary_.items()}

In [21]:
for w in sel_weights:
    print(vocab_idx[w])

직성
가능
발전
아무리
에서
여유
조건
ㄴ다면
지만
그리고
