# 문장 유형 분류

 - 데이콘 대회: [문장 유형 분류 AI 경진대회](https://dacon.io/competitions/official/236037/overview/description)
 1. 데이터 로드
 2. 데이터 전처리 및 증강
     - 특수문자 제거
     - 랜덤 스왑 [(참고한 catSirup님 링크)](https://github.com/catSirup/KorEDA)
 3. 데이터셋 생성
 4. 모델 정의 및 훈련
     - KR-Medium 사용 [(Hugginface 링크)](https://huggingface.co/snunlp/KR-Medium)
     - 불균형한 label 분포를 극복하고자 Focal Loss와 Asymmetric Loss 사용
     - 5개의 KFold로 나누어 훈련
     - 최종적으로 5개의 모델 soft voting을 통해 결과 예측
 5. 결과 예측 및 제출
     - 최종 Public 스코어: 0.7352 (78위)

### 1. 데이터 로드

In [1]:
import re
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from tqdm import tqdm
from transformers import TFAutoModel, AutoTokenizer, BertTokenizer
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Dropout, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

import warnings
warnings.filterwarnings(action='ignore')

seed = 52
np.random.seed(seed)
random.seed(seed)
tf.random.set_seed(seed)

In [3]:
file_path = '/kaggle/input/sentence/'
file_name = 'train.csv'
df = pd.read_csv(file_path + file_name)
df_test = pd.read_csv(file_path + 'test.csv')
df.head()

Unnamed: 0,ID,문장,유형,극성,시제,확실성,label
0,TRAIN_00000,0.75%포인트 금리 인상은 1994년 이후 28년 만에 처음이다.,사실형,긍정,현재,확실,사실형-긍정-현재-확실
1,TRAIN_00001,이어 ＂앞으로 전문가들과 함께 4주 단위로 상황을 재평가할 예정＂이라며 ＂그 이전이...,사실형,긍정,과거,확실,사실형-긍정-과거-확실
2,TRAIN_00002,정부가 고유가 대응을 위해 7월부터 연말까지 유류세 인하 폭을 30%에서 37%까지...,사실형,긍정,미래,확실,사실형-긍정-미래-확실
3,TRAIN_00003,"서울시는 올해 3월 즉시 견인 유예시간 60분을 제공하겠다고 밝혔지만, 하루 만에 ...",사실형,긍정,과거,확실,사실형-긍정-과거-확실
4,TRAIN_00004,익사한 자는 사다리에 태워 거꾸로 놓고 소금으로 코를 막아 가득 채운다.,사실형,긍정,현재,확실,사실형-긍정-현재-확실


### 2. 데이터 전처리 및 증강

In [4]:
# 간단한 전처리. 한글, 영어, 숫자만 남기고 제거
df['문장'] = df['문장'].apply(lambda x: re.sub('[^A-Za-z0-9가-힣\s]', '', x))
df_test['문장'] = df_test['문장'].apply(lambda x: re.sub('[^A-Za-z0-9가-힣\s]', '', x))

In [5]:
# Easy Data Augmentation
# 참고: https://github.com/catSirup/KorEDA/blob/master/eda.py
import random

def random_swap(words, n=3):
    new_words = words.copy()
    for _ in range(n):
        new_words = swap_word(new_words)

    return new_words

def swap_word(new_words):
    random_idx_1 = random.randint(0, len(new_words)-1)
    random_idx_2 = random_idx_1
    counter = 0

    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words)-1)
        counter += 1
        if counter > 3:
            return new_words

    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1]
    return new_words


def augment_data(sentence, alpha_rs=0.1, num_aug=9):
    words = sentence.split(' ')
    words = [word for word in words if word is not ""]
    num_words = len(words)

    augmented_sentences = []
    num_new_per_technique = int(num_aug/4) + 1

    n_rs = max(1, int(alpha_rs*num_words))

    # rs
    for _ in range(num_new_per_technique):
        a_words = random_swap(words, n_rs)
        augmented_sentences.append(" ".join(a_words))

    augmented_sentences = [sentence for sentence in augmented_sentences]
    random.shuffle(augmented_sentences)

    if num_aug >= 1:
        augmented_sentences = augmented_sentences[:num_aug]
    else:
        keep_prob = num_aug / len(augmented_sentences)
        augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
        
    return augmented_sentences

In [6]:
augmentation = df['문장'].apply(lambda x:augment_data(x))
augmentation[0]

['075포인트 금리 인상은 1994년 이후 처음이다 만에 28년',
 '만에 금리 인상은 1994년 이후 28년 075포인트 처음이다',
 '이후 금리 인상은 1994년 075포인트 28년 만에 처음이다']

In [7]:
df_temp = df.copy()
for i in range(3):
    temp = df.copy()
    temp['문장'] = list(map(lambda x: x[i], augmentation))
    df_temp = df_temp.append(temp)

In [8]:
train = df_temp.drop_duplicates(keep='first').sample(frac=1).reset_index(drop=True)
train

Unnamed: 0,ID,문장,유형,극성,시제,확실성,label
0,TRAIN_09966,무료로 즐기는 모바일 MMORPG라는 특성에 걸맞게 다른 요소들도 함께 즐기는 이용...,사실형,긍정,과거,확실,사실형-긍정-과거-확실
1,TRAIN_15274,오십견 없다 파열로 치료를 받고 있지만 차도가 회전근개,사실형,부정,현재,확실,사실형-부정-현재-확실
2,TRAIN_12061,기생충은 101년간의 한국영화 역사상 처음으로 시상식에서 아카데미 수상했다,사실형,긍정,과거,확실,사실형-긍정-과거-확실
3,TRAIN_06138,4위 첼시와 5위 맨체스터 유나이티드는 조직력이 예전만 못하지만 여전히 관록 있는 팀이다,추론형,긍정,현재,확실,추론형-긍정-현재-확실
4,TRAIN_13489,중산층 붕괴 원인 문제도 꼽힌 정보기술사회 도래에 따른 일자리 2위로 앞으로 우리 ...,추론형,긍정,현재,확실,추론형-긍정-현재-확실
...,...,...,...,...,...,...,...
64717,TRAIN_04247,두 여성과 커티스의 위치는 계단을 두고 나뉜다,사실형,긍정,현재,확실,사실형-긍정-현재-확실
64718,TRAIN_08223,애나 던런Anna Donlon 발로란트 책임 프로듀서는 발로란트는 정밀한 작전 박진...,사실형,긍정,현재,확실,사실형-긍정-현재-확실
64719,TRAIN_11100,모두가 제목이 대통령의 사람들All The Presidents Men이다,사실형,긍정,현재,확실,사실형-긍정-현재-확실
64720,TRAIN_09933,네이처셀 관계자는 수험생과 학생의 기억력 개선 꾸준한 건망증 및 치매 예방용 제품으...,사실형,긍정,과거,확실,사실형-긍정-과거-확실


### 3. 데이터셋 생성

In [9]:
# Tokenizer 정의
model_ckpt = 'snunlp/KR-Medium'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [10]:
# Dataset 생성
MAX_LEN = 200

def bert_tokenizer(sent, max_len):
    encoded_dict = tokenizer.encode_plus(
        text = sent,
        add_special_tokens = True,      
        max_length = max_len,           
        pad_to_max_length = True,
        return_attention_mask = True,   
        truncation = True
    )
    
    input_id = encoded_dict['input_ids']
    attention_mask = encoded_dict['attention_mask']
    token_type_id = encoded_dict['token_type_ids']
    
    return input_id, attention_mask, token_type_id

# Train data 생성
def build_data(doc, max_len):
    x_ids = []
    x_msk = []
    x_typ = []

    for sent in tqdm(doc):
        input_id, attention_mask, token_type_id = bert_tokenizer(sent, max_len)
        x_ids.append(input_id)
        x_msk.append(attention_mask)
        x_typ.append(token_type_id)

    x_ids = np.array(x_ids, dtype=int)
    x_msk = np.array(x_msk, dtype=int)
    x_typ = np.array(x_typ, dtype=int)

    return x_ids, x_msk, x_typ

In [13]:
# Label 생성
type_ohe = OneHotEncoder().fit(train['유형'].values.reshape(-1, 1))
polar_ohe = OneHotEncoder().fit(train['극성'].values.reshape(-1, 1))
tense_ohe = OneHotEncoder().fit(train['시제'].values.reshape(-1, 1))
certain_ohe = OneHotEncoder().fit(train['확실성'].values.reshape(-1, 1))

def make_label(doc):
    types = type_ohe.transform(doc['유형'].values.reshape(-1, 1)).toarray()
    polars = polar_ohe.transform(doc['극성'].values.reshape(-1, 1)).toarray()
    tenses = tense_ohe.transform(doc['시제'].values.reshape(-1, 1)).toarray()
    certains = certain_ohe.transform(doc['확실성'].values.reshape(-1, 1)).toarray()
    
    return [types, polars, tenses, certains]

### 4. 모델 정의 및 훈련

In [11]:
bert = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)
bert.trainable = True

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'bert.embeddings.position_ids', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the 

In [12]:
# Custom model define
class CustomModel(tf.keras.Model):
    def __init__(self):
        super(CustomModel, self).__init__()
        self.bert_layer = bert
        self.type_out = tf.keras.Sequential(
            [Dropout(0.2),
             Dense(256, activation='relu'),
             Dense(4, activation='softmax')]
        )
        self.polar_out = tf.keras.Sequential(
            [Dropout(0.2),
             Dense(256, activation='relu'),
             Dense(3, activation='softmax')]
        )
        self.tense_out = tf.keras.Sequential(
            [Dropout(0.2),
             Dense(256, activation='relu'),
             Dense(3, activation='softmax')]
        )
        self.certain_out = tf.keras.Sequential(
            [Dropout(0.2),
             Dense(256, activation='relu'),
             Dense(2, activation='softmax')]
        )
        
    def call(self, inputs):
        bert_output = self.bert_layer(inputs)[1]
        type_output = self.type_out(bert_output)
        polar_output = self.polar_out(bert_output)
        tense_output = self.tense_out(bert_output)
        certain_output = self.certain_out(bert_output)
        
        # output shape: [[1, 0, 0, 0], [1, 0, 0], [1, 0, 0], [1, 0]]
        return type_output, polar_output, tense_output, certain_output


def focal_loss(gamma=2., alpha=0.25):
    def focal_loss_fixed(y_true, y_pred):
        y_pred = tf.clip_by_value(y_pred, 1e-8, 1-1e-8)
        pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
        pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))
        return -tf.reduce_sum(alpha * tf.pow(1. - pt_1, gamma) * tf.math.log(pt_1)) \
               -tf.reduce_sum((1 - alpha) * tf.pow(pt_0, gamma) * tf.math.log(1. - pt_0))
    return focal_loss_fixed


def asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1):
    def asymmetric_loss_fixed(y_true, y_pred):
        y_pred = tf.clip_by_value(y_pred, 1e-8, 1 - 1e-8)
        p_t = tf.where(tf.equal(y_true, 1), y_pred, 1 - y_pred)
        theta_t = tf.where(tf.equal(y_true, 1), theta * tf.ones_like(y_true), (1 - theta) * tf.ones_like(y_true))
        gamma_t = tf.where(tf.equal(y_true, 1), gamma_pos * tf.ones_like(y_true), gamma_neg * tf.ones_like(y_true))
        return -tf.reduce_sum(theta_t * tf.pow(1. - p_t, gamma_t) * tf.math.log(p_t))
    return asymmetric_loss_fixed


# model = CustomModel()

# Focal Loss 사용
# model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
#               loss=[focal_loss(gamma=2., alpha=0.25), 
#                     focal_loss(gamma=2., alpha=0.25), 
#                     focal_loss(gamma=2., alpha=0.25), 
#                     focal_loss(gamma=2., alpha=0.25)])

# Asymmetric Loss 사용
# model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
#               loss=[asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1), 
#                     asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1), 
#                     asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1), 
#                     asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1)])

In [14]:
# KFold=5로 5개의 모델을 학습해서 모델들의 soft voting을 통해 최종 결과를 예측
X = train
kf = KFold(n_splits=5, shuffle=True, random_state=seed)

for i, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f'----- {i}번째 Fold -----')
    train_data, val_data = train.iloc[train_idx, :], train.iloc[val_idx, :]
    x_train, y_train = build_data(train_data['문장'], MAX_LEN), make_label(train_data)
    x_val, y_val = build_data(val_data['문장'], MAX_LEN), make_label(val_data)
    
    model = CustomModel()
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
              loss=[asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1), 
                    asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1), 
                    asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1), 
                    asymmetric_loss(theta=0.5, gamma_neg=4, gamma_pos=1)])
    
    early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1, mode='min', restore_best_weights=True)
    checkpoint_filepath = '/kaggle/working/best_model' + str(i)
    model_checkpoint = ModelCheckpoint(checkpoint_filepath, monitor='val_loss', save_best_only=True, verbose=1, mode='min', save_weights_only=True)
    
    history = model.fit(x_train, y_train, epochs=100, batch_size=32, validation_data=[x_val, y_val], callbacks=[early_stopping, model_checkpoint])

### 5. 결과 예측 및 제출

In [15]:
test_doc = df_test['문장']
x_test_ids, x_test_msk, x_test_typ = build_data(test_doc, MAX_LEN)
x_test = [x_test_ids, x_test_msk, x_test_typ]

# ans_type = {0: '사실형', 1: '추론형', 2: '대화형', 3: '예측형'}
# ans_polar = {0: '긍정', 1: '부정', 2: '미정'}
# ans_tense = {0: '과거', 1: '현재', 2: '미래'}
# ans_certain = {0: '확실', 1: '불확실'}

100%|████████████████████████████████████████████████████████████████████████████| 7090/7090 [00:01<00:00, 5651.19it/s]


In [None]:
predictions = []
trained_model = CustomModel()
for i in range(5):
    save_model_path = f'/kaggle/input/sentence-clf-model/best_model{i}'
    trained_model.load_weights(save_model_path)
    prediction = trained_model(x_test)
    predictions.append(prediction)

In [33]:
results = []
# Soft Voting 수행
for pred in zip(*predictions):
    results.append(np.sum(pred, axis=0))

In [34]:
submit = []
for pred in zip(*results):
    tp = type_ohe.inverse_transform(pred[0].reshape(1, -1))
    pl = polar_ohe.inverse_transform(pred[1].reshape(1, -1))
    tn = tense_ohe.inverse_transform(pred[2].reshape(1, -1))
    ct = certain_ohe.inverse_transform(pred[3].reshape(1, -1))
    
    tp = tp[0][0]
    pl = pl[0][0]
    tn = tn[0][0]
    ct = ct[0][0]
    
    submit.append(f"{tp}-{pl}-{tn}-{ct}")

submit[:5]

df_submit = df_test
df_submit['label'] = submit
df_submit = df_submit.drop(['문장'], axis=1)
df_submit.head()

Unnamed: 0,ID,label
0,TEST_0000,사실형-긍정-현재-확실
1,TEST_0001,사실형-긍정-현재-확실
2,TEST_0002,사실형-긍정-과거-확실
3,TEST_0003,사실형-긍정-현재-확실
4,TEST_0004,사실형-긍정-과거-확실


In [35]:
df_submit.to_csv('submit_KFold_Final.csv', index=False)