<a href="https://colab.research.google.com/github/kwanglo/mge51101-20195171/blob/master/final_project/01_Data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing preprocessing

First thing we need to do is linking our colab to google drive. <br>
Then we import and install necessary libraries. <br>
Final job before data procesing is configuring data path in advance.

In [31]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


In [32]:
!pip3 install konlpy
!pip3 install soynlp



In [33]:
import os
import re

from sklearn import datasets, model_selection

import pandas as pd
import numpy as np

In [34]:
path='/gdrive/My Drive/Colab Notebooks/Final Project/dataset/'

# Data Preprocessing - Multi-sentiment

Both sentiment and utterance will go through same preprocessing. <br>
We first check whether there is unnecessary columns and if so, we drop them.<br>
Then, we check the balance between labels and split them individually.<br>

In [35]:
df = pd.read_excel(path+"한국어_단발성_대화_데이터셋.xlsx")

In [36]:
df.head(2)

Unnamed: 0,Sentence,Emotion,Unnamed: 2,Unnamed: 3,Unnamed: 4,공포,5468
0,언니 동생으로 부르는게 맞는 일인가요..??,공포,,,,놀람,5898.0
1,그냥 내 느낌일뿐겠지?,공포,,,,분노,5665.0


In [37]:
df = df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4','공포',5468])

In [38]:
df.groupby("Emotion").count()

Unnamed: 0_level_0,Sentence
Emotion,Unnamed: 1_level_1
공포,5468
놀람,5898
분노,5665
슬픔,5267
중립,4830
행복,6037
혐오,5429


In [39]:
# Char to int
df.loc[(df.Emotion == "중립"),"Emotion"] = 0
df.loc[(df.Emotion == "공포"),"Emotion"] = 1
df.loc[(df.Emotion == "놀람"),"Emotion"] = 2
df.loc[(df.Emotion == "분노"),"Emotion"] = 3
df.loc[(df.Emotion == "슬픔"),"Emotion"] = 4
df.loc[(df.Emotion == "행복"),"Emotion"] = 5
df.loc[(df.Emotion == "혐오"),"Emotion"] = 6

In [40]:
data = df

In [41]:
# 숫자로 이미지 나눌때 
#중립
Neutral = data[data["Emotion"] == 0]
#공포
Fear = data[data["Emotion"] == 1]
Fear.head()
#놀람
Surprise = data[data["Emotion"] == 2]
#분노
Anger = data[data["Emotion"] == 3]
#슬픔
Sad = data[data["Emotion"] == 4]
#행복
Happy = data[data["Emotion"] == 5]
#혐오
Disgust = data[data["Emotion"] == 6]

In [42]:
rnd_num = 2020
Fear_train = Fear.sample(frac=0.7, random_state=rnd_num)
Fear_test = Fear.drop(Fear_train.index)

Surprise_train = Surprise.sample(frac=0.7, random_state=rnd_num)
Surprise_test = Surprise.drop(Surprise_train.index)

Anger_train = Anger.sample(frac=0.7, random_state=rnd_num)
Anger_test = Anger.drop(Anger_train.index)

Sad_train = Sad.sample(frac=0.7, random_state=rnd_num)
Sad_test = Sad.drop(Sad_train.index)

Neutral_train = Neutral.sample(frac=0.7, random_state=rnd_num)
Neutral_test = Neutral.drop(Neutral_train.index)

Happy_train = Happy.sample(frac=0.7, random_state=rnd_num)
Happy_test = Happy.drop(Happy_train.index)

Disgust_train = Disgust.sample(frac=0.7, random_state=rnd_num)
Disgust_test = Disgust.drop(Disgust_train.index)

train = pd.concat([Fear_train,Surprise_train,Anger_train,Sad_train,Neutral_train,Happy_train,Disgust_train])
test = pd.concat([Fear_test,Surprise_test,Anger_test,Sad_test,Neutral_test,Happy_test,Disgust_test])

In [43]:
train.groupby("Emotion").count()

Unnamed: 0_level_0,Sentence
Emotion,Unnamed: 1_level_1
0,3381
1,3828
2,4129
3,3965
4,3687
5,4226
6,3800


# Preprocessing part 2

In [44]:
from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.normalizer import *
import re
from konlpy.tag import Okt

def tokenizer(text): # create a tokenizer function
    okt = Okt()
    text = only_hangle(text)
    text = repeat_normalize(text, num_repeats = 2)
    x = okt.morphs(text , stem= True)
    return x

In [45]:
stop_words_set = pd.read_csv(path+'stopwords100.txt',header = 0, delimiter = '\t', quoting = 3)
stop_words= (list(stop_words_set['aa']))
stop_words2 = ['은', '는', '이', '가', '하', '아', '것', '들','의', '있', '되', '수', '보', '주', '등', '한']
stop_words.extend(stop_words)

In [46]:
import torch
from torchtext import data
from torchtext import datasets
SEED = 3432

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


TEXT = data.Field(tokenize=tokenizer, stop_words = stop_words, include_lengths = True)
LABEL = data.LabelField()

In [50]:
from sklearn.model_selection import train_test_split

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(train['Sentence'],
                                                                                    train['Emotion'], 
                                                                                    random_state=2020, 
                                                                                    test_size=0.3)

In [51]:
train_set = pd.concat([train_inputs, train_labels], axis=1)
valid_set = pd.concat([validation_inputs, validation_labels], axis=1)
test_set = test

In [None]:
train_set.to_csv(path+'sentiment_train.csv',index=False, encoding='utf-8')
valid_set.to_csv(path+'sentiment_valid.csv',index=False, encoding='utf-8')
test_set.to_csv(path+'sentiment_test.csv',index=False, encoding='utf-8')

# Data Preprocessing - Utterance

For utterance data, the data provider pre-divided train and test set.<br>
So, we can import train-test seperately.

In [52]:
df = pd.read_csv(path+"fci_data.csv")
train_set = pd.read_csv(path+"fci_train.csv")
test_set = pd.read_csv(path+"fci_test.csv")

In [53]:
train_set.head(2)

Unnamed: 0,label,text
0,0,만화
1,0,이치가


In [54]:
df.groupby("label").count()
#Fragments(FR) - 0
#Statements(S) - 1
#Questions(Q) - 2
#Commands(C) - 3
#Rhetorical questions(RQ) - 4
#Rhetorical commands(RC) - 5
#FragmeIntonation-dependent utterances(IU) - 6

Unnamed: 0_level_0,text
label,Unnamed: 1_level_1
0,6009
1,18300
2,17869
3,12968
4,1745
5,1087
6,3277


Now both datasets are ready and we can import them at other colabs at our needs to build model training, validation and test set.