# **미니프로젝트 4차 1대1 문의 내용 유형 분류기**
# 단계3 : Text classification

### 문제 정의
> 1:1 문의 내용 분류 문제<br>
> 1. 문의 내용 분석
> 2. 문의 내용 분류 모델 성능 평가
### 학습 데이터
> * 1:1 문의 내용 데이터 : train.csv

### 변수 소개
> * text : 문의 내용
> * label : 문의 유형

### References
> * Machine Learning
>> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
> * Deep Learning
>> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
>> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
>> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)
>> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)

## 1. 개발 환경 설정

### 1-1. 라이브러리 설치

In [None]:
# 필요 라이브러리부터 설치할께요.
!pip install konlpy pandas seaborn wordcloud python-mecab-ko wget transformers

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
Collecting python-mecab-ko
  Downloading python_mecab_ko-1.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (573 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m573.9/573.9 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting JPype1>=0.7.0 (from konlpy)
  Downloading JPype1-1.4.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 kB[0m [31m32.8 MB/

### 1-2. 라이브러리 import

In [None]:
from mecab import MeCab
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import wget,os
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import tensorflow as tf
import nltk
import wget,os

In [None]:
# 런타임 재시작 필요
!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  fonts-nanum
0 upgraded, 1 newly installed, 0 to remove and 18 not upgraded.
Need to get 10.3 MB of archives.
After this operation, 34.1 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-nanum all 20200506-1 [10.3 MB]
Fetched 10.3 MB in 0s (25.0 MB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package fonts-nanum.
(Reading database ... 120875 files and direc

### 1-3. 한글 글꼴 설정

In [None]:
!sudo apt-get install -y fonts-nanum

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
fonts-nanum is already the newest version (20200506-1).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


In [None]:
FONT_PATH = '/usr/share/fonts/truetype/nanum/NanumGothic.ttf'
font_name = fm.FontProperties(fname=FONT_PATH, size=10).get_name()
print(font_name)
plt.rcParams['font.family']=font_name
assert plt.rcParams['font.family'] == [font_name], "한글 폰트가 설정되지 않았습니다."

NanumGothic


### 1-4. 구글드라이브 연결

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. 전처리한 데이터 불러오기
* 1, 2일차에 전처리한 데이터를 불러옵니다.
* sparse data에 대해서는 scipy.sparse.load_npz 활용

In [None]:
PATH = '/content/drive/MyDrive/content_classification'
def file_path(path):
    return os.path.join(PATH, path)

In [None]:
import numpy as np
import scipy.sparse

In [None]:
x_tr_tfidf = scipy.sparse.load_npz (file_path("X_tfidf_train.npz"))
x_val_tfidf = scipy.sparse.load_npz (file_path("X_tfidf_val.npz"))


In [None]:
X_mor_tr_seq = np.load(file_path("X_mor_sequence_train.npy"))
X_mor_val_seq = np.load(file_path("X_mor_sequence_val.npy"))

In [None]:
Y_tr = np.load(file_path("y_train.npy"))
Y_val = np.load(file_path("y_val.npy"))

TypeError: ignored

In [None]:
print("X_train tfidf shape:", x_tr_tfidf.shape  )
print("X_val tfidf  shape:", x_val_tfidf.shape  )


X_train tfidf shape: (2964, 10167)
X_val tfidf  shape: (742, 10167)


In [None]:
print("X_mor_train seq shape:", X_mor_tr_seq.shape  )
print("X_mor_val seq shape:", X_mor_val_seq.shape  )

X_mor_train seq shape: (2964, 400)
X_mor_val seq shape: (742, 400)


In [None]:
print("y_train shape:", Y_tr.shape   ) #y_train
print("y_val tfidf  shape:", Y_val.shape  ) #y_test

y_train shape: (2964,)
y_val tfidf  shape: (742,)


## 3. Machine Learning(N-grams)
* N-gram으로 전처리한 데이터를 이용하여 3개 이상의 Machine Learning 모델 학습 및 성능 분석
> * [sklearn-tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

### 3-1. Model 1

In [None]:

from sklearn.metrics import confusion_matrix, classification_report

In [None]:

from lightgbm import LGBMClassifier

In [None]:
%%time
model = LGBMClassifier(random_state=2023)
model.fit(x_tr_tfidf, Y_tr)

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 32121
[LightGBM] [Info] Number of data points in the train set: 2964, number of used features: 953
[LightGBM] [Info] Start training from score -0.855428
[LightGBM] [Info] Start training from score -1.612479
[LightGBM] [Info] Start training from score -1.638187
[LightGBM] [Info] Start training from score -1.863068
[LightGBM] [Info] Start training from score -3.650490
CPU times: user 6.44 s, sys: 22 ms, total: 6.46 s
Wall time: 6.55 s


In [None]:
y_pred = model.predict(x_val_tfidf)
print(classification_report(Y_val, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84       325
           1       0.76      0.76      0.76       141
           2       0.74      0.72      0.73       152
           3       0.83      0.79      0.81       101
           4       0.96      0.96      0.96        23

    accuracy                           0.80       742
   macro avg       0.82      0.82      0.82       742
weighted avg       0.80      0.80      0.80       742



### 3-2. Model 2

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier(random_state=2023)
model.fit(x_tr_tfidf, Y_tr)

In [None]:
y_pred = model.predict(x_val_tfidf)
print(classification_report(Y_val, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.95      0.84       325
           1       0.88      0.70      0.78       141
           2       0.81      0.61      0.70       152
           3       0.84      0.78      0.81       101
           4       1.00      0.70      0.82        23

    accuracy                           0.80       742
   macro avg       0.86      0.75      0.79       742
weighted avg       0.81      0.80      0.80       742



### 3-3. Model 3

### 3-4. Hyperparameter Tuning(Optional)
* Manual Search, Grid search, Bayesian Optimization, TPE...
> * [grid search tutorial sklearn](https://scikit-learn.org/stable/modules/grid_search.html)
> * [optuna tutorial](https://optuna.org/#code_examples)
> * [ray-tune tutorial](https://docs.ray.io/en/latest/tune/examples/tune-sklearn.html)

## 4. Deep Learning(Sequence)
* Sequence로 전처리한 데이터를 이용하여 DNN, 1-D CNN, LSTM 등 3가지 이상의 deep learning 모델 학습 및 성능 분석
> * [Google Tutorial](https://developers.google.com/machine-learning/guides/text-classification)
> * [Tensorflow Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
> * [Keras-tutorial](https://keras.io/examples/nlp/text_classification_from_scratch/)

In [None]:
import tensorflow as tf
from tensorflow import keras
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Input, Dense, Flatten, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

### 4-1. DNN

In [None]:
# 세션 클리어
K.clear_session()

# 모델 쌓기
il = Input(shape=(400, ))

hl = Dense(64, activation='swish')(il)
hl = Dense(128, activation='swish')(hl)
hl = Dense(256, activation='swish')(hl)
ol = Dense(5, activation='softmax')(hl)

# 모델 선언
model = Model(il, ol)

# 컴파일
model.compile(loss='sparse_categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

# 요약
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 400)]             0         
                                                                 
 dense (Dense)               (None, 64)                25664     
                                                                 
 dense_1 (Dense)             (None, 128)               8320      
                                                                 
 dense_2 (Dense)             (None, 256)               33024     
                                                                 
 dense_3 (Dense)             (None, 5)                 1285      
                                                                 
Total params: 68293 (266.77 KB)
Trainable params: 68293 (266.77 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:

es = EarlyStopping(patience=20, restore_best_weights=True, verbose=1)
lr_reduction = ReduceLROnPlateau(factor=0.5, patience=5, min_lr=0.000001)
history = model.fit(seq_x_train, y_train, validation_data=(seq_x_val, y_val),
                    epochs=1000, callbacks=[es])

NameError: ignored

### 4-2. 1-D CNN

### 4-3. LSTM

## 5. Using pre-trained model(Optional)
* 한국어 pre-trained model로 fine tuning 및 성능 분석
> * [BERT-tutorial](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)
> * [HuggingFace-Korean](https://huggingface.co/models?language=korean)