## 실행하기 전에 런타임 유형을 꼭!! GPU로 바꿔놔야 함.

## fine tuning 데이터셋 만들기

우선 적당한 데이터셋을 찾아야 한다. AI-Hub에서 제공하는 논문요약 자료를 훈련 데이터셋으로 쓰기로 했다. 적당한 구글 드라이브 디렉터리에 파일을 다운받아놓는다.

source from: https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=582


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
import os
from tqdm import tqdm

저장해놓은 파일을 json으로 불러온다.

In [None]:
with open("/content/drive/MyDrive/kubig/kucon/논문요약_0225_5_1.json", "r") as st_json:
    data = json.load(st_json)

In [None]:
print("데이터 개수:", data['totalcount']) # 데이터 크기는 32,000개다.

데이터 개수: 32000


json 형식을 데이터프레임으로 바꿔주는 함수이다.

In [None]:
def json_to_df(data):
  id = []
  category = []
  text = []
  summary = []
  for i in range(data['totalcount']):
    id.append(data['data'][i]['doc_id'])
    category.append(data['data'][i]['ipc'])
    text.append(data['data'][i]['summary_entire'][0]['orginal_text'])
    summary.append(data['data'][i]['summary_entire'][0]['summary_text'])
  df = pd.DataFrame({'id':id,
                     'category':category,
                      'text':text,
                     'summary':summary})
  return df

32,000개 데이터를 모두 fine tuning 시키면 1 epoch당 1시간 넘게 소요가 돼서 일단 여기서는 100개의 데이터만 예시로 사용해볼 것이다.

In [None]:
df = json_to_df(data)
df_ = df[:100] # fine-tuning이 너무 길어지지 않게 일부만 사용하기로

In [None]:
# train, test 분리

length_data = len(df_)     # data 행 개수
split_ratio = 0.7           # 0.7 / 0.3 으로 분리
length_train = round(length_data * split_ratio)  
length_validation = length_data - length_train
print("Data length :", length_data)
print("Train data length :", length_train)
print("Validation data lenth :", length_validation)

Data length : 100
Train data length : 70
Validation data lenth : 30


컬럼명을 아래처럼 정의해야 fine tuning에서 에러가 안 난다.

In [None]:
df_ = df_.iloc[:,2:]
df_.columns = ['news','summary']

train = df_[:length_train]
test = df_[length_train:]

fine tuning의 데이터는 tsv 파일 형태로 넣어줘야 하기 때문에 아래코드를 실행한다.

저장이 된 파일들은 일단 보관해둔다.

In [None]:
train.to_csv('train.tsv', sep='\t', encoding='utf-8')

In [None]:
test.to_csv('test.tsv', sep='\t', encoding='utf-8')

## fine tuning에 사용할 레퍼지터리 clone


In [None]:
!pip install git+https://github.com/SKT-AI/KoBART

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/SKT-AI/KoBART
  Cloning https://github.com/SKT-AI/KoBART to /tmp/pip-req-build-6t3xrzpf
  Running command git clone --filter=blob:none --quiet https://github.com/SKT-AI/KoBART /tmp/pip-req-build-6t3xrzpf
  Resolved https://github.com/SKT-AI/KoBART to commit 30c5eb7b593828d6ec2d767eeedb2f2ed02c5c2a
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting boto3
  Downloading boto3-1.26.74-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.7/132.7 KB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch-lightning==1.2.1
  Downloading pytorch_lightning-1.2.1-py3-none-any.whl (814 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m814.2/814.2 KB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch==1.7.1
  Downloading torch-1.7.1-cp38-cp38-manylinux1_x86_64.whl (776.8 MB)
[2

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


내 로컬의 원하는 폴더에서 git bash를 켜고

`$ git clone https://github.com/seujung/KoBART-summarization.git` 명령어를 입력해준다.

폴더가 잘 clone이 되었다면 이 폴더를 구글 드라이브 원하는 위치에 넣어준다.

바로 드라이브에 clone하는 방법도 있는지는 잘 모르겠다.

그리고선 cd 명령어로 폴더를 넣어준 디렉터리로 이동한다

In [None]:
%cd /content/drive/MyDrive/kubig/kucon/KoBART-summarization

/content/drive/MyDrive/kubig/kucon/KoBART-summarization


## fine tuning 코드 실행 전 환경 설정

훈련에 필요한 환경을 설치해준다. `requirements.txt`파일은 clone해온 폴더 내에 들어 있다.

In [None]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.10.0
  Downloading torch-1.10.0-cp38-cp38-manylinux1_x86_64.whl (881.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m881.9/881.9 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.8.2
  Downloading transformers-4.8.2-py3-none-any.whl (2.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pytorch-lightning==1.3.8
  Downloading pytorch_lightning-1.3.8-py3-none-any.whl (813 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m813.4/813.4 KB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting streamlit==1.1.0
  Downloading streamlit-1.1.0-py2.py3-none-any.whl (8.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.3/8.3 MB[0m [31m88.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub==0.

그리고, 앞서 만든 tsv 파일을 `/KoBART-summarization/data` 위치에 저장해준다.

사실 fine tuning은 코드 한 줄이면 바로 실행할 수 있지만, 그 전에 코드 몇 개를 더 실행해야 오류가 안 나는 것 같다. 물론 작업환경마다 발생하는 오류의 종류도 다를 수 있기에 아래 방식이 잘 안 먹힐 수도 있다.

In [None]:
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.7.1%2Bcu101-cp38-cp38-linux_x86_64.whl (735.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m735.4/735.4 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.8.2+cu101
  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.8.2%2Bcu101-cp38-cp38-linux_x86_64.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchvision
  Attempting uninstall: torch
    Found existing installation: torch 1.10.0
    Uninstalling torch-1.10.0:
      Successfully uninstalled torch-1.10.0
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.14.1+cu

In [None]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



아래는 `ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/usr/local/lib/python3.8/dist-packages/torchmetrics/utilities/data.py)`
오류를 디버깅하는 코드이다.

In [None]:
!pip install torchmetrics==0.6.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchmetrics==0.6.0
  Downloading torchmetrics-0.6.0-py3-none-any.whl (329 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/329.4 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.4/329.4 KB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torchmetrics
  Attempting uninstall: torchmetrics
    Found existing installation: torchmetrics 0.11.1
    Uninstalling torchmetrics-0.11.1:
      Successfully uninstalled torchmetrics-0.11.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kobart 0.5.1 requires pytorch-lightning==1.2.1, but you have pytorch-lightning 1.3.8 which is incompatible.
kobart 0.5.1 requires transformers==4.3.3, but you have transfor


아래는 `OSError: /usr/local/lib/python3.8/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol:` 오류를 디버깅 하는 코드이다.

torch와 torchtext를 맞춰주는 과정이다.

In [None]:
!pip install torchtext==0.8.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.8.0
  Downloading torchtext-0.8.0-cp38-cp38-manylinux1_x86_64.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.14.1
    Uninstalling torchtext-0.14.1:
      Successfully uninstalled torchtext-0.14.1
Successfully installed torchtext-0.8.0


아래는 `AttributeError: 'Trainer' object has no attribute '_data_connector'` 오류를 디버깅하는 코드이다.

오류 발생 원인은 pytorch_lightning의 버전이 호환되지 않아서이다.

In [None]:
!pip install pytorch_lightning==1.5.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch_lightning==1.5.2
  Downloading pytorch_lightning-1.5.2-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyDeprecate==0.3.1
  Downloading pyDeprecate-0.3.1-py3-none-any.whl (10 kB)
Installing collected packages: pyDeprecate, pytorch_lightning
  Attempting uninstall: pyDeprecate
    Found existing installation: pyDeprecate 0.3.0
    Uninstalling pyDeprecate-0.3.0:
      Successfully uninstalled pyDeprecate-0.3.0
  Attempting uninstall: pytorch_lightning
    Found existing installation: pytorch-lightning 1.3.8
    Uninstalling pytorch-lightning-1.3.8:
      Successfully uninstalled pytorch-lightning-1.3.8
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following depende

## fine tuning 시작!

In [None]:
# GPU 사용시 쓰는 코드이다.
# 이 코드가 fine-tuning하는 코드. epoch이나 batch_size를 조정해줘도 된다.
!python train.py  --gradient_clip_val 1.0 --max_epochs 5 --default_root_dir logs --gpus 1 --batch_size 8 --num_workers 4

위 코드가 정상적으로 종료됐다면 logs 파일에 각 epoch별 ckpt(체크포인트) 파일이 저장됐을 것이다.

아래 코드는 저장된 모델의 체크포인트를 불러와 모델을 bin 파일로 설정해주는 작업이다.

hparams의 경우에는 ./logs/tb_logs/default/version_0/hparams.yaml 파일을 활용하고,

model_binary 의 경우에는 ./logs/model_chp 안에 있는 .ckpt 파일을 활용하면 되는데, loss 낮은 것을 임의대로 선택하면 될 것 같다.

In [None]:
!python get_model_binary.py --hparams ./logs/tb_logs/default/version_0/hparams.yaml --model_binary ./logs/model_chp/epoch=01-val_loss=0.717.ckpt

2023-02-20 06:48:09.374980: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-20 06:48:13.297680: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-20 06:48:13.298006: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
  hparams = yaml.load(f)
Downloading: 

위 코드가 정상적으로 작동했다면 kobart_summary 폴더 내에 config.json 파일과 pytorch_model.bin 파일이 생성된다.


`streamlit`은 python으로 데모 웹을 만들어주는 모듈인데, 원래 아래 코드를 실행하면 URL이 뜨고 이를 누르면 데모 페이지가 뜨는 게 정상이다.

근데 URL은 만들어지는데 데모 페이지가 열리지가 않는 상황!

In [None]:
# 모델을 Demo Page로 열어주는 코드
!streamlit run infer.py

2023-02-20 06:51:07.384 INFO    numexpr.utils: NumExpr defaulting to 2 threads.
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://35.245.58.120:8501[0m
[0m
/content/drive/MyDrive/kubig/kucon/KoBART-summarization/.cache/kobart_base_tokenizer_cased_cf74400bce.zip[██████████████████████████████████████████████████]
[34m  Stopping...[0m
[34m  Stopping...[0m
