<a href="https://colab.research.google.com/github/junieberry/NLP-withPyTorch/blob/main/05_CBOW_Frankenstein/05_CBOW_Frankenstein_Preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
import os
from argparse import Namespace
import collections
import nltk.data
import numpy as np
import pandas as pd
import re
import string
from tqdm import tqdm_notebook

In [10]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [

True

## 5.2 CBOW 임베딩 학습하기

context에서 가운데 단어를 제거하고 context를 통해 누락된 단어 예측

`nn.Embedding`에선 토큰의 정수 ID를 벡터로 메핑

### 5.2.1 프랑켄슈타인 데이터셋

1. 메리 셸리의 소설 프랑켄슈타인 기단으로 Dataset
2. 전처리 (텍스트 문장 분할, 소문자로, 구두점 제거)
3. 데이터셋을 윈도우로 묶음
4. 훈련, 검증 테스트 세트로 분할


In [3]:
cd /content/drive/MyDrive/nlp-with-pytorch/chapter_5/5_2_CBOW/data

/content/drive/MyDrive/nlp-with-pytorch/chapter_5/5_2_CBOW/data


In [4]:
!chmod 755 get-all-data.sh
!./get-all-data.sh

Trying to fetch /content/drive/MyDrive/nlp-with-pytorch/chapter_5/5_2_CBOW/data/books/frankenstein.txt
14it [00:00, 2241.66it/s]
Trying to fetch /content/drive/MyDrive/nlp-with-pytorch/chapter_5/5_2_CBOW/data/books/frankenstein_with_splits.csv
109it [00:00, 4003.14it/s]


In [12]:
cd ..

/content/drive/My Drive/nlp-with-pytorch/chapter_5/5_2_CBOW


전처리

In [6]:
args = Namespace(
    raw_dataset_txt="data/books/frankenstein.txt",
    window_size=5,
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="data/books/frankenstein_with_splits.csv",
    seed=1337
)

In [13]:
# Split the raw text book into sentences


## 이거 쓰려면 다운로드 해줘야함
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
with open(args.raw_dataset_txt) as fp:
    book = fp.read()
sentences = tokenizer.tokenize(book)

In [15]:

print (len(sentences), "sentences")
print ("Sample:", sentences[0])

3427 sentences
Sample: Frankenstein,

or the Modern Prometheus


by

Mary Wollstonecraft (Godwin) Shelley


Letter 1


St. Petersburgh, Dec. 11th, 17--

TO Mrs. Saville, England

You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings.


In [16]:
def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text

In [17]:
cleaned_sentences = [preprocess_text(sentence) for sentence in sentences]

In [18]:
# Global vars
MASK_TOKEN = "<MASK>"

In [19]:
# Create windows
flatten = lambda outer_list: [item for inner_list in outer_list for item in inner_list]
windows = flatten([list(nltk.ngrams([MASK_TOKEN] * args.window_size + sentence.split(' ') + 
    [MASK_TOKEN] * args.window_size, args.window_size * 2 + 1)) \
    for sentence in tqdm_notebook(cleaned_sentences)])

# Create cbow data
data = []
for window in tqdm_notebook(windows):
    target_token = window[args.window_size]
    context = []
    for i, token in enumerate(window):
        if token == MASK_TOKEN or i == args.window_size:
            continue
        else:
            context.append(token)
    data.append([' '.join(token for token in context), target_token])
    
            
# Convert to dataframe
cbow_data = pd.DataFrame(data, columns=["context", "target"])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


  0%|          | 0/3427 [00:00<?, ?it/s]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  import sys


  0%|          | 0/90698 [00:00<?, ?it/s]

In [20]:
# Create split data
n = len(cbow_data)
def get_split(row_num):
    if row_num <= n*args.train_proportion:
        return 'train'
    elif (row_num > n*args.train_proportion) and (row_num <= n*args.train_proportion + n*args.val_proportion):
        return 'val'
    else:
        return 'test'
cbow_data['split']= cbow_data.apply(lambda row: get_split(row.name), axis=1)

In [21]:
cbow_data.head()

Unnamed: 0,context,target,split
0,", or the modern prometheus",frankenstein,train
1,frankenstein or the modern prometheus by,",",train
2,"frankenstein , the modern prometheus by mary",or,train
3,"frankenstein , or modern prometheus by mary wo...",the,train
4,"frankenstein , or the prometheus by mary wolls...",modern,train


In [None]:
# Write split data to file
cbow_data.to_csv(args.output_munged_csv, index=False)