### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install torchtext
!pip install spacy
!python -m spacy download en

In [3]:
import torch
import pandas as pd
import numpy as np
from torchtext import datasets
from torchtext.legacy import data
import re

In [4]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('/content/drive/MyDrive/NLP_DeepLearning/polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste .it...,0
1998,there was a time when john carpenter was a gre...,0


### 建立Pipeline生成資料 

定義前處理函式&建立text&label的Field

In [5]:
# preprocessing
def remove_non_char(text):
  text = [re.sub('[^A-Za-z]','',word) for word in text if len(re.sub('[^A-Za-z]','',word))>2]
  return text

In [6]:
# 建立Field與Dataset
### <your code> ###
text_field = data.Field(sequential=True,dtype=torch.float64,lower=True,tokenize='spacy',preprocessing=remove_non_char)
label_field = data.Field(sequential=False,use_vocab=False)
fields = {'text':('t',text_field),'label':('l',label_field)}

###方法1

建立examples再創建dataset&字典<br>
data.Example.fromlist()<br>
data.Dataset()<br>
data.Iterator()<br>

In [None]:
# 取得examples並打亂順序
### <your code> ###
input_data = input_data.sample(frac=1)
examples=[]
for (text,label) in input_data.itertuples(index=False):
  examples.append(data.Example.fromlist(data = [text,label],
                    fields = [('text',text_field),('label',label_field)]))

# 以8:2的比例切分examples
### <your code> ###
train_ex = examples[:int(len(examples)*0.8)]
test_ex = examples[int(len(examples)*0.8):]
# 建立training與testing dataset
### <your code> ###
train_data = data.Dataset(examples=train_ex, fields={'text':text_field, 'label':label_field})
test_data = data.Dataset(examples=test_ex, fields={'text':text_field, 'label':label_field})
#{'text':text_field, 'label':label_field}

train_data[0].label, train_data[0].text

In [8]:
# 建立字典
### <your code> ###
text_field.build_vocab(train_data)
print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', 'the', 'and', 'that', 'with', 'for', 'his', 'this', 'film'] 



In [9]:
# create iterator for training and testing data
train_iter = data.Iterator(dataset=train_data,
                    batch_size=4,
                    repeat=False,
                    sort_key=lambda x:len(x.text))

In [10]:
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[1.7310e+03, 9.0000e+01, 1.5960e+03, 4.4000e+01],
        [6.0000e+00, 5.0000e+01, 6.0000e+00, 2.5400e+02],
        [6.0990e+03, 8.1500e+02, 1.9448e+04, 1.2000e+01],
        ...,
        [1.0000e+00, 1.0000e+00, 5.5600e+02, 1.0000e+00],
        [1.0000e+00, 1.0000e+00, 4.7800e+02, 1.0000e+00],
        [1.0000e+00, 1.0000e+00, 3.1100e+02, 1.0000e+00]], dtype=torch.float64) torch.Size([607, 4])
tensor([1, 1, 0, 0]) torch.Size([4])


###方法2

data.TabularDataset()<br>
data.BucketIterator()<br>

In [None]:
input_data = input_data.sample(frac=1)
All_data = data.TabularDataset(path='/content/drive/MyDrive/NLP_DeepLearning/polarity.tsv', 
              format='tsv',
              fields=[('text', text_field), ('label', label_field)])
Train_data, Test_data = All_data.split(split_ratio=0.8)
Train_data[0].label, Train_data[0].text

In [12]:
# 建立字典
### <your code> ###
text_field.build_vocab(Train_data)
print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', 'the', 'and', 'that', 'with', 'for', 'film', 'his', 'this'] 



In [13]:
Train_iterator, Test_iterator = data.BucketIterator.splits((Train_data,Test_data),batch_size=4)

In [14]:
for train_batch in Train_iterator:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[6.4700e+02, 1.7000e+01, 7.9000e+01, 7.9000e+01],
        [3.0000e+01, 1.2000e+02, 2.0200e+02, 1.9300e+02],
        [1.5210e+03, 1.0000e+02, 2.0000e+00, 1.2340e+03],
        ...,
        [1.0000e+00, 1.0000e+00, 1.0000e+00, 1.6950e+03],
        [1.0000e+00, 1.0000e+00, 1.0000e+00, 3.4270e+03],
        [1.0000e+00, 1.0000e+00, 1.0000e+00, 1.8400e+03]], dtype=torch.float64) torch.Size([637, 4])
tensor([0, 0, 0, 1]) torch.Size([4])
