### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
import torch
import pandas as pd
import numpy as np
from torchtext import data, datasets

In [2]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('./polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1.0
1,every now and then a movie comes along from a ...,1.0
2,you've got mail works alot better than it dese...,1.0
3,jaws is a rare film that grabs your attentio...,1.0
4,moviemaking is a lot like being the general ma...,1.0
...,...,...
533,"capsule : trippy , hyperspeed action machine f...",1.0
534,the heartbreak kid ( reviewed on aug . 26th/19...,1.0
535,it's a curious thing - i've found that when wi...,1.0
536,i'll be the first to admit i didn't expect muc...,1.0


### 建立Pipeline生成資料

In [3]:
# 建立Field與Dataset
text_field = data.Field(sequential=True, dtype=torch.float64, lower=True, tokenize='spacy')
label_field = data.Field(sequential=False)

In [4]:
# 取的examples並打亂順序
# examples =  input_data.examples
# 取的examples並打亂順序
examples = []
for (text, label) in input_data.values:
    examples.append(data.Example.fromlist(data=[text, label],
                  fields=[('text', text_field),
                  ('label', label_field)]))
np.random.shuffle(examples)

# 以8:2的比例切分examples
train_ex = examples[:int(len(examples)*0.8)]
test_ex = examples[int(len(examples)*0.8):]

# 建立training與testing dataset
train_data = data.Dataset(examples=train_ex, fields={'text':text_field, 'label':label_field})
test_data = data.Dataset(examples=test_ex, fields={'text':text_field, 'label':label_field})

train_data[0].label, train_data[0].text

(1.0,
 [' ',
  'when',
  'will',
  'the',
  'devil',
  'take',
  'me',
  '?',
  ' ',
  'he',
  'asks',
  'rhetorically',
  'in',
  'lulling',
  'voice',
  'over',
  '.the',
  'spoiled',
  'title',
  'character',
  'of',
  '_',
  'onegin',
  '_',
  '(',
  'pronounced',
  'oh',
  '-',
  'negg',
  '-',
  'in',
  ')',
  'is',
  'waiting',
  'on',
  'death',
  'to',
  'relieve',
  'him',
  'after',
  'a',
  'lifetime',
  'of',
  'rapacious',
  'behaviour',
  '.martha',
  'fiennes',
  "'",
  'debut',
  'feature',
  'is',
  '(',
  'quite',
  'literally',
  ')',
  'filmed',
  'poetry',
  '(',
  'it',
  "'s",
  'based',
  'on',
  'an',
  'epic',
  'russian',
  'poem',
  'by',
  'alexander',
  'pushkin',
  ')',
  ',',
  'a',
  'profound',
  'study',
  'of',
  'regret',
  ',',
  'of',
  'how',
  'we',
  'confuse',
  'shame',
  'with',
  'guilt',
  '.when',
  'we',
  'first',
  'meet',
  'eugene',
  'onegin',
  '(',
  'ralph',
  ',',
  'acting',
  'for',
  'his',
  'sister',
  ';',
  'another',
  

In [5]:
# 建立字典
text_field.build_vocab(train_data)
label_field.build_vocab(train_data)

print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', ',', 'the', 'a', 'and', 'of', 'to', 'is', 'in'] 



In [6]:
# create iterator for training and testing data
train_iter, test_iter = data.Iterator.splits(datasets=(train_data, test_data),
                        batch_sizes = (3, 3),
                        repeat=False,  
                        sort_key = lambda ex:len(ex.text))

In [7]:
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[3.0000e+00, 5.7000e+03, 2.6000e+01],
        [1.2110e+03, 1.8540e+03, 1.2620e+03],
        [4.8000e+02, 8.0000e+00, 2.0000e+00],
        ...,
        [1.0000e+00, 1.3600e+02, 1.0000e+00],
        [1.0000e+00, 5.1800e+02, 1.0000e+00],
        [1.0000e+00, 3.2580e+03, 1.0000e+00]], dtype=torch.float64) torch.Size([696, 3])
tensor([1, 1, 1]) torch.Size([3])
