### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
import torch
import pandas as pd
import numpy as np
from torchtext import data, datasets
import spacy
spacy_en = spacy.load('en')

In [2]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('./polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste .it...,0
1998,there was a time when john carpenter was a gre...,0


In [3]:
input_data[:5]

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1


### 建立Pipeline生成資料

In [4]:
def tokenizer(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [5]:
# 建立Field
text_field = data.Field(sequential=True,dtype = torch.float64,lower=True,tokenize=tokenizer)
label_field = data.Field(sequential=False) 
### <your code> ###

In [6]:
# 取的examples並打亂順序
### <your code> ###
examples = []
for (text,label) in input_data.values:
    examples.append(data.Example.fromlist(data=[text,label],
                                         fields =[('text',text_field),
                                                 ('label',label_field)] ))
np.random.shuffle(examples)
# 以8:2的比例切分examples
### <your code> ###
train_ex = examples[:int(len(examples)*0.8)]
test_ex = examples[int(len(examples)*0.8):]

# 建立training與testing dataset
### <your code> ###
train_data = data.Dataset(examples = train_ex,fields={'text':text_field, 'label':label_field})
test_data = data.Dataset(examples=test_ex, fields={'text':text_field, 'label':label_field})

train_data[0].label, train_data[0].text

(0,
 ['everybody',
  'in',
  'this',
  'film',
  "'s",
  'thinking',
  'of',
  'alicia',
  '.no',
  ',',
  'this',
  'is',
  'not',
  'a',
  'documentary',
  'on',
  'those',
  'of',
  'us',
  'after',
  'we',
  'first',
  'saw',
  'the',
  '"',
  'cryin',
  "'",
  '"',
  'video',
  '.this',
  'is',
  'one',
  'of',
  'those',
  'erotic',
  'thrillers',
  ',',
  'but',
  'not',
  'like',
  'one',
  'starring',
  'shannon',
  'whirry',
  'or',
  'shannon',
  'tweed',
  '.first',
  'off',
  ',',
  'there',
  "'s",
  'zero',
  'sex',
  ',',
  'almost',
  'no',
  'nudity',
  ',',
  'and',
  'it',
  "'s",
  'not',
  'as',
  'well',
  '-',
  'plotted',
  'as',
  'one',
  'of',
  'those',
  'tweed',
  'flicks',
  '.well',
  ',',
  'anyway',
  '.the',
  '"',
  'plot',
  '.',
  '"',
  'alicia',
  'plays',
  ',',
  'well',
  ',',
  'the',
  'babysitter',
  ',',
  'who',
  'is',
  'taking',
  'care',
  'of',
  'some',
  'kids',
  'one',
  'night',
  'while',
  'the',
  'parents',
  '(',
  'j',
  

In [7]:
# 建立字典
### <your code> ###
text_field.build_vocab(train_data)
label_field.build_vocab(train_data)
print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', ',', 'the', 'a', 'and', 'of', 'to', 'is', 'in'] 



In [8]:
# create iterator for training and testing data
train_iter, test_iter = data.Iterator.splits(datasets = (train_data,test_data),
                                            batch_size=3,
                                            repeat=False,
                                            sort_key=lambda ex:len(ex.text))

In [9]:
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[8.5000e+02, 2.9000e+01, 4.2710e+03],
        [2.1240e+03, 3.3600e+02, 1.3002e+04],
        [3.7000e+03, 1.7500e+02, 1.6000e+01],
        ...,
        [7.0000e+00, 1.0000e+00, 1.0000e+00],
        [9.2200e+02, 1.0000e+00, 1.0000e+00],
        [1.7000e+01, 1.0000e+00, 1.0000e+00]], dtype=torch.float64) torch.Size([1057, 3])
tensor([2, 2, 2]) torch.Size([3])
