### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os

# Current directory
print(os.getcwd())

# change directory
os.chdir('/content/drive/MyDrive/python_training/NLP100Days-part2/D06_Torchtext_NLP_text_process')
print(os.getcwd())

/content
/content/drive/MyDrive/python_training/NLP100Days-part2/D06_Torchtext_NLP_text_process


In [3]:
!pip install torchtext
!pip install spacy
!python -m spacy download en


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [4]:
import torch
import pandas as pd
import numpy as np
#from torchtext import data, datasets
from torchtext.legacy import data, datasets
import spacy
spacy.load('en')
#spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x7f571f2bbd90>

In [5]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('./polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste .it...,0
1998,there was a time when john carpenter was a gre...,0


### 建立Pipeline生成資料

In [6]:
# 建立Field與Dataset
### <your code> ###
text_field = data.Field(sequential=True, dtype=torch.float64, lower=True, tokenize='spacy')
label_field = data.Field(sequential=False)
input_data = data.TabularDataset(path='polarity.tsv', 
                 format='tsv', 
                 fields=[('text', text_field), ('label', label_field)])


In [7]:
# 取的examples並打亂順序
### <your code> ###
examples = input_data.examples
np.random.shuffle(examples)

# 以8:2的比例切分examples
### <your code> ###
split_idx = int(len(examples)*0.8)
train_ex = examples[:split_idx]
test_ex = examples[split_idx:]

# 建立training與testing dataset
### <your code> ###
train_data = data.Dataset(examples=train_ex, fields={'text':text_field, 'label': label_field})
test_data = data.Dataset(examples=test_ex, fields={'text':text_field, 'label':label_field})

train_data[0].label, train_data[0].text

('1',
 ['mpaa',
  ':',
  'not',
  'rated',
  '(',
  'though',
  'i',
  'feel',
  'it',
  'would',
  'likely',
  'be',
  'pg',
  ',',
  'for',
  'martial',
  '-',
  'arts',
  'violence',
  '.',
  ')',
  'with',
  'three',
  'movies',
  'already',
  '(',
  're',
  ')',
  'released',
  'theatrically',
  'in',
  'america',
  ',',
  'and',
  'at',
  'least',
  'three',
  'more',
  'on',
  'their',
  'way',
  ',',
  'jackie',
  'chan',
  'is',
  'one',
  'of',
  'the',
  'newest',
  '"',
  'hot',
  'properties',
  '"',
  'in',
  'action',
  'adventure',
  'stardom',
  ',',
  'and',
  'it',
  "'s",
  'just',
  'about',
  'time',
  '.for',
  'over',
  'twenty',
  '-',
  'five',
  'years',
  ',',
  'jackie',
  "'s",
  'been',
  'starring',
  'in',
  'martial',
  'arts',
  'and',
  'action',
  'movies',
  'in',
  'hong',
  'kong',
  ',',
  'thrilling',
  'audiences',
  'with',
  'both',
  'an',
  'incredible',
  'grasp',
  'of',
  'acrobatics',
  'and',
  'martial',
  'arts',
  'and',
  'a',
  '

In [8]:
# 建立字典
### <your code> ###
text_field.build_vocab(train_data)
label_field.build_vocab(train_data)

print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', ',', 'the', 'a', 'and', 'of', 'to', 'is', 'in'] 



In [9]:
# create iterator for training and testing data
##train_iter, test_iter = ### <your code> ###
train_iter, test_iter = data.Iterator.splits(datasets=(train_data, test_data),
                        batch_sizes=(3, 3),
                        repeat=False,  
                        sort_key = lambda ex: len(ex.text))


In [10]:
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[1.5200e+02, 6.4700e+02, 4.6800e+02],
        [4.4400e+02, 4.0000e+00, 7.2000e+01],
        [3.3340e+03, 1.6200e+02, 5.9150e+03],
        ...,
        [1.2900e+02, 1.0000e+00, 1.0000e+00],
        [1.1300e+02, 1.0000e+00, 1.0000e+00],
        [1.7000e+01, 1.0000e+00, 1.0000e+00]], dtype=torch.float64) torch.Size([1185, 3])
tensor([1, 1, 2]) torch.Size([3])
