### 作業目的: 熟練Pytorch Dataset與DataLoader進行資料讀取

本此作業主要會使用[IMDB](http://ai.stanford.edu/~amaas/data/sentiment/)資料集利用Pytorch的Dataset與DataLoader進行
客製化資料讀取。
下載後的資料有分成train與test，因為這份作業目的在讀取資料，所以我們取用train部分來進行練習。
(請同學先行至IMDB下載資料)

### 載入套件

In [6]:
# Import torch and other required modules
import glob
import torch
import re
import nltk
import numpy as np
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import load_svmlight_file
from nltk.corpus import stopwords

nltk.download('stopwords') #下載stopwords
nltk.download('punkt') #下載word_tokenize需要的corpus
stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar.gz

[1;30;43m串流輸出內容已截斷至最後 5000 行。[0m
aclImdb/train/unsup/44983_0.txt
aclImdb/train/unsup/44982_0.txt
aclImdb/train/unsup/44981_0.txt
aclImdb/train/unsup/44980_0.txt
aclImdb/train/unsup/44979_0.txt
aclImdb/train/unsup/44978_0.txt
aclImdb/train/unsup/44977_0.txt
aclImdb/train/unsup/44976_0.txt
aclImdb/train/unsup/44975_0.txt
aclImdb/train/unsup/44974_0.txt
aclImdb/train/unsup/44973_0.txt
aclImdb/train/unsup/44972_0.txt
aclImdb/train/unsup/44971_0.txt
aclImdb/train/unsup/44970_0.txt
aclImdb/train/unsup/44969_0.txt
aclImdb/train/unsup/44968_0.txt
aclImdb/train/unsup/44967_0.txt
aclImdb/train/unsup/44966_0.txt
aclImdb/train/unsup/44965_0.txt
aclImdb/train/unsup/44964_0.txt
aclImdb/train/unsup/44963_0.txt
aclImdb/train/unsup/44962_0.txt
aclImdb/train/unsup/44961_0.txt
aclImdb/train/unsup/44960_0.txt
aclImdb/train/unsup/44959_0.txt
aclImdb/train/unsup/44958_0.txt
aclImdb/train/unsup/44957_0.txt
aclImdb/train/unsup/44956_0.txt
aclImdb/train/unsup/44955_0.txt
aclImdb/train/unsup/44954_0.txt
aclIm

### 探索資料與資料前處理
在train資料中，有分成pos(positive)與neg(negative)，分別為正評價與負評價，此評價即為label。

In [7]:
# 讀取字典，這份字典為review內所有出現的字詞
with open('aclImdb/imdb.vocab', 'r', encoding='utf-8') as f:
  vocab = f.read().split('\n')

# 以nltk stopwords移除贅字，過多的贅字無法提供有用的訊息，也可能影響模型的訓練
print(f"vocab length before removing stopwords: {len(vocab)}")
vocab = [word for word in vocab if word not in stopwords]
print(f"vocab length after removing stopwords: {len(vocab)}")

# 將字典轉換成dictionary
vocab_dic = {}
idx = 0
for word in vocab:
    if word not in vocab_dic:
        vocab_dic[word] = idx
        idx += 1
vocab_dic

vocab length before removing stopwords: 89527
vocab length after removing stopwords: 89356


{'movie': 0,
 'film': 1,
 'one': 2,
 '!': 3,
 'like': 4,
 '?': 5,
 'good': 6,
 'even': 7,
 'time': 8,
 'would': 9,
 'story': 10,
 'really': 11,
 'see': 12,
 'well': 13,
 'much': 14,
 'get': 15,
 'people': 16,
 'bad': 17,
 'also': 18,
 'great': 19,
 'first': 20,
 'made': 21,
 'way': 22,
 'make': 23,
 'could': 24,
 'movies': 25,
 'think': 26,
 'characters': 27,
 'character': 28,
 'watch': 29,
 'films': 30,
 'two': 31,
 'many': 32,
 'seen': 33,
 'acting': 34,
 'never': 35,
 'plot': 36,
 'little': 37,
 'love': 38,
 'best': 39,
 'life': 40,
 'show': 41,
 'know': 42,
 'ever': 43,
 'better': 44,
 'man': 45,
 'still': 46,
 'end': 47,
 'say': 48,
 'scene': 49,
 'scenes': 50,
 'go': 51,
 'something': 52,
 'back': 53,
 "i'm": 54,
 'watching': 55,
 'real': 56,
 'though': 57,
 'thing': 58,
 'years': 59,
 'actors': 60,
 'director': 61,
 'another': 62,
 'nothing': 63,
 'new': 64,
 'funny': 65,
 'actually': 66,
 'work': 67,
 'makes': 68,
 'find': 69,
 'look': 70,
 'old': 71,
 'going': 72,
 'lot': 73,


In [8]:
# 將資料打包成(x, y)配對，其中x為review的檔案路徑，y為正評(1)或負評(0)
# 這裡將x以檔案路徑代表的原因是讓同學練習不一次將資料全讀取進來，若電腦記憶體夠大(所有資料檔案沒有很大)
# 可以將資料全一次讀取，可以減少在訓練時I/O時間，增加訓練速度
review_pairs = []
for tag in ['pos', 'neg']:
    label = int(tag=='pos')
    for file in glob.glob(f'./aclImdb/train/{tag}/*.txt'):
        review_pairs.append((file, label))
        
print(review_pairs[:2])
print(f"Total reviews: {len(review_pairs)}")

[('./aclImdb/train/pos/6836_9.txt', 1), ('./aclImdb/train/pos/10091_7.txt', 1)]
Total reviews: 25000


### 建立Dataset與DataLoader讀取資料
這裡我們會需要兩個helper functions，其中一個是讀取資料與清洗資料的函式(load_review)，另外一個是生成詞向量BoW的函式
(generate_bow)

In [9]:
def load_review(review_path):
    
    with open(review_path, 'r', encoding='utf-8') as f:
        review = f.read()
        
    #移除non-alphabet符號、贅字與tokenize
    review = re.sub('\W', ' ', review)
    review = nltk.word_tokenize(review)
    review = [word for word in review if word not in stopwords]
    return review

In [10]:
def generate_bow(review, vocab_dic):
    bag_vector = np.zeros(len(vocab_dic))
    for word in review:
        if vocab_dic.get(word):
            bag_vector[vocab_dic.get(word)] += 1
            
    return bag_vector

In [11]:
class dataset(Dataset):
    '''custom dataset to load reviews and labels
    Parameters
    ----------
    data_pairs: list
        directory of all review-label pairs
    vocab: list
        list of vocabularies
    '''
    def __init__(self, data_dirs, vocab):
        self.data_dirs = data_dirs
        self.vocab = vocab

    def __len__(self):
        return len(self.data_dirs)

    def __getitem__(self, idx):
        review_path, tag = self.data_dirs[idx]
        review = load_review(review_path)
        bag_vector = generate_bow(review, self.vocab)
        return (bag_vector, tag)

In [12]:
# 建立客製化dataset
custom_dst = dataset(review_pairs, vocab_dic)
custom_dst[10]

(array([0., 1., 1., ..., 0., 0., 0.]), 1)

In [13]:
# 建立dataloader
custom_dataloader = DataLoader(custom_dst, batch_size=4, shuffle=True)
next(iter(custom_dataloader))

[tensor([[0., 1., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 5., 1.,  ..., 0., 0., 0.],
         [0., 0., 2.,  ..., 0., 0., 0.]], dtype=torch.float64),
 tensor([0, 1, 0, 0])]