### 作業目的: 熟練自定義collate_fn與sampler進行資料讀取

本此作業主要會使用[IMDB](http://ai.stanford.edu/~amaas/data/sentiment/)資料集利用Pytorch的Dataset與DataLoader進行
客製化資料讀取。
下載後的資料有分成train與test，因為這份作業目的在讀取資料，所以我們取用train部分來進行練習。
(請同學先行至IMDB下載資料)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os

# Current directory
print(os.getcwd())

# change directory
os.chdir('/content/drive/MyDrive/python_training/NLP100Days-part2/D05_Pytorch_dataset_free_read')
print(os.getcwd())

/content
/content/drive/MyDrive/python_training/NLP100Days-part2/D05_Pytorch_dataset_free_read


### 載入套件

In [3]:
# Import torch and other required modules
import nltk
nltk.download('stopwords') #下載stopwords
nltk.download('punkt') #下載word_tokenize需要的corpus

import glob
import torch
import re
import numpy as np
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import load_svmlight_file
from nltk.corpus import stopwords
from torch.utils.data import Dataset, DataLoader, RandomSampler
from torch.nn.utils.rnn import pad_sequence


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 探索資料與資料前處理
這份作業我們使用test資料中的pos與neg


In [4]:
# 讀取字典，這份字典為review內所有出現的字詞
###<your code>###
with open(os.path.join('../dataset/aclImdb', 'imdb.vocab'), encoding='utf-8') as f:
    vocab = [line.strip() for line in f.readlines()]

# 以nltk stopwords移除贅字，過多的贅字無法提供有用的訊息，也可能影響模型的訓練
print(f"vocab length before removing stopwords: {len(vocab)}")
### <your code> ###
en_stopwords = set(stopwords.words('english'))
vocab = [word for word in vocab if word not in en_stopwords]

print(f"vocab length after removing stopwords: {len(vocab)}")

# 將字典轉換成dictionary
### <your code> ###
vocab_dic = {word: idx for idx, word in enumerate(vocab)}

vocab length before removing stopwords: 89527
vocab length after removing stopwords: 89356


In [5]:
# 將資料打包成(x, y)配對，其中x為review的檔案路徑，y為正評(1)或負評(0)
# 這裡將x以檔案路徑代表的原因是讓同學練習不一次將資料全讀取進來，若電腦記憶體夠大(所有資料檔案沒有很大)
# 可以將資料全一次讀取，可以減少在訓練時I/O時間，增加訓練速度

### <your code> ###
#/aclImdb/train/pos/11562_9.txt
review_pairs = []

review_pos = glob.glob("../dataset/aclImdb/train/pos/*.txt")
review_neg = glob.glob("../dataset/aclImdb/train/neg/*.txt")
review_all = review_pos + review_neg
y = [1]*len(review_pos) + [0]*len(review_neg)
review_pairs = list(zip(review_all, y))

print(review_pairs[:2])
print(f"Total reviews: {len(review_pairs)}")

[('../dataset/aclImdb/train/pos/11562_9.txt', 1), ('../dataset/aclImdb/train/pos/11581_10.txt', 1)]
Total reviews: 25000


### 建立Dataset, DataLoader, Sampler與Collate_fn讀取資料
這裡我們會需要兩個helper functions，其中一個是讀取資料與清洗資料的函式(load_review)，另外一個是生成詞向量函式
(generate_vec)，注意這裡我們用來產生詞向量的方法是單純將文字tokenize(為了使產生的文本長度不同，而不使用BoW)

In [6]:
def load_review(review_path):
    
    ###<your code>###
    with open(review_path, encoding='utf-8') as f:
        review = f.read()
    review = re.sub(r'\W', ' ', review)
    review = nltk.word_tokenize(review)    
    return review

def generate_vec(review, vocab_dic):
    ### <your code> ###
    bag_vector = np.zeros(len(vocab_dic))
    for word in review:
        if vocab_dic.get(word):
            bag_vector[vocab_dic.get(word)] += 1
            
    return bag_vector

In [7]:
#建立客製化dataset

class dataset(Dataset):
    '''custom dataset to load reviews and labels
    Parameters
    ----------
    data_pairs: list
        directory of all review-label pairs
    vocab: list
        list of vocabularies
    '''
    ### <your code> ###
    def __init__(self, data_dirs, vocab):
        ###<your code>###
        self.data_dirs = data_dirs
        self.vocab = vocab

    def __len__(self):
        ###<your code>###
        return len(self.data_dirs)

    def __getitem__(self, idx):
        ###<your code>###
        review_path, label = self.data_dirs[idx]
        review = load_review(review_path)
        bag_vector = generate_vec(review, self.vocab)

        return bag_vector, label
    

#建立客製化collate_fn，將長度不一的文本pad 0 變成相同長度
def collate_fn(batch):
    ### <your code> ###
    reviews, labels = zip(*batch)
    lengths = torch.LongTensor([len(review) for review in reviews])
    labels = torch.LongTensor(labels)
    reviews = pad_sequence([
        torch.LongTensor(review) for review in reviews
    ], batch_first=True, padding_value=0)

    return reviews, labels, lengths

In [8]:
# 使用Pytorch的RandomSampler來進行indice讀取並建立dataloader
### <your code> ###
custom_dst = dataset(review_pairs, vocab_dic)
custom_dataloader = DataLoader(custom_dst, 
                 batch_size=4, 
                 sampler=RandomSampler(custom_dst), 
                 collate_fn=collate_fn)

next(iter(custom_dataloader))

(tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 8, 1,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 1,  ..., 0, 0, 0]]),
 tensor([1, 1, 0, 0]),
 tensor([89356, 89356, 89356, 89356]))