### 作業目的: 熟練自定義collate_fn與sampler進行資料讀取

本此作業主要會使用[IMDB](http://ai.stanford.edu/~amaas/data/sentiment/)資料集利用Pytorch的Dataset與DataLoader進行
客製化資料讀取。
下載後的資料有分成train與test，因為這份作業目的在讀取資料，所以我們取用train部分來進行練習。
(請同學先行至IMDB下載資料)

### 載入套件

In [1]:
# Import torch and other required modules
import glob
import torch
import re
import nltk
import numpy as np
from torch.utils.data import Dataset, DataLoader, RandomSampler
from sklearn.datasets import load_svmlight_file
from nltk.corpus import stopwords
import os

nltk.download('stopwords') #下載stopwords
nltk.download('punkt') #下載word_tokenize需要的corpus

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xvf /content/aclImdb_v1.tar.gz

### 探索資料與資料前處理
這份作業我們使用test資料中的pos與neg


In [3]:
# 讀取字典，這份字典為review內所有出現的字詞
###<your code>###
vocab = []
vocab_dic = {}
id=0
with open('./aclImdb/imdb.vocab','r',encoding='utf-8') as f:
  for word in f.readlines():
    vocab.append(word.strip())
vocab = list(set(vocab))
print(f"vocab length before removing stopwords: {len(vocab)}")
# 以nltk stopwords移除贅字，過多的贅字無法提供有用的訊息，也可能影響模型的訓練
###<your code>###
stop_words = set(stopwords.words('english'))
vocab = [word for word in vocab if word not in stop_words]
print(f"vocab length after removing stopwords: {len(vocab)}")

# 將字典轉換成dictionary
### <your code>###
for word in vocab:
  if word not in vocab_dic:
    vocab_dic[word]=id
    id+=1

vocab length before removing stopwords: 89527
vocab length after removing stopwords: 89356


In [4]:
# 將資料打包成(x, y)配對，其中x為review的檔案路徑，y為正評(1)或負評(0)
# 這裡將x以檔案路徑代表的原因是讓同學練習不一次將資料全讀取進來，若電腦記憶體夠大(所有資料檔案沒有很大)
# 可以將資料全一次讀取，可以減少在訓練時I/O時間，增加訓練速度

###<your code>###
review_pairs = []
folder_list = {'pos':1,'neg':0}
path = './aclImdb/test'
for folder in folder_list:
  glob_pattern = os.path.join(path, folder, '*.txt')
  for file_path in glob.glob(glob_pattern):
    review_pairs.append((file_path,folder_list[folder]))

print(review_pairs[:2])
print(f"Total reviews: {len(review_pairs)}")

[('./aclImdb/test/pos/8274_10.txt', 1), ('./aclImdb/test/pos/10571_10.txt', 1)]
Total reviews: 25000


### 建立Dataset, DataLoader, Sampler與Collate_fn讀取資料
這裡我們會需要兩個helper functions，其中一個是讀取資料與清洗資料的函式(load_review)，另外一個是生成詞向量函式
(generate_vec)，注意這裡我們用來產生詞向量的方法是單純將文字tokenize(為了使產生的文本長度不同，而不使用BoW)

In [5]:
def load_review(review_path):
    
  ###<your code>###
  with open(review_path,'r',encoding='utf-8')as f:
    review = f.read()
    review = nltk.word_tokenize(review)
    review = [re.sub('[^A-Za-z\-]+','',word) for word in review if word not in stop_words and len(word)>2]
  return review

def generate_vec(review, vocab_dic):
  ### <your code> ###
  review_vector = [vocab_dic[word] for word in review if word in vocab_dic]
  return review_vector

In [6]:
#建立客製化dataset

class dataset(Dataset):
  '''custom dataset to load reviews and labels
  Parameters
  ----------
  data_pairs: list
      tuple of all review-label pairs
  vocab: dict
      list of vocabularies and word_id
  '''
  ### <your code> ###
  def __init__(self,data_pairs,vocab):
    self.data_pairs = data_pairs
    self.vocab = vocab
  
  def __len__(self):
    return len(self.data_pairs)

  def __getitem__(self,idx):
    path,label = self.data_pairs[idx]
    review = load_review(path)
    review_vector = generate_vec(review, self.vocab)
    return torch.tensor(review_vector), torch.tensor(label)
    
#建立客製化collate_fn，將長度不一的文本pad 0 變成相同長度
def collate_fn(batch):
  ### <your code> ###
  corpus, labels = zip(*batch) 
    
  ### create pads for corpus ###
  lengths = [len(review) for review in corpus]
  max_length = max(lengths)

  batch_corpus = []
  
  for i in range(len(corpus)):
      # pad corpus
      tmp_pads = torch.zeros(max_length)
      tmp_pads[:lengths[i]] = corpus[i]
      batch_corpus.append(tmp_pads.view(1,-1))

  return torch.cat(batch_corpus,dim=0), torch.tensor(labels) , torch.tensor(lengths)



In [7]:
# 使用Pytorch的RandomSampler來進行indice讀取並建立dataloader
### <your code> ###
custom_dataset = dataset(review_pairs,vocab_dic)
custom_dataloader = DataLoader(custom_dataset, 
                batch_size=4, 
                sampler=RandomSampler(custom_dataset), 
                collate_fn=collate_fn)
next(iter(custom_dataloader))

(tensor([[77078., 38213., 19969., 38835., 30956.,   813.,     0.,     0.,     0.,
              0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
              0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
              0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
              0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
              0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
              0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
              0.],
         [38213., 25526., 42197., 40206., 16532., 89194., 61853., 81324., 38213.,
           1097., 38213., 59376., 83610.,   407., 55985., 57415., 38611., 61853.,
          72759.,  6919., 81324., 60020., 67104., 85964., 43822., 87576., 74176.,
          54098., 19969., 85396.,  6635., 43822., 47502., 79574., 38199., 40975.,
          24020., 79574., 16822., 21779., 38213., 43822., 74967.,  2790.,  5084