### 作業目的: 熟練Pytorch Dataset與DataLoader進行資料讀取

本此作業主要會使用[IMDB](http://ai.stanford.edu/~amaas/data/sentiment/)資料集利用Pytorch的Dataset與DataLoader進行
客製化資料讀取。
下載後的資料有分成train與test，因為這份作業目的在讀取資料，所以我們取用train部分來進行練習。
(請同學先行至IMDB下載資料)

### 載入套件

In [11]:
# Import torch and other required modules
import glob
import torch
import re
import os
import nltk
import numpy as np
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import load_svmlight_file
from nltk.corpus import stopwords

nltk.download('stopwords') #下載stopwords
nltk.download('punkt') #下載word_tokenize需要的corpus

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xvf /content/aclImdb_v1.tar.gz

### 探索資料與資料前處理
在train資料中，有分成pos(positive)與neg(negative)，分別為正評價與負評價，此評價即為label。

In [13]:
# 讀取字典，這份字典為review內所有出現的字詞
###<your code>###
vocab = []
vocab_dic = {}
id=0
with open('./aclImdb/imdb.vocab','r',encoding='utf-8') as f:
  for word in f.readlines():
    vocab.append(word.strip())
vocab = list(set(vocab))
print(f"vocab length before removing stopwords: {len(vocab)}")
# 以nltk stopwords移除贅字，過多的贅字無法提供有用的訊息，也可能影響模型的訓練
###<your code>###
stop_words = set(stopwords.words('english'))
vocab = [word for word in vocab if word not in stop_words]
print(f"vocab length after removing stopwords: {len(vocab)}")

# 將字典轉換成dictionary
### <your code>###
for word in vocab:
  if word not in vocab_dic:
    vocab_dic[word]=id
    id+=1
vocab_dic

vocab length before removing stopwords: 89527
vocab length after removing stopwords: 89356


{'ukulele': 0,
 'squirrelly': 1,
 'saturated': 2,
 'sensationalised': 3,
 'out-surf': 4,
 'interacted': 5,
 'thuggish': 6,
 'svengali': 7,
 'microphone': 8,
 'promotional': 9,
 'bakula': 10,
 'foresee': 11,
 'cupboard': 12,
 'leased': 13,
 'exemplar': 14,
 'luckier': 15,
 'oven': 16,
 'andrews': 17,
 'yosemite': 18,
 'inflective': 19,
 'export': 20,
 'bare-bones': 21,
 'pokeball': 22,
 'spooking': 23,
 'bolivarian': 24,
 'too-sweet-to-be-believed': 25,
 'hamnet': 26,
 'pensacola-based': 27,
 'soule': 28,
 'jordanian': 29,
 'banton': 30,
 'chojnacki': 31,
 'twilight-zone': 32,
 'honolulu': 33,
 'medalist': 34,
 'ponto': 35,
 'soupcon': 36,
 'multizillion-dollar': 37,
 'infection': 38,
 'best-selling': 39,
 'anti-military': 40,
 'harp': 41,
 'sing-alongs': 42,
 'mcewan': 43,
 'terrorized': 44,
 'chix': 45,
 'outcroppings': 46,
 'invincibility': 47,
 'dietrich-type': 48,
 'lowensohn': 49,
 "matrix'-style": 50,
 'moll-with-a-heart-of-gold': 51,
 'quigon': 52,
 'hedy': 53,
 'nelsan': 54,
 '

In [14]:
# 將資料打包成(x, y)配對，其中x為review的檔案路徑，y為正評(1)或負評(0)
# 這裡將x以檔案路徑代表的原因是讓同學練習不一次將資料全讀取進來，若電腦記憶體夠大(所有資料檔案沒有很大)
# 可以將資料全一次讀取，可以減少在訓練時I/O時間，增加訓練速度

###<your code>###
review_pairs = []
folder_list = {'pos':1,'neg':0}
path = './aclImdb/train'
for folder in folder_list:
  glob_pattern = os.path.join(path, folder, '*.txt')
  for file_path in glob.glob(glob_pattern):
    review_pairs.append((file_path,folder_list[folder]))

print(review_pairs[:2])
print(f"Total reviews: {len(review_pairs)}")

[('./aclImdb/train/pos/8274_10.txt', 1), ('./aclImdb/train/pos/406_8.txt', 1)]
Total reviews: 25000


### 建立Dataset與DataLoader讀取資料
這裡我們會需要兩個helper functions，其中一個是讀取資料與清洗資料的函式(load_review)，另外一個是生成詞向量BoW的函式
(generate_bow)

In [15]:
def load_review(review_path):
    
  with open(review_path,'r',encoding='utf-8')as f:
    content = f.read()
      #移除non-alphabet符號、贅字與tokenize
      ###<your code>###
    review = nltk.word_tokenize(content)
    review = [re.sub('[^A-Za-z\-]+','',word) for word in review if word not in stop_words and len(word)>2]
    
  return review

In [16]:
def generate_bow(review, vocab_dic):
  bag_vector = np.zeros(len(vocab_dic))
  for word in review:
    if vocab_dic.get(word):
      bag_vector[vocab_dic.get(word)] += 1
          
  return bag_vector

In [17]:
class dataset(Dataset):
  '''custom dataset to load reviews and labels
  Parameters
  ----------
  data_pairs: list
      tuple of all review-label pairs
  vocab: dict
      list of vocabularies and word_id
  '''
  def __init__(self, data_pairs, vocab):
    ###<your code>###
    self.data_pairs = data_pairs
    self.vocab = vocab
  def __len__(self):
    ###<your code>###
    return len(self.data_pairs)

  def __getitem__(self, idx):
    ###<your code>###
    review_path, label = self.data_pairs[idx]
    review = load_review(review_path)
    bow = generate_bow(review,self.vocab)
    return bow,label


In [18]:
# 建立客製化dataset
###<your code>###
custom_dst = dataset(review_pairs,vocab_dic)
custom_dst[10]

(array([0., 0., 0., ..., 0., 0., 0.]), 1)

In [19]:
# 建立dataloader
###<your code>###
custom_dataloader = DataLoader(custom_dst,batch_size=4,shuffle=True)
next(iter(custom_dataloader))

[tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]], dtype=torch.float64),
 tensor([0, 1, 0, 1])]