# **AAI0026 Practice : SASRec**
## **Self-Attentive Sequential Recommendation**

<img width="500" alt="1" src="https://drive.google.com/uc?id=1En4xwSyUKfWVpRMfvJBgEXKHaIMLXvjt">
</p>

In this Colab notebook, we will construct our own Sequential Recommendation model called Self-Attentive Sequential Recommendation (SASRec) by using PyTorch, and apply this model on the MovieLens dataset which is a widely used benchmark for evaluating collaborative filtering algorithms.


Unlike Markov-Chain- (MC-), CNN- and RNN-based models, SASRec relies on Transformer encoder that generates a new representation of a sequence of items which a user interacted with. Experiments showed that SASRec outperforms the MC-based models and the CNN/RNN-based approaches.

As shown in the figure above, (1) the item sequence goes through a block which consists of (2) Embedding Layer, (3)Self-Attention Layer (S-A layer), and (4) Point-Wise Feed Forward Network (P-W FFN). In this course, we stack two blocks, which are then followed by (5) Prediction Layer.

We will go through these steps from loading input data to predicting user's next item.

Have fun!

---

**Note**: Make sure to **sequentially run all the cells in each section**, so that the intermediate variables / packages will carry over to the next cell

# Device
You might need to use GPU for this Colab.

Please click `Runtime` and then `Change runtime type`. Then set the `hardware accelerator` to **GPU**.

# Installation

In [1]:
import os
import sys
import copy
import torch
import time
import random
import numpy as np

from collections import defaultdict
from multiprocessing import Process, Queue


In [2]:
torch.__version__

'2.1.0+cu118'

# 1 Input Action Sequence
- Load MovieLens Dataset

MovieLens has several versions available. We will use the MovieLens 1M dataset, which has 1 million ratings from 6000 users on 4000 movies. Each line of the dataset file consists of a user ID, a movie ID, the user's rating on the movie, and a timestamp. However, given that SASRec is designed to learn from implicit information (interaction between the user and the movie), we will only utilize user IDs and movie IDs in our dataset.

## Train/Validation/Test Data Generation

<img width="700" alt="1" src="https://drive.google.com/uc?id=1G-S7FpUzlykFBNmVh-CtLUpmHh0-Y5XG">
</p>

We will begin by loading the dataset into a Python dictionary with User ID = Key and Movie ID = Value. Then we will split the dataset to training/validation/test data. The second-last and the last Movie IDs from each user will be assigned for the validation and test data respectively while all the remaining will constitute the training data.

In [4]:
def data_partition(data_path):
   # TODO: Implement this function that takes a text file,
   # convert the text file to the dictionary,
   # and then split the dictionary to the training, validation, and test data

    usernum = 0
    itemnum = 0
    User = defaultdict(list) # create a dictionary with default value as empty list
    user_train = {}
    user_valid = {}
    user_test = {}

    os.system(f'wget {data_path} -O data.txt')
    with open('data.txt', 'r') as f:
      for line in f:
          u, i = map(int, line.rstrip().split(' '))
          User[u].append(i)

      ############# Your code here ############
      ## Note:
      ## 1. Compute the number 'usernum' of users, and
      ##    the number 'itemnum' of items from the dictionary 'User'
      ## 2. Each line of 'data.txt' consists of a user ID and an item ID
      ## 3. A user ID starts from 1. User IDs are sorted in ascending order
      ## 4. Hint:use max() function
          usernum = max(u, usernum)
          itemnum = max(i, itemnum)

      #########################################

    # The second-last and the last movies are for validation and test data respectively
      for user in User:
          nfeedback = len(User[user])
          if nfeedback < 3:
              user_train[user] = User[user]
              user_valid[user] = []
              user_test[user] = []
          else:
              user_train[user] = User[user][:-2]
              user_valid[user] = []
              user_valid[user].append(User[user][-2])
              user_test[user] = []
              user_test[user].append(User[user][-1])
    return [user_train, user_valid, user_test, usernum, itemnum]

## Question 1: What is the number of Users and Items?

In [5]:
data_path = 'https://drive.google.com/uc?id=1sjeWz4pXkVGmy__Tr8zCtGV6rPGZfgLy'
user_train, user_valid, user_test, usernum, itemnum = data_partition(data_path)

print(f'number of users: {usernum}, number of items: {itemnum}')

number of users: 6040, number of items: 3416


## Question 2: What is the ID of the movie that User 26 has watched most recently?
### (use dictionary user_test)

In [6]:
i_th = 26
print('학습데이터: ',user_train[i_th])
print('validation: ',user_valid[i_th])
print('test: ',user_test[i_th])

학습데이터:  [1100, 1135, 280, 242, 143, 862, 984, 47, 188, 693, 1703, 904, 4, 101, 527, 202, 1491, 1388, 680, 1704, 145, 150, 153, 278, 675, 255, 158, 903, 29, 161, 879, 673, 396, 27, 247, 159, 62, 1067, 176, 1074, 1705, 848, 809, 558, 813, 212, 586, 216, 1706, 1222, 1688, 817, 1707, 186, 853, 208, 949, 1136, 252, 1408, 1570, 139, 951, 877, 115, 1148, 1342, 1580, 410, 243, 1708, 1709, 241, 1, 752, 810, 368, 39, 808, 172, 203, 870, 180, 173, 804, 200, 209, 224, 389, 205, 191, 213, 123, 742, 971, 112, 126, 82, 802, 1710, 829, 1711, 893, 9, 234, 420, 461, 601, 1712, 653, 15, 297, 16, 254, 13, 149, 237, 244, 362, 1620, 613, 1106, 864, 1713, 20, 250, 1714, 681, 14, 22, 1102, 1427, 32, 24, 1419, 340, 717, 697, 866, 1616, 1715, 1716, 157, 167, 256, 641, 122, 818, 283, 179, 1331, 3, 134, 1104, 257, 109, 42, 624, 792, 1717, 52, 392, 800, 409, 795, 956, 1718, 827, 1719, 1720, 103, 636, 801, 1721, 466, 538, 44, 505, 46, 650, 842, 784, 707, 111, 498, 45, 2, 51, 775, 390, 175, 64, 1211, 843, 806, 76, 1

In [7]:
print(f'The last movie that User 26 has watched: {user_test[26]}')

The last movie that User 26 has watched: [901]


## Sampler for Batch Generation


<img width="800" alt="1" src="https://drive.google.com/uc?id=1cbB_RfDtPN3kGOcO_FDEz3YbXt6y1a39">
</p>

As each user has interacted with a different number of movies, we should transform a training sequence into the fixed-length sequence with the maximum length (maxlen=200). If the sequence length is greater than 200, we retain only the most recent 200 movies. If the sequence length is less than 200, we repeatedly add a padding item (0) to the left until the length becomes 200. A constant zero vector is used as the embedding for the padding item.

- seq: Input NumPy ndarray (batch_size*maxlen) for Embedding Layer.
- pos: Ground Truth NumPy ndarray (batch_size*maxlen), the first movie is deleted and the last movie is added as our task is to predict the next item.
- neg: Negative Movie Lists Numpy ndarray (batch_size*maxlen), randomly sampled from the movies that have interacted with no users.

NumPy ndarrays seg, pos, and neg have the same shape, and pos and neg will be used when computing loss at Prediction Layer.

In [8]:
#Negative Sampling: select items not in the set of positive items
def random_neq(l, r, s):
    ##TODO: Implement this function that takes two indices 'l', 'r', and set 's',
    ##and return a random integer number t in the range [l,r]
    ##but not in set 's' so it can generate a new random number

    t = np.random.randint(l, r)

    ############# Your code here ############
    ## Note:
    ## 1. to ensure that t is not in set 's', use while loop
    ## 2. repeat generating random numbers until t is not in set 's'
    ## 3. (~3 lines of code)
    while t in s:
        t = np.random.randint(l, r)
    return t

    #########################################

#generate seg, pos, neg numpy ndarrays.
#they are the fixed-length sequences ('maxlen')

def sample_function(user_train, usernum, itemnum, batch_size, maxlen, result_queue, SEED):
    def sample():
    ##TODO: Implement this function that takes arguments,
    ##and return numpy ndarrays user, seq, pos and neg

        user = np.random.randint(1, usernum + 1)
        while len(user_train[user]) <= 1: user = np.random.randint(1, usernum + 1)

        seq = np.zeros([maxlen], dtype=np.int32) #zeros for padding
        pos = np.zeros([maxlen], dtype=np.int32)
        neg = np.zeros([maxlen], dtype=np.int32)
        nxt = user_train[user][-1]
        idx = maxlen - 1
        # print(maxlen)
        ts = set(user_train[user]) #positive item set
        for i in reversed(user_train[user][:-1]): #reverse for padding from the left

            seq[idx] = i
            pos[idx] = nxt

            if nxt != 0:
            ############# Your code here ############
            ##Note:
            ##1. generate neg Numpy ndarray by using random_neg(l,r,s)
            ##2. fill in neg[idx] by using the for loop we are in
            ##3. remember item ID starts from 1
            ##4. (~1 line of code)
                neg[idx] = random_neq(1, itemnum+1, ts) # item 1 ~ 3416

            #########################################
            nxt = i
            idx -= 1
            if idx == -1: break

        return (user, seq, pos, neg)

    # Make sure random numbers generated are always the same everytime
    np.random.seed(SEED)

    while True:
        one_batch = []
        for i in range(batch_size):
            one_batch.append(sample())
        result_queue.put(zip(*one_batch))

class WarpSampler(object):
    def __init__(self, User, usernum, itemnum, batch_size=64, maxlen=10, n_workers=1):
    ##TODO: Implement this function that takes arguments,
    ##and process sampling in parallel using multiple worker processes

        self.result_queue = Queue(maxsize=n_workers*10)
        self.processors = []

        for i in range(n_workers):

            ############# Your code here ###################################
            ##Note:
            ##1. append "Process(target, args)" object in self.processors
            ##2. set target to "sample_function"
            ##3. create a tuple of all arguemnts of sample_function,
            ##.  result_queue='self.result_queue', and SEED='np.random.randint(2e9)'.
            ##4. set args to this tuple
            self.processors.append(
                Process(target=sample_function, args=(User,usernum,itemnum,batch_size,maxlen,self.result_queue,np.random.randint(2e9)))
            )

            #################################################################

            self.processors[-1].daemon = True #doesn't prevent the program from exiting
            self.processors[-1].start() #To begine the Multiprocess

    #get result_queue when it becomes available, without waiting for all processes to finish
    def next_batch(self):
        return self.result_queue.get()

    def close(self):
        for p in self.processors:
            p.terminate()
            p.join()

## Question 3: Does our sampler work well for User 26?

In [9]:
sampler = WarpSampler(user_train, usernum, itemnum, batch_size=64, maxlen=10, n_workers=3)

for _ in range(1):  # Check one batch
    u, seq, pos, neg = sampler.next_batch()

    ############# Your code here ###################################
    ##note:
    ##1.use any() function
    ##2.check if neg[26] items are in pos[26]

    same_item = any([i in pos[26] for i in neg[26]])

    #################################################################
    if same_item==False:
      print("Works Well!")

Works Well!


# 2 SASRec Model

Now we will implement our SASRec model!

Please see the following:

1.   Embedding Layer (Item Embedding, Position Embedding):
<img width="700" alt="1" src="https://drive.google.com/uc?id=16w1sfleDGnnKEQvL-wwPO4REwt3te1TA">
</p>

  We will obtain an `item embedding` and a learnable `position embedding` by using this layer. Every item ID in `seq` turns into its item embedding, which gives Tensor(M) of `item embeddings` in shape batch_size(128) x max_len(200) x d(50). Since the self-attention model doesn't include any recurrent or convolutional module, it's impossible to be aware of the positions of previous items. Hence we inject a `position embedding` that represents the position of value in `seq`, i.e., max_len (200) positions ranging 0~199. The positions are encoded to Tensor(P) with the same shape as M. Final `input embedding(E=M+P)` serves as Key and Value, and the layer normalization of input embedding serves as Query.

2.   Self-Attention Layer:
<img width="700" alt="1" src="https://drive.google.com/uc?id=1UuTRCQuCQjf3p49wxPO4z3_95xGE0SqR">
</p>

     - Feed (Q,K,V) into Attention layer and compute `attention scores(S)`
     - when proceeding dot production of Query and Key (Q*K), need `Attention Masking` to avoid cheating
     - Residual Connection
     - Layer Normalization

3.   Pointwise Feed Forward Network:
<img width="700" alt="1" src="https://drive.google.com/uc?id=1bij9ohJtxj4UhpB9OcjRnK8rw9bQLqVz">
</p>

  Feed Forward Network consists of "two 1d Convolution Layers" which functions as linear transformation. Before passing 1d-Convolution Layer, input data need to be transposed to allow the kernel to move across time-steps. These Convolution Layers are linked with `ReLU activation` and their outputs will be added with the input data for Residual Connection. After Forward Layer the masking (timeline mask), which was used in item embedding, is applied to preserve their initial padding positions.    

4.   Prediction Layer:
<img width="700" alt="1" src="https://drive.google.com/uc?id=14ACxSsuyxeKdtA8QG2HN_XGzV0-p_xSc">
</p>

  Finally, Prediction Layer generates `pos_logits` and `neg_logits`, i.e., the scores of movies in positive sets and negative sets respectively for users.
  We compute the "element-wise product" between pos embedding and attention scores (`log_feats`) obtained in Attention Layer to generate `pos_logits`. We also compute the element-wise product between neg_embedding and the attention score to generate `neg_logits`.



## (1) Build Layers for SASRec
- Embedding Layer
- Self-Attention Layer
- Forward Layer
- Prediction Layer

In [15]:
class SASRec(torch.nn.Module):
    def __init__(self, user_num, item_num, args):
        super(SASRec, self).__init__()

        ##TODO: stack embedding layer, attention layer and forward layer,
        ##define the function that computes Attention Scores(log_feat)
        ##and get pos_logits, neg_logits from FFN and Prediction

        self.user_num = user_num
        self.item_num = item_num
        self.dev = args['device']

        self.item_emb = torch.nn.Embedding(self.item_num+1, args['hidden_units'], padding_idx=0)

        ##################Your code here ##################
        ## Note:
        ## 1. Define position embedding and dropout
        ## 2. See how item embedding is created above
        ## 3. use arg['maxlen'], args['hidden_units'], args['dropout_rate']
        ## 4. position embedding doesn't need padding idx

        self.pos_emb = torch.nn.Embedding(args['maxlen'], args['hidden_units']) # learnable pos embedding
        self.emb_dropout = torch.nn.Dropout(args['dropout_rate'])

        ###################################################

        self.attention_layernorms = torch.nn.ModuleList() # to be Query for self-attention
        self.attention_layers = torch.nn.ModuleList() #multi-head for self-attention
        self.forward_layernorms = torch.nn.ModuleList()
        self.forward_layers = torch.nn.ModuleList()
        self.last_layernorm = torch.nn.LayerNorm(args['hidden_units'], eps=1e-8)

        for _ in range(args['num_blocks']): #stacks 2 blocks

            new_attn_layernorm = torch.nn.LayerNorm(args['hidden_units'], eps=1e-8)
            self.attention_layernorms.append(new_attn_layernorm)

            # 멀티헤드 어텐션 args : embed_dim, num_heads, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, batch_first=False, device=None, dtype=None
            new_attn_layer =  torch.nn.MultiheadAttention(args['hidden_units'],
                                                            args['num_heads'],
                                                            args['dropout_rate'])
            self.attention_layers.append(new_attn_layer)

            new_fwd_layernorm = torch.nn.LayerNorm(args['hidden_units'], eps=1e-8)
            self.forward_layernorms.append(new_fwd_layernorm)

            new_fwd_layer = PointWiseFeedForward(args['hidden_units'], args['dropout_rate'])
            self.forward_layers.append(new_fwd_layer)

            self.pos_sigmoid = torch.nn.Sigmoid()
            self.neg_sigmoid = torch.nn.Sigmoid()

    def log2feats(self, log_seqs):
        ## TODO: Implement this function that takes log_seqs,
        ## and make input tensor(item embedding + position embedding) for self-attention layer
        ## then return attention scores(log_feats) tensor from self-attention layer
        # print("[1]:",log_seqs.shape)
        # print(log_seqs)
        seqs = self.item_emb(torch.LongTensor(log_seqs).to(self.dev))
        # print("[2]:",seqs.shape)
        seqs *= self.item_emb.embedding_dim ** 0.5 #scaling to stabilize the training process

        positions = np.tile(np.array(range(log_seqs.shape[1])), [log_seqs.shape[0], 1])

        ##################Your code here ##################
        ## Note:
        ## 1. Sum position embedding to input embedding(seqs)
        ## 2. Proceed the dropout after
        ## 3. (~2 lines of code)
        seqs += self.pos_emb(torch.LongTensor(positions).to(self.dev))
        seqs = self.emb_dropout(seqs)

        ###################################################

        timeline_mask = torch.BoolTensor(log_seqs == 0).to(self.dev)
        seqs *= ~timeline_mask.unsqueeze(-1) # broadcast in last dim

        tl = seqs.shape[1] # time dim len for enforce causality
        attention_mask = ~torch.tril(torch.ones((tl, tl), dtype=torch.bool, device=self.dev))
        #to mask upper triangular part

        for i in range(len(self.attention_layers)):
            seqs = torch.transpose(seqs, 0, 1)

           ##################Your code here ##################
           ## Note:
           ## 1. get Q by attention_layernorms[]()
           ## 2. get multihead attention outputs (mha_outputs): sum of the weighted V
           ##    by attention_layers[]() with using attention_mask
           ## 3. key, value = seqs

            Q = self.attention_layernorms[i](seqs)
            mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs, attn_mask = attention_mask)

            ###################################################
            seqs = Q + mha_outputs
            seqs = torch.transpose(seqs, 0, 1)

            seqs = self.forward_layernorms[i](seqs)
            seqs = self.forward_layers[i](seqs)
            seqs *=  ~timeline_mask.unsqueeze(-1)

        log_feats = self.last_layernorm(seqs)

        return log_feats #Attention Scores

    def forward(self, user_ids, log_seqs, pos_seqs, neg_seqs):
        log_feats = self.log2feats(log_seqs)

        pos_embs = self.item_emb(torch.LongTensor(pos_seqs).to(self.dev))
        neg_embs = self.item_emb(torch.LongTensor(neg_seqs).to(self.dev))

        ##################Your code here ##################
        ## Note: get pos_logits and neg_logits
        ## 1.compute Attention Scores * Value(pos_embs/neg_embs)
        ## Hint: use sum(dim=-1)
        ## (~2 lines of code)

        pos_logits = (log_feats * pos_embs).sum(dim=-1)
        neg_logits = (log_feats * neg_embs).sum(dim=-1)

        ###################################################

        pos_pred = self.pos_sigmoid(pos_logits)
        neg_pred = self.neg_sigmoid(neg_logits)

        return pos_logits, neg_logits

    def predict(self, user_ids, log_seqs, item_indices):
        log_feats = self.log2feats(log_seqs)

        final_feat = log_feats[:, -1, :] # the final Attention Scores for the prediction

        item_embs = self.item_emb(torch.LongTensor(item_indices).to(self.dev))

        logits = item_embs.matmul(final_feat.unsqueeze(-1)).squeeze(-1) #dot product between item's embedding and final feature
                                                                        #squeeze to make dimensions match
        preds = self.pos_sigmoid(logits)

        return preds

## Question 4: What does the output (preds) of SASRec mean?

(1) list of next items to recommend for the users

(2) Probability of each items to recommend for the users

(3) Yes/No on each items to recomeend for the users

(4) Tensor of next items to recommend for the users

Answer: 2 (유저에게 추천할 각 아이템들의 확률분포)

## (2) Pointwise Feed Forward Network


In [17]:
class PointWiseFeedForward(torch.nn.Module):
    def __init__(self, hidden_units, dropout_rate):

        super(PointWiseFeedForward, self).__init__()

        ##################Your code here ##################
        ## Note:
        ## 1. stack 2 Convolution Layers
        ## 2. after each convolution layer, proceed dropout
        ## 3. add relu after the first dropout and before the second convolution
        ## Hint: use torch.nn.Conv1d with kernel_size=1

        self.conv1 = torch.nn.Conv1d(hidden_units, hidden_units, kernel_size=1)
        self.dropout1 = torch.nn.Dropout(p=dropout_rate)
        self.relu = torch.nn.ReLU()
        self.conv2 = torch.nn.Conv1d(hidden_units, hidden_units, kernel_size=1)
        self.dropout2 = torch.nn.Dropout(p=dropout_rate)

        ###################################################

    def forward(self, inputs):
        outputs = self.dropout2(self.conv2(self.relu(self.dropout1(self.conv1(inputs.transpose(-1, -2)))))) #transpose so kernel can pass by time steps
        outputs = outputs.transpose(-1, -2) # return it back
        outputs += inputs #Residual Connection

        # print("final output shape : ",outputs.shape)
        return outputs


## Question 5: The shape of output for the two convolution layer becomes batch_size(128) x d(50) x max_len(200). What will be the shape of the final output of the forward function after transpose and residual connection?

Shape of final outputs: [128, 200, 50] # 밑에 args 에 파라미터와 달라, max len 200 으로 통일하였습니다.

# 3 Evaluate on Test/Validation data


In [12]:
#randomly sample 100 negative items and rank these items with the ground truth item. Based on the rankings we can evaluate Hit@10 and NDCG@10
def evaluate(model, dataset, selT, args):
    [train, valid, test, usernum, itemnum] = copy.deepcopy(dataset)

    NDCG = 0.0 #Normalized Discounted Cumulative Gain: evaluates the ranked quality of recommendations
               #by considering both position and relevance of the ground truth item in the recommendation list
    HR = 0.0 #Hit Rate@k: measures the fraction of times the ground truth next item is among the top-k recommendations
    valid_user = 0.0

    if selT==True:
      VT=test
    else:
      VT=valid

    users = range(1,usernum+1)
    for u in users:
        if len(train[u]) < 1 or len(VT[u]) < 1: continue

        seq = np.zeros([args['maxlen']], dtype=np.int32)
        idx = args['maxlen'] - 1
        if selT == True:
          seq[idx] = valid[u][0]
          idx -= 1

        for i in reversed(train[u]):
            seq[idx] = i
            idx -= 1
            if idx == -1: break

        rated = set(train[u])
        rated.add(0)
        item_idx = [VT[u][0]]

        for _ in range(100):
            t = np.random.randint(1, itemnum + 1)
            while t in rated: t = np.random.randint(1, itemnum + 1)
            item_idx.append(t)

        predictions = -model.predict(*[np.array(l) for l in [[u], [seq], item_idx]])
        predictions = predictions[0]

        ##################Your code here ##################
        ## Note:
        ## 1. create a rank by sorting 'predictions'
        ## Hint: use argsort()

        rank = predictions.argsort().argsort()[0].item()

        ###################################################

        valid_user += 1

        if rank < 10:
            NDCG += 1 / np.log2(rank + 2)
            HR += 1
        if valid_user % 100 == 0:
            print('.', end="")
            sys.stdout.flush()

    return NDCG / valid_user, HR / valid_user

## Prediction

Now we will implement our SASRec model!

*Note: evaluation will take quite a while without a GPU (~ 15 minutes)*

In [13]:
# Please do not change the args
args={
    'datapath':'https://drive.google.com/uc?id=1sjeWz4pXkVGmy__Tr8zCtGV6rPGZfgLy',
    'batch_size': 128,
    'lr': 0.001,
    'maxlen': 200, # 문제와 동일하게 200으로 수정함.
    'hidden_units': 50,
    'num_blocks': 2,
    'num_epochs': 101,
    'num_heads': 1,
    'dropout_rate': 0.5,
    'l2_emb': 0.0,
    'device': 'cpu',
}

In [18]:
if __name__ == '__main__':
    # global dataset
    dataset = data_partition(args['datapath'])

    [user_train, user_valid, user_test, usernum, itemnum] = dataset
    # print(usernum, itemnum)
    num_batch = len(user_train) // args['batch_size']
    # print(num_batch)

    sampler = WarpSampler(user_train, usernum, itemnum, batch_size=args['batch_size'], maxlen=args['maxlen'], n_workers=3)
    model = SASRec(usernum, itemnum, args).to(args['device'])

    for name, param in model.named_parameters():
        try: torch.nn.init.xavier_normal_(param.data)
        except: pass

    model.train()

    epoch_start_idx = 1
    bce_criterion = torch.nn.BCEWithLogitsLoss()
    adam_optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'], betas=(0.9, 0.98))

    T = 0.0
    t0 = time.time()

    for epoch in range(epoch_start_idx, args['num_epochs'] + 1):
        for step in range(num_batch):
        ##################Your code here ##################
        ## Note: implement a training loop
        ## 1. get u, seq, pos, neg from sampler.next_batch()
        ##    (see how we defined the WarpSampler)
        ## 2. convert these to np.array objects.
        ## 3. predict pos_logits and neg_logits by running model() with u, seq, pos, neg

            u, seq, pos, neg = sampler.next_batch()
            u, seq, pos, neg = np.array(u), np.array(seq), np.array(pos), np.array(neg)
            # print("[0]:",seq.shape)
            pos_logits, neg_logits = model(u, seq, pos, neg)

         ###################################################

            pos_labels, neg_labels = torch.ones(pos_logits.shape, device=args['device']), torch.zeros(neg_logits.shape, device=args['device'])
            adam_optimizer.zero_grad()
            indices = np.where(pos != 0)
            loss = bce_criterion(pos_logits[indices], pos_labels[indices])
            loss += bce_criterion(neg_logits[indices], neg_labels[indices])
            for param in model.item_emb.parameters(): loss += args['l2_emb'] * torch.norm(param)
            loss.backward()
            adam_optimizer.step()

        if epoch % 20 == 0:
            model.eval()
            t1 = time.time() - t0
            T += t1
            print('Evaluating', end='')
            t_test = evaluate(model, dataset, True, args)
            t_valid = evaluate(model, dataset, False, args)
            print('epoch:%d, time: %f(s), valid (NDCG@10: %.4f, HR@10: %.4f), test (NDCG@10: %.4f, HR@10: %.4f)'
                    % (epoch, T, t_valid[0], t_valid[1], t_test[0], t_test[1]))
            t0 = time.time()
            model.train()

    sampler.close()
    print("Done")



Evaluating........................................................................................................................epoch:20, time: 683.676578(s), valid (NDCG@10: 0.4830, HR@10: 0.7492), test (NDCG@10: 0.4638, HR@10: 0.7250)
Evaluating........................................................................................................................epoch:40, time: 1365.203602(s), valid (NDCG@10: 0.5505, HR@10: 0.8013), test (NDCG@10: 0.5232, HR@10: 0.7791)
Evaluating........................................................................................................................epoch:60, time: 2035.611916(s), valid (NDCG@10: 0.5690, HR@10: 0.8118), test (NDCG@10: 0.5465, HR@10: 0.7919)
Evaluating........................................................................................................................epoch:80, time: 2704.991881(s), valid (NDCG@10: 0.5797, HR@10: 0.8262), test (NDCG@10: 0.5587, HR@10: 0.7995)
Evaluating...............................

## Question 6: What are the test NDCG@10 and HR@10 for SASRec?

NDCG@10: 0.5655

HR@10: 0.8061

# Submission

In order to get credit, you need to submit the `ipynb` file to LMS.

To get this file, click `File` and `Download .ipynb`. Please make sure that your output of each cell is available in your `ipynb` file.