# Lab3: Recommendation System and BERT
## 1. AWS S3
Amazon S3 (Simple Storage Service) is AWS’s object storage service.
- Object = data + metadata

Here are some key terms:

- **Bucket**: the top-level container (globally unique name, for example, `de300-tutorial-data`).

- Object: one stored file (e.g., `movies.dat`).

- Key: the “path-like name” of the object inside a bucket (e.g., `s3://de300-tutorial-data/datasets/movielens/ml-1m.zip`).

- Prefix: the leading part of a key used for organization. S3 doesn’t have real folders; the console shows folder-like views using prefixes.

We may upload/download/modify files on s3 through AWS CLI tool (which may be installed through instruction on [AWS CLI installation](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)).

Before we use AWS CLI to get access to data on S3, we should set up credentials on the device (your own laptop or EC2). One of the ways to do so is go to the [https://nu-sso.awsapps.com/](https://nu-sso.awsapps.com/), click `Access Keys` and copy AWS Environmental Variables, e.g.,
```
export AWS_ACCESS_KEY_ID="ASIAYAAO5HRMPUB3MCSY"
export AWS_SECRET_ACCESS_KEY="rfQO99t/exgmgg96UZPlgwiPdQeIvDH+TwdpMmYZ"
export AWS_SESSION_TOKEN="IQoJb3JpZ2luX2VjED4aCXVzLWVhc3QtMiJHMEUCIQDIgnu1hQVRJkVO5hNC2nwZoymcocar0UBm/iLYAwce/gIgMAg+597W6iSiSjGf79zufe3O+iTtUTFgRvXA+EIs9H0q7QIIBxAAGgw1NDk3ODcwOTAwMDgiDNxFudXUE23299Ym3yrKAsauHL8Z7rTAIL25QvZi4zYbhMv7yUMtQ/uQlrMWa3Hqmec202sYY3iyaycTTrhG86olYRddLYBzqB5Voh9+VTcWgeTSOpZG86AUNoULPkFGLBVFDaokpqdfWexNgsSzIAPnya9MK0Rd7MKmeTWrDPZVx1BCwW9OkDiH6kCYBbfZU6BtkdkJ3Evsmo8Bbb+8myGzqMIoQPcekcDjBQ1Gf1TBh9hvsocxVUA/DX3NrShNImKLXW+XTExn2Vi3ZFrCof+35gVhh1L3s6Z16gDNCOAr1LYc/MjBdR/RSSTKCVgkCaioOd+hRxTpNz1IPMCnuubbcST014WUFgASQDNUQ6QOpYcegbwNjjyq4pA6kzQBoewNThZ9x1meusGz7uJcDBmqGJDimEV88Wa3ucBQSgHpfwheoeJrPL2yjyLp5R7GRdoeGOA6k/POZzCd14nMBjqlAWaFsCTsj3+5QN/rsUmUoIq8qyMPpanSn9P191MA8hIykQquGiqhDkjF3ctIBSgS1Rv0JdOd7SJsEK7iOzBV4JkSQ9Mr6vs5UuozFGWMIJlS0F2bmVNhJuD6efIgdlcqeI58mc8nMxmiFYNiWivQQopMHSFgsHnTu9Q+7hutm2jMzmsCBrAxFw12jGG8ab0MwB54V7OUXQvhe1TTeeeBdos9jt4dUA=="
```

Some commands that are frequently used:
1. List all objects under a prefix
   ```
   aws s3 ls s3://dinglin-winter26/
   ```
2. Copy ONE object from S3 to local
   ```
   aws s3 cp s3://dinglin-winter26/lab4/ml-1m/movies.dat ./movies.dat
   ```

3. Copy a “folder” (prefix) recursively
   ```
   aws s3 cp s3://dinglin-winter26/lab4/ml-1m/ ./lab4-data/ --recursive
   ```

4. Copy files and folders among different buckets
   ```
   aws s3 cp s3://dinglin-winter26/lab4/ml-1m/movies.dat s3://DEST_BUCKET/path/to/movies.dat
   aws s3 cp s3://dinglin-winter26/lab4/ml-1m/ s3://DEST_BUCKET/some/prefix/ --recursive
   ```

## Recommendation System
### Dataset Intro
MovieLens 1M is a classic benchmark dataset for recommender systems released by the GroupLens Research Project. It contains 1,000,209 ratings (1–5 stars, whole-star only) from 6,040 users on movie IDs up to 3,952, and each user has at least 20 ratings.

### Data Loading
We will read `ratings.dat` and `movies.dat` from the `ml-1m/` folder. These files use `::` as a delimiter, so we set `engine='python'` in `pandas.read_csv`.

- ratings.dat: (user_id, movie_id, rating, timestamp)
- movies.dat: (movie_id, title, genres)

In [1]:
import pandas as pd

ratings = pd.read_csv("lab4-data/ratings.dat", sep="::", engine="python",
                      names=["user_id","movie_id","rating","timestamp"])
movies  = pd.read_csv("lab4-data/movies.dat",  sep="::", engine="python",
                      names=["movie_id","title","genres"], encoding="latin-1")

ratings.head(), movies.head()

(   user_id  movie_id  rating  timestamp
 0        1      1193       5  978300760
 1        1       661       3  978302109
 2        1       914       3  978301968
 3        1      3408       4  978300275
 4        1      2355       5  978824291,
    movie_id                               title                        genres
 0         1                    Toy Story (1995)   Animation|Children's|Comedy
 1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
 2         3             Grumpier Old Men (1995)                Comedy|Romance
 3         4            Waiting to Exhale (1995)                  Comedy|Drama
 4         5  Father of the Bride Part II (1995)                        Comedy)

Implicit recommendation models usually need positive interactions rather than 1–5 star ratings. Here we treat ratings ≥ 4 as a positive signal and create a binary column value=1.0

In [2]:
pos = ratings[ratings["rating"] >= 4].copy()
pos["value"] = 1.0

To evaluate recommendation, we hold out each user’s most recent positive interaction as the test item (using timestamp). The remaining positives become training data.

In [3]:
pos = pos.sort_values(["user_id","timestamp"])
test = pos.groupby("user_id").tail(1)
train = pos.drop(test.index)

train.shape, test.shape


((569243, 5), (6038, 5))

Before building sophisticated models, it’s useful to compute a simple baseline: recommend the most frequently liked movies in the training set. This gives you a sanity check for your pipeline.

In [4]:
top_movies = train.groupby("movie_id")["value"].sum().sort_values(ascending=False)
top10 = top_movies.head(10).index.tolist()
movies[movies["movie_id"].isin(top10)][["movie_id","title","genres"]]


Unnamed: 0,movie_id,title,genres
257,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
589,593,"Silence of the Lambs, The (1991)",Drama|Thriller
604,608,Fargo (1996),Crime|Drama|Thriller
1178,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
1180,1198,Raiders of the Lost Ark (1981),Action|Adventure
1192,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
1959,2028,Saving Private Ryan (1998),Action|Drama|War
2502,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
2693,2762,"Sixth Sense, The (1999)",Thriller
2789,2858,American Beauty (1999),Comedy|Drama


## ALS (Collborative Filltering)
Many recommenders (including `implicit ALS`) expect a sparse matrix. We map raw IDs to contiguous indices:

- user2idx: user_id → row index
- item2idx: movie_id → column index
Then we create a sparse matrix X_ui with shape [num_users, num_items].

In [5]:
import numpy as np
from scipy.sparse import coo_matrix

user_ids = train["user_id"].unique()
movie_ids = train["movie_id"].unique()

user2idx = {u:i for i,u in enumerate(user_ids)}
movie_ids = np.array(movie_ids)  # ensure it's a numpy array
item2idx = {int(m): i for i, m in enumerate(movie_ids)}
idx2item = {i: int(m) for i, m in enumerate(movie_ids)}

rows = train["user_id"].map(user2idx).to_numpy()
cols = train["movie_id"].map(item2idx).to_numpy()
data = train["value"].to_numpy().astype(np.float32)

X_ui = coo_matrix((data, (rows, cols)), shape=(len(user_ids), len(movie_ids))).tocsr()


In [7]:
# %pip install implicit

In [6]:
import implicit

model = implicit.als.AlternatingLeastSquares(
    factors=64, regularization=0.01, iterations=20
)

# implicit expects item-user matrix
model.fit(X_ui)


  from .autonotebook import tqdm as notebook_tqdm
  check_blas_config()
100%|██████████| 20/20 [00:00<00:00, 24.03it/s]


In [8]:
u = int(user_ids[0])
u_idx = int(user2idx[u])

# IMPORTANT: pass X_ui[u_idx] (1 row), not the full X_ui
recs = model.recommend(u_idx, X_ui[u_idx], N=10)

# Handle implicit returning either (ids, scores) OR list of pairs
if isinstance(recs, tuple) and len(recs) == 2:
    item_ids, scores = recs
else:
    item_ids = [i for i, s in recs]
    scores  = [s for i, s in recs]

rec_movie_ids = [idx2item[int(i)] for i in item_ids]

rec_df = movies.set_index("movie_id").loc[rec_movie_ids][["title", "genres"]].reset_index()
rec_df["score"] = list(scores)
rec_df


Unnamed: 0,movie_id,title,genres,score
0,34,Babe (1995),Children's|Comedy|Drama,0.58017
1,364,"Lion King, The (1994)",Animation|Children's|Musical,0.557676
2,318,"Shawshank Redemption, The (1994)",Drama,0.473276
3,2081,"Little Mermaid, The (1989)",Animation|Children's|Comedy|Musical|Romance,0.407703
4,3471,Close Encounters of the Third Kind (1977),Drama|Sci-Fi,0.381777
5,1225,Amadeus (1984),Drama,0.36608
6,1282,Fantasia (1940),Animation|Children's|Musical,0.361515
7,593,"Silence of the Lambs, The (1991)",Drama|Thriller,0.35366
8,2078,"Jungle Book, The (1967)",Animation|Children's|Comedy|Musical,0.331318
9,1947,West Side Story (1961),Musical|Romance,0.331311


### Evaluation: Recall@K and NDCG@K
We evaluate each user by checking whether the held-out test movie appears in the top-K recommendations:

- Recall@K: fraction of users whose held-out item is retrieved in top-K.
- NDCG@K: like Recall@K, but gives higher credit when the item appears closer to rank 1.

In [9]:
import math

def ndcg_at_k(rank, k):
    if rank is None or rank >= k:
        return 0.0
    return 1.0 / math.log2(rank + 2)

def _get_rec_items(recs):
    # implicit may return (item_ids, scores) OR list of (id, score)
    if isinstance(recs, tuple) and len(recs) == 2:
        item_ids = recs[0]
        return [int(i) for i in item_ids]
    else:
        return [int(i) for i, _ in recs]

def eval_model(model, X_ui, test_df, K=10):
    hits, ndcgs, n = 0, 0.0, 0
    for u, gt in zip(test_df["user_id"], test_df["movie_id"]):
        if (u not in user2idx) or (gt not in item2idx):
            continue

        u_idx = int(user2idx[u])

        # IMPORTANT: pass only this user's row
        recs = model.recommend(u_idx, X_ui[u_idx], N=K)

        rec_items = _get_rec_items(recs)
        gt_idx = int(item2idx[gt])

        if gt_idx in rec_items:
            hits += 1
            ndcgs += ndcg_at_k(rec_items.index(gt_idx), K)

        n += 1

    return {"Recall@K": hits / max(n, 1), "NDCG@K": ndcgs / max(n, 1), "Users": n}

eval_model(model, X_ui, test, K=10)

{'Recall@K': 0.0946304275770633, 'NDCG@K': 0.04763869188980648, 'Users': 6034}

## Content Embedding with BERT Model

In [10]:
movies["text"] = movies["title"].fillna("") + " [SEP] " + movies["genres"].fillna("")
movies["text"].head()


0    Toy Story (1995) [SEP] Animation|Children's|Co...
1    Jumanji (1995) [SEP] Adventure|Children's|Fantasy
2         Grumpier Old Men (1995) [SEP] Comedy|Romance
3          Waiting to Exhale (1995) [SEP] Comedy|Drama
4      Father of the Bride Part II (1995) [SEP] Comedy
Name: text, dtype: object

In [11]:
import torch
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
bert = AutoModel.from_pretrained("bert-base-uncased").to(device)
bert.eval()

@torch.no_grad()
def encode_texts(texts, batch_size=64, max_len=64):
    embs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inp = tok(batch, padding=True, truncation=True, max_length=max_len, return_tensors="pt").to(device)
        out = bert(**inp).last_hidden_state[:,0,:]  # [CLS]
        embs.append(out.cpu())
    return torch.cat(embs, dim=0)

item_emb = encode_texts(movies["text"].tolist())
item_emb.shape


Loading weights: 100%|██████████| 199/199 [00:00<00:00, 348.33it/s, Materializing param=pooler.dense.weight]                               
BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


torch.Size([3883, 768])

### Save item embedding as tensor

In [12]:
save_path = "./lab4-data/item_emb.pt"
torch.save(item_emb, save_path)

# later
item_emb2 = torch.load(save_path, map_location="cpu")
print(item_emb2.shape)


torch.Size([3883, 768])


In [13]:
payload = {
    "item_emb": item_emb,                      # [num_items, hidden_dim]
    "movie_id": movies["movie_id"].to_numpy(), # same order as embeddings
}
torch.save(payload, "./lab4-data/item_emb_with_ids.pt")

# later
obj = torch.load("./lab4-data/item_emb_with_ids.pt", map_location="cpu", weights_only=False)
item_emb = obj["item_emb"]
movie_id = obj["movie_id"]

In [14]:
import torch.nn.functional as F

# normalize for cosine similarity
E = F.normalize(item_emb, p=2, dim=1)

movieid_to_row = {mid:i for i,mid in enumerate(movies["movie_id"].tolist())}

def similar_movies(movie_id, topk=10):
    i = movieid_to_row[movie_id]
    sims = (E @ E[i]).numpy()
    top = sims.argsort()[::-1][1:topk+1]  # exclude itself
    return movies.iloc[top][["movie_id","title","genres"]]

similar_movies(movie_id=1, topk=10)  # Toy Story is movie_id=1 in ml-1m


Unnamed: 0,movie_id,title,genres
3045,3114,Toy Story 2 (1999),Animation|Children's|Comedy
3682,3751,Chicken Run (2000),Animation|Children's|Comedy
1050,1064,Aladdin and the King of Thieves (1996),Animation|Children's|Comedy
584,588,Aladdin (1992),Animation|Children's|Comedy|Musical
2286,2355,"Bug's Life, A (1998)",Animation|Children's|Comedy
1743,1806,Paulie (1998),Adventure|Children's|Comedy
2069,2138,Watership Down (1978),Animation|Children's|Drama|Fantasy
591,595,Beauty and the Beast (1991),Animation|Children's|Musical
2072,2141,"American Tail, An (1986)",Animation|Children's|Comedy
451,455,Free Willy (1993),Adventure|Children's|Drama


In [15]:
from collections import defaultdict
import numpy as np

liked = train.groupby("user_id")["movie_id"].apply(list).to_dict()

def recommend_for_user_content(user_id, topk=10):
    mids = liked.get(user_id, [])
    mids = [m for m in mids if m in movieid_to_row]
    if not mids: 
        return None
    rows = [movieid_to_row[m] for m in mids]
    u = E[rows].mean(dim=0, keepdim=True)
    sims = (E @ u[0]).numpy()
    # filter already seen
    seen_rows = set(rows)
    candidates = [i for i in sims.argsort()[::-1] if i not in seen_rows]
    top = candidates[:topk]
    return movies.iloc[top][["movie_id","title","genres"]]

recommend_for_user_content(user_ids[0], topk=10)


Unnamed: 0,movie_id,title,genres
3038,3107,Backdraft (1991),Action|Drama
3099,3168,Easy Rider (1969),Adventure|Drama
1096,1112,Palookaville (1996),Action|Drama
651,657,Yankee Zulu (1994),Comedy|Drama
1885,1954,Rocky (1976),Action|Drama
2857,2926,Hairspray (1988),Comedy|Drama
2508,2577,Metroland (1997),Comedy|Drama
3096,3165,Boiling Point (1993),Action|Drama
533,537,Sirens (1994),Comedy|Drama
3457,3526,Parenthood (1989),Comedy|Drama


In [16]:
# item2idx maps movie_id -> item_idx used by X_ui / eval_model
idx2item = {i:m for m,i in item2idx.items()}

# Build a matrix of embeddings in item-index order (0..num_items-1)
# Assumes you have:
# - E: normalized embeddings for ALL movies in `movies` dataframe (same order as movies rows)
# - movieid_to_row: mapping movie_id -> row index in `movies`/E
# If you don't already have E/movieid_to_row, see the note below.

# If you have `item_emb` as torch tensor [num_movies_all, 768]:
E_all = F.normalize(item_emb, p=2, dim=1)

movieid_to_row = {mid:i for i,mid in enumerate(movies["movie_id"].tolist())}

num_items = len(item2idx)
dim = E_all.shape[1]
E_items = torch.empty((num_items, dim), dtype=E_all.dtype)

missing = 0
for item_idx in range(num_items):
    mid = idx2item[item_idx]
    row = movieid_to_row.get(mid, None)
    if row is None:
        missing += 1
        E_items[item_idx] = 0
    else:
        E_items[item_idx] = E_all[row]

print("missing movies:", missing, "out of", num_items)

missing movies: 0 out of 3530


## Lab Assignment
1. Complete the recommend functions powered by BERT Content Embedding.
2. Try at least 2 different pre-trained model for embedding and test the model performance.

In [None]:
class BertContentModel:
    def __init__(self, E_items):
        # E_items: torch tensor [num_items, dim], already normalized
        self.E = F.normalize(E_items, p=2, dim=1)

    def recommend(self, u_idx, X_ui, N=10):
        # Get items user interacted with from sparse matrix row
        row = X_ui #[u_idx]; no need u_idx again
        seen = row.indices  # item indices seen in training interactions

        # Edge cases: the user has no seen movies
        if len(seen) == 0:
            return []

        # User embedding = mean of seen item embeddings

        # Top-N
        top_idx = []

        return [(int(i), float(scores[i])) for i in top_idx]

In [None]:
bert_model = BertContentModel(E_items)

In [None]:
eval_model(bert_model, X_ui, test, K=10)