### Imports
This section imports the libraries used in the notebook:
- **pandas / numpy** for data loading and matrix operations
- **TruncatedSVD (scikit-learn)** for matrix factorization
- **math** for log operations used in ranking metrics (nDCG)

In [135]:
import os
import math
import textwrap
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD

### Configuration
This section defines key parameters used throughout the notebook:
- `DATA_DIR`: local folder containing Goodbooks-10K files (`ratings.csv`, `books.csv`)
- `K`: the Top-K cutoff for recommendations and ranking metrics
- `RELEVANT_THRESHOLD`: threshold used to define “relevant” items in the test set (rating ≥ 4.0)
- `RANDOM_STATE`: ensures reproducible splits and sampling

In [136]:
DATA_DIR = "./data"
K = 5
RELEVANT_THRESHOLD = 4.0
RANDOM_STATE = 42

### Data Loading
This section loads:
- `ratings.csv` containing user–book ratings (user_id, book_id, rating)
- `books.csv` containing book metadata (book_id, title)

The ratings data is the main input for training and evaluation.  
The books metadata is used to convert `book_id` values into readable book titles for outputs.

In [137]:
ratings = pd.read_csv(os.path.join(DATA_DIR, "ratings.csv"))
ratings = ratings[["user_id", "book_id", "rating"]]

ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


In [138]:
books_df = pd.read_csv(os.path.join(DATA_DIR, "books.csv"))
books_df = books_df[["book_id", "title"]]

books_df.head()

Unnamed: 0,book_id,title
0,1,"The Hunger Games (The Hunger Games, #1)"
1,2,Harry Potter and the Sorcerer's Stone (Harry P...
2,3,"Twilight (Twilight, #1)"
3,4,To Kill a Mockingbird
4,5,The Great Gatsby


### Train-Test Split (Offline Evaluation Setup)
This section splits the ratings data into:
- **Training set (80%)**: used to learn latent factors (SVD)
- **Test set (20%)**: held out for offline evaluation

This simulates a real deployment scenario where a model is trained on past interactions and evaluated on unseen interactions.

In [139]:
np.random.seed(RANDOM_STATE)

mask = np.random.rand(len(ratings)) < 0.8
train_df = ratings[mask]
test_df = ratings[~mask]

print("Train size:", len(train_df))
print("Test size:", len(test_df))


Train size: 4781611
Test size: 1194868


### User–Item Rating Matrix Construction
This section converts the training interactions into a user–item matrix:
- Rows = users
- Columns = books
- Values = ratings (missing values filled with 0)

This matrix is the input to the matrix factorization method.

In [140]:
train_matrix = train_df.pivot_table(
    index="user_id",
    columns="book_id",
    values="rating"
)

train_matrix = train_matrix.fillna(0)

train_matrix.shape

(53424, 10000)

### AI Method: Matrix Factorization (Truncated SVD)
This section applies **Truncated Singular Value Decomposition (SVD)** to factorize the user–item matrix into latent factors.

Conceptually:
- The rating matrix `R` is approximated as `R ≈ U × V`
- `U` represents latent user preferences
- `V` represents latent item (book) characteristics

These latent factors capture hidden patterns in user taste and book similarity.

In [141]:
svd = TruncatedSVD(n_components=50, random_state=RANDOM_STATE)
user_factors = svd.fit_transform(train_matrix)
item_factors = svd.components_

### Prediction Reconstruction
This section reconstructs predicted preference scores using matrix multiplication of latent factors.

After factorization:
- The model estimates a predicted rating for each user–book pair
- These predicted scores are used to rank books for recommendations

The reconstructed prediction matrix enables efficient Top-K ranking for any user.

In [142]:
predicted_matrix = np.dot(user_factors, item_factors)

predicted_df = pd.DataFrame(
    predicted_matrix,
    index=train_matrix.index,
    columns=train_matrix.columns
)

predicted_df.head()

book_id,1,2,3,4,5,6,7,8,9,10,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,10000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.300721,-0.062356,0.264478,0.042926,0.579756,0.029156,-0.581507,0.02575,0.027302,3.309465,...,0.042989,-0.006351,-0.004377,-0.011845,0.012702,0.001653,0.004196,0.002021,0.009616,0.002599
2,-0.548098,4.337417,-0.032157,0.559053,2.708069,0.856055,0.561822,1.585513,0.971445,2.739162,...,0.004753,0.00141,0.00847,-0.014645,-0.000155,-0.004492,0.030056,0.007645,0.031538,0.02822
3,-0.223258,0.124348,-0.089531,0.024713,-0.072898,-0.198892,0.164668,0.388168,-0.304063,-0.137401,...,0.004923,0.002779,0.000101,0.002096,0.006676,-0.002718,0.002841,-0.001402,0.007382,0.005537
4,-0.127423,4.969884,-0.218483,3.996812,-0.227493,-0.623645,2.009626,1.863415,1.516681,3.612967,...,0.018378,-0.024718,0.002865,0.017941,0.02016,0.012425,0.012303,0.021077,0.017781,0.022653
5,0.070524,-0.158815,-0.086455,-0.051436,0.401914,1.765862,-0.11804,0.202794,0.289053,-0.207354,...,0.005421,0.00879,0.004993,-0.003444,-0.000586,0.004818,0.003895,0.001909,0.006043,-0.001525


### Evaluation Metrics (Ranking-Based)
This section defines ranking metrics used to evaluate recommendation quality:
- **MAP@K**: rewards ranking relevant items early and consistently
- **nDCG@K**: rewards placing relevant items near the top with position-based discounting
- **MRR@K**: focuses on how early the first relevant recommendation appears

These metrics require converting the Top-K list into a binary “relevance vector” (1 = relevant, 0 = not relevant).

In [143]:
def dcg_at_k(rels, k):
    return sum(rel / math.log2(i + 2) for i, rel in enumerate(rels[:k]))

def ndcg_at_k(rels, k):
    ideal = sorted(rels, reverse=True)
    return dcg_at_k(rels, k) / dcg_at_k(ideal, k) if dcg_at_k(ideal, k) > 0 else 0

def average_precision_at_k(rels, k):
    score, hits = 0.0, 0
    for i, rel in enumerate(rels[:k], start=1):
        if rel:
            hits += 1
            score += hits / i
    return score / min(sum(rels), k) if sum(rels) > 0 else 0

def reciprocal_rank_at_k(rels, k):
    for i, rel in enumerate(rels[:k], start=1):
        if rel:
            return 1 / i
    return 0


### Offline Testing and Evaluation Procedure
This section evaluates the recommender using the held-out test set.

For each user:
1. Identify **relevant test books** where rating ≥ `RELEVANT_THRESHOLD`
2. Exclude books already seen in the training set
3. Rank unseen books by predicted score
4. Compute MAP@K, nDCG@K, and MRR@K using the relevance vector

Final results are obtained by averaging scores across all evaluated users.

In [144]:
map_scores, ndcg_scores, mrr_scores = [], [], []

for user in test_df["user_id"].unique():
    
    if user not in predicted_df.index:
        continue

    user_test = test_df[test_df["user_id"] == user]
    relevant = set(user_test[user_test["rating"] >= RELEVANT_THRESHOLD]["book_id"])

    if len(relevant) == 0:
        continue

    user_predictions = predicted_df.loc[user]

    # Remove items already seen in training
    seen_items = set(train_df[train_df["user_id"] == user]["book_id"])
    user_predictions = user_predictions.drop(seen_items, errors="ignore")

    ranked = user_predictions.sort_values(ascending=False).index.tolist()[:K]

    rels = [1 if iid in relevant else 0 for iid in ranked]

    map_scores.append(average_precision_at_k(rels, K))
    ndcg_scores.append(ndcg_at_k(rels, K))
    mrr_scores.append(reciprocal_rank_at_k(rels, K))

metrics = {
    "MAP@k": np.mean(map_scores),
    "nDCG@k": np.mean(ndcg_scores),
    "MRR@k": np.mean(mrr_scores)
}

metrics


{'MAP@k': np.float64(0.5271576037259914),
 'nDCG@k': np.float64(0.5909082119863426),
 'MRR@k': np.float64(0.5548188140899463)}

In [145]:
print(f"MAP@{K}:  {metrics['MAP@k']:.4f}")
print(f"nDCG@{K}: {metrics['nDCG@k']:.4f}")
print(f"MRR@{K}:  {metrics['MRR@k']:.4f}")

MAP@5:  0.5272
nDCG@5: 0.5909
MRR@5:  0.5548


### Input/Output Examples (Random Users)
This section generates real examples for 3 random users.

The function:
- randomly selects users from the test set
- generates Top-5 recommendations (book_id + title + predicted rating)
- computes a relevance vector by comparing recommendations with relevant test items

This produces concrete “input → output” examples that demonstrate the AI task.

In [146]:
def generate_real_examples(k=5, n_examples=3):
    
    examples = []
    rng = np.random.default_rng(RANDOM_STATE)
    
    # Get users that appear in test set
    test_users = test_df["user_id"].unique()
    
    count = 0
    
    rng.shuffle(test_users)

    for user in test_users:
        
        if user not in predicted_df.index:
            continue
        
        user_test = test_df[test_df["user_id"] == user]
        relevant = set(
            user_test[user_test["rating"] >= RELEVANT_THRESHOLD]["book_id"]
        )
        
        if len(relevant) == 0:
            continue
        
        user_predictions = predicted_df.loc[user]
        
        # Remove seen items
        seen_items = set(train_df[train_df["user_id"] == user]["book_id"])
        user_predictions = user_predictions.drop(seen_items, errors="ignore")
        
        ranked = user_predictions.sort_values(ascending=False).head(k)
        
        rels = [1 if iid in relevant else 0 for iid in ranked.index]

        id_to_title = dict(zip(books_df["book_id"], books_df["title"]))
        top_books = []
        
        for iid, score in ranked.items():
            top_books.append({
                "book_id": iid,
                "title": id_to_title.get(iid, "Unknown"),
                "predicted_rating": round(float(score), 3)
            })
            
        example = {
            "user_id": user,
            "top_k_books": top_books,
            "relevance_vector": rels,
            "relevant_books_in_test": list(relevant)
        }
        
        examples.append(example)
        count += 1
        
        if count == n_examples:
            break
    
    return examples

In [147]:
examples = generate_real_examples(k=5,
    n_examples=3
)

for ex in examples:
    print("=" * 80)
    print(f"User ID: {ex['user_id']}")
    print("-" * 60)
    print("Output: Top-5 recommended books:")
    
    for i, book in enumerate(ex["top_k_books"], start=1):
        print(
            f"{i}. {book['title']} "
            f"(Book ID: {book['book_id']}, "
            f"Predicted Rating: {book['predicted_rating']})"
        )
    
    print("-" * 60)
    print("Relevance Vector:", ex["relevance_vector"])
    print("Relevant Books in Test:", ex["relevant_books_in_test"])


User ID: 50462
------------------------------------------------------------
Output: Top-5 recommended books:
1. Jane Eyre (Book ID: 43, Predicted Rating: 3.128)
2. Wuthering Heights (Book ID: 63, Predicted Rating: 3.08)
3. Romeo and Juliet (Book ID: 29, Predicted Rating: 2.949)
4. Macbeth (Book ID: 154, Predicted Rating: 2.864)
5. Sense and Sensibility (Book ID: 76, Predicted Rating: 2.279)
------------------------------------------------------------
Relevance Vector: [0, 0, 1, 0, 0]
Relevant Books in Test: [804, 357, 1607, 265, 650, 301, 2355, 180, 55, 248, 29, 767]
User ID: 4542
------------------------------------------------------------
Output: Top-5 recommended books:
1. Brave New World (Book ID: 55, Predicted Rating: 2.738)
2. The Catcher in the Rye (Book ID: 8, Predicted Rating: 2.533)
3. Emma (Book ID: 171, Predicted Rating: 2.049)
4. The Curious Incident of the Dog in the Night-Time (Book ID: 60, Predicted Rating: 1.649)
5. Lord of the Flies (Book ID: 28, Predicted Rating: 1.6

### Input/Output For Given User ID
This section generates recommendations for given user ID.

The function:
- generates Top-5 recommendations (book_id + title + predicted rating)
- computes a relevance vector by comparing recommendations with relevant test items

In [148]:
def recommend_for_user(user_id, k=5):
    # Check if user exists in prediction matrix
    if user_id not in predicted_df.index:
        print(f"User {user_id} not found in prediction matrix.")
        return None
    
    # Check if user exists in test set
    if user_id not in test_df["user_id"].values:
        print(f"User {user_id} not found in test set.")
        return None
    
    user_test = test_df[test_df["user_id"] == user_id]
    relevant = set(
        user_test[user_test["rating"] >= RELEVANT_THRESHOLD]["book_id"]
    )
    
    if len(relevant) == 0:
        print(f"User {user_id} has no relevant items in test set (rating >= {RELEVANT_THRESHOLD}).")
        return None
    
    user_predictions = predicted_df.loc[user_id]
    
    # Remove seen items
    seen_items = set(train_df[train_df["user_id"] == user_id]["book_id"])
    user_predictions = user_predictions.drop(seen_items, errors="ignore")
    
    ranked = user_predictions.sort_values(ascending=False).head(k)
    
    rels = [1 if iid in relevant else 0 for iid in ranked.index]

    id_to_title = dict(zip(books_df["book_id"], books_df["title"]))
    top_books = []
    
    for iid, score in ranked.items():
        top_books.append({
            "book_id": iid,
            "title": id_to_title.get(iid, "Unknown"),
            "predicted_rating": round(float(score), 3)
        })
    
    print("=" * 80)
    print(f"User ID: {user_id}")
    print("-" * 60)
    print("Output: Top-5 recommended books:")
    
    for i, book in enumerate(top_books, start=1):
        print(
            f"{i}. {book['title']} "
            f"(Book ID: {book['book_id']}, "
            f"Predicted Rating: {book['predicted_rating']})"
        )
    
    print("-" * 60)
    print("Relevance Vector:", rels)
    print(textwrap.fill(f"Relevant Books in Test: {list(relevant)}", width=85))

In [149]:
recommend_for_user(user_id=45485)

User ID: 45485
------------------------------------------------------------
Output: Top-5 recommended books:
1. The Lovely Bones (Book ID: 22, Predicted Rating: 3.417)
2. The Secret Life of Bees (Book ID: 57, Predicted Rating: 3.188)
3. The Joy Luck Club (Book ID: 118, Predicted Rating: 3.026)
4. Animal Farm (Book ID: 14, Predicted Rating: 2.985)
5. A Thousand Splendid Suns (Book ID: 67, Predicted Rating: 2.783)
------------------------------------------------------------
Relevance Vector: [1, 0, 0, 0, 0]
Relevant Books in Test: [2, 3, 515, 5, 661, 22, 534, 409, 160, 2722, 162, 2215, 45,
6190, 180, 825, 317, 63, 2367, 457, 76, 848, 339, 83, 471, 727, 90, 476, 1373, 98,
5860, 111, 495, 1393, 115, 372, 4597]


In [150]:
recommend_for_user(user_id=38897)

User ID: 38897
------------------------------------------------------------
Output: Top-5 recommended books:
1. Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) (Book ID: 18, Predicted Rating: 4.585)
2. A Storm of Swords (A Song of Ice and Fire, #3) (Book ID: 135, Predicted Rating: 3.967)
3. The Fellowship of the Ring (The Lord of the Rings, #1) (Book ID: 19, Predicted Rating: 3.285)
4. The Adventures of Huckleberry Finn (Book ID: 58, Predicted Rating: 2.928)
5. Hamlet (Book ID: 125, Predicted Rating: 2.778)
------------------------------------------------------------
Relevance Vector: [1, 1, 1, 1, 0]
Relevant Books in Test: [1, 131, 4, 135, 903, 138, 14, 18, 19, 20, 278, 6167, 3739,
160, 298, 2102, 58, 191, 194, 204, 479, 225, 740, 503]


In [151]:
recommend_for_user(user_id=41440, k=5)

User ID: 41440
------------------------------------------------------------
Output: Top-5 recommended books:
1. The Secret Garden (Book ID: 93, Predicted Rating: 2.271)
2. Where the Wild Things Are (Book ID: 102, Predicted Rating: 1.998)
3. The Very Hungry Caterpillar Board Book (Book ID: 296, Predicted Rating: 1.782)
4. The Giving Tree (Book ID: 85, Predicted Rating: 1.697)
5. Oh, The Places You'll Go! (Book ID: 358, Predicted Rating: 1.584)
------------------------------------------------------------
Relevance Vector: [0, 0, 0, 1, 0]
Relevant Books in Test: [8213, 2582, 798, 6945, 7590, 681, 2222, 4287, 2625, 1478,
1998, 1488, 1747, 85, 863, 740, 1253, 998, 493, 6142]
