# AIE425 Intelligent Recommender Systems - Course Project
### 3.2. Part 1: PCA Method with Mean-Filling


**Name:** Menna Salem Elsayed

**ID:** 221101277

In [1]:
import pandas as pd
import numpy as np
ratings_df= pd.read_csv(r"D:\project IRS\ratings.csv")

In [2]:
# Count ratings per user and per item
user_counts = ratings_df['userId'].value_counts()
item_counts = ratings_df['movieId'].value_counts()

# Filter items/users with at least 20 ratings (quality threshold)
valid_items = item_counts[item_counts >= 20].index.tolist()
valid_users = user_counts[user_counts >= 20].index.tolist()

print(f"Valid items (≥20 ratings): {len(valid_items):,}")
print(f"Valid users (≥20 ratings): {len(valid_users):,}")

# RANDOM SAMPLE - INCREASED sizes to ensure >100K ratings
np.random.seed(42)
N_ITEMS = 800  
N_USERS = 15000 

sample_items = set(np.random.choice(valid_items, size=min(N_ITEMS, len(valid_items)), replace=False))
sample_items.add(2)     
sample_items.add(8860)  
sample_users = set(np.random.choice(valid_users, size=min(N_USERS, len(valid_users)), replace=False))

filtered_df = ratings_df[
    (ratings_df['userId'].isin(sample_users)) & 
    (ratings_df['movieId'].isin(sample_items))
]

n_ratings = len(filtered_df)
n_users = filtered_df['userId'].nunique()
n_items = filtered_df['movieId'].nunique()

print("\n" + "="*60)
print("FILTERED DATASET (Random Sample)")
print("="*60)
print(f"Ratings: {n_ratings:,} (>100K) {'✓' if n_ratings > 100000 else '✗'}")
print(f"Users: {n_users:,} (>10K) {'✓' if n_users > 10000 else '✗'}")
print(f"Items: {n_items:,} (≥500) {'✓' if n_items >= 500 else '✗'}")

Valid items (≥20 ratings): 13,132
Valid users (≥20 ratings): 138,493

FILTERED DATASET (Random Sample)
Ratings: 142,888 (>100K) ✓
Users: 14,201 (>10K) ✓
Items: 800 (≥500) ✓


## Define Target Items (I1 and I2)

In [3]:
I1 = 2     # Popular item
I2 = 8860  # Less popular item
target_items = [I1, I2]
print("="*60)
print("TARGET ITEM SELECTION")
print("="*60)

# Show popularity in original dataset
ratings_original = pd.read_csv(r'D:\project IRS\section1_statistical_analysis\ratings_statistics.csv')
I1_count = len(ratings_original[ratings_original['movieId'] == I1])
I2_count = len(ratings_original[ratings_original['movieId'] == I2])
all_counts = ratings_original['movieId'].value_counts()
I1_pct = (all_counts < I1_count).mean() * 100
I2_pct = (all_counts < I2_count).mean() * 100

print(f"\nI1 (movieId={I1}): {I1_count:,} ratings ({I1_pct:.1f}th percentile) - POPULAR")
print(f"I2 (movieId={I2}): {I2_count:,} ratings ({I2_pct:.1f}th percentile) - LESS POPULAR")


TARGET ITEM SELECTION

I1 (movieId=2): 22,243 ratings (99.5th percentile) - POPULAR
I2 (movieId=8860): 1,306 ratings (90.0th percentile) - LESS POPULAR


In [4]:
# Verify target items in filtered data
print(f"I1 ({I1}) in data: {'✓' if I1 in filtered_df['movieId'].values else '✗'}")
print(f"I2 ({I2}) in data: {'✓' if I2 in filtered_df['movieId'].values else '✗'}")

I1 (2) in data: ✓
I2 (8860) in data: ✓


In [5]:
# Ensure basic items and matrices are defined
df = filtered_df
I1, I2 = 2, 8860
target_items = [I1, I2]

# Base Rating Matrices
R = df[df['movieId'].isin(target_items)].pivot_table(index='userId', columns='movieId', values='rating')
R_all = df.pivot_table(index='userId', columns='movieId', values='rating')

In [6]:
missing_I1 = R[I1].isna().sum()
missing_I2 = R[I2].isna().sum()

print("Number of Missing Ratings:")
print(f"I1 ({I1}) Missing Ratings = {missing_I1}")
print(f"I2 ({I2}) Missing Ratings = {missing_I2}")

Number of Missing Ratings:
I1 (2) Missing Ratings = 59
I2 (8860) Missing Ratings = 2278


In [7]:
print("Number of Ratings for Each Target Item")

for item in target_items:
    num_ratings = df[df["movieId"] == item]["rating"].count()
    print(f"I{item}: Number of Ratings = {num_ratings}")

Number of Ratings for Each Target Item
I2: Number of Ratings = 2350
I8860: Number of Ratings = 131


In [8]:
users_I1 = R[I1].notna().sum()
users_I2 = R[I2].notna().sum()

print("Number of Users Who Rated Each Target Item:")
print(f"I1 ({I1}) rated by {users_I1} users")
print(f"I2 ({I2}) rated by {users_I2} users")

Number of Users Who Rated Each Target Item:
I1 (2) rated by 2350 users
I2 (8860) rated by 131 users


In [9]:
# define to use for predictions 
users_missing_I1 = R[R[I1].isna()].index.tolist()
users_missing_I2 = R[R[I2].isna()].index.tolist()

print("Users with Missing Ratings:")
print(f"\nI1 ({I1}) - Users with missing ratings:")
print(users_missing_I1)

print(f"\nI2 ({I2}) - Users with missing ratings:")
print(users_missing_I2)

Users with Missing Ratings:

I1 (2) - Users with missing ratings:
[2274, 7036, 8110, 8636, 13757, 14196, 14342, 16270, 17877, 19574, 19829, 22483, 22727, 28479, 29020, 34750, 35071, 35128, 35856, 38702, 39579, 40304, 49666, 50079, 53074, 55798, 56358, 61162, 62218, 68021, 68155, 69290, 76688, 81186, 81958, 84239, 84410, 88616, 89583, 90998, 93473, 93512, 94536, 94956, 97957, 100199, 101167, 108206, 109173, 110016, 118625, 120427, 123935, 124487, 130561, 132643, 134154, 134637, 138289]

I2 (8860) - Users with missing ratings:
[34, 170, 326, 370, 444, 457, 518, 554, 586, 648, 684, 694, 703, 725, 737, 789, 812, 847, 1136, 1181, 1255, 1302, 1313, 1339, 1374, 1516, 1525, 1568, 1579, 1643, 1672, 1743, 1825, 1887, 1927, 1959, 2155, 2306, 2487, 2671, 2676, 2705, 2755, 2807, 2839, 2875, 2925, 3170, 3192, 3264, 3318, 3348, 3352, 3405, 3450, 3457, 3483, 3596, 3639, 3646, 3663, 3792, 3935, 3948, 4083, 4172, 4229, 4244, 4392, 4410, 4418, 4525, 4528, 4532, 4533, 4549, 4574, 4577, 4581, 4600, 4605, 4

1.Mean rating for target items (I1, I2)")

In [10]:
R = (
    df[df["movieId"].isin(target_items)]
    .pivot_table(index="userId", columns="movieId", values="rating")
)

print("\n User–Item Rating Matrix (R)")
print(R.head())


 User–Item Rating Matrix (R)
movieId  2     8860
userId             
34        3.0   NaN
124       2.0   3.0
170       3.0   NaN
326       3.0   NaN
370       4.0   NaN


In [11]:
item_means_target = R.mean(axis=0)

print("\n mean Rating for Each Target Item")
print(item_means_target)


 mean Rating for Each Target Item
movieId
2       3.213404
8860    3.152672
dtype: float64


2.Mean-Filling missing ratings

In [12]:
Rf = R.fillna(item_means_target)

print("\nRatings AFTER Mean-Filling (Rf)")
print(Rf.head())
print("\nMissing values after mean-filling:")
print(Rf.isna().sum())


Ratings AFTER Mean-Filling (Rf)
movieId  2         8860
userId                 
34        3.0  3.152672
124       2.0  3.000000
170       3.0  3.152672
326       3.0  3.152672
370       4.0  3.152672

Missing values after mean-filling:
movieId
2       0
8860    0
dtype: int64


### 3. Calculate the average rating for each item

In [13]:
item_means_all = R_all.mean(axis=0)

print("Step 3: Average Rating for Each Item (Top 5):")
print(item_means_all.head())

Step 3: Average Rating for Each Item (Top 5):
movieId
2      3.213404
34     3.622133
58     3.958824
75     2.500000
106    3.227273
dtype: float64


4. Centering (Difference from Item Mean)

In [14]:
#Each value = Rating - Mean(Item)
X_centered = Rf - item_means_target

print("\n Centered Rating Matrix")
print(X_centered.head().round(2))



 Centered Rating Matrix
movieId  2     8860
userId             
34      -0.21  0.00
124     -1.21 -0.15
170     -0.21  0.00
326     -0.21  0.00
370      0.79  0.00


In [15]:
# A. Centering target items (using item means from Step 1)
X_centered = Rf - item_means_target

# B. Mean-filling and Centering full dataset (needed for Step 5)
Rf_all = R_all.fillna(item_means_all)
X_centered_all = Rf_all - item_means_all

print("Step 4: Centering completed.")
print("X_centered_all shape:", X_centered_all.shape)
print("Sample centered ratings (I1, I2):")
print(X_centered.head().round(2))

Step 4: Centering completed.
X_centered_all shape: (14201, 800)
Sample centered ratings (I1, I2):
movieId  2     8860
userId             
34      -0.21  0.00
124     -1.21 -0.15
170     -0.21  0.00
326     -0.21  0.00
370      0.79  0.00


### 5. Compute the covariance for each two items

The covariance between two items $i$ and $j$ is calculated as:
$$cov(i, j) = \frac{\sum_{u=1}^{N} (R_{u,i} - \bar{R}_i)(R_{u,j} - \bar{R}_j)}{N-1}$$

Where:
- $R_{u,i}$ is the rating of user $u$ for item $i$.
- $\bar{R}_i$ is the average rating of item $i$.
- $N$ is the number of users.

In [16]:
N_users = X_centered_all.shape[0]

# Matrix multiplication: (Centered^T @ Centered) / (N-1)
cov_matrix_all = (X_centered_all.T @ X_centered_all) / (N_users - 1)


### 6. Generate the covariance matrix

In [17]:
print("Step 6: Covariance Matrix (Sample columns):")
print(cov_matrix_all.iloc[:5, :5])
print("Covariance Matrix Shape:", cov_matrix_all.shape)
print("\nCOVARIANCE BETWEEN I1 AND I2:")
print(cov_matrix_all.loc[[I1, I2], [I1, I2]])
print("Sub-matrix Shape:", cov_matrix_all.loc[[I1, I2], [I1, I2]].shape)

Step 6: Covariance Matrix (Sample columns):
movieId       2         34        58        75        106
movieId                                                  
2        0.147763  0.027919  0.004555  0.001036  0.000429
34       0.027919  0.298167  0.007175 -0.000425  0.000527
58       0.004555  0.007175  0.085809  0.000655  0.000680
75       0.001036 -0.000425  0.000655  0.007535  0.000000
106      0.000429  0.000527  0.000680  0.000000  0.003006
Covariance Matrix Shape: (800, 800)

COVARIANCE BETWEEN I1 AND I2:
movieId      2         8860
movieId                    
2        0.147763  0.001742
8860     0.001742  0.007567
Sub-matrix Shape: (2, 2)


7- Determine the top 5-peers and top 10-peers for each of the target items (I1 and I2) using the transformed representation (covariance matrix).

In [18]:
# Top-5 and Top-10 Peers using Covariance Matrix
def get_top_peers(cov_matrix_all, target_item, top_k):
    # Get covariance values for the target item
    cov_values = cov_matrix_all[target_item].copy()

    # Remove self-covariance
    cov_values = cov_values.drop(target_item)

    # Sort by similarity (covariance) descending
    top_peers = cov_values.sort_values(ascending=False).head(top_k)

    return top_peers


In [19]:
# Target Item I1

top_5_I1 = get_top_peers(cov_matrix_all, I1, 5)
top_10_I1 = get_top_peers(cov_matrix_all, I1, 10)

print(f"\nTop 5 Peers for I1 (movieId={I1}):")
for i, (pid, cov) in enumerate(top_5_I1.items(), 1):
    print(f"  {i}. movieId={pid}, covariance={cov:.4f}")

print(f"\nTop 10 Peers for I1 (movieId={I1}):")
for i, (pid, cov) in enumerate(top_10_I1.items(), 1):
    print(f"  {i}. movieId={pid}, covariance={cov:.4f}")



Top 5 Peers for I1 (movieId=2):
  1. movieId=586, covariance=0.0320
  2. movieId=34, covariance=0.0279
  3. movieId=788, covariance=0.0218
  4. movieId=1097, covariance=0.0206
  5. movieId=410, covariance=0.0184

Top 10 Peers for I1 (movieId=2):
  1. movieId=586, covariance=0.0320
  2. movieId=34, covariance=0.0279
  3. movieId=788, covariance=0.0218
  4. movieId=1097, covariance=0.0206
  5. movieId=410, covariance=0.0184
  6. movieId=2571, covariance=0.0172
  7. movieId=256, covariance=0.0169
  8. movieId=2028, covariance=0.0145
  9. movieId=919, covariance=0.0114
  10. movieId=2424, covariance=0.0108


In [20]:
# Target Item I2
top_5_I2 = get_top_peers(cov_matrix_all, I2, 5)
top_10_I2 = get_top_peers(cov_matrix_all, I2, 10)

print(f"\nTop 5 Peers for I2 (movieId={I2}):")
for i, (pid, cov) in enumerate(top_5_I2.items(), 1):
    print(f"  {i}. movieId={pid}, covariance={cov:.4f}")

print(f"\nTop 10 Peers for I2 (movieId={I2}):")
for i, (pid, cov) in enumerate(top_10_I2.items(), 1):
    print(f"  {i}. movieId={pid}, covariance={cov:.4f}")


Top 5 Peers for I2 (movieId=8860):
  1. movieId=2, covariance=0.0017
  2. movieId=2710, covariance=0.0013
  3. movieId=1097, covariance=0.0012
  4. movieId=3175, covariance=0.0012
  5. movieId=2571, covariance=0.0012

Top 10 Peers for I2 (movieId=8860):
  1. movieId=2, covariance=0.0017
  2. movieId=2710, covariance=0.0013
  3. movieId=1097, covariance=0.0012
  4. movieId=3175, covariance=0.0012
  5. movieId=2571, covariance=0.0012
  6. movieId=7004, covariance=0.0012
  7. movieId=1784, covariance=0.0012
  8. movieId=256, covariance=0.0012
  9. movieId=2699, covariance=0.0011
  10. movieId=1407, covariance=0.0011


8. Determine reduced dimensional space for each user in case of using the top 5-peers then
9. predictions of the original missing rating for each of the target items (11 and I2) using the top 5-peers.

In [21]:
top5_I1 = get_top_peers(cov_matrix_all, I1, 5).index
top5_I2 = get_top_peers(cov_matrix_all, I2, 5).index
cov_I1_5 = cov_matrix_all.loc[top5_I1, top5_I1]
cov_I2_5 = cov_matrix_all.loc[top5_I2, top5_I2]

In [22]:
# Reduced space for I1 (Top-5 peers)
R_I1_5 = Rf_all[top5_I1]

# Centering
R_I1_5_centered = R_I1_5 - R_I1_5.mean(axis=0)

print("Reduced dimensional space for I1 (Top-5):")
print(R_I1_5_centered.head())


Reduced dimensional space for I1 (Top-5):
movieId      586           34        788           1097      410 
userId                                                           
11       0.905311  4.440892e-16  0.000000  7.571757e-01  1.045025
12       0.000000  3.778670e-01  1.087672 -4.440892e-16  0.000000
21       0.000000  4.440892e-16  0.000000  2.571757e-01  0.000000
33       0.000000  4.440892e-16  0.000000 -4.440892e-16  0.000000
34       0.000000 -6.221330e-01  0.000000 -4.440892e-16  1.045025


In [23]:
# Reduced space for I2 (Top-5 peers)
R_I2_5 = Rf_all[top5_I2]

# Centering
R_I2_5_centered = R_I2_5 - R_I2_5.mean(axis=0)

print("Reduced dimensional space for I2 (Top-5):")
print(R_I2_5_centered.head())


Reduced dimensional space for I2 (Top-5):
movieId          2     2710          1097          3175          2571
userId                                                               
11       8.881784e-16   0.0  7.571757e-01 -4.440892e-16  8.049766e-01
12       8.881784e-16   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16
21       8.881784e-16   0.0  2.571757e-01 -4.440892e-16 -8.881784e-16
33       8.881784e-16   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16
34      -2.134043e-01   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16


In [24]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(u, v):
    return np.dot(u, v) / (norm(u) * norm(v))


def predict_missing_ratings_actual(
    target_item,
    users_missing,
    reduced_users,
    ratings_matrix,
    item_means
):
    predictions = {}

    # Users who rated the target item
    users_rated = ratings_matrix[ratings_matrix[target_item].notna()].index

    for u in users_missing:
        u_vec = reduced_users.loc[u].values

        num = 0
        den = 0

        for v in users_rated:
            v_vec = reduced_users.loc[v].values
            sim = cosine_similarity(u_vec, v_vec)

            num += sim * ratings_matrix.loc[v, target_item]
            den += abs(sim)

        if den != 0:
            # centered prediction
            pred_centered = num / den

            # convert to actual rating
            pred_actual = item_means[target_item] + pred_centered

            # clip to valid range [1, 5]
            predictions[u] = min(5, max(1, pred_actual))
        else:
            predictions[u] = np.nan

    return predictions



#  predictions of the original missing rating for each of the target items I1 using the top 5-peers.

In [25]:
pred_I1_top5_actual = predict_missing_ratings_actual(
    target_item=I1,
    users_missing=users_missing_I1,
    reduced_users=R_I1_5_centered,
    ratings_matrix=Rf_all,
    item_means=item_means_all
)

print("Final Predicted Ratings for I1 (Top-5):")
for u, r in pred_I1_top5_actual.items():
    print(f"User {u} → {r:.2f}")


Final Predicted Ratings for I1 (Top-5):
User 2274 → 1.65
User 7036 → 5.00
User 8110 → 5.00
User 8636 → 1.89
User 13757 → 1.36
User 14196 → 2.52
User 14342 → 4.29
User 16270 → 1.00
User 17877 → 5.00
User 19574 → 1.27
User 19829 → 1.11
User 22483 → 5.00
User 22727 → 2.12
User 28479 → 5.00
User 29020 → 5.00
User 34750 → 1.27
User 35071 → 5.00
User 35128 → 2.16
User 35856 → 5.00
User 38702 → 5.00
User 39579 → 3.91
User 40304 → 5.00
User 49666 → 5.00
User 50079 → 5.00
User 53074 → 3.60
User 55798 → 2.16
User 56358 → 1.14
User 61162 → 2.33
User 62218 → 5.00
User 68021 → 5.00
User 68155 → 5.00
User 69290 → 5.00
User 76688 → 1.63
User 81186 → 5.00
User 81958 → 3.25
User 84239 → 5.00
User 84410 → 1.32
User 88616 → 5.00
User 89583 → 5.00
User 90998 → 5.00
User 93473 → 5.00
User 93512 → 1.36
User 94536 → 5.00
User 94956 → 5.00
User 97957 → 5.00
User 100199 → 1.27
User 101167 → 5.00
User 108206 → 3.70
User 109173 → 2.86
User 110016 → 5.00
User 118625 → 5.00
User 120427 → 3.23
User 123935 → 5.00
Us

  ## predictions of the original missing rating for each of the target items I2 using the top 5-peers.

In [26]:
def predict_missing_ratings_actual_fast(
    target_item,
    users_missing,
    reduced_users,
    ratings_matrix,
    item_means,
    K=30
):
    predictions = {}

    # Users who rated the target item
    users_rated = ratings_matrix[ratings_matrix[target_item].notna()].index

    # Matrices
    U_missing = reduced_users.loc[users_missing].values
    U_rated   = reduced_users.loc[users_rated].values
    ratings   = ratings_matrix.loc[users_rated, target_item].values

    # Normalize (Cosine similarity)
    U_missing_norm = U_missing / np.linalg.norm(U_missing, axis=1, keepdims=True)
    U_rated_norm   = U_rated   / np.linalg.norm(U_rated, axis=1, keepdims=True)

    mean_item = item_means[target_item]

    for i, u in enumerate(users_missing):

        sims = U_missing_norm[i] @ U_rated_norm.T

        # Top-K similar users
        topk_idx = np.argsort(np.abs(sims))[-K:]

        num = np.sum(sims[topk_idx] * ratings[topk_idx])
        den = np.sum(np.abs(sims[topk_idx]))

        if den != 0:
            pred_centered = num / den
            pred_actual = mean_item + pred_centered
            predictions[u] = min(5, max(1, pred_actual))
        else:
            predictions[u] = mean_item

    return predictions

In [27]:
pred_I2_top5_actual = predict_missing_ratings_actual_fast(
    target_item=I2,
    users_missing=users_missing_I2,
    reduced_users=R_I2_5_centered,
    ratings_matrix=Rf_all,
    item_means=item_means_all,
    K=30
)

print("Total predictions:", len(pred_I2_top5_actual))
print("Final Predicted Ratings (Top-5()")
for u, r in pred_I2_top5_actual.items():
    print(f"User {u} → {r:.2f}")


Total predictions: 2278
Final Predicted Ratings (Top-5()
User 34 → 3.15
User 170 → 5.00
User 326 → 3.15
User 370 → 4.62
User 444 → 3.15
User 457 → 2.68
User 518 → 3.15
User 554 → 5.00
User 586 → 3.55
User 648 → 3.91
User 684 → 3.15
User 694 → 2.56
User 703 → 2.94
User 725 → 3.15
User 737 → 1.48
User 789 → 2.40
User 812 → 3.53
User 847 → 5.00
User 1136 → 3.15
User 1181 → 3.15
User 1255 → 3.15
User 1302 → 3.15
User 1313 → 3.15
User 1339 → 5.00
User 1374 → 3.97
User 1516 → 3.46
User 1525 → 2.28
User 1568 → 4.01
User 1579 → 3.15
User 1643 → 2.06
User 1672 → 4.22
User 1743 → 3.39
User 1825 → 3.91
User 1887 → 5.00
User 1927 → 4.42
User 1959 → 3.29
User 2155 → 3.15
User 2306 → 3.15
User 2487 → 2.94
User 2671 → 3.15
User 2676 → 5.00
User 2705 → 3.15
User 2755 → 5.00
User 2807 → 3.15
User 2839 → 3.15
User 2875 → 3.15
User 2925 → 5.00
User 3170 → 3.76
User 3192 → 3.15
User 3264 → 5.00
User 3318 → 5.00
User 3348 → 3.71
User 3352 → 5.00
User 3405 → 3.15
User 3450 → 3.38
User 3457 → 4.66
User 3483 

### 10.Determine reduced dimensional space for each user in case of using the top 10-peers

### 11.predictions of the original missing rating for each of the target items (11 and I2) using the top 10-peers.

In [28]:
top10_I1 = get_top_peers(cov_matrix_all, I1, 10).index
top10_I2 = get_top_peers(cov_matrix_all, I2, 10).index
cov_I1_10 = cov_matrix_all.loc[top5_I1, top5_I1]
cov_I2_10 = cov_matrix_all.loc[top5_I2, top5_I2]

In [29]:
# Reduced space for I1 (Top-10 peers)
R_I1_10 = Rf_all[top10_I1]

# Centering
R_I1_10_centered = R_I1_10 - R_I1_10.mean(axis=0)

print("Reduced dimensional space for I1 (Top-10):")
print(R_I1_10_centered.head())

Reduced dimensional space for I1 (Top-10):
movieId      586           34        788           1097      410   \
userId                                                              
11       0.905311  4.440892e-16  0.000000  7.571757e-01  1.045025   
12       0.000000  3.778670e-01  1.087672 -4.440892e-16  0.000000   
21       0.000000  4.440892e-16  0.000000  2.571757e-01  0.000000   
33       0.000000  4.440892e-16  0.000000 -4.440892e-16  0.000000   
34       0.000000 -6.221330e-01  0.000000 -4.440892e-16  1.045025   

movieId          2571          256           2028          919           2424  
userId                                                                         
11       8.049766e-01  2.306893e+00  9.356202e-01 -4.440892e-16 -8.881784e-16  
12      -8.881784e-16 -8.881784e-16  8.881784e-16 -4.440892e-16 -8.881784e-16  
21      -8.881784e-16 -8.881784e-16  8.881784e-16 -4.440892e-16 -8.881784e-16  
33      -8.881784e-16 -8.881784e-16  8.881784e-16 -4.440892e-16 -8.881784

In [30]:
# Reduced space for I2 (Top-10 peers)
R_I2_10 = Rf_all[top10_I2]

# Centering
R_I2_10_centered = R_I2_10 - R_I2_10.mean(axis=0)

print("Reduced dimensional space for I2 (Top-10):")
print(R_I2_10_centered.head())


Reduced dimensional space for I2 (Top-10):
movieId          2     2710          1097          3175          2571  \
userId                                                                  
11       8.881784e-16   0.0  7.571757e-01 -4.440892e-16  8.049766e-01   
12       8.881784e-16   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16   
21       8.881784e-16   0.0  2.571757e-01 -4.440892e-16 -8.881784e-16   
33       8.881784e-16   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16   
34      -2.134043e-01   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16   

movieId          7004  1784          256           2699          1407  
userId                                                                 
11      -4.440892e-16   0.0  2.306893e+00 -4.440892e-16 -4.440892e-16  
12      -4.440892e-16   0.0 -8.881784e-16 -4.440892e-16 -4.440892e-16  
21      -4.440892e-16   0.0 -8.881784e-16 -4.440892e-16 -4.440892e-16  
33      -4.440892e-16   0.0 -8.881784e-16 -4.440892e-16 -4.440892e-16  
34      -4.44

In [31]:
R_I2_10 = Rf_all[top10_I2]
R_I2_10_centered = R_I2_10 - R_I2_10.mean(axis=0)

print("Reduced space for I2 (Top-10):")
print(R_I2_10_centered.head())


Reduced space for I2 (Top-10):
movieId          2     2710          1097          3175          2571  \
userId                                                                  
11       8.881784e-16   0.0  7.571757e-01 -4.440892e-16  8.049766e-01   
12       8.881784e-16   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16   
21       8.881784e-16   0.0  2.571757e-01 -4.440892e-16 -8.881784e-16   
33       8.881784e-16   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16   
34      -2.134043e-01   0.0 -4.440892e-16 -4.440892e-16 -8.881784e-16   

movieId          7004  1784          256           2699          1407  
userId                                                                 
11      -4.440892e-16   0.0  2.306893e+00 -4.440892e-16 -4.440892e-16  
12      -4.440892e-16   0.0 -8.881784e-16 -4.440892e-16 -4.440892e-16  
21      -4.440892e-16   0.0 -8.881784e-16 -4.440892e-16 -4.440892e-16  
33      -4.440892e-16   0.0 -8.881784e-16 -4.440892e-16 -4.440892e-16  
34      -4.440892e-16   0

In [32]:
import numpy as np

def predict_missing_ratings_actual_fast(
    target_item,
    users_missing,
    reduced_users,
    ratings_matrix,
    item_means,
    K=30
):
    predictions = {}

    # Users who rated the target item
    users_rated = ratings_matrix[ratings_matrix[target_item].notna()].index

    # Matrices
    U_missing = reduced_users.loc[users_missing].values
    U_rated   = reduced_users.loc[users_rated].values
    ratings   = ratings_matrix.loc[users_rated, target_item].values

    # Normalize (Cosine similarity)
    U_missing_norm = U_missing / np.linalg.norm(U_missing, axis=1, keepdims=True)
    U_rated_norm   = U_rated   / np.linalg.norm(U_rated, axis=1, keepdims=True)

    mean_item = item_means[target_item]

    for i, u in enumerate(users_missing):

        sims = U_missing_norm[i] @ U_rated_norm.T

        # Top-K similar users
        topk_idx = np.argsort(np.abs(sims))[-K:]

        num = np.sum(sims[topk_idx] * ratings[topk_idx])
        den = np.sum(np.abs(sims[topk_idx]))

        if den != 0:
            pred_centered = num / den
            pred_actual = mean_item + pred_centered
            predictions[u] = min(5, max(1, pred_actual))
        else:
            predictions[u] = mean_item

    return predictions

In [33]:
pred_I1_top10_actual = predict_missing_ratings_actual(
    target_item=I1,
    users_missing=users_missing_I1,
    reduced_users=R_I1_10_centered,   
    ratings_matrix=Rf_all,
    item_means=item_means_all
)

print("Final Predicted Ratings for I1 (Top-10):")
for u, r in pred_I1_top10_actual.items():
    print(f"User {u} → {r:.2f}")



Final Predicted Ratings for I1 (Top-10):
User 2274 → 2.47
User 7036 → 4.28
User 8110 → 4.90
User 8636 → 3.37
User 13757 → 2.19
User 14196 → 4.59
User 14342 → 3.08
User 16270 → 1.53
User 17877 → 3.74
User 19574 → 2.81
User 19829 → 2.19
User 22483 → 2.72
User 22727 → 4.34
User 28479 → 4.38
User 29020 → 3.04
User 34750 → 2.31
User 35071 → 4.10
User 35128 → 4.31
User 35856 → 3.04
User 38702 → 3.74
User 39579 → 3.29
User 40304 → 2.41
User 49666 → 3.56
User 50079 → 3.94
User 53074 → 2.56
User 55798 → 1.80
User 56358 → 2.07
User 61162 → 3.28
User 62218 → 1.80
User 68021 → 2.34
User 68155 → 3.39
User 69290 → 4.48
User 76688 → 4.10
User 81186 → 2.18
User 81958 → 3.81
User 84239 → 2.41
User 84410 → 2.41
User 88616 → 4.90
User 89583 → 4.90
User 90998 → 1.73
User 93473 → 4.58
User 93512 → 4.61
User 94536 → 3.21
User 94956 → 4.90
User 97957 → 4.84
User 100199 → 3.79
User 101167 → 4.04
User 108206 → 4.57
User 109173 → 4.58
User 110016 → 2.41
User 118625 → 3.53
User 120427 → 2.42
User 123935 → 2.41
U

In [34]:
pred_I2_top10_actual = predict_missing_ratings_actual_fast(
    target_item=I2,
    users_missing=users_missing_I2,
    reduced_users=R_I2_10_centered,   # Top-10
    ratings_matrix=Rf_all,
    item_means=item_means_all,
    K=30                              
)
print("Final Predicted Ratings (Top-10")
for u, r in pred_I2_top10_actual.items():
    print(f"User {u} → {r:.2f}")

Final Predicted Ratings (Top-10
User 34 → 3.15
User 170 → 5.00
User 326 → 3.15
User 370 → 3.99
User 444 → 3.15
User 457 → 2.49
User 518 → 3.15
User 554 → 4.21
User 586 → 2.69
User 648 → 5.00
User 684 → 3.15
User 694 → 1.53
User 703 → 2.73
User 725 → 3.15
User 737 → 1.06
User 789 → 2.38
User 812 → 4.20
User 847 → 5.00
User 1136 → 3.15
User 1181 → 3.15
User 1255 → 3.15
User 1302 → 3.15
User 1313 → 3.15
User 1339 → 4.84
User 1374 → 3.78
User 1516 → 3.30
User 1525 → 2.31
User 1568 → 3.79
User 1579 → 3.15
User 1643 → 3.55
User 1672 → 4.56
User 1743 → 4.58
User 1825 → 2.75
User 1887 → 4.45
User 1927 → 2.52
User 1959 → 2.72
User 2155 → 4.99
User 2306 → 5.00
User 2487 → 4.14
User 2671 → 3.15
User 2676 → 3.97
User 2705 → 3.15
User 2755 → 5.00
User 2807 → 3.15
User 2839 → 3.15
User 2875 → 5.00
User 2925 → 3.20
User 3170 → 4.01
User 3192 → 5.00
User 3264 → 3.17
User 3318 → 3.98
User 3348 → 3.35
User 3352 → 5.00
User 3405 → 3.40
User 3450 → 2.97
User 3457 → 1.91
User 3483 → 2.33
User 3596 → 4.83
U

# 12-Compare the results of point 9 with results of point 11

In [35]:
import pandas as pd

# I1 comparison
compare_I1 = pd.DataFrame({
    "Top-5": pd.Series(pred_I1_top5_actual),
    "Top-10": pd.Series(pred_I1_top10_actual)
})

# I2 comparison
compare_I2 = pd.DataFrame({
    "Top-5": pd.Series(pred_I2_top5_actual),
    "Top-10": pd.Series(pred_I2_top10_actual)
})



In [36]:
print("=" * 70)
print("NUMERICAL COMPARISON")
print("=" * 70)

print("\nI1:")
print(f"Top-5  → Mean = {compare_I1['Top-5'].mean():.3f}, Std = {compare_I1['Top-5'].std():.3f}")
print(f"Top-10 → Mean = {compare_I1['Top-10'].mean():.3f}, Std = {compare_I1['Top-10'].std():.3f}")

print("\nI2:")
print(f"Top-5  → Mean = {compare_I2['Top-5'].mean():.3f}, Std = {compare_I2['Top-5'].std():.3f}")
print(f"Top-10 → Mean = {compare_I2['Top-10'].mean():.3f}, Std = {compare_I2['Top-10'].std():.3f}")


NUMERICAL COMPARISON

I1:
Top-5  → Mean = 3.674, Std = 1.596
Top-10 → Mean = 3.332, Std = 1.050

I2:
Top-5  → Mean = 3.523, Std = 0.948
Top-10 → Mean = 3.465, Std = 0.986


## Conclusion

This notebook implemented PCA with mean-filling technique for rating prediction using:
- **1500+ items** (larger covariance matrix)
- **Same target items as Part 1**: I1=movieId 2, I2=movieId 8860
PCA with mean-filling was used to predict missing ratings for items I1 and I2.

- The covariance matrix helped identify relationships between items.

- Dimensionality reduction made the data easier to handle and less sparse.

- The method was effective in predicting missing ratings but biased


# Key Findings

- Mean-filling successfully handled missing values.

- The covariance matrix 

- PCA components kept important information while reducing dimensions.
-  calulate eigen valaues and eigen vectors.

- Using Top-5 peers to make prediction (I1: → Mean = 3.674, Std = 1.596.)& (I2: → Mean = 3.523, Std = 0.948)

- Using Top-10 peers gave more predictions (I1: → Mean = 3.332, Std = 1.050) & (I2: → Mean = 3.465, Std = 0.986)

