# Project 3: Movies Recommendations
## Manas Gandhi (manaspg2), Neeya Devanagondi (neeyati2), Rahul Kasibhatla (rahulk8)


### Problem (A) - 10 most popular movies

"Popularity" can be defined as having the highest average rating for movies that have received at least a certain number (a threshold) of ratings.

We will define that threshold as 1000 ratings. This will filter out niche movies that might have high average ratings due to a low number of ratings.

In [30]:
import pandas as pd

In [31]:
#load data

data_folder = './ml-1m/'
ratings_file = f'{data_folder}ratings.dat'
movies_file = f'{data_folder}movies.dat'

#col names from the file - cant read directly because it's a .dat file
r_cols = ['UserID', 'MovieID', 'Rating', 'Timestamp']
m_cols = ['MovieID', 'Title', 'Genres']

#separator = ::
ratings = pd.read_csv(ratings_file, sep='::', engine='python', names=r_cols)
movies = pd.read_csv(movies_file, sep='::', engine='python', names=m_cols, encoding='ISO-8859-1')

In [32]:
#calculate movie statistics
movie_stats = ratings.groupby('MovieID')['Rating'].agg(['mean', 'count'])
movie_stats.columns = ['AvgRating', 'RatingCount']

In [33]:
#merge the dfs
popular_movies_df = pd.merge(movies, movie_stats, on='MovieID')
print(popular_movies_df.head())

   MovieID                               Title                        Genres  \
0        1                    Toy Story (1995)   Animation|Children's|Comedy   
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy   
2        3             Grumpier Old Men (1995)                Comedy|Romance   
3        4            Waiting to Exhale (1995)                  Comedy|Drama   
4        5  Father of the Bride Part II (1995)                        Comedy   

   AvgRating  RatingCount  
0   4.146846         2077  
1   3.201141          701  
2   3.016736          478  
3   2.729412          170  
4   3.006757          296  


In [34]:
#get top ten most popular
min_ratings_threshold = 1000

filtered_popular_movies = popular_movies_df.loc[popular_movies_df['RatingCount'] >= min_ratings_threshold].copy()
top_10_popular = filtered_popular_movies.sort_values(by='AvgRating', ascending=False).head(10)

In [35]:
print("--- Part A: Top 10 'Most Popular' Movies ---")
print(f"Definition of \"popular\": Highest average rating with at least {min_ratings_threshold} ratings.\n")
top_10_display = top_10_popular[['Title', 'AvgRating', 'RatingCount']].copy()
top_10_display['AvgRating'] = top_10_display['AvgRating'].round(3)
top_10_display.index = range(1, 11)
print(top_10_display)

print("\n--- Full Movie Titles (in order) ---")
for i, title in enumerate(top_10_display['Title']):
    print(f"{i+1}. {title}")

print("\nSomething to note: the 'The,' that shows up in some titles is just because of how the data is stored \n- it's done to preserve alphabetical order, and not list all those movies under 'T'.")

--- Part A: Top 10 'Most Popular' Movies ---
Definition of "popular": Highest average rating with at least 1000 ratings.

                                                Title  AvgRating  RatingCount
1                    Shawshank Redemption, The (1994)      4.555         2227
2                               Godfather, The (1972)      4.525         2223
3                          Usual Suspects, The (1995)      4.517         1783
4                             Schindler's List (1993)      4.510         2304
5                      Raiders of the Lost Ark (1981)      4.478         2514
6                                  Rear Window (1954)      4.476         1050
7           Star Wars: Episode IV - A New Hope (1977)      4.454         2991
8   Dr. Strangelove or: How I Learned to Stop Worr...      4.450         1367
9                                   Casablanca (1942)      4.413         1669
10                            Sixth Sense, The (1999)      4.406         2459

--- Full Movie Titl

### Solution to Part A

Based on our definition of popular, here are the top 10 most popular movies, their average rating (out of 5), and the number of ratings they had:

| Title | AvgRating | RatingCount |
| :--- | ---: | ---: |
| The Shawshank Redemption (1994) | 4.555 | 2227 |
| The Godfather (1972) | 4.525 | 2223 |
| The Usual Suspects (1995) | 4.517 | 1783 |
| Schindler's List (1993) | 4.510 | 2304 |
| Raiders of the Lost Ark (1981) | 4.478 | 2514 |
| Rear Window (1954) | 4.476 | 1050 |
| Star Wars: Episode IV - A New Hope (1977) | 4.454 | 2991 |
| Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 4.450 | 1367 |
| Casablanca (1942) | 4.413 | 1669 |
| The Sixth Sense (1999) | 4.406 | 2459 |

### Problem (B) - Recommendation based on Item-Based Collaborative Filtering 

In [36]:
import numpy as np
from scipy.sparse import csr_matrix

#user/movie IDs to indices
unique_users = ratings['UserID'].unique()
unique_movies = movies['MovieID'].unique()
user_id_to_idx = {u: i for i, u in enumerate(np.sort(unique_users))}
movie_id_to_idx = {m: i for i, m in enumerate(np.sort(unique_movies))}
idx_to_movie_id = np.array(sorted(unique_movies))

#sparse matrix
rows = ratings['UserID'].map(user_id_to_idx).to_numpy()
cols = ratings['MovieID'].map(movie_id_to_idx).to_numpy()
vals = ratings['Rating'].astype(np.float32).to_numpy()
R = csr_matrix((vals, (rows, cols)), shape=(len(unique_users), len(unique_movies)))

#row means
row_nnz = np.diff(R.indptr)
row_sums = np.bincount(np.repeat(np.arange(len(unique_users)), row_nnz), weights=R.data, minlength=len(unique_users))
user_means = np.divide(row_sums, row_nnz, out=np.zeros_like(row_sums), where=row_nnz > 0)

#center
R_centered = R.copy().tocsr()
for u in range(len(unique_users)):
    start, end = R_centered.indptr[u], R_centered.indptr[u+1]
    if start < end:
        R_centered.data[start:end] -= user_means[u]


In [37]:
R_bin = R.copy()
R_bin.data[:] = 1.0

counts = (R_bin.T @ R_bin).astype(np.int32)
counts_arr = counts.toarray()
D = (R_centered.T @ R_centered).astype(np.float32)

# Norms of each movie vector
col_norms = np.sqrt(D.diagonal()).astype(np.float32)
col_norms[col_norms == 0] = np.nan

# Cosine similarity
with np.errstate(invalid='ignore', divide='ignore'):
    cosine = D.toarray() / np.outer(col_norms, col_norms)

cosine[counts_arr < 3] = np.nan
np.fill_diagonal(cosine, np.nan)
sim = cosine


In [38]:
#get top 30
top_k = 30
sim_topk = np.full_like(sim, np.nan, dtype=np.float32)

for j in range(len(unique_movies)):
    col = sim[:, j]
    valid_idx = np.where(~np.isnan(col))[0]
    if valid_idx.size == 0:
        continue
    k = min(top_k, valid_idx.size)
    top_idx = valid_idx[np.argpartition(col[valid_idx], -k)[-k:]]
    sim_topk[top_idx, j] = col[top_idx]


In [39]:
#display similarities
id_to_title = dict(zip(movies['MovieID'], movies['Title']))
title_to_id = dict(zip(movies['Title'], movies['MovieID']))

def show_similarities_for_titles(titles, sim_matrix, k_show=10):
    for t in titles:
        mid = title_to_id.get(t)
        if mid is None:
            print(f"[WARN] Title not found: {t}")
            continue
        j = movie_id_to_idx[mid]
        col = sim_matrix[:, j]
        valid = np.where(~np.isnan(col))[0]
        order = valid[np.argsort(-col[valid])]
        top = order[:k_show]
        pairs = [(id_to_title[idx_to_movie_id[i]], float(col[i])) for i in top]
        print(f"\nTop {len(top)} similarities for: {t}")
        display(pd.DataFrame(pairs, columns=["Similar Movie", "Similarity"]))

five_titles = [
    "Toy Story (1995)",
    "GoldenEye (1995)",
    "Liar Liar (1997)",
    "Lost World: Jurassic Park (1997)",
    "Sixth Sense, The (1999)"
]
show_similarities_for_titles(five_titles, sim_topk, k_show=10)



Top 10 similarities for: Toy Story (1995)


Unnamed: 0,Similar Movie,Similarity
0,Toy Story 2 (1999),0.502959
1,Raiders of the Lost Ark (1981),0.31079
2,"Bug's Life, A (1998)",0.301616
3,"Shawshank Redemption, The (1994)",0.287588
4,Star Wars: Episode V - The Empire Strikes Back...,0.285666
5,Star Wars: Episode IV - A New Hope (1977),0.283317
6,Back to the Future (1985),0.279306
7,Aladdin (1992),0.272698
8,"Princess Bride, The (1987)",0.257914
9,"Wrong Trousers, The (1993)",0.256799



Top 10 similarities for: GoldenEye (1995)


Unnamed: 0,Similar Movie,Similarity
0,Tomorrow Never Dies (1997),0.36227
1,"World Is Not Enough, The (1999)",0.268518
2,"Man with the Golden Gun, The (1974)",0.220534
3,Thunderball (1965),0.200164
4,For Your Eyes Only (1981),0.186418
5,Dr. No (1962),0.180461
6,Live and Let Die (1973),0.179791
7,"Spy Who Loved Me, The (1977)",0.177945
8,From Russia with Love (1963),0.160824
9,Licence to Kill (1989),0.157537



Top 10 similarities for: Liar Liar (1997)


Unnamed: 0,Similar Movie,Similarity
0,Ace Ventura: Pet Detective (1994),0.191058
1,Dumb & Dumber (1994),0.182468
2,Tommy Boy (1995),0.170277
3,Happy Gilmore (1996),0.145092
4,Christmas Vacation (1989),0.137462
5,"Mask, The (1994)",0.137106
6,"Wedding Singer, The (1998)",0.137081
7,Mrs. Doubtfire (1993),0.135377
8,Billy Madison (1995),0.124647
9,Lethal Weapon 2 (1989),0.116407


[WARN] Title not found: Lost World: Jurassic Park (1997)

Top 10 similarities for: Sixth Sense, The (1999)


Unnamed: 0,Similar Movie,Similarity
0,"Shawshank Redemption, The (1994)",0.383948
1,"Silence of the Lambs, The (1991)",0.367204
2,"Usual Suspects, The (1995)",0.36552
3,Raiders of the Lost Ark (1981),0.343404
4,"Matrix, The (1999)",0.342962
5,Star Wars: Episode IV - A New Hope (1977),0.330914
6,Saving Private Ryan (1998),0.323322
7,Schindler's List (1993),0.307686
8,Star Wars: Episode V - The Empire Strikes Back...,0.305152
9,American Beauty (1999),0.292499


In [40]:
pop_ranking = (filtered_popular_movies.sort_values(by=["AvgRating", "RatingCount"], ascending=[False, False])["MovieID"].tolist())

def ibcf_recommend(newuser_ratings, sim_topk_matrix, k_recs=10):
    num_movies_local = len(unique_movies)

    x = np.full(num_movies_local, np.nan, dtype=np.float32)
    for mid, r in newuser_ratings.dropna().items():
        if mid in movie_id_to_idx:
            x[movie_id_to_idx[mid]] = float(r)

    rated_mask = ~np.isnan(x)

    if rated_mask.sum() == 0:
        take = [m for m in pop_ranking if m not in newuser_ratings.index][:k_recs]
        return pd.DataFrame({
            "Title": [id_to_title[m] for m in take],
            "PredictedRating": [np.nan] * len(take)
        })

    u_mean = np.nanmean(x[rated_mask])
    x_centered = x.copy()
    x_centered[rated_mask] = x_centered[rated_mask] - u_mean

    preds = np.full(num_movies_local, np.nan, dtype=np.float32)

    for j in np.where(~rated_mask)[0]:
        s = sim_topk_matrix[:, j]

        valid_i = np.where(rated_mask & ~np.isnan(s))[0]
        if valid_i.size == 0:
            continue

        num = np.nansum(s[valid_i] * x_centered[valid_i])
        den = np.nansum(np.abs(s[valid_i]))

        if den == 0 or np.isnan(den) or np.isnan(num):
            continue

        pred = u_mean + num / den
        preds[j] = np.clip(pred, 1.0, 5.0)

    ranked_idx = np.where(~np.isnan(preds))[0]
    ranked_idx = ranked_idx[np.argsort(-preds[ranked_idx])]

    seen_movieids = set(newuser_ratings.dropna().index)
    recs = []
    for j in ranked_idx:
        mid = int(idx_to_movie_id[j])
        if mid in seen_movieids:
            continue
        recs.append((mid, float(preds[j])))
        if len(recs) >= k_recs:
            break

    if len(recs) < k_recs:
        chosen = {m for m, _ in recs}
        for mid in pop_ranking:
            if mid in seen_movieids or mid in chosen:
                continue
            recs.append((mid, np.nan))
            if len(recs) >= k_recs:
                break

    return pd.DataFrame({
        "Title": [id_to_title[m] for m, _ in recs],
        "PredictedRating": [p for _, p in recs]
    })


In [41]:
#existing user
row_1500 = 1499
u_start, u_end = R.indptr[row_1500], R.indptr[row_1500+1]
u_cols = R.indices[u_start:u_end]
u_vals = R.data[u_start:u_end]
user1500_series = pd.Series(u_vals, index=idx_to_movie_id[u_cols])

print("\n--- Top 10 for user in row 1500 ---")
display(ibcf_recommend(user1500_series, sim_topk))


#hypothetical user
hypo = pd.Series(index=movies['MovieID'], dtype=np.float32)
hypo[:] = np.nan
hypo.at[title_to_id["Star Wars: Episode IV - A New Hope (1977)"]] = 5.0
hypo.at[title_to_id["Independence Day (ID4) (1996)"]] = 4.0

print("\n--- Top 10 for hypothetical user (SW=5, ID4=4) ---")
display(ibcf_recommend(hypo, sim_topk))



--- Top 10 for user in row 1500 ---


Unnamed: 0,Title,PredictedRating
0,"Contender, The (2000)",5.0
1,Hilary and Jackie (1998),5.0
2,"War Room, The (1993)",5.0
3,"Thin Red Line, The (1998)",5.0
4,"Great Escape, The (1963)",5.0
5,Germinal (1993),5.0
6,"Prince of Egypt, The (1998)",5.0
7,Pink Flamingos (1972),5.0
8,Unforgiven (1992),5.0
9,"Celebration, The (Festen) (1998)",5.0



--- Top 10 for hypothetical user (SW=5, ID4=4) ---


Unnamed: 0,Title,PredictedRating
0,Toy Story (1995),5.0
1,Indiana Jones and the Last Crusade (1989),5.0
2,Forbidden Planet (1956),5.0
3,Butch Cassidy and the Sundance Kid (1969),5.0
4,When Harry Met Sally... (1989),5.0
5,"End of the Affair, The (1955)",5.0
6,Toy Story 2 (1999),5.0
7,Star Trek: The Wrath of Khan (1982),5.0
8,This Is Spinal Tap (1984),5.0
9,Almost Famous (2000),5.0


#### Top 10 movies for user in row 1500

| Title | Predicted Rating |
| :--- | :---: |
| Thin Blue Line, The (1988) | 5.0 |
| To Kill a Mockingbird (1962) | 5.0 |
| Contender, The (2000) | 5.0 |
| Sex, Lies, and Videotape (1989) | 5.0 |
| 12 Angry Men (1957) | 5.0 |
| Clueless (1995) | 5.0 |
| Delicatessen (1991) | 5.0 |
| Richard III (1995) | 5.0 |
| Help! (1965) | 5.0 |
| Ace Ventura: When Nature Calls (1995) | 5.0 |

#### Top 10 movies for hypothetical user (SW=5, ID4=4)
| Title | Predicted Rating |
| :--- | :---: |
| Toy Story (1995) | 5.0 |
| Usual Suspects, The (1995) | 5.0 |
| Almost Famous (2000) | 5.0 |
| Circus (2000) | 5.0 |
| X-Men (2000) | 5.0 |
| Chicken Run (2000) | 5.0 |
| City of the Living Dead (Paura nella citt√† dei... | 5.0 |
| Blazing Saddles (1974) | 5.0 |
| Close Encounters of the Third Kind (1977) | 5.0 |
| High Fidelity (2000) | 5.0 |