In [1]:
import pandas as pd

ratings_clean = pd.read_csv('IS 477 Project/ratings_clean.csv')
movies_clean = pd.read_csv('IS 477 Project/movies_clean.csv')

ratings_clean.head(), movies_clean.head()


(   userId  movieId  rating  timestamp      rating_datetime
 0       1        1     4.0  964982703  2000-07-30 18:45:03
 1       1        3     4.0  964981247  2000-07-30 18:20:47
 2       1        6     4.0  964982224  2000-07-30 18:37:04
 3       1       47     5.0  964983815  2000-07-30 19:03:35
 4       1       50     5.0  964982931  2000-07-30 18:48:51,
    movieId                               title  \
 0        1                    Toy Story (1995)   
 1        2                      Jumanji (1995)   
 2        3             Grumpier Old Men (1995)   
 3        4            Waiting to Exhale (1995)   
 4        5  Father of the Bride Part II (1995)   
 
                                         genres  release_year  \
 0  Adventure|Animation|Children|Comedy|Fantasy        1995.0   
 1                   Adventure|Children|Fantasy        1995.0   
 2                               Comedy|Romance        1995.0   
 3                         Comedy|Drama|Romance        1995.0   
 4    

This cell loads the cleaned datasets that were saved at the end of Week 3, so Week 4 can start from an already standardized, filtered version of the data. Showing head() for both tables confirms that the files were read correctly and that important columns like movieId, rating, title, and release_year are present.

In [2]:
print("Ratings_clean shape:", ratings_clean.shape)
print("Movies_clean shape:", movies_clean.shape)

print("\nUnique movieIds in ratings_clean:", ratings_clean['movieId'].nunique())
print("Unique movieIds in movies_clean:", movies_clean['movieId'].nunique())


Ratings_clean shape: (100836, 5)
Movies_clean shape: (9729, 5)

Unique movieIds in ratings_clean: 9724
Unique movieIds in movies_clean: 9729


Here we look at the number of rows and unique movieId values in each cleaned dataset. This gives us a baseline for how many movies we expect to match and will help us understand how many are lost or unmatched after the merge.

In [4]:
movies_for_merge = movies_clean[['movieId', 'title', 'release_year', 'genres', 'genres_list']]

merged = ratings_clean.merge(movies_for_merge, on='movieId', how='inner')

merged.head()


Unnamed: 0,userId,movieId,rating,timestamp,rating_datetime,title,release_year,genres,genres_list
0,1,1,4.0,964982703,2000-07-30 18:45:03,Toy Story (1995),1995.0,Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy..."
1,5,1,4.0,847434962,1996-11-08 06:36:02,Toy Story (1995),1995.0,Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy..."
2,7,1,4.5,1106635946,2005-01-25 06:52:26,Toy Story (1995),1995.0,Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy..."
3,15,1,2.5,1510577970,2017-11-13 12:59:30,Toy Story (1995),1995.0,Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy..."
4,17,1,4.5,1305696483,2011-05-18 05:28:03,Toy Story (1995),1995.0,Adventure|Animation|Children|Comedy|Fantasy,"['Adventure', 'Animation', 'Children', 'Comedy..."


This cell performs the main Week 4 integration step: joining user ratings with movie metadata using the shared movieId column. We use an inner join so that we only keep rows where movieId exists in both datasets, producing a table that links each rating to its movie’s title, release year, and genres.

In [5]:
print("Merged shape:", merged.shape)
print("\nUnique movieIds in merged:", merged['movieId'].nunique())

print("\nUnique movieIds in ratings_clean:", ratings_clean['movieId'].nunique())
print("Unique movieIds in movies_clean:", movies_clean['movieId'].nunique())


Merged shape: (100818, 9)

Unique movieIds in merged: 9711

Unique movieIds in ratings_clean: 9724
Unique movieIds in movies_clean: 9729


This cell checks how many rows and unique movieIds are present in the merged dataset compared to the original cleaned tables. These numbers show how many movies successfully matched across the two sources and help document any loss of coverage when the datasets were combined.

In [7]:
ratings_ids = set(ratings_clean['movieId'])
movies_ids = set(movies_clean['movieId'])

unmatched_in_movies = ratings_ids - movies_ids

print("Number of movieIds in ratings with no metadata match:", len(unmatched_in_movies))
list(unmatched_in_movies)[:10]  


Number of movieIds in ratings with no metadata match: 13


[171749,
 171495,
 176601,
 162414,
 171631,
 143410,
 171891,
 147250,
 167570,
 149334]

This cell looks for movieIds that appear in the ratings data but do not have a corresponding entry in the movies metadata. Documenting how many IDs fail to match is part of “resolving mismatches,” because it shows exactly where information is missing and explains why some rows are dropped during the merge.

In [None]:
unmatched_in_ratings = movies_ids - ratings_ids

print("Number of movieIds with metadata but no ratings:", len(unmatched_in_ratings))
list(unmatched_in_ratings)[:10]


Number of movieIds with metadata but no ratings: 18


[3456, 6849, 4194, 85565, 32160, 26085, 3338, 6668, 7020, 30892]

This cell checks for the opposite case: movies that exist in the metadata table but never appear in the ratings. These movies are excluded from rating-based analysis, and reporting their count helps clarify that the merged dataset focuses only on titles that actually have user ratings.

In [None]:
merged.to_csv('IS 477 Project/ratings_with_metadata.csv', index=False)


This cell writes the integrated dataset to a new CSV file called ratings_with_metadata.csv inside the IS 477 Project folder. Saving this output makes it easy for later weeks (and for your partner) to load a single file that already contains both user ratings and movie attributes for exploratory analysis and visualization.