## Movie Recommendation System Using Clustering

#### I don't know about you but I love movies maybe even more than TV shows. Movies have long been a cornerstone of entertainment, with people expressing expressing diverse preferences regarding genres, themes, and styles. To dive deeper into these preferences, I will be using a comprehensive movie dataset https://www.kaggle.com/code/ekim01/explore-movie-ratings/report that features user ratings for films across various genres. I would also like to acknowledge F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872. As this dataset was adapted from their research. Key attributes of this dataset include movie names, genres, release years, and viewer ratings, providing a robust foundation for analysis. By leveraging clustering techniques, this project will explore patterns in user behavior and movie characteristics, aiming to answer intriguing questions such as: 
* Can we identify clusters of users with similar tastes? 
* How can these clusters be utilized to deliver personalized movie recommendations? 
* Are there specific groups of movies that are universally loved or disliked? 
#### Through these questions, this project seeks to uncover actionable insights into user preferences and enhance the movie recommendation experience.

### What Is Clustering? 
Clustering is an unsupervised machine learning technique used to group data points into clusters based on their similarity. This can help  patterns and structures within data to be realized without predefined labels. In this project, clustering will help analyze user ratings and movie characteristics to uncover groups of users with similar tastes and clusters of movies with comparable appeal. I plan to use K-Means Clustering, to partition data into a predefined number of clusters by minimizing the variance within each cluster. In order to identify distinct user groups based on their rating behaviors. This approach will enable personalized movie recommendations by associating users with clusters that reflect their preferences. Additionally, I will also use Agglomerative Hierarchical Clustering to explore relationships between users or movies at varying levels. Providing insights into nested or hierarchical patterns in the data. Together, these techniques will address key questions such as identifying user groups with shared preferences and uncovering clusters of universally liked or disliked movies.

In [91]:
%pip install scikit-learn
%pip install matplotlib
%pip install seaborn
%pip install pandas
%pip install numpy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Lets start off by importing our datasets 

In [93]:
import pandas as pd

In [94]:
# reading the two csv file we are intrested in
df1 = pd.read_csv("ml-32m/movies.csv")
df2 = pd.read_csv("ml-32m/ratings.csv")

In [103]:
df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [106]:
df2.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858


In [107]:
combined_df = pd.merge(df1, df2, on='movieId', how='inner')

In [114]:
combined_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,2.5,1169265231
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,3.0,850085076
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.0,1027305751
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19,3.0,974704488
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,20,5.0,1553184230


In [116]:
aggregated_df = combined_df.groupby(['movieId', 'title', 'genres']).agg(
    avg_rating=('rating', 'mean'),
    num_ratings=('rating', 'count')
).reset_index()



In [117]:
aggregated_df.head()

Unnamed: 0,movieId,title,genres,avg_rating,num_ratings
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.897438,68997
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.275758,28904
2,3,Grumpier Old Men (1995),Comedy|Romance,3.139447,13134
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.845331,2806
4,5,Father of the Bride Part II (1995),Comedy,3.059602,13154
