# Project 3: Movies Recommendations
## Authors: Manas Gandhi (manaspg2), Neeya Devanagondi (neeyati2), Rahul Kasibhatla (rahulk8)


### Problem (A) - 10 most popular movies

"Popularity" can be defined as having the highest average rating for movies that have received at least a certain number (a threshold) of ratings.

We will define that threshold as 1000 ratings. This will filter out niche movies that might have high average ratings due to a low number of ratings.

In [1]:
import pandas as pd

In [2]:
#load data

data_folder = './ml-1m/'
ratings_file = f'{data_folder}ratings.dat'
movies_file = f'{data_folder}movies.dat'

#col names from the file - cant read directly because it's a .dat file
r_cols = ['UserID', 'MovieID', 'Rating', 'Timestamp']
m_cols = ['MovieID', 'Title', 'Genres']

#separator = ::
ratings = pd.read_csv(ratings_file, sep='::', engine='python', names=r_cols)
movies = pd.read_csv(movies_file, sep='::', engine='python', names=m_cols, encoding='ISO-8859-1')

In [None]:
#calculate movie statistics
movie_stats = ratings.groupby('MovieID')['Rating'].agg(['mean', 'count'])
movie_stats.columns = ['AvgRating', 'RatingCount']

In [10]:
#merge the dfs
popular_movies_df = pd.merge(movies, movie_stats, on='MovieID')
print(popular_movies_df.head())

   MovieID                               Title                        Genres  \
0        1                    Toy Story (1995)   Animation|Children's|Comedy   
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy   
2        3             Grumpier Old Men (1995)                Comedy|Romance   
3        4            Waiting to Exhale (1995)                  Comedy|Drama   
4        5  Father of the Bride Part II (1995)                        Comedy   

   AvgRating  RatingCount  
0   4.146846         2077  
1   3.201141          701  
2   3.016736          478  
3   2.729412          170  
4   3.006757          296  


In [7]:
#get top ten most popular
min_ratings_threshold = 1000

filtered_popular_movies = popular_movies_df.loc[popular_movies_df['RatingCount'] >= min_ratings_threshold].copy()
top_10_popular = filtered_popular_movies.sort_values(by='AvgRating', ascending=False).head(10)

In [21]:
print("--- Part A: Top 10 'Most Popular' Movies ---")
print(f"Definition of \"popular\": Highest average rating with at least {min_ratings_threshold} ratings.\n")
top_10_display = top_10_popular[['Title', 'AvgRating', 'RatingCount']].copy()
top_10_display['AvgRating'] = top_10_display['AvgRating'].round(3)
top_10_display.index = range(1, 11)
print(top_10_display)

print("\n--- Full Movie Titles (in order) ---")
for i, title in enumerate(top_10_display['Title']):
    print(f"{i+1}. {title}")

print("\nSomething to note: the 'The,' that shows up in some titles is just because of how the data is stored \n- it's done to preserve alphabetical order, and not list all those movies under 'T'.")

--- Part A: Top 10 'Most Popular' Movies ---
Definition of "popular": Highest average rating with at least 1000 ratings.

                                                Title  AvgRating  RatingCount
1                    Shawshank Redemption, The (1994)      4.555         2227
2                               Godfather, The (1972)      4.525         2223
3                          Usual Suspects, The (1995)      4.517         1783
4                             Schindler's List (1993)      4.510         2304
5                      Raiders of the Lost Ark (1981)      4.478         2514
6                                  Rear Window (1954)      4.476         1050
7           Star Wars: Episode IV - A New Hope (1977)      4.454         2991
8   Dr. Strangelove or: How I Learned to Stop Worr...      4.450         1367
9                                   Casablanca (1942)      4.413         1669
10                            Sixth Sense, The (1999)      4.406         2459

--- Full Movie Titl

### Solution to Part A

Based on our definition of popular, here are the top 10 most popular movies, their average rating (out of 5), and the number of ratings they had:

| Title | AvgRating | RatingCount |
| :--- | ---: | ---: |
| The Shawshank Redemption (1994) | 4.555 | 2227 |
| The Godfather (1972) | 4.525 | 2223 |
| The Usual Suspects (1995) | 4.517 | 1783 |
| Schindler's List (1993) | 4.510 | 2304 |
| Raiders of the Lost Ark (1981) | 4.478 | 2514 |
| Rear Window (1954) | 4.476 | 1050 |
| Star Wars: Episode IV - A New Hope (1977) | 4.454 | 2991 |
| Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 4.450 | 1367 |
| Casablanca (1942) | 4.413 | 1669 |
| The Sixth Sense (1999) | 4.406 | 2459 |

## Part B - Make a recommendation based on Item-Based Collaborative Filtering