# Business Understanding

This project aims to create top 5 movie recommendations for each user of the streaming service. 


# Data Understanding

These datasets can be found at https://grouplens.org/datasets/movielens/latest/. In the datase are movie ratings raninging from 0.5-5 stars. The dataseet includes, User Ids (selected users at random), Movie Id's (Movies with one rating or more), Ratings data files (5-star scale). 

In [1]:
import os
import pandas as pd

# Define the folder path for the "movierec" folder from Desktop
folder_path = os.path.join(os.path.expanduser("~"), "Desktop", "movierec")

# Define file paths for each CSV file within the "movierec" folder
file_paths = {
    'movies': os.path.join(folder_path, 'movies.csv'),
    'links': os.path.join(folder_path, 'links.csv'),
    'ratings': os.path.join(folder_path, 'ratings.csv'),
    'tags': os.path.join(folder_path, 'tags.csv')
}

# Create an empty dictionary to store DataFrames
dfs = {}

# Read each CSV file into a DataFrame and store it in the dictionary
for key, path in file_paths.items():
    dfs[key] = pd.read_csv(path)

# Check the number of rows in each DataFrame and display the first few rows with column names
for key, df in dfs.items():
    print(f"DataFrame: {key}, Number of Rows: {df.shape[0]}")
    print(df.head())  # Display the first few rows of the DataFrame with column names
    print("\n")



DataFrame: movies, Number of Rows: 9742
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


DataFrame: links, Number of Rows: 9742
   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114885  31357.0
4        5  113041  11862.0


DataFrame: ratings, Number of Rows: 100836
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  9649812

In [2]:
# Merge 'movies' DataFrame with 'ratings' DataFrame on 'movieId'
merged_df = pd.merge(dfs['movies'], dfs['ratings'], on='movieId', how='inner')

# Merge 'merged_df' with 'tags' DataFrame on 'movieId'
merged_df = pd.merge(merged_df, dfs['tags'], on='movieId', how='inner')

# Merge 'merged_df' with 'links' DataFrame on 'movieId'
merged_df = pd.merge(merged_df, dfs['links'], on='movieId', how='inner')

# Display the first few rows of the merged DataFrame
print(merged_df.head())


   movieId             title                                       genres  \
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   

   userId_x  rating  timestamp_x  userId_y    tag  timestamp_y  imdbId  tmdbId  
0         1     4.0    964982703       336  pixar   1139045764  114709   862.0  
1         1     4.0    964982703       474  pixar   1137206825  114709   862.0  
2         1     4.0    964982703       567    fun   1525286013  114709   862.0  
3         5     4.0    847434962       336  pixar   1139045764  114709   862.0  
4         5     4.0    847434962       474  pixar   1137206825  114709   862.0  


### Discription Table

The below shows the columns and their discription. 

In [3]:
# Print all the columns of the merged DataFrame
print(merged_df.columns)


Index(['movieId', 'title', 'genres', 'userId_x', 'rating', 'timestamp_x',
       'userId_y', 'tag', 'timestamp_y', 'imdbId', 'tmdbId'],
      dtype='object')


| Column                | Description                                                    |
|-----------------------|----------------------------------------------------------------|
| movieId               | Unique identifier for each movie                               |
| title                 | Title of the movie along with the release year                 |
| genres                | Genres associated with the movie, separated by '|'            |
| userId_x              | User ID of the user who provided the rating                    |
| rating                | Rating given to the movie by the user                          |
| timestamp_x           | Timestamp when the rating was given by the user                |
| userId_y              | User ID of the user who applied the tag                        |
| tag                   | Tag applied to the movie by the user                           |
| timestamp_y           | Timestamp when the tag was applied by the user                 |
| imdbId                | IMDb ID of the movie                                            |
| tmdbId                | TMDb ID of the movie                                            |


# Data Preperation 

## Searching for Outliers and Missing Data

The following codes indicate there is no misssing data.

In [4]:
missing_values = merged_df.isnull().sum()
# Display missing values count for each column
print("Missing values count for each column:")
print(missing_values)

Missing values count for each column:
movieId        0
title          0
genres         0
userId_x       0
rating         0
timestamp_x    0
userId_y       0
tag            0
timestamp_y    0
imdbId         0
tmdbId         0
dtype: int64


In [5]:
title = merged_df['title']

# Display all values in the specific column
print("All values in the title column:")
print(title)
top_200_titles = merged_df['title'].value_counts().head(200)

print("Top 200 values of the 'title' column:")
print(top_200_titles)

All values in the title column:
0                       Toy Story (1995)
1                       Toy Story (1995)
2                       Toy Story (1995)
3                       Toy Story (1995)
4                       Toy Story (1995)
                       ...              
233208    Solo: A Star Wars Story (2018)
233209         Gintama: The Movie (2010)
233210         Gintama: The Movie (2010)
233211         Gintama: The Movie (2010)
233212         Gintama: The Movie (2010)
Name: title, Length: 233213, dtype: object
Top 200 values of the 'title' column:
Pulp Fiction (1994)                                               55567
Fight Club (1999)                                                 11772
Star Wars: Episode IV - A New Hope (1977)                          6526
Léon: The Professional (a.k.a. The Professional) (Léon) (1994)     4655
2001: A Space Odyssey (1968)                                       4469
                                                                  ...  
High

## Dummy Variables Created for Genre

In order to make the data more interpretable dummy varaibales were created for Genres. 19 dummy variables representing the diffrent Genres were created.

In [6]:
# Extract the 'genres' column
genres = merged_df['genres']

# Display all unique values in the 'genres' column
print("All unique genres:")
print(genres.unique())


All unique genres:
['Adventure|Animation|Children|Comedy|Fantasy'
 'Adventure|Children|Fantasy' 'Comedy|Romance' 'Comedy'
 'Comedy|Drama|Romance' 'Drama' 'Crime|Drama' 'Drama|Romance'
 'Comedy|Crime|Thriller' 'Crime|Drama|Horror|Mystery|Thriller'
 'Adventure|Drama|Fantasy|Mystery|Sci-Fi' 'Mystery|Sci-Fi|Thriller'
 'Children|Drama' 'Children|Comedy' 'Drama|War' 'Comedy|Drama|Thriller'
 'Mystery|Thriller' 'Crime|Mystery|Thriller' 'Drama|Horror|Thriller'
 'Comedy|Drama' 'Adventure|Comedy|Crime|Romance'
 'Adventure|Children|Comedy|Musical' 'Action|Drama|War'
 'Crime|Drama|Thriller' 'Documentary' 'Adventure|Drama|IMAX'
 'Action|Adventure|Comedy|Crime' 'Action|Adventure|Mystery|Sci-Fi'
 'Drama|Thriller|War' 'Action|Crime|Thriller' 'Drama|Musical|Romance'
 'Drama|Thriller' 'Action|Adventure|Sci-Fi' 'Drama|Horror|Sci-Fi'
 'Action|Crime|Drama|Thriller' 'Comedy|Crime|Drama|Thriller'
 'Comedy|Drama|Fantasy' 'Adventure|Drama|Sci-Fi' 'Action|Sci-Fi|Thriller'
 'Drama|Mystery|Thriller' 'Comedy|Drama|

In [7]:
# Split the 'genres' column into separate genre columns
genres_split = merged_df['genres'].str.get_dummies(sep='|')

# Concatenate the dummy genre columns with the original DataFrame
merged_df_with_dummies = pd.concat([merged_df, genres_split], axis=1)

# Drop the original 'genres' column
merged_df_with_dummies.drop(columns=['genres'], inplace=True)

# Print the first few rows of the DataFrame with dummy variables
print("DataFrame with dummy variables for genres:")
print(merged_df_with_dummies.head())


DataFrame with dummy variables for genres:
   movieId             title  userId_x  rating  timestamp_x  userId_y    tag  \
0        1  Toy Story (1995)         1     4.0    964982703       336  pixar   
1        1  Toy Story (1995)         1     4.0    964982703       474  pixar   
2        1  Toy Story (1995)         1     4.0    964982703       567    fun   
3        1  Toy Story (1995)         5     4.0    847434962       336  pixar   
4        1  Toy Story (1995)         5     4.0    847434962       474  pixar   

   timestamp_y  imdbId  tmdbId  ...  Film-Noir  Horror  IMAX  Musical  \
0   1139045764  114709   862.0  ...          0       0     0        0   
1   1137206825  114709   862.0  ...          0       0     0        0   
2   1525286013  114709   862.0  ...          0       0     0        0   
3   1139045764  114709   862.0  ...          0       0     0        0   
4   1137206825  114709   862.0  ...          0       0     0        0   

   Mystery  Romance  Sci-Fi  Thriller

In [8]:
# Print all column names in merged_df_with_dummies
print("Column names in merged_df_with_dummies:")
print(merged_df_with_dummies.columns)


Column names in merged_df_with_dummies:
Index(['movieId', 'title', 'userId_x', 'rating', 'timestamp_x', 'userId_y',
       'tag', 'timestamp_y', 'imdbId', 'tmdbId', '(no genres listed)',
       'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object')


# Exploratory Data Analysis

In [9]:
# To get basic statistics for all numeric columns in the DataFrame
all_numeric_stats = merged_df_with_dummies.describe()

# Printing the summary statistics for your review
print(all_numeric_stats)


             movieId       userId_x         rating   timestamp_x  \
count  233213.000000  233213.000000  233213.000000  2.332130e+05   
mean    12319.999443     309.688191       3.966535  1.213524e+09   
std     28243.919401     178.206387       0.968637  2.250448e+08   
min         1.000000       1.000000       0.500000  8.281246e+08   
25%       296.000000     156.000000       3.500000  1.017365e+09   
50%      1198.000000     309.000000       4.000000  1.217325e+09   
75%      4638.000000     460.000000       5.000000  1.443201e+09   
max    193565.000000     610.000000       5.000000  1.537799e+09   

            userId_y   timestamp_y        imdbId         tmdbId  \
count  233213.000000  2.332130e+05  2.332130e+05  233213.000000   
mean      470.683564  1.384774e+09  2.610632e+05    9378.277742   
std       153.329632  1.534621e+08  4.414411e+05   36943.139800   
min         2.000000  1.137179e+09  1.234900e+04      11.000000   
25%       424.000000  1.242494e+09  1.103570e+05    

In [10]:
# Correcting the DataFrame name for filtering movies with no genres listed
no_genre_movies = merged_df_with_dummies[merged_df_with_dummies['(no genres listed)'] == 1]

# Count the number of movies with no genres listed
no_genre_count = no_genre_movies.shape[0]

# Display the count and a preview of these movies
print(f"Number of movies with no genres listed: {no_genre_count}")
print(no_genre_movies[['movieId', 'title']].head())  # Adjusted to display only available columns



Number of movies with no genres listed: 3
        movieId     title
232424   156605  Paterson
232425   156605  Paterson
232426   156605  Paterson


In [11]:
# Count instances of "Paterson" by title
paterson_count_by_title = merged_df_with_dummies[merged_df_with_dummies['title'] == 'Paterson'].shape[0]
print(f"Number of instances of 'Paterson' by title: {paterson_count_by_title}")


Number of instances of 'Paterson' by title: 3


### Removal of all movies with 'Paterson'

Paterson was removed becasue it had incomplete data, and removing 3 data points, will have minimal impact on the dataset.

In [12]:
# Removing any movie with 'Paterson'
filtered_df = merged_df_with_dummies[merged_df_with_dummies['title'] != "Paterson"]

# Confirming removal
print(filtered_df.shape)
print("Paterson" in filtered_df['title'].values)


(233210, 30)
False


In [13]:
exact_duplicates = filtered_df.duplicated(keep=False)
print(f"Number of exact duplicate rows: {exact_duplicates.sum()}")


Number of exact duplicate rows: 0


In [14]:
user_ratings_count = filtered_df['userId_x'].value_counts()
user_tags_count = filtered_df['userId_y'].value_counts()

print(user_ratings_count.describe())


count     610.000000
mean      382.311475
std       365.388007
min         5.000000
25%       106.500000
50%       280.000000
75%       525.000000
max      2455.000000
Name: userId_x, dtype: float64


In [15]:
# Checking for missing values in the 'rating' column
missing_ratings = filtered_df['rating'].isnull().sum()

print(f"Number of missing ratings: {missing_ratings}")


Number of missing ratings: 0


In [16]:
# Inspect the 'tag' column in the DataFrame filtered_df
tag_column = filtered_df['tag']

# Print some sample entries from the 'tag' column
print("Sample entries from the 'tag' column:")
print(tag_column.head())


Sample entries from the 'tag' column:
0    pixar
1    pixar
2      fun
3    pixar
4    pixar
Name: tag, dtype: object


In [17]:
# Count the number of unique tags
total_tags = filtered_df['tag'].nunique()

# Display the total number of unique tags
print("Total number of unique tags:", total_tags)


Total number of unique tags: 1584


In [18]:
# Count the occurrences of each tag
tag_counts = filtered_df['tag'].value_counts()

# Filter tags with over 300 occurrences
tags_over_300 = tag_counts[tag_counts > 300]

# Count the number of tags with over 300 occurrences
num_tags_over_300 = len(tags_over_300)

# Display the number of tags and the tags themselves
print("Number of tags with over 300 occurrences:", num_tags_over_300)
print("Tags with over 300 occurrences:")
print(tags_over_300)



Number of tags with over 300 occurrences: 269
Tags with over 300 occurrences:
sci-fi                 2527
thought-provoking      2487
twist ending           2434
atmospheric            2227
dark comedy            2056
                       ... 
organised crime         307
drug overdose           307
biblical references     307
drugs & music           307
alternate universe      303
Name: tag, Length: 269, dtype: int64


In [19]:

tag_counts = filtered_df['tag'].value_counts()

# Filtering tags with over 300 occurrences 
tags_over_300 = tag_counts[tag_counts > 300]

# Create a boolean series where each tag in 'filtered_df' is checked against 'tags_over_300'
tags_over_300_filter = filtered_df['tag'].isin(tags_over_300.index)

# Create 'analysis_df' by filtering 'filtered_df' using 'tags_over_300_filter'
analysis_df = filtered_df[tags_over_300_filter].copy()



In [20]:


# Normalize the 'tag' column
analysis_df['tag'] = analysis_df['tag'].str.lower().str.replace('-', '').str.replace(r'\W+', '', regex=True)

# count the occurrences of each normalized tag
tag_counts = analysis_df['tag'].value_counts()

print(tag_counts)


atmospheric               2534
scifi                     2527
thoughtprovoking          2487
twistending               2434
darkcomedy                2056
                          ... 
bloodsplatters             307
badlanguage                307
intertwiningstorylines     307
killerasprotagonist        307
alternateuniverse          303
Name: tag, Length: 259, dtype: int64


In [21]:
# Normalize all tags to lowercase for consistent mapping
analysis_df['normalized_tag'] = analysis_df['tag'].str.lower().str.strip()
combined_tags_to_category = {
    'disney': 'family', 'pixar': 'family', 'animation': 'family', 'children': 'family',
    'superhero': 'superhero', 'comicbook': 'superhero', 'powerfulending': 'twist' ,
    'twistending': 'twist', 'clever': 'twist', 'mystery': 'twist', 'twist': 'twist', 'unpredictable': 'twist' , 'enigmatic': 'twist' ,
    'drugs': 'drugfilms', 'hallucinatory': 'drugfilms', 'coke': 'drugfilms', 'drugoverdose': 'drugfilms', 'drugsmusic': 'drugfilms' ,'drugmusic': 'drugfilms', 'heroin': 'drugfilms' ,
    'classic': 'classicmovies', 'classicmovie': 'classicmovies', 'genius': 'classicmovies' , 'shrimp': 'classicmovies', 'dinosaur': 'classicmovies' , 'indianajones': 'classicmovies' , 'masterpiece': 'classicmovies' ,
    'bubbagumpshrimp': 'classicmovies', 'imdbtop250': 'classicmovies', 'lieutenantdan': 'classicmovies', 'classicmovie': 'classicmovies' ,
    'crime': 'orgcrime', 'hitmen': 'orgcrime', 'mafia': 'orgcrime', 'organizedcrime': 'orgcrime', 'oldiebutgoodie': 'classicmovies' ,
    'gangsters': 'orgcrime', 'gangster': 'orgcrime', 'mobsters': 'orgcrime', 'mobster': 'orgcrime', 'hitman': 'orgcrime' , 'revenge': 'action' ,
    'organisedcrime': 'orgcrime', 'innetflixqueue': 'classicmovies', 'fastpaced': 'action' , 'gritty': 'action' , 'challenging': 'psythriller' ,
    'surreal': 'graphics', 'visually appealing': 'graphics', 'visuallyappealing': 'graphics', 'dreamlike': 'graphics', 'oscarbestcinematography': 'graphics' , 'cinematography': 'graphics' ,'beautifulscenery':'graphics', 'beautiful': 'graphics', 'oscarbestcinematography': 'graphics',
    'violence': 'action', 'action': 'action', 'guns': 'action', 'brucewillis': 'action', 'bloody': 'action' , 'actionpacked': 'action' , 'heist': 'action' ,
    'brad pitt': 'action', 'violent': 'action', 'bigboyswithguns': 'action', 'arnoldschwarzenegger': 'action' , 'martialarts': 'action' , 'fighting': 'action',
    'thought-provoking': 'psythriller',  'complicated': 'psythriller' ,'mindfuck': 'psythriller', 'psychological': 'psythriller', 'thoughtprovoking': 'psythriller', 'suspense': 'psythriller' , 'thriller': 'psythriller',
    'psychology': 'psythriller', 'philosophy': 'psythriller', 'philosophical': 'psythriller', 'psychologicalthriller': 'psythriller', 'interesting' : 'psythriller' , 
    'mental illness': 'psythriller', 'intellectual': 'psythriller', 'cerebral': 'psythriller', 'classicscifi': 'scifi' , 'nerd': 'scifi' ,
    'intelligent': 'psythriller', 'sci-fi': 'scifi', 'timetravel': 'scifi', 'space': 'scifi', 'spaceadventure': 'scifi' , 'spaceepic': 'scifi' ,
    'atmospheric': 'scifi', 'aliens': 'scifi', 'robots': 'scifi', 'classicscscfi': 'scifi', 'futuristic' : 'scifi' , 'alternateuniverse': 'scifi' ,
    'spaceopera': 'scifi', 'starwars': 'scifi', 'spaceaction': 'scifi', 'lukeskywalker': 'scifi', 'iamyourfather': 'scifi', 'gerorgelucas': 'scifi' ,
    'darthvader': 'scifi', 'robotsandandroids': 'scifi', 'space adventure': 'scifi',
    'space epic': 'scifi', 'moon': 'scifi', 'nasa': 'scifi', 'scifimasterpiece': 'scifi',
    'artificialintelligence': 'scifi', 'future': 'scifi', 'alternatereality': 'scifi',
    'teen': 'teenfilm' , 'highschool': 'teenfilm' , 'stanleykubrick': 'noteabledirector', 'rogeravary': 'noteabledirector' ,
    'davidfincher': 'noteabledirector' , 'georgelucas': 'noteabledirector', 'quentintarantino': 'noteabledirector' , 'alfredhitchcock': 'noteabledirector' , 'tarantino': 'noteabledirector' ,'tarintino': 'noteabledirector' , 'alfredhitchcock ': 'noteabledirector' , 'coenbrothers': 'noteabledirector' , 'stevenspielberg': 'noteabledirector' , 'martinscorsese': 'noteabledirector' ,
    'quirky': 'comedy', 'funny': 'comedy', 'crudehumor': 'comedy', 'comedy': 'comedy', 'humour': 'comedy' , 'hilarious': 'comedy' ,
    'satire': 'comedy', 'sarcasm': 'comedy', 'witty': 'comedy', 'parody': 'comedy', 'adamsandler': 'comedy' , 'britishcomedy': 'comedy' ,
    'humorous': 'comedy', 'satirical': 'comedy', 'spook': 'comedy', 'stevebuscemi': 'comedy', 'spoof': 'comedy', 'willferrell': 'comedy' ,
    'stupidisasstupiddoes': 'comedy', 'veryfunny': 'comedy', 'benstiller': 'comedy', 'blackhumour': 'darkcomedy' ,'blackhumor': 'darkcomedy' , 'darkhumor': 'darkcomedy', 'blackcomedy': 'darkcomedy', 'goodmusic': 'soundtrack' ,'notablesoundtrack': 'soundtrack', 'soundtrack': 'soundtrack', 'greatsoundtrack': 'soundtrack', 'music': 'soundtrack'
}

combined_tags_to_category.update({
    'leonardodicaprio': 'oscarwinningactors',
    'harveykeitel': 'oscarwinningactors',
    'bradpitt': 'oscarwinningactors',
    'samuelljackson': 'oscarwinningactors',
    'morganfreeman': 'oscarwinningactors',
    'alpacino': 'oscarwinningactors',
    'tomhanks': 'oscarwinningactors',
    'helenabonhamcarter': 'oscarwinningactors',
    'travolta': 'oscarwinningactors',
    'umathurman': 'oscarwinningactors',
    'johntravolta': 'oscarwinningactors',
    'edwardnorton': 'oscarwinningactors',
    'harrisonford': 'oscarwinningactors',
    'jacknicholson': 'oscarwinningactors',
    'apocalypse': 'apocalypse' ,
    'worldwarii' : 'apocalypse',
    'postmodern' : 'apocalypse',
    'nuclearwar' : 'apocalypse',
    'postapocalyptic': 'apocalypse',
    'unique': 'original' ,
    'originalplot': 'original',
    'mindblowing': 'original' ,
    'different': 'original',
    'innovative': 'original',
    'original': 'original'
    
    
})

combined_tags_to_category.update({
    'ghost': 'monsters',
    'ghosts': 'monsters',
    'gothic': 'monsters',
    'zombies': 'monsters',
    'horror': 'monsters' ,
    'creepy': 'monsters',
    'hanniballector': 'monsters' ,
    'stephenking': 'famousauthors',
    'tolkein': 'famousauthors',
    'tolkien': 'famousauthors' ,
    'basedonabook': 'famousauthors',
    'chuckpalahniuk': 'famousauthors' ,
    'palahnuik': 'famousauthors',
    'janeausten': 'famousauthors',
    'amazingdialogues': 'dialogue',
    'gooddialogue': 'dialogue',
    'greatdialogue': 'dialogue',
    'dialogue': 'dialogue',
    'entirelydialogue': 'dialogue',
    'smartwriting': 'dialogue',
    'storytelling': 'dialogue',
    'conversation' : 'dialogue'
})

    


combined_tags_to_category.update({
    'cult': 'cultclassic',
    'cultfilm': 'cultclassic',
    'cultclassic': 'cultclassic',
    'diner': 'pulpfiction' ,
    'milkshake': 'pulpfiction',
    'pulp': 'pulpfiction',
    'quotable': 'cultclassic',
    'dance': 'pulpfiction' ,
    'motherfucker': 'pulpfiction',
    'royalwithcheese': 'pulpfiction' ,
    'blood': 'pulpfiction' ,
    'noir': 'cultclassic' ,
    'cool': 'pulpfiction' ,
    'splatter': 'pulpfiction' ,
    'awesome': 'pulpfiction' ,
    'amazing': 'pulpfiction' ,
    'exciting' : 'pulpfiction', 
    'entertaining': 'pulpfiction' ,
    'iconic' : 'pulpfiction' ,
    'gore': 'pulpfiction' ,
    'bible': 'pulpfiction' ,
    'homosexuality' : 'pulpfiction' ,
    'neonoir' : 'cultclassic',
    'badass': 'pulpfiction',
    'unusual': 'pulpfiction' ,
    'foullanguage': 'pulpfiction' ,
    'highlyquotable': 'pulpfiction',
    'sexy': 'pulpfiction',
    'goldenwatch': 'pulpfiction',
    'anthology': 'pulpfiction', 
    'ironic' : 'pulpfiction' ,
    'monologue' : 'pulpfiction' ,
    'sophisticated' : 'pulpfiction' ,
    'random': 'pulpfiction' ,
    '1990s': 'pulpfiction',
    'palmedor': 'pulpfiction' ,
    'dancing': 'pulpfiction' ,
    'irony': 'pulpfiction' ,
    'losangeles': 'pulpfiction',
    'popculturerefrences' : 'pulpfiction',
    'filmnoir' : 'pulpfiction'
    
})

combined_tags_to_category.update({
    'fantasy': 'fantasy',
    'magic': 'fantasy',
    'wizards': 'fantasy',
    'fantasyworld': 'fantasy' ,
    'vietnam': 'war',
    'scotland': 'war',
    'medieval': 'war',
    'swordfight': 'war' ,
    'military': 'war' ,
    'adventure':'adventure',
    'engrossingadventure': 'adventure' ,
    'treasurehunt': 'adventure' ,
    'epic': 'adventure' ,
    'fun': 'adventure',
    'archaeology' : 'nonlinear',
    'notchrinological': 'nonlinear',
    'intertwiningstorylines': 'nonlinear' ,
    'interwovenstorylines': 'nonlinear' ,
    'multiplestories': 'nonlinear' ,
    'multiplestorylines': 'nonlinear' ,
    'nonlinearnarrative': 'nonlinear',
    'achronological': 'nonlinear',
    'disjointedtimeline': 'nonlinear',
    'episodic': 'nonlinear' ,
    'nonlineartimeline': 'nonlinear',
    'nonlinearstoryline': 'nonlinear' ,
    'nonlinear': 'nonlinear',
    'wedding': 'lovestory',
    'lovestory': 'lovestory',
    'romance': 'lovestory' ,
    'histoical': 'nonfiction',
    'biography': 'nonfiction',
    'basedontruestory': 'nonfiction',
    'truestory': 'nonfiction',
    'historical': 'nonfiction' ,
    'basedonatruestory' : 'nonfiction',
    'outoforder': 'pulpfiction',
    'coolstyle': 'pulpfiction',
    'crimescenescrubbing': 'pulpfiction',
    'biblicalreferences': 'pulpfiction',
    'popculturereferences' : 'pulpfiction'
})


combined_tags_to_category.update({
    'disturbing': 'maturemovie',
    'dark': 'maturemovie',
    'holocaust': 'maturemovie',
    'assassination': 'maturemovie',
    'rape': 'maturemovie',
    'aggressive': 'maturemovie',
    'offensive': 'maturemovie',
    'rdisturbingviolentcontentincludingrape': 'maturemovie',
    'nuditytopless': 'maturemovie',
    'murder': 'maturemovie',
    'prostitution': 'maturemovie',
    'strongbloodyviolence': 'maturemovie',
    'stronglanguage': 'maturemovie',
    'rsustainedstrongstylizedviolence': 'maturemovie',
    'rgraphicsexuality': 'maturemovie',
    'casualviolence': 'maturemovie',
    'rsomeviolence': 'maturemovie',
    'bloodsplatters': 'maturemovie',
    'rdisturbingviolentimages': 'maturemovie',
    'rviolence': 'maturemovie',
    'brutality': 'maturemovie' ,
    'meaninglessviolence' : 'maturemovie',
    'rstronglanguage' : 'maturemovie',
    'rstrongbloodyviolence' : 'maturemovie',
    'assassin': 'maturemovie',
    'existentialism': 'maturemovie',
    'menindrag': 'maturemovie' ,
    'serialkiller' : 'maturemovie' ,
    'badlanguage': 'maturemovie',
    'kidnapping': 'maturemovie' ,
    'terrorism': 'maturemovie'
})

combined_tags_to_category.update({
    'drama': 'drama',
    'emotional': 'drama',
    'bittersweet': 'drama',
    'racism': 'drama',
    'loneliness': 'drama',
    'feelgood': 'drama',
    'inspirational': 'drama',
    'heartwarming': 'drama',
    'touching': 'drama',
    'moving': 'drama',
    'tense': 'drama' ,
    'intense': 'drama',
    'divorce': 'drama' ,
    'journalism': 'media',
    'tv': 'media' ,
    'moviebusiness': 'media',
    'basedontvshow': 'media' ,
    'basedonatvshow': 'media' ,
    'ensemblecast': 'acting',
    'bignameactors': 'acting',
    'greatacting': 'acting' ,
    'characters': 'acting' ,
    'killerasprotagonist': 'acting' ,
    'characterdevelopment': 'acting' ,
    'jeanreno': 'acting' ,
    'melgibson': 'acting',
    'police': 'legalsystem' ,
    'court': 'legalsystem',
    'prison': 'legalsystem',
    'wrongfulimprisonment': 'legalsystem',
    'socialcommentary': 'socialcommentary',
    'politics': 'socialcommentary' ,
    'corruption' : 'socialcommentary' ,
    'societalcriticism': 'socialcommentary' ,
    'consumerism': 'socialcommentary' ,
    'controversial': 'socialcommentary',
    'stylish': 'stylish',
    'stylized':'stylish',
    'mentalillness': 'mentalwellbeing',
    'schizophrenia': 'mentalwellbeing',
    'imaginaryfriend':'mentalwellbeing',
    'memory': 'mentalwellbeing',
    
    
    
})



# Apply the combined mapping to create a 'category' column
analysis_df['category'] = analysis_df['normalized_tag'].map(combined_tags_to_category)

# Fill NaN values in 'category' with the normalized tags if they don't match any key in the dictionary
analysis_df['category'] = analysis_df['category'].fillna(analysis_df['normalized_tag'])

# Now you can count the occurrences of each category
category_counts = analysis_df['category'].value_counts()

# Print the category counts to see if the mappings are correctly reflected
print(category_counts)


scifi                 15235
psythriller           13505
pulpfiction           12795
maturemovie            9140
comedy                 8650
action                 8634
classicmovies          6436
orgcrime               5851
drama                  5769
nonlinear              5342
graphics               4193
oscarwinningactors     4089
twist                  4042
darkcomedy             4019
soundtrack             3615
dialogue               3079
family                 2985
cultclassic            2828
drugfilms              2689
noteabledirector       2553
acting                 2527
superhero              2191
original               2082
famousauthors          1786
stylish                1604
adventure              1600
fantasy                1582
monsters               1065
socialcommentary       1012
mentalwellbeing         823
legalsystem             705
war                     694
religion                535
sequel                  478
remake                  464
media               

In [22]:


# Count the occurrences of each category
category_counts = analysis_df['category'].value_counts()

# Filter for categories with at least 300 occurrences
categories_over_300 = category_counts[category_counts >= 300]

# Print the filtered categories and their counts
print("Categories with over 300 occurrences:")
for category, count in categories_over_300.items():
    print(f"{category}: {count}")



Categories with over 300 occurrences:
scifi: 15235
psythriller: 13505
pulpfiction: 12795
maturemovie: 9140
comedy: 8650
action: 8634
classicmovies: 6436
orgcrime: 5851
drama: 5769
nonlinear: 5342
graphics: 4193
oscarwinningactors: 4089
twist: 4042
darkcomedy: 4019
soundtrack: 3615
dialogue: 3079
family: 2985
cultclassic: 2828
drugfilms: 2689
noteabledirector: 2553
acting: 2527
superhero: 2191
original: 2082
famousauthors: 1786
stylish: 1604
adventure: 1600
fantasy: 1582
monsters: 1065
socialcommentary: 1012
mentalwellbeing: 823
legalsystem: 705
war: 694
religion: 535
sequel: 478
remake: 464
media: 432
nonfiction: 429
teenfilm: 420
lovestory: 403
england: 361
apocalypse: 356
anime: 342
christmas: 334
narrated: 320
retro: 317


# Modeling

In [23]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(filtered_df, test_size=0.2, random_state=42)


In [24]:
from surprise import Dataset, Reader

reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(train[['userId_x', 'movieId', 'rating']], reader)


In [25]:
from surprise import SVD

algo = SVD()
algo.fit(data.build_full_trainset())


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fa5f17844f0>

In [26]:
title = filtered_df[['movieId', 'title']].drop_duplicates()

# Display the column names of the 'title' DataFrame
print(title.columns)

Index(['movieId', 'title'], dtype='object')


In [27]:
top_5_recommendations = {}
for user_id in train['userId_x'].unique():
    # Predict ratings for all items for the current user
    user_ratings = []
    for movie_id in train['movieId'].unique():
        user_ratings.append((movie_id, algo.predict(user_id, movie_id).est))
    
    # Sort the predicted ratings in descending order
    user_ratings.sort(key=lambda x: x[1], reverse=True)
    
    # Select the top 5 recommendations
    top_5_recommendations[user_id] = user_ratings[:5]

In [28]:
# Top 5 Movie Recs
for user_id in train['userId_x'].unique():
    # Predict ratings for all items for the current user
    user_ratings = []
    for movie_id in train['movieId'].unique():
        user_ratings.append((movie_id, algo.predict(user_id, movie_id).est))
    
    # Sort the predicted ratings in descending order
    user_ratings.sort(key=lambda x: x[1], reverse=True)
    
    # Select the top 5 recommendations
    top_5_recommendations = user_ratings[:5]
    
    print(f"User {user_id}:")
    for movie_id, rating in top_5_recommendations:
        movie_title = title[title['movieId'] == movie_id]['title'].values[0]
        print(f"  - {movie_title} (Movie ID: {movie_id}, Predicted Rating: {rating})")


User 599:
  - Star Wars: Episode V - The Empire Strikes Back (1980) (Movie ID: 1196, Predicted Rating: 4.721013181429916)
  - Big Lebowski, The (1998) (Movie ID: 1732, Predicted Rating: 4.706658650264314)
  - Reservoir Dogs (1992) (Movie ID: 1089, Predicted Rating: 4.679025405555715)
  - Aliens (1986) (Movie ID: 1200, Predicted Rating: 4.671457667340153)
  - 2001: A Space Odyssey (1968) (Movie ID: 924, Predicted Rating: 4.643014510401436)
User 404:
  - Inside Job (2010) (Movie ID: 80906, Predicted Rating: 4.199743334319191)
  - Great Escape, The (1963) (Movie ID: 1262, Predicted Rating: 4.109683262945384)
  - Guess Who's Coming to Dinner (1967) (Movie ID: 3451, Predicted Rating: 4.078205306375252)
  - Captain Fantastic (2016) (Movie ID: 158966, Predicted Rating: 4.073400739341058)
  - Rosemary's Baby (1968) (Movie ID: 2160, Predicted Rating: 4.072730202388988)
User 577:
  - Shawshank Redemption, The (1994) (Movie ID: 318, Predicted Rating: 4.989908884703873)
  - Schindler's List (1993)

In [29]:
from surprise import Dataset

# Load the dataset from the DataFrame
test_data = Dataset.load_from_df(test[['userId_x', 'movieId', 'rating']], reader)

# Use the build_full_trainset method to get the testset
testset = test_data.build_full_trainset().build_testset()

# Run algorithm on the test set
predictions = algo.test(testset)

from surprise import accuracy

# Compute RMSE
rmse = accuracy.rmse(predictions)
print("Test RMSE:", rmse)

# Compute MAE
mae = accuracy.mae(predictions)
print("Test MAE:", mae)


RMSE: 0.3996
Test RMSE: 0.3996166250540951
MAE:  0.2633
Test MAE: 0.26330807367497944


NameError: name 'hhg' is not defined

# Conclusions

# Recommendations