# Recommender Systems - Mini Challenge HS25

In this minichallenge we will explore a MovieLens dataset and implement several recommender systems and evaluation methods. Subsequently we will optimize these methods and compare the results. 

**Submission deadline:** Sunday of SW11 um 18:00 

## Guidelines for Implementation and Submission
- Code must be written in Python. The versions of all used packages must be given for reproducability.
- You may respond in English or German.
- We develop numerous algorithms ourselves. Unless explicitly stated otherwise, only the following libraries may be used in Python: numpy, matplotlib, seaborn, pandas. 
- Follow good coding practices and write modular, reusable code.
- The submitted solution must contain all codes and the results. No code may be outsourced.
- All pathes must be relative and just downloading your repo must be executable without modifications.
- Only fully running code is graded. The notebook must run sequential from start to end.
- During development, if computation time is too long for productive prototyping and debugging work, it is recommended to reduce the dataset to a fraction of its original. However, final results must be calculated on the full dataset. 
- All plots must be fully labeled (title, axes, labels, colorbar, etc.) so that the plot can be easily understood.
- Each plot must be accompanied by a brief discussion, which explains the plot and captures the key insights that become visible.
- Only fully labeled plots with an accompanying discussion will be assessed.
- The last commit in your fork of the repo before the submission deadline counts as the submission.
- Points will be deducted if you write inconsise (Denial of service will be punished) or if I read text not written for me but for the user of ChatGPT oir similar. 
- If you would like to submit and have the mini-challenge assessed, please send a short email to the subject expert (moritz.kirschmann@fhnw.ch) within 2 days after submission.
- Please do not delete, duplicate, or move the existing cells. This leads to problems during the correction. However, you may add as many additional cells as you like.

## Exercises

### Exercise 1 - A deep exploration of the dataset (17 points)
We will work with a subset of the MovieLens dataset. This subset is located under ``data/ml-latest-small``. Read the ``README.txt``carefully. 
Open the files. 

a) Describe the available data.

b) Find and fix bad data (e.g. duplicates, missing values, etc.).

Generate lists of

c) - Top 20 movies by average rating

d) - Top 20 movies by number of views

e) What is the range of the ratings? 

f) Which genre has be rated how many times?

g) How sparse is the User Rating Matrix?

Plot the following:

h) How many users have rated how many movies

i) Which rating is given how often over time with a time resolution of month 

j) Which rating is given how often per genre

k) The rating distributions of 10 random movies

l) The rating distributions of 3 movies that you have watched

m) How many users give which average rating

n) How often a movie was rated as a function of average rating

o) A heatmap of the User Item Matrix

p) A heatmap of the User Item Matrix for the 100 most rated movies for the 50 users with most ratings


In [None]:
# Exercise 1 - A deep exploration of the dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

# Load the datasets
print("Loading datasets...")
ratings = pd.read_csv('data/ratings.csv')
movies = pd.read_csv('data/movies.csv')
links = pd.read_csv('data/links.csv')
tags = pd.read_csv('data/tags.csv')

print("Data loaded successfully!")
print("\nDataset shapes:")
print(f"Ratings: {ratings.shape}")
print(f"Movies: {movies.shape}")
print(f"Links: {links.shape}")
print(f"Tags: {tags.shape}")


In [None]:
# a) Describe the available data

print("=== DATASET DESCRIPTION ===")
print("\n1. RATINGS DATASET:")
print(ratings.head())
print(f"\nColumns: {list(ratings.columns)}")
print(f"Data types:\n{ratings.dtypes}")
print(f"Basic statistics:\n{ratings.describe()}")

print("\n2. MOVIES DATASET:")
print(movies.head())
print(f"\nColumns: {list(movies.columns)}")
print(f"Data types:\n{movies.dtypes}")

print("\n3. LINKS DATASET:")
print(links.head())
print(f"\nColumns: {list(links.columns)}")
print(f"Data types:\n{links.dtypes}")

print("\n4. TAGS DATASET:")
print(tags.head())
print(f"\nColumns: {list(tags.columns)}")
print(f"Data types:\n{tags.dtypes}")

print("\n=== DATA OVERVIEW ===")
print(f"• Total ratings: {len(ratings):,}")
print(f"• Unique users: {ratings['userId'].nunique():,}")
print(f"• Unique movies: {ratings['movieId'].nunique():,}")
print(f"• Total movies in dataset: {len(movies):,}")
print(f"• Total tags: {len(tags):,}")
print(f"• Rating period: {datetime.fromtimestamp(ratings['timestamp'].min()).strftime('%Y-%m-%d')} to {datetime.fromtimestamp(ratings['timestamp'].max()).strftime('%Y-%m-%d')}")


In [None]:
# b) Find and fix bad data (duplicates, missing values, etc.)

print("=== DATA QUALITY CHECK ===")

# Check for missing values
print("\n1. MISSING VALUES:")
print("Ratings dataset:")
print(ratings.isnull().sum())
print("\nMovies dataset:")
print(movies.isnull().sum())
print("\nLinks dataset:")
print(links.isnull().sum())
print("\nTags dataset:")
print(tags.isnull().sum())

# Check for duplicates
print("\n2. DUPLICATES:")
print(f"Duplicate ratings: {ratings.duplicated().sum()}")
print(f"Duplicate movies: {movies.duplicated().sum()}")
print(f"Duplicate links: {links.duplicated().sum()}")
print(f"Duplicate tags: {tags.duplicated().sum()}")

# Check for rating duplicates (same user rating same movie multiple times)
print(f"\nDuplicate user-movie ratings: {ratings.duplicated(subset=['userId', 'movieId']).sum()}")

# Check data consistency
print("\n3. DATA CONSISTENCY:")
print(f"Movies in ratings but not in movies: {set(ratings['movieId']) - set(movies['movieId'])}")
print(f"Movies in movies but not in ratings: {len(set(movies['movieId']) - set(ratings['movieId']))}")

# Check for invalid ratings
print("\n4. INVALID RATINGS:")
invalid_ratings = ratings[(ratings['rating'] < 0.5) | (ratings['rating'] > 5.0)]
print(f"Invalid ratings (outside 0.5-5.0 range): {len(invalid_ratings)}")

# Check for movies with no genres
print(f"\nMovies with no genres: {movies[movies['genres'] == '(no genres listed)'].shape[0]}")

print("\n=== DATA CLEANING ===")
# Remove duplicates if any
if ratings.duplicated(subset=['userId', 'movieId']).sum() > 0:
    print("Removing duplicate user-movie ratings...")
    ratings = ratings.drop_duplicates(subset=['userId', 'movieId'])
    print(f"Ratings after removing duplicates: {len(ratings)}")

# Convert timestamp to datetime for easier analysis
ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s')
tags['datetime'] = pd.to_datetime(tags['timestamp'], unit='s')

print("Data cleaning completed!")


In [None]:
# c) Top 20 movies by average rating

# Merge ratings with movies to get movie titles
movie_stats = ratings.groupby('movieId').agg({
    'rating': ['mean', 'count'],
    'userId': 'count'
}).round(3)

movie_stats.columns = ['avg_rating', 'rating_count', 'user_count']
movie_stats = movie_stats.reset_index()

# Merge with movie information
movie_stats = movie_stats.merge(movies[['movieId', 'title', 'genres']], on='movieId')

# Filter movies with at least 50 ratings to avoid bias from movies with very few ratings
min_ratings = 50
top_movies_by_rating = movie_stats[movie_stats['rating_count'] >= min_ratings].sort_values('avg_rating', ascending=False).head(20)

print("=== TOP 20 MOVIES BY AVERAGE RATING (min 50 ratings) ===")
for idx, row in top_movies_by_rating.iterrows():
    print(f"{row['avg_rating']:.3f} - {row['title']} ({row['rating_count']} ratings)")

print(f"\nNote: Only movies with at least {min_ratings} ratings are included to avoid bias from movies with very few ratings.")


In [None]:
# d) Top 20 movies by number of views (ratings)

top_movies_by_views = movie_stats.sort_values('rating_count', ascending=False).head(20)

print("=== TOP 20 MOVIES BY NUMBER OF RATINGS ===")
for idx, row in top_movies_by_views.iterrows():
    print(f"{row['rating_count']} ratings - {row['title']} (avg: {row['avg_rating']:.3f})")


In [None]:
# e) What is the range of the ratings?

print("=== RATING RANGE ANALYSIS ===")
print(f"Minimum rating: {ratings['rating'].min()}")
print(f"Maximum rating: {ratings['rating'].max()}")
print(f"Rating range: {ratings['rating'].min()} to {ratings['rating'].max()}")
print(f"Unique rating values: {sorted(ratings['rating'].unique())}")
print(f"Number of unique rating values: {ratings['rating'].nunique()}")

# Rating distribution
print("\nRating distribution:")
rating_dist = ratings['rating'].value_counts().sort_index()
for rating, count in rating_dist.items():
    percentage = (count / len(ratings)) * 100
    print(f"Rating {rating}: {count:,} ratings ({percentage:.1f}%)")


In [None]:
# f) Which genre has been rated how many times?

# First, let's expand the genres (pipe-separated) into individual rows
movie_genres = movies.copy()
movie_genres['genres_list'] = movie_genres['genres'].str.split('|')
movie_genres = movie_genres.explode('genres_list')
movie_genres['genre'] = movie_genres['genres_list']

# Merge with ratings to get rating counts per genre
genre_ratings = ratings.merge(movie_genres[['movieId', 'genre']], on='movieId')

# Count ratings per genre
genre_stats = genre_ratings.groupby('genre').agg({
    'rating': ['count', 'mean'],
    'userId': 'nunique'
}).round(3)

genre_stats.columns = ['total_ratings', 'avg_rating', 'unique_users']
genre_stats = genre_stats.sort_values('total_ratings', ascending=False)

print("=== GENRE RATING STATISTICS ===")
print(f"{'Genre':<20} {'Total Ratings':<15} {'Avg Rating':<12} {'Unique Users':<15}")
print("-" * 65)
for genre, row in genre_stats.iterrows():
    print(f"{genre:<20} {row['total_ratings']:<15,} {row['avg_rating']:<12.3f} {row['unique_users']:<15,}")

print(f"\nTotal ratings analyzed: {genre_stats['total_ratings'].sum():,}")
print(f"Note: Some ratings may be counted multiple times if a movie has multiple genres.")


In [None]:
# g) How sparse is the User Rating Matrix?

# Create user-item matrix to analyze sparsity
user_item_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating')

print("=== USER RATING MATRIX SPARSITY ANALYSIS ===")
print(f"Matrix shape: {user_item_matrix.shape} (users x movies)")
print(f"Total possible ratings: {user_item_matrix.shape[0] * user_item_matrix.shape[1]:,}")
print(f"Actual ratings: {ratings.shape[0]:,}")
print(f"Missing ratings: {(user_item_matrix.shape[0] * user_item_matrix.shape[1]) - ratings.shape[0]:,}")

# Calculate sparsity
sparsity = 1 - (ratings.shape[0] / (user_item_matrix.shape[0] * user_item_matrix.shape[1]))
print(f"Sparsity: {sparsity:.4f} ({sparsity*100:.2f}%)")
print(f"Density: {1-sparsity:.4f} ({(1-sparsity)*100:.2f}%)")

# Additional sparsity insights
print(f"\nSparsity insights:")
print(f"• Average ratings per user: {ratings.shape[0] / user_item_matrix.shape[0]:.1f}")
print(f"• Average ratings per movie: {ratings.shape[0] / user_item_matrix.shape[1]:.1f}")
print(f"• Users with most ratings: {ratings.groupby('userId').size().max()}")
print(f"• Movies with most ratings: {ratings.groupby('movieId').size().max()}")
print(f"• Users with fewest ratings: {ratings.groupby('userId').size().min()}")
print(f"• Movies with fewest ratings: {ratings.groupby('movieId').size().min()}")


In [None]:
# h) Plot: How many users have rated how many movies

# Calculate ratings per user
user_rating_counts = ratings.groupby('userId').size().sort_values(ascending=False)

plt.figure(figsize=(12, 6))

# Plot 1: Histogram of ratings per user
plt.subplot(1, 2, 1)
plt.hist(user_rating_counts, bins=50, alpha=0.7, color='skyblue', edgecolor='black')
plt.xlabel('Number of Movies Rated')
plt.ylabel('Number of Users')
plt.title('Distribution of Ratings per User')
plt.grid(True, alpha=0.3)

# Plot 2: Cumulative distribution
plt.subplot(1, 2, 2)
sorted_counts = user_rating_counts.sort_values(ascending=False)
cumulative_users = np.arange(1, len(sorted_counts) + 1)
plt.plot(sorted_counts.values, cumulative_users, linewidth=2, color='red')
plt.xlabel('Number of Movies Rated')
plt.ylabel('Cumulative Number of Users')
plt.title('Cumulative Distribution of Ratings per User')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"User rating statistics:")
print(f"• Min ratings per user: {user_rating_counts.min()}")
print(f"• Max ratings per user: {user_rating_counts.max()}")
print(f"• Mean ratings per user: {user_rating_counts.mean():.1f}")
print(f"• Median ratings per user: {user_rating_counts.median():.1f}")
print(f"• Std ratings per user: {user_rating_counts.std():.1f}")

print(f"\nTop 10 users by number of ratings:")
for i, (user_id, count) in enumerate(user_rating_counts.head(10).items(), 1):
    print(f"{i:2d}. User {user_id}: {count} ratings")


In [None]:
# i) Plot: Which rating is given how often over time with monthly resolution

# Add year-month column for time analysis
ratings['year_month'] = ratings['datetime'].dt.to_period('M')

# Count ratings by month and rating value
monthly_ratings = ratings.groupby(['year_month', 'rating']).size().unstack(fill_value=0)

plt.figure(figsize=(15, 10))

# Plot 1: Stacked area chart of ratings over time
plt.subplot(2, 1, 1)
colors = plt.cm.viridis(np.linspace(0, 1, len(monthly_ratings.columns)))
monthly_ratings.plot(kind='area', stacked=True, color=colors, alpha=0.7, figsize=(15, 8))
plt.title('Rating Distribution Over Time (Monthly)', fontsize=14, fontweight='bold')
plt.xlabel('Time (Year-Month)')
plt.ylabel('Number of Ratings')
plt.legend(title='Rating Value', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)

# Plot 2: Line plot showing trends for each rating
plt.subplot(2, 1, 2)
for rating_val in sorted(monthly_ratings.columns):
    plt.plot(monthly_ratings.index.astype(str), monthly_ratings[rating_val], 
             marker='o', label=f'Rating {rating_val}', linewidth=2, markersize=4)
plt.title('Rating Trends Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Time (Year-Month)')
plt.ylabel('Number of Ratings')
plt.legend(title='Rating Value', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

print("Monthly rating trends analysis:")
print(f"• Dataset spans from {ratings['year_month'].min()} to {ratings['year_month'].max()}")
print(f"• Total months with data: {ratings['year_month'].nunique()}")
print(f"• Average ratings per month: {len(ratings) / ratings['year_month'].nunique():.1f}")

# Show some monthly statistics
monthly_totals = monthly_ratings.sum(axis=1)
print(f"\nTop 5 months with most ratings:")
for month, count in monthly_totals.nlargest(5).items():
    print(f"• {month}: {count:,} ratings")

print(f"\nTop 5 months with fewest ratings:")
for month, count in monthly_totals.nsmallest(5).items():
    print(f"• {month}: {count:,} ratings")


In [None]:
# j) Plot: Which rating is given how often per genre

# Use the genre_ratings data we created earlier
genre_rating_dist = genre_ratings.groupby(['genre', 'rating']).size().unstack(fill_value=0)

# Calculate percentages for each genre
genre_rating_pct = genre_rating_dist.div(genre_rating_dist.sum(axis=1), axis=0) * 100

plt.figure(figsize=(16, 10))

# Plot 1: Stacked bar chart of absolute counts
plt.subplot(2, 1, 1)
genre_rating_dist.plot(kind='bar', stacked=True, figsize=(16, 8), 
                       color=plt.cm.viridis(np.linspace(0, 1, len(genre_rating_dist.columns))))
plt.title('Rating Distribution by Genre (Absolute Counts)', fontsize=14, fontweight='bold')
plt.xlabel('Genre')
plt.ylabel('Number of Ratings')
plt.legend(title='Rating Value', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

# Plot 2: Stacked bar chart of percentages
plt.subplot(2, 1, 2)
genre_rating_pct.plot(kind='bar', stacked=True, figsize=(16, 8),
                      color=plt.cm.viridis(np.linspace(0, 1, len(genre_rating_pct.columns))))
plt.title('Rating Distribution by Genre (Percentages)', fontsize=14, fontweight='bold')
plt.xlabel('Genre')
plt.ylabel('Percentage of Ratings')
plt.legend(title='Rating Value', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Genre rating analysis:")
print("\nAverage rating by genre:")
avg_rating_by_genre = genre_ratings.groupby('genre')['rating'].mean().sort_values(ascending=False)
for genre, avg_rating in avg_rating_by_genre.items():
    total_ratings = genre_rating_dist.loc[genre].sum()
    print(f"• {genre:<20}: {avg_rating:.3f} (from {total_ratings:,} ratings)")

print(f"\nMost common rating by genre:")
for genre in genre_rating_dist.index:
    most_common_rating = genre_rating_dist.loc[genre].idxmax()
    count = genre_rating_dist.loc[genre].max()
    percentage = (count / genre_rating_dist.loc[genre].sum()) * 100
    print(f"• {genre:<20}: Rating {most_common_rating} ({count:,} ratings, {percentage:.1f}%)")


In [None]:
# k) Plot: Rating distributions of 10 random movies

# Select 10 random movies that have at least 20 ratings for better visualization
movies_with_sufficient_ratings = movie_stats[movie_stats['rating_count'] >= 20]['movieId'].tolist()
random_movies = np.random.choice(movies_with_sufficient_ratings, size=10, replace=False)

plt.figure(figsize=(15, 10))

for i, movie_id in enumerate(random_movies, 1):
    plt.subplot(2, 5, i)
    
    # Get ratings for this movie
    movie_ratings = ratings[ratings['movieId'] == movie_id]['rating']
    
    # Create histogram
    plt.hist(movie_ratings, bins=np.arange(0.5, 6, 0.5), alpha=0.7, color='skyblue', edgecolor='black')
    
    # Get movie title
    movie_title = movies[movies['movieId'] == movie_id]['title'].iloc[0]
    avg_rating = movie_ratings.mean()
    rating_count = len(movie_ratings)
    
    plt.title(f'{movie_title[:25]}...\\nAvg: {avg_rating:.2f}, Count: {rating_count}', 
              fontsize=8, fontweight='bold')
    plt.xlabel('Rating')
    plt.ylabel('Count')
    plt.xticks(np.arange(0.5, 6, 0.5))
    plt.grid(True, alpha=0.3)

plt.suptitle('Rating Distributions of 10 Random Movies', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("Random movies selected for analysis:")
for i, movie_id in enumerate(random_movies, 1):
    movie_title = movies[movies['movieId'] == movie_id]['title'].iloc[0]
    movie_ratings = ratings[ratings['movieId'] == movie_id]['rating']
    avg_rating = movie_ratings.mean()
    rating_count = len(movie_ratings)
    print(f"{i:2d}. {movie_title} (ID: {movie_id}) - Avg: {avg_rating:.3f}, Count: {rating_count}")


In [None]:
# l) Plot: Rating distributions of 3 movies that you have watched

# I'll select 3 well-known popular movies that I'm familiar with
# Let's find some popular movies by searching for well-known titles
familiar_movies = []

# Search for some well-known movies
movie_search_terms = ['Toy Story', 'Forrest Gump', 'The Matrix']
for term in movie_search_terms:
    matching_movies = movies[movies['title'].str.contains(term, case=False, na=False)]
    if not matching_movies.empty:
        # Get the first match and check if it has sufficient ratings
        movie_id = matching_movies.iloc[0]['movieId']
        if movie_id in movie_stats[movie_stats['rating_count'] >= 50]['movieId'].values:
            familiar_movies.append(movie_id)

# If we don't have enough movies, add some popular ones
if len(familiar_movies) < 3:
    # Get some of the most rated movies as familiar ones
    popular_movies = movie_stats.sort_values('rating_count', ascending=False).head(10)['movieId'].tolist()
    for movie_id in popular_movies:
        if movie_id not in familiar_movies:
            familiar_movies.append(movie_id)
        if len(familiar_movies) >= 3:
            break

plt.figure(figsize=(15, 5))

for i, movie_id in enumerate(familiar_movies[:3], 1):
    plt.subplot(1, 3, i)
    
    # Get ratings for this movie
    movie_ratings = ratings[ratings['movieId'] == movie_id]['rating']
    
    # Create histogram
    plt.hist(movie_ratings, bins=np.arange(0.5, 6, 0.5), alpha=0.7, color='lightcoral', edgecolor='black')
    
    # Get movie title and stats
    movie_title = movies[movies['movieId'] == movie_id]['title'].iloc[0]
    movie_genres = movies[movies['movieId'] == movie_id]['genres'].iloc[0]
    avg_rating = movie_ratings.mean()
    rating_count = len(movie_ratings)
    
    plt.title(f'{movie_title}\\nGenres: {movie_genres}\\nAvg: {avg_rating:.2f}, Count: {rating_count}', 
              fontsize=10, fontweight='bold')
    plt.xlabel('Rating')
    plt.ylabel('Count')
    plt.xticks(np.arange(0.5, 6, 0.5))
    plt.grid(True, alpha=0.3)

plt.suptitle('Rating Distributions of 3 Popular Movies', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Movies selected for analysis:")
for i, movie_id in enumerate(familiar_movies[:3], 1):
    movie_title = movies[movies['movieId'] == movie_id]['title'].iloc[0]
    movie_genres = movies[movies['movieId'] == movie_id]['genres'].iloc[0]
    movie_ratings = ratings[ratings['movieId'] == movie_id]['rating']
    avg_rating = movie_ratings.mean()
    rating_count = len(movie_ratings)
    std_rating = movie_ratings.std()
    print(f"{i}. {movie_title}")
    print(f"   Genres: {movie_genres}")
    print(f"   Average Rating: {avg_rating:.3f}")
    print(f"   Rating Count: {rating_count}")
    print(f"   Standard Deviation: {std_rating:.3f}")
    print()


In [None]:
# m) Plot: How many users give which average rating

# Calculate average rating per user
user_avg_ratings = ratings.groupby('userId')['rating'].agg(['mean', 'count']).reset_index()
user_avg_ratings.columns = ['userId', 'avg_rating', 'rating_count']

plt.figure(figsize=(12, 6))

# Plot 1: Histogram of average ratings per user
plt.subplot(1, 2, 1)
plt.hist(user_avg_ratings['avg_rating'], bins=30, alpha=0.7, color='lightgreen', edgecolor='black')
plt.xlabel('Average Rating Given by User')
plt.ylabel('Number of Users')
plt.title('Distribution of Average Ratings per User')
plt.grid(True, alpha=0.3)

# Plot 2: Scatter plot of average rating vs number of ratings
plt.subplot(1, 2, 2)
plt.scatter(user_avg_ratings['rating_count'], user_avg_ratings['avg_rating'], 
           alpha=0.6, color='purple', s=20)
plt.xlabel('Number of Ratings Given')
plt.ylabel('Average Rating Given')
plt.title('Average Rating vs Number of Ratings per User')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("User average rating statistics:")
print(f"• Min average rating: {user_avg_ratings['avg_rating'].min():.3f}")
print(f"• Max average rating: {user_avg_ratings['avg_rating'].max():.3f}")
print(f"• Mean average rating: {user_avg_ratings['avg_rating'].mean():.3f}")
print(f"• Median average rating: {user_avg_ratings['avg_rating'].median():.3f}")
print(f"• Std average rating: {user_avg_ratings['avg_rating'].std():.3f}")

print(f"\nUsers with highest average ratings:")
top_raters = user_avg_ratings.nlargest(10, 'avg_rating')
for idx, row in top_raters.iterrows():
    print(f"• User {row['userId']}: {row['avg_rating']:.3f} (from {row['rating_count']} ratings)")

print(f"\nUsers with lowest average ratings:")
low_raters = user_avg_ratings.nsmallest(10, 'avg_rating')
for idx, row in low_raters.iterrows():
    print(f"• User {row['userId']}: {row['avg_rating']:.3f} (from {row['rating_count']} ratings)")


In [None]:
# n) Plot: How often a movie was rated as a function of average rating

plt.figure(figsize=(12, 8))

# Plot 1: Scatter plot
plt.subplot(2, 1, 1)
plt.scatter(movie_stats['avg_rating'], movie_stats['rating_count'], 
           alpha=0.6, color='orange', s=20)
plt.xlabel('Average Rating of Movie')
plt.ylabel('Number of Ratings Received')
plt.title('Number of Ratings vs Average Rating per Movie')
plt.grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(movie_stats['avg_rating'], movie_stats['rating_count'], 1)
p = np.poly1d(z)
plt.plot(movie_stats['avg_rating'], p(movie_stats['avg_rating']), "r--", alpha=0.8, linewidth=2)

# Plot 2: Box plot by rating bins
plt.subplot(2, 1, 2)

# Create rating bins
movie_stats['rating_bin'] = pd.cut(movie_stats['avg_rating'], 
                                   bins=[0, 1.5, 2.5, 3.5, 4.5, 5.0], 
                                   labels=['0-1.5', '1.5-2.5', '2.5-3.5', '3.5-4.5', '4.5-5.0'])

# Create box plot
rating_bins = []
rating_counts_by_bin = []
for bin_label in movie_stats['rating_bin'].cat.categories:
    bin_data = movie_stats[movie_stats['rating_bin'] == bin_label]['rating_count']
    if len(bin_data) > 0:
        rating_bins.append(bin_label)
        rating_counts_by_bin.append(bin_data)

plt.boxplot(rating_counts_by_bin, labels=rating_bins)
plt.xlabel('Average Rating Range')
plt.ylabel('Number of Ratings')
plt.title('Distribution of Rating Counts by Average Rating Range')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Movie rating analysis:")
print(f"• Correlation between average rating and number of ratings: {movie_stats['avg_rating'].corr(movie_stats['rating_count']):.3f}")

print(f"\nStatistics by rating range:")
for bin_label in movie_stats['rating_bin'].cat.categories:
    bin_data = movie_stats[movie_stats['rating_bin'] == bin_label]
    if len(bin_data) > 0:
        print(f"• {bin_label}: {len(bin_data)} movies, avg ratings count: {bin_data['rating_count'].mean():.1f}")

print(f"\nTop 5 movies by average rating:")
top_rated = movie_stats.nlargest(5, 'avg_rating')
for idx, row in top_rated.iterrows():
    print(f"• {row['title']}: {row['avg_rating']:.3f} (from {row['rating_count']} ratings)")

print(f"\nMost rated movies:")
most_rated = movie_stats.nlargest(5, 'rating_count')
for idx, row in most_rated.iterrows():
    print(f"• {row['title']}: {row['rating_count']} ratings (avg: {row['avg_rating']:.3f})")


In [None]:
# o) Plot: A heatmap of the User Item Matrix

# For visualization purposes, we'll create a smaller subset of the matrix
# Let's take the first 50 users and first 100 movies for better visualization
subset_users = sorted(ratings['userId'].unique())[:50]
subset_movies = sorted(ratings['movieId'].unique())[:100]

# Create subset of ratings
subset_ratings = ratings[
    (ratings['userId'].isin(subset_users)) & 
    (ratings['movieId'].isin(subset_movies))
]

# Create user-item matrix for the subset
subset_matrix = subset_ratings.pivot_table(index='userId', columns='movieId', values='rating')

plt.figure(figsize=(15, 8))
sns.heatmap(subset_matrix, cmap='viridis', cbar=True, 
            xticklabels=False, yticklabels=False, 
            cbar_kws={'label': 'Rating'})
plt.title('User-Item Matrix Heatmap (First 50 Users, First 100 Movies)', 
          fontsize=14, fontweight='bold')
plt.xlabel('Movies (Movie IDs)')
plt.ylabel('Users (User IDs)')
plt.tight_layout()
plt.show()

print("User-Item Matrix Heatmap Analysis:")
print(f"• Subset size: {subset_matrix.shape[0]} users × {subset_matrix.shape[1]} movies")
print(f"• Total possible ratings in subset: {subset_matrix.shape[0] * subset_matrix.shape[1]:,}")
print(f"• Actual ratings in subset: {subset_matrix.notna().sum().sum():,}")
print(f"• Sparsity of subset: {1 - (subset_matrix.notna().sum().sum() / (subset_matrix.shape[0] * subset_matrix.shape[1])):.3f}")

# Show some statistics about the subset
print(f"\nSubset statistics:")
print(f"• Average rating: {subset_matrix.mean().mean():.3f}")
print(f"• Rating range: {subset_matrix.min().min():.1f} - {subset_matrix.max().max():.1f}")
print(f"• Users with most ratings in subset: {subset_matrix.notna().sum(axis=1).max()}")
print(f"• Movies with most ratings in subset: {subset_matrix.notna().sum(axis=0).max()}")


In [None]:
# p) Plot: A heatmap of the User Item Matrix for the 100 most rated movies for the 50 users with most ratings

# Get top 50 users with most ratings
top_users = ratings.groupby('userId').size().nlargest(50).index.tolist()

# Get top 100 movies with most ratings
top_movies = ratings.groupby('movieId').size().nlargest(100).index.tolist()

# Create subset of ratings for top users and movies
top_subset_ratings = ratings[
    (ratings['userId'].isin(top_users)) & 
    (ratings['movieId'].isin(top_movies))
]

# Create user-item matrix for the top subset
top_subset_matrix = top_subset_ratings.pivot_table(index='userId', columns='movieId', values='rating')

plt.figure(figsize=(15, 8))
sns.heatmap(top_subset_matrix, cmap='plasma', cbar=True, 
            xticklabels=False, yticklabels=False, 
            cbar_kws={'label': 'Rating'})
plt.title('User-Item Matrix Heatmap (Top 50 Users by Rating Count, Top 100 Movies by Rating Count)', 
          fontsize=14, fontweight='bold')
plt.xlabel('Top 100 Movies by Rating Count')
plt.ylabel('Top 50 Users by Rating Count')
plt.tight_layout()
plt.show()

print("Top Users and Movies Matrix Analysis:")
print(f"• Matrix size: {top_subset_matrix.shape[0]} users × {top_subset_matrix.shape[1]} movies")
print(f"• Total possible ratings in matrix: {top_subset_matrix.shape[0] * top_subset_matrix.shape[1]:,}")
print(f"• Actual ratings in matrix: {top_subset_matrix.notna().sum().sum():,}")
print(f"• Density of matrix: {(top_subset_matrix.notna().sum().sum() / (top_subset_matrix.shape[0] * top_subset_matrix.shape[1])):.3f}")

# Show some statistics about the top subset
print(f"\nTop subset statistics:")
print(f"• Average rating: {top_subset_matrix.mean().mean():.3f}")
print(f"• Rating range: {top_subset_matrix.min().min():.1f} - {top_subset_matrix.max().max():.1f}")

print(f"\nTop 10 users by rating count:")
user_counts = ratings.groupby('userId').size().nlargest(10)
for user_id, count in user_counts.items():
    print(f"• User {user_id}: {count} ratings")

print(f"\nTop 10 movies by rating count:")
movie_counts = ratings.groupby('movieId').size().nlargest(10)
for movie_id, count in movie_counts.items():
    movie_title = movies[movies['movieId'] == movie_id]['title'].iloc[0]
    print(f"• Movie {movie_id} ({movie_title}): {count} ratings")

print(f"\nComparison with full matrix:")
print(f"• Full matrix sparsity: {sparsity:.4f}")
print(f"• Top subset sparsity: {1 - (top_subset_matrix.notna().sum().sum() / (top_subset_matrix.shape[0] * top_subset_matrix.shape[1])):.4f}")
print(f"• The top users and movies matrix is {(1 - (top_subset_matrix.notna().sum().sum() / (top_subset_matrix.shape[0] * top_subset_matrix.shape[1]))) / sparsity:.2f}x denser than the full matrix")


### Exercise 2 - Building a baseline RS (7 points)
In this exercise we will build a baseline RS and functions to calculate fundamental performance metrics. 

Build the following baseline RS to predict Top-N (default N=20):
1. In reference to the book *Collaborative Filtering Recommender Systems by Michael D. Ekstrand, John T. Riedl and Joseph A. Konstan* (p. 91ff) implement the baseline predictor $$ b_{u,i}= \mu +b_u +b_i $$ with the regularized user and item average offsets: $$ b_u = \frac{1}{|I_u| + \beta_u} \sum_{i \in I_u} (r_{u,i} - \mu) $$ and $$ b_i = \frac{1}{|U_i| + \beta_i} \sum_{u \in U_i} (r_{u,i} - b_u - \mu) . $$ Build a recommender system upon this baseline predictor. Set the default damping factors $\beta_u$ and $\beta_i$ both to 20.
2. Build a RS that recommends based on *random* recommendations.  

Output the recommendations for three example users (Ids 1, 3 and 7) and the default parameters. Give the titles of the recommended movies and their predicted scores not just their Ids.

In [None]:
# Exercise 2 - Building a baseline RS

import numpy as np
from collections import defaultdict

class BaselineRecommender:
    """
    Baseline recommender system implementing the regularized baseline predictor:
    b_{u,i} = μ + b_u + b_i
    
    Where:
    b_u = (1 / (|I_u| + β_u)) * Σ(r_{u,i} - μ)
    b_i = (1 / (|U_i| + β_i)) * Σ(r_{u,i} - b_u - μ)
    """
    
    def __init__(self, beta_u=20, beta_i=20):
        self.beta_u = beta_u
        self.beta_i = beta_i
        self.mu = 0  # Global average rating
        self.b_u = {}  # User biases
        self.b_i = {}  # Item biases
        self.trained = False
        
    def fit(self, ratings_df):
        """Train the baseline predictor on the ratings data"""
        print("Training baseline predictor...")
        
        # Calculate global average
        self.mu = ratings_df['rating'].mean()
        print(f"Global average rating (μ): {self.mu:.3f}")
        
        # Initialize biases
        user_ratings = defaultdict(list)
        item_ratings = defaultdict(list)
        
        # Group ratings by user and item
        for _, row in ratings_df.iterrows():
            user_ratings[row['userId']].append(row['rating'])
            item_ratings[row['movieId']].append((row['userId'], row['rating']))
        
        # Calculate user biases
        print("Calculating user biases...")
        for user_id, ratings_list in user_ratings.items():
            numerator = sum(r - self.mu for r in ratings_list)
            denominator = len(ratings_list) + self.beta_u
            self.b_u[user_id] = numerator / denominator
        
        # Calculate item biases (iteratively)
        print("Calculating item biases...")
        # First pass: calculate item biases without user bias correction
        for item_id, rating_pairs in item_ratings.items():
            numerator = sum(r - self.mu for _, r in rating_pairs)
            denominator = len(rating_pairs) + self.beta_i
            self.b_i[item_id] = numerator / denominator
        
        # Second pass: refine item biases with user bias correction
        for item_id, rating_pairs in item_ratings.items():
            numerator = sum(r - self.b_u.get(user_id, 0) - self.mu for user_id, r in rating_pairs)
            denominator = len(rating_pairs) + self.beta_i
            self.b_i[item_id] = numerator / denominator
        
        self.trained = True
        print(f"Training completed. Calculated biases for {len(self.b_u)} users and {len(self.b_i)} items.")
        
    def predict(self, user_id, item_id):
        """Predict rating for a user-item pair"""
        if not self.trained:
            raise ValueError("Model must be trained before making predictions")
        
        b_u = self.b_u.get(user_id, 0)
        b_i = self.b_i.get(item_id, 0)
        
        prediction = self.mu + b_u + b_i
        
        # Clip to valid rating range
        return max(0.5, min(5.0, prediction))
    
    def recommend_top_n(self, user_id, n=20, exclude_rated=True):
        """Recommend top N items for a user"""
        if not self.trained:
            raise ValueError("Model must be trained before making recommendations")
        
        # Get all items
        all_items = set(self.b_i.keys())
        
        # Exclude items already rated by user if requested
        if exclude_rated:
            rated_items = set(ratings[ratings['userId'] == user_id]['movieId'])
            candidate_items = all_items - rated_items
        else:
            candidate_items = all_items
        
        # Calculate predictions for all candidate items
        predictions = []
        for item_id in candidate_items:
            pred_rating = self.predict(user_id, item_id)
            predictions.append((item_id, pred_rating))
        
        # Sort by predicted rating and return top N
        predictions.sort(key=lambda x: x[1], reverse=True)
        return predictions[:n]
    
    def get_bias_stats(self):
        """Get statistics about the calculated biases"""
        if not self.trained:
            return {}
        
        user_biases = list(self.b_u.values())
        item_biases = list(self.b_i.values())
        
        return {
            'global_avg': self.mu,
            'user_bias_stats': {
                'mean': np.mean(user_biases),
                'std': np.std(user_biases),
                'min': np.min(user_biases),
                'max': np.max(user_biases)
            },
            'item_bias_stats': {
                'mean': np.mean(item_biases),
                'std': np.std(item_biases),
                'min': np.min(item_biases),
                'max': np.max(item_biases)
            }
        }

# Initialize and train the baseline recommender
baseline_rec = BaselineRecommender(beta_u=20, beta_i=20)
baseline_rec.fit(ratings)

# Display bias statistics
bias_stats = baseline_rec.get_bias_stats()
print("\n=== BASELINE PREDICTOR STATISTICS ===")
print(f"Global average rating: {bias_stats['global_avg']:.3f}")
print(f"User bias statistics:")
for stat, value in bias_stats['user_bias_stats'].items():
    print(f"  {stat}: {value:.3f}")
print(f"Item bias statistics:")
for stat, value in bias_stats['item_bias_stats'].items():
    print(f"  {stat}: {value:.3f}")


In [None]:
# 2. Build a RS that recommends based on random recommendations

class RandomRecommender:
    """
    Random recommender system that recommends items randomly
    """
    
    def __init__(self, random_seed=42):
        self.random_seed = random_seed
        np.random.seed(random_seed)
        self.trained = False
        
    def fit(self, ratings_df, movies_df):
        """Initialize the random recommender with available items"""
        print("Initializing random recommender...")
        self.movies_df = movies_df
        self.available_items = set(movies_df['movieId'].unique())
        self.trained = True
        print(f"Random recommender initialized with {len(self.available_items)} items.")
        
    def recommend_top_n(self, user_id, n=20, exclude_rated=True):
        """Recommend N random items for a user"""
        if not self.trained:
            raise ValueError("Model must be trained before making recommendations")
        
        # Exclude items already rated by user if requested
        if exclude_rated:
            rated_items = set(ratings[ratings['userId'] == user_id]['movieId'])
            candidate_items = list(self.available_items - rated_items)
        else:
            candidate_items = list(self.available_items)
        
        # Randomly sample N items
        if len(candidate_items) >= n:
            selected_items = np.random.choice(candidate_items, size=n, replace=False)
        else:
            selected_items = candidate_items
        
        # Assign random scores (for consistency with other recommenders)
        random_scores = np.random.uniform(1.0, 5.0, len(selected_items))
        
        # Create list of (item_id, score) tuples
        recommendations = list(zip(selected_items, random_scores))
        
        return recommendations

# Initialize and train the random recommender
random_rec = RandomRecommender(random_seed=42)
random_rec.fit(ratings, movies)


In [None]:
# Output recommendations for three example users (IDs 1, 3 and 7)

def display_recommendations(recommender, user_id, n=20, recommender_name="Recommender"):
    """Display recommendations for a user with movie titles and scores"""
    print(f"\n=== {recommender_name.upper()} - TOP {n} RECOMMENDATIONS FOR USER {user_id} ===")
    
    # Get recommendations
    recommendations = recommender.recommend_top_n(user_id, n=n, exclude_rated=True)
    
    if not recommendations:
        print("No recommendations available (user may have rated all items)")
        return
    
    # Display recommendations with movie titles
    for i, (movie_id, score) in enumerate(recommendations, 1):
        movie_title = movies[movies['movieId'] == movie_id]['title'].iloc[0]
        movie_genres = movies[movies['movieId'] == movie_id]['genres'].iloc[0]
        print(f"{i:2d}. {movie_title}")
        print(f"    Movie ID: {movie_id}, Predicted Score: {score:.3f}")
        print(f"    Genres: {movie_genres}")
        print()
    
    print(f"Total recommendations: {len(recommendations)}")

# Test users
test_users = [1, 3, 7]

# Display recommendations for each user with both recommenders
for user_id in test_users:
    print(f"\n{'='*80}")
    print(f"RECOMMENDATIONS FOR USER {user_id}")
    print(f"{'='*80}")
    
    # Check if user exists in the dataset
    if user_id not in ratings['userId'].unique():
        print(f"User {user_id} not found in the dataset!")
        continue
    
    user_rating_count = len(ratings[ratings['userId'] == user_id])
    user_avg_rating = ratings[ratings['userId'] == user_id]['rating'].mean()
    print(f"User {user_id} has rated {user_rating_count} movies with average rating: {user_avg_rating:.3f}")
    
    # Baseline recommender recommendations
    display_recommendations(baseline_rec, user_id, n=20, recommender_name="Baseline")
    
    # Random recommender recommendations
    display_recommendations(random_rec, user_id, n=20, recommender_name="Random")


## Exercise 2 Summary

**Exercise 2 - Building a baseline RS** has been completed successfully!

### Implemented Components:

1. **Baseline Predictor with Regularized Offsets**: 
   - Implemented the formula: `b_{u,i} = μ + b_u + b_i`
   - User bias: `b_u = (1/(|I_u| + β_u)) * Σ(r_{u,i} - μ)`
   - Item bias: `b_i = (1/(|U_i| + β_i)) * Σ(r_{u,i} - b_u - μ)`
   - Default regularization parameters: β_u = 20, β_i = 20

2. **Baseline Recommender System**:
   - Trained on the full ratings dataset
   - Calculates user and item biases with regularization
   - Provides Top-N recommendations (default N=20)
   - Excludes already rated items from recommendations

3. **Random Recommender System**:
   - Recommends items randomly from the available movie catalog
   - Assigns random scores for consistency
   - Uses reproducible random seed (42)

4. **Recommendation Output**:
   - Generated Top-20 recommendations for users 1, 3, and 7
   - Displayed movie titles, IDs, predicted scores, and genres
   - Compared baseline vs random recommendations

### Key Features:
- **Modular Design**: Clean class-based implementation for both recommenders
- **Comprehensive Output**: Movie titles, scores, and metadata for all recommendations
- **Performance Analysis**: Statistical comparison between baseline and random approaches
- **Proper Regularization**: Prevents overfitting with β parameters
- **User-Friendly**: Clear display of recommendations with full movie information

The baseline recommender provides a much more sophisticated approach than random recommendations, using learned user and item biases to make personalized predictions. This serves as a strong foundation for comparison with more advanced collaborative filtering methods in subsequent exercises.


In [None]:
# Additional analysis: Compare baseline and random recommenders

print("=== COMPARISON ANALYSIS ===")

# Calculate some basic statistics for comparison
def analyze_recommender_performance(recommender, recommender_name, sample_users=None):
    """Analyze basic performance metrics for a recommender"""
    if sample_users is None:
        sample_users = [1, 3, 7]
    
    total_recommendations = 0
    avg_scores = []
    
    for user_id in sample_users:
        if user_id in ratings['userId'].unique():
            recommendations = recommender.recommend_top_n(user_id, n=20, exclude_rated=True)
            total_recommendations += len(recommendations)
            if recommendations:
                scores = [score for _, score in recommendations]
                avg_scores.extend(scores)
    
    if avg_scores:
        return {
            'name': recommender_name,
            'total_recommendations': total_recommendations,
            'avg_predicted_score': np.mean(avg_scores),
            'std_predicted_score': np.std(avg_scores),
            'min_predicted_score': np.min(avg_scores),
            'max_predicted_score': np.max(avg_scores)
        }
    else:
        return {
            'name': recommender_name,
            'total_recommendations': 0,
            'avg_predicted_score': 0,
            'std_predicted_score': 0,
            'min_predicted_score': 0,
            'max_predicted_score': 0
        }

# Analyze both recommenders
baseline_stats = analyze_recommender_performance(baseline_rec, "Baseline")
random_stats = analyze_recommender_performance(random_rec, "Random")

print("Performance Comparison:")
print(f"{'Metric':<25} {'Baseline':<15} {'Random':<15}")
print("-" * 55)
print(f"{'Total Recommendations':<25} {baseline_stats['total_recommendations']:<15} {random_stats['total_recommendations']:<15}")
print(f"{'Avg Predicted Score':<25} {baseline_stats['avg_predicted_score']:.3f}{'':<11} {random_stats['avg_predicted_score']:.3f}{'':<11}")
print(f"{'Std Predicted Score':<25} {baseline_stats['std_predicted_score']:.3f}{'':<11} {random_stats['std_predicted_score']:.3f}{'':<11}")
print(f"{'Min Predicted Score':<25} {baseline_stats['min_predicted_score']:.3f}{'':<11} {random_stats['min_predicted_score']:.3f}{'':<11}")
print(f"{'Max Predicted Score':<25} {baseline_stats['max_predicted_score']:.3f}{'':<11} {random_stats['max_predicted_score']:.3f}{'':<11}")

print(f"\n=== BASELINE PREDICTOR DETAILS ===")
print(f"Global average rating (μ): {baseline_rec.mu:.3f}")
print(f"Regularization parameters: β_u = {baseline_rec.beta_u}, β_i = {baseline_rec.beta_i}")
print(f"Number of users with biases: {len(baseline_rec.b_u)}")
print(f"Number of items with biases: {len(baseline_rec.b_i)}")

print(f"\n=== RANDOM RECOMMENDER DETAILS ===")
print(f"Random seed: {random_rec.random_seed}")
print(f"Available items for recommendation: {len(random_rec.available_items)}")

print(f"\nNote: The baseline recommender uses the regularized baseline predictor formula:")
print(f"b_{{u,i}} = μ + b_u + b_i")
print(f"where b_u and b_i are regularized user and item biases respectively.")
print(f"This provides a more sophisticated baseline than random recommendations.")


### Exercise 3 - Evaluation methods (15 points)
Split the data into train/validation set and a separate test set. This test set shall contain the first 20% of the users and shall not be used at all before exercise 10. With the remaining 80% do the following: 
Implement a function to partition your dataset for an offline evaluation based on holding out of random users with 5x cross validation with a 80/20 train/validation split. Within the validation set implement a masking with *all but n* approach. 
See page 2942 of https://jmlr.csail.mit.edu/papers/volume10/gunawardana09a/gunawardana09a.pdf for details on this approach. 

Choose the number of masked items n reasonably and explain your considerations.

Implement functions to calculate the following metrics:
- *Mean Absolute Error (MAE)* 
- *Root Mean Square Error (RMSE)*
- *Precision@N* with default $N=20$ and relevance threshold 4.0 stars.
- *Recall@N* with default $N=20$ and relevance threshold 4.0 stars.
- *One metric of the following: Novelty, Diverstity, Unexpectedness, Serendipity, Coverage*
Explain each of these. How does the relevance threshold influence the metrics? How would you choose this parameter?

Note: For *precision@N* and *Recall@N* use the definitions from https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54 with one exception: In case of the denominator being zero, set the metric to 0. 

For *Novelty*, *Diverstity*, *Unexpectedness*, *Serendipity*, *Coverage* you may use definitions from Silveira et al. https://link.springer.com/article/10.1007/s13042-017-0762-9 

### Exercise 4 - Optimize hyperparameters of baseline RS (6 points)
Optimize the hyperparameters $\beta_u$ and $\beta_i$ for the baseline RS from exercise 2 based on the RMSE metric. To save computation time find a reasonable maximum value for the betas. Explain your approach and your solution.
Plot the MAE, RMSE, Precision@N, Recall@N as functions of the betas.

Which metric would you use for hyperparameter tuning? Explain your decision.

### Exercise 5 - Collaborative filtering; item-based and user-based (12 points)
In this exersise we will build several different collaborative-filtering RS based on nearest neighbour technique, both in terms of item and user. 

Implement:
1. a RS based on the $K$ most similar items (K nearest neighbours). Similarity shall be calculated based on *cosine similarity*. 
2. a RS based on the $K$ most similar items (K nearest neighbours). Similarity shall be calculated based on *Pearson Correlation Coefficienct*. 
3. a RS based on the $K$ most similar users (K nearest neighbours). Similarity shall be calculated based on *cosine similarity*. 
4. a RS based on the $K$ most similar users (K nearest neighbours). Similarity shall be calculated based on *Pearson Correlation Coefficienct*. 

Each should have a default $K$ of 30.

Explain how you handle NaN values in the user rating matrix when computing similarities? What other preparations are useful such as normalization and mean centering?

Describe the two similarity metrics.

Show the top 20 recommended items for user ids 3, 5 and 7.


### Exercise 6 - Optimize hyperparameter $K$ (6 points)
Optimize the hyperparameter $K$ for all RS from the prior exercise optimizing for minimal RMSE. 
For each RS plot RMSE, Precision@N and Recall@N as a function of $K$. 

Compare the results of these four RS on the 3 example users. Do the results match your expectation? Describe.

### Exercise 7 - Model-based RS: SVD (10 points)
In this exercise we will use the unsupervised method *singular value decomposition (SVD)* from the python package *surprise* (https://surpriselib.com, documentation https://surprise.readthedocs.io/en/stable/matrix_factorization.html). SVD can compress much of the information of a matrix in few components.  

a)Run the SVD RS and show the results on the three example users from exercise 2. Explain how this algorithm works.

Note: A very good general introduction to SVD is this youtube video series starting with https://www.youtube.com/watch?v=gXbThCXjZFM&t=337s . See *Collaborative filtering recommender systems* by Ekstrand et al. *Mining of massive datasets* by Leskovec, Kapitel 11 (2020) and, *Recommender systems: The textbook*, by Aggarwal, chapter 3

b) We explore now what latent factors SVD has learned. Generate an interactive 2D UMAP plot of the biggest 10 latent movie factors. 
UMAP is a method for dimensionality reduction. Dimensionality reduction is typically used to respresent high dimensional data sets in less dimensions with goal to allow for visualization. See for the documentation of the python package:
https://umap-learn.readthedocs.io/en/latest/ and for interactive experimentation with this method https://pair-code.github.io/understanding-umap/ to gain a intuitive understanding of the two important parameters of this method: n_neighbours and min_dist


### Exercise 7 - Model-based RS: SVD (10 points)
In this exercise we will use the unsupervised method *singular value decomposition (SVD)* from the python package *surprise* (https://surpriselib.com, documentation https://surprise.readthedocs.io/en/stable/matrix_factorization.html). SVD can compress much of the information of a matrix in few components.  

a)Run the SVD RS and show the results on the three example users from exercise 2. Explain how this algorithm works.

Note: A very good general introduction to SVD is this youtube video series starting with https://www.youtube.com/watch?v=gXbThCXjZFM&t=337s . See *Collaborative filtering recommender systems* by Ekstrand et al. *Mining of massive datasets* by Leskovec, Kapitel 11 (2020) and, *Recommender systems: The textbook*, by Aggarwal, chapter 3

b) We explore now what latent factors SVD has learned. Generate an interactive 2D UMAP plot of the biggest 10 latent movie factors. Explore the resulting plot. With your movie knowledge can you interpret the movie clusters that form in the plot? Can you give names to (some) clusters?

Background: UMAP is a method for dimensionality reduction. Dimensionality reduction is typically used to respresent high dimensional data sets in less dimensions with goal to allow interpretable 2D/3D visualization. See for the documentation of the python package https://umap-learn.readthedocs.io/en/latest/ and for interactive experimentation with this method https://pair-code.github.io/understanding-umap/ to gain an intuition of the two important parameters of this method: *n_neighbours* and *min_dist*.

**At the MSP defense I do not expect a mathematical explanation how UMAP works. However you should have a intuition what the methods does and how the two parameters mentioned above influence the results.**


### Exercise 8 - Optimize hyperparameter $k$ or `n_factors` (4 points)
Optimize the hyperparameter, representing the number of greatest SVD components used for the truncated reconstruction of the user item matrix, to minimize RMSE.
Plot RMSE, Precision@N and Recall@N as a function of this hyperparameter. Finally output all performance metrics from exercise 3 for the optimal $k$ value.

### Exercise 9 - Everything goes (30 points)
In this exercise you can explore different methods of RS. You are not limited what methods you apply. You can try to improve the methods from the earlier exercises by modifiying them or generating ensemble or hybrid RS. Also you could train deep neural networks, use NLP methods, use the available links to imdb available in the dataset to further enrich the dataset or find an obscure method by someone else on Github. 
Document what your inspirations and sources are and describe the method conceptually. 

**Build and optimize in total *three* different methods. The last one has the additional requirement that it should increase the diversity of the recommendations in order to minimize filter bubbles.**

**Important: If you use the work of someone else you must be able to explain the method conceptually during the defense MSP.** 

Output the performance metrics of exercise 3. 

### Exercise 10 - Compare all RS that you build in this challenge (8 points)
a) Compile a table with the performance metrics of exercise 3 for all RS from this MC (Make sure to include the baseline RS and random RS) on the test set defined in exercise 3. Also generate comparative plots. Discuss.

b) Why is it important to keep a test set seperate till the end of a benchmark?

**Read the Guidelines for Implementation and Submission one more time.**