# MOVIE RECOMMENDER SYSTEM USING RESTRICTED BOLTZMANN MACHINES (RBM)

This notebook implements a movie recommendation system using Restricted Boltzmann Machines (RBM).
RBMs are generative stochastic neural networks that can learn probability distributions over their inputs.

Key Concepts:
- RBM learns latent factors from user-movie interactions
- Uses Contrastive Divergence for training
- Converts explicit ratings to binary implicit feedback
- Can handle missing data (unrated movies)

Dataset: MovieLens 100K and 1M datasets

# STEP 1: IMPORTING NECESSARY LIBRARIES

In [1]:
# Data manipulation and downloading
import pandas as pd
import zipfile
import urllib.request

# Numerical operations and deep learning
import numpy as np
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable

# STEP 2: DOWNLOADING THE DATASETS

In [2]:
# Download MovieLens 100K dataset
url_100k = "https://files.grouplens.org/datasets/movielens/ml-100k.zip"
urllib.request.urlretrieve(url_100k, "ml-100k.zip")

# Download MovieLens 1M dataset
url_1m = "https://files.grouplens.org/datasets/movielens/ml-1m.zip"
urllib.request.urlretrieve(url_1m, "ml-1m.zip")

# Extract the datasets
with zipfile.ZipFile("ml-100k.zip", 'r') as zip_ref:
    zip_ref.extractall("ml-100k")

with zipfile.ZipFile("ml-1m.zip", 'r') as zip_ref:
    zip_ref.extractall("ml-1m")

# STEP 3: LOADING AND EXPLORING THE DATA

In [3]:
# Load movies data
movies = pd.read_csv('ml-1m/ml-1m/movies.dat', sep='::', header=None,
                     engine='python', encoding='latin-1')
print(f"✅ Movies data loaded: {movies.shape[0]} movies, {movies.shape[1]} columns")

# Display movie data structure
print("\n📽️ Movie Data Structure:")
print("Column 0: Movie ID")
print("Column 1: Movie Title")
print("Column 2: Movie Genre")
print(f"\nSample movies:")
print(movies.head())

✅ Movies data loaded: 3883 movies, 3 columns

📽️ Movie Data Structure:
Column 0: Movie ID
Column 1: Movie Title
Column 2: Movie Genre

Sample movies:
   0                                   1                             2
0  1                    Toy Story (1995)   Animation|Children's|Comedy
1  2                      Jumanji (1995)  Adventure|Children's|Fantasy
2  3             Grumpier Old Men (1995)                Comedy|Romance
3  4            Waiting to Exhale (1995)                  Comedy|Drama
4  5  Father of the Bride Part II (1995)                        Comedy


In [4]:
users = pd.read_csv('ml-1m/ml-1m/users.dat', sep='::', header=None,
                    engine='python', encoding='latin-1')
print(f"✅ Users data loaded: {users.shape[0]} users, {users.shape[1]} columns")

# Display user data structure
print("\n👤 User Data Structure:")
print("Column 0: User ID")
print("Column 1: Gender (M/F)")
print("Column 2: Age Group")
print("Column 3: Occupation Code")
print("Column 4: Zip Code")
print(f"\nSample users:")
print(users.head())

✅ Users data loaded: 6040 users, 5 columns

👤 User Data Structure:
Column 0: User ID
Column 1: Gender (M/F)
Column 2: Age Group
Column 3: Occupation Code
Column 4: Zip Code

Sample users:
   0  1   2   3      4
0  1  F   1  10  48067
1  2  M  56  16  70072
2  3  M  25  15  55117
3  4  M  45   7  02460
4  5  M  25  20  55455


In [5]:
# Load ratings data
ratings = pd.read_csv('ml-1m/ml-1m/ratings.dat', sep='::', header=None,
                      engine='python', encoding='latin-1')
print(f"✅ Ratings data loaded: {ratings.shape[0]} ratings, {ratings.shape[1]} columns")

# Display ratings data structure
print("\n⭐ Ratings Data Structure:")
print("Column 0: User ID")
print("Column 1: Movie ID")
print("Column 2: Rating (1-5)")
print("Column 3: Timestamp")
print(f"\nSample ratings:")
print(ratings.head())

✅ Ratings data loaded: 1000209 ratings, 4 columns

⭐ Ratings Data Structure:
Column 0: User ID
Column 1: Movie ID
Column 2: Rating (1-5)
Column 3: Timestamp

Sample ratings:
   0     1  2          3
0  1  1193  5  978300760
1  1   661  3  978302109
2  1   914  3  978301968
3  1  3408  4  978300275
4  1  2355  5  978824291


In [6]:
# Dataset statistics
print(f"\n📈 Dataset Statistics:")
print(f"Total movies: {movies.shape[0]}")
print(f"Total users: {users.shape[0]}")
print(f"Total ratings: {ratings.shape[0]:,}")
print(f"Average ratings per user: {ratings.shape[0]/users.shape[0]:.1f}")
print(f"Average ratings per movie: {ratings.shape[0]/movies.shape[0]:.1f}")


📈 Dataset Statistics:
Total movies: 3883
Total users: 6040
Total ratings: 1,000,209
Average ratings per user: 165.6
Average ratings per movie: 257.6


In [7]:
# Example: Find movie by ID
movie_id = 238
movie_name = movies[movies[0] == movie_id][1].iloc[0]
print(f"\n🎬 Example: Movie ID {movie_id} is '{movie_name}'")


🎬 Example: Movie ID 238 is 'Far From Home: The Adventures of Yellow Dog (1995)'


# STEP 4: PREPARING TRAINING AND TEST SETS

In [8]:
# Load training and test sets
training_set = pd.read_csv('ml-100k/ml-100k/u1.base', delimiter='\t')
training_set = np.array(training_set, dtype='int')
print(f"✅ Training set loaded: {training_set.shape[0]:,} ratings")

test_set = pd.read_csv('ml-100k/ml-100k/u1.test', delimiter='\t')
test_set = np.array(test_set, dtype='int')
print(f"✅ Test set loaded: {test_set.shape[0]:,} ratings")

print(f"Training set: {training_set.shape[0]:,} ratings ({training_set.shape[0]/(training_set.shape[0]+test_set.shape[0])*100:.1f}%)")
print(f"Test set: {test_set.shape[0]:,} ratings ({test_set.shape[0]/(training_set.shape[0]+test_set.shape[0])*100:.1f}%)")

✅ Training set loaded: 79,999 ratings
✅ Test set loaded: 19,999 ratings
Training set: 79,999 ratings (80.0%)
Test set: 19,999 ratings (20.0%)


In [9]:
# Determine number of users and movies
nb_users = int(max(max(training_set[:, 0]), max(test_set[:, 0])))
nb_movies = int(max(max(training_set[:, 1]), max(test_set[:, 1])))

print(f"\n👥 Dataset Dimensions:")
print(f"Total unique users: {nb_users}")
print(f"Total unique movies: {nb_movies}")


👥 Dataset Dimensions:
Total unique users: 943
Total unique movies: 1682


# STEP 5: DATA PREPROCESSING


In [10]:
def convert(data):
    """
    Convert rating data to user-movie matrix format.
    Each row represents a user, each column represents a movie.
    Values are ratings (0 for unrated movies).
    """
    new_data = []
    for id_users in range(1, nb_users + 1):
        # Get movies and ratings for current user
        id_movies = data[:, 1][data[:, 0] == id_users]
        id_ratings = data[:, 2][data[:, 0] == id_users]

        # Create ratings vector for all movies (0 for unrated)
        ratings = np.zeros(nb_movies)
        ratings[id_movies - 1] = id_ratings  # -1 because movie IDs start from 1

        new_data.append(list(ratings))
    return new_data

# Convert datasets
training_set = convert(training_set)
test_set = convert(test_set)

# Convert to PyTorch tensors
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

print(f"Training set shape: {training_set.shape}")
print(f"Test set shape: {test_set.shape}")

Training set shape: torch.Size([943, 1682])
Test set shape: torch.Size([943, 1682])


# STEP 6: RATING CONVERSION TO BINARY FEEDBACK

In [11]:
# Converting explicit ratings 1-5 to binary implicit feedback 1 or 0

# Convert training set
training_set[training_set == 0] = -1  # Unrated movies
training_set[training_set == 1] = 0   # Not liked
training_set[training_set == 2] = 0   # Not liked
training_set[training_set >= 3] = 1   # Liked

# Convert test set
test_set[test_set == 0] = -1  # Unrated movies
test_set[test_set == 1] = 0   # Not liked
test_set[test_set == 2] = 0   # Not liked
test_set[test_set >= 3] = 1   # Liked

print("✅ Rating conversion: ")
print(f"Training set - Liked movies: {(training_set == 1).sum().item():,}")
print(f"Training set - Not liked: {(training_set == 0).sum().item():,}")
print(f"Training set - Unrated: {(training_set == -1).sum().item():,}")

✅ Rating conversion: 
Training set - Liked movies: 66,102
Training set - Not liked: 13,897
Training set - Unrated: 1,506,127


# STEP 7: RBM ARCHITECTURE IMPLEMENTATION

In [12]:
class RBM():
    """
    Restricted Boltzmann Machine for collaborative filtering.

    Architecture:
    - Visible layer: Movies (input/output)
    - Hidden layer: Latent factors (learned features)
    - No connections within the same layer (restricted)
    """

    def __init__(self, nv, nh):
        """
        Initialize RBM with visible and hidden units.

        Args:
            nv (int): Number of visible units (movies)
            nh (int): Number of hidden units (latent factors)
        """
        # Initialize weights and biases with random values
        self.W = torch.randn(nh, nv)  # Weights between visible and hidden
        self.a = torch.randn(1, nh)   # Bias for hidden units
        self.b = torch.randn(1, nv)   # Bias for visible units

        print(f"✅ RBM initialized with {nv} visible units and {nh} hidden units")

    def sample_h(self, x):
        """
        Sample hidden units given visible units (forward pass).

        Args:
            x: Visible units (user ratings)

        Returns:
            tuple: (probabilities, sampled states)
        """
        wx = torch.mm(x, self.W.t())  # Weighted sum
        activation = wx + self.a.expand_as(wx)  # Add bias
        p_h_given_v = torch.sigmoid(activation)  # Probability
        return p_h_given_v, torch.bernoulli(p_h_given_v)  # Sample

    def sample_v(self, y):
        """
        Sample visible units given hidden units (backward pass).

        Args:
            y: Hidden units (latent factors)

        Returns:
            tuple: (probabilities, sampled states)
        """
        wy = torch.mm(y, self.W)  # Weighted sum
        activation = wy + self.b.expand_as(wy)  # Add bias
        p_v_given_h = torch.sigmoid(activation)  # Probability
        return p_v_given_h, torch.bernoulli(p_v_given_h)  # Sample

    def train(self, v0, vk, ph0, phk):
        """
        Update RBM parameters using Contrastive Divergence.

        Args:
            v0: Original visible states
            vk: Reconstructed visible states
            ph0: Hidden probabilities given v0
            phk: Hidden probabilities given vk
        """
        # Update weights
        self.W += (torch.mm(v0.t(), ph0) - torch.mm(vk.t(), phk)).t()
        # Update biases
        self.b += torch.sum((v0 - vk), 0)
        self.a += torch.sum((ph0 - phk), 0)

# Initialize RBM
nv = len(training_set[0])  # Number of visible units (movies)
nh = 100                   # Number of hidden units (latent factors)
batch_size = 100           # Batch size for training

print(f"\n🔧 RBM Configuration:")
print(f"Visible units (movies): {nv}")
print(f"Hidden units (latent factors): {nh}")
print(f"Batch size: {batch_size}")

rbm = RBM(nv, nh)


🔧 RBM Configuration:
Visible units (movies): 1682
Hidden units (latent factors): 100
Batch size: 100
✅ RBM initialized with 1682 visible units and 100 hidden units


# STEP 8: TRAINING THE RBM

In [13]:
# Training method: Contrastive Divergence (CD-k)
# Number of Gibbs sampling steps: 10

nb_epoch = 10
print(f"Number of epochs: {nb_epoch}")

print("\n📈 Training Progress:")
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    s = 0.

    # Process data in batches
    for id_user in range(0, nb_users - batch_size, batch_size):
        # Get batch of users
        vk = training_set[id_user:id_user + batch_size]
        v0 = training_set[id_user:id_user + batch_size]

        # Positive phase: sample hidden units given visible
        ph0, _ = rbm.sample_h(v0)

        # Gibbs sampling (Contrastive Divergence)
        for k in range(10):
            _, hk = rbm.sample_h(vk)
            _, vk = rbm.sample_v(hk)
            # Keep original ratings for rated movies
            vk[v0 < 0] = v0[v0 < 0]

        # Negative phase: sample hidden units given reconstructed visible
        phk, _ = rbm.sample_h(vk)

        # Update RBM parameters
        rbm.train(v0, vk, ph0, phk)

        # Calculate loss (only for rated movies)
        train_loss += torch.mean(torch.abs(v0[v0 >= 0] - vk[v0 >= 0]))
        s += 1.

    avg_loss = train_loss / s
    print(f"Epoch {epoch:2d}/{nb_epoch}: Loss = {avg_loss:.4f}")

Number of epochs: 10

📈 Training Progress:
Epoch  1/10: Loss = 0.3457
Epoch  2/10: Loss = 0.2147
Epoch  3/10: Loss = 0.2447
Epoch  4/10: Loss = 0.2506
Epoch  5/10: Loss = 0.2482
Epoch  6/10: Loss = 0.2470
Epoch  7/10: Loss = 0.2494
Epoch  8/10: Loss = 0.2468
Epoch  9/10: Loss = 0.2486
Epoch 10/10: Loss = 0.2470


# STEP 9: EVALUATING THE RBM MODEL

In [14]:
# Evaluating RBM model on test set
# Evaluation metric: Mean Absolute Error (MAE)

test_loss = 0
s = 0.

# Evaluate on each user
for id_user in range(nb_users):
    v = training_set[id_user:id_user+1]   # User's training data
    vt = test_set[id_user:id_user+1]      # User's test data

    # Only evaluate if user has test ratings
    if len(vt[vt >= 0]) > 0:
        # Generate predictions
        _, h = rbm.sample_h(v)
        _, v = rbm.sample_v(h)

        # Calculate MAE for rated movies
        test_loss += torch.mean(torch.abs(vt[vt >= 0] - v[vt >= 0]))
        s += 1.

avg_test_loss = test_loss / s
print(f"✅ Test MAE: {avg_test_loss:.4f}")

✅ Test MAE: 0.2544


# STEP 10: INTERPRETATION AND CONCLUSIONS

In [15]:
print("\n🎯 Model Performance Analysis:")
print(f"Training MAE: {avg_loss:.4f}")
print(f"Test MAE: {avg_test_loss:.4f}")

if avg_test_loss < 0.3:
    print("🌟 Excellent performance! Model generalizes well.")
elif avg_test_loss < 0.4:
    print("👍 Good performance! Model is learning effectively.")
elif avg_test_loss < 0.5:
    print("⚠️  Moderate performance. Consider tuning hyperparameters.")
else:
    print("❌ Poor performance. Model may need more training or different architecture.")

print("\n💡 Key Insights:")
print("• RBM successfully learned latent factors from user-movie interactions")
print("• Binary conversion simplified the learning problem")
print("• Contrastive Divergence enabled efficient training")
print("• Model can handle missing data (unrated movies)")

print("\n🚀 Potential Improvements:")
print("• Try different numbers of hidden units")
print("• Experiment with different learning rates")
print("• Use different rating conversion thresholds")
print("• Implement early stopping based on validation loss")
print("• Add regularization to prevent overfitting")


🎯 Model Performance Analysis:
Training MAE: 0.2470
Test MAE: 0.2544
🌟 Excellent performance! Model generalizes well.

💡 Key Insights:
• RBM successfully learned latent factors from user-movie interactions
• Binary conversion simplified the learning problem
• Contrastive Divergence enabled efficient training
• Model can handle missing data (unrated movies)

🚀 Potential Improvements:
• Try different numbers of hidden units
• Experiment with different learning rates
• Use different rating conversion thresholds
• Implement early stopping based on validation loss
• Add regularization to prevent overfitting
