# Research Paper Implementation : Neural Collaborative Filtering (NCF)

In this post, we will implement the Neural Collaborative Filtering (NCF) model proposed in the research paper "Neural Collaborative Filtering" by Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu and Tat-Seng Chua. The paper was published in 2017 and can be accessed at the following link: [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031).

## Contributions of the Paper

The paper makes the following contributions:
- Proposes a general framework for collaborative filtering using neural networks, enabling the learning of non-linear user-item interaction functions.
- Combines the **linearity** of Generalized Matrix Factorization (GMF) with the **non-linearity** of Multi-Layer Perceptron (MLP) in a unified model called *NeuMF*.
- Demonstrates the effectiveness of this hybrid approach on widely used recommendation datasets, outperforming traditional methods.

## Background
### Collaborative Filtering (CF) 

Collaborative Filtering (CF) is a technique used in recommender systems to predict a userâ€™s preferences based on patterns in historical user-item interactions. It can be broadly classified into:
- **User-based CF** : Finds similar users to the target user and recommends items based on their preferences
- **Item-based CF** : Finds items similar to those the user has interacted with and recommends them.

### Matrix Factorization (MF)

Matrix Factorization is a popular model-based CF approach that represents users and items as latent factors in a shared vector space. It predicts user-item interactions by computing the dot product of their corresponding latent factors. While effective, the dot-product operation limits the ability to model complex relationships between users and items.

### Neural Collaborative Filtering (NCF)

Previous work on collaborative filtering often used neural networks to encode auxiliary information (e.g., textual descriptions of items) while relying on traditional MF to model latent user-item interactions. The NCF model proposed in this paper replaces the rigid dot-product operation with a neural network, allowing it to learn complex, non-linear interaction functions. This approach is particularly suited for implicit feedback tasks, where the goal is to predict binary interactions (e.g., whether a user interacted with an item).

## Implementation

To reproduce and implement NCF, we will:

1. Implement a baseline Matrix Factorization (MF) model.  
2. Implement the Neural Collaborative Filtering (NCF) model.
3. Train and evaluate the models on the MovieLens 100K dataset.
4. Analyze the results and compare the performance of the models.

## Metrics

To compare the models, we will use the following metrics:

- **Hit Ratio (HR)**: Measures whether the true positive (e.g., the ground-truth item) appears in the top-k recommendations for a user. A higher HR indicates better recall.
- **Normalized Discounted Cumulative Gain (NDCG)**: Measures the ranking quality of the recommended items. It assigns higher scores to the items that are ranked higher in the list of recommendations.

We will use leave-one-out evaluation, where one positive interaction per user is held out for testing, and the remaining interactions are used for training.

## Dataset

We will use MovieLens 100K dataset. The dataset can be downloaded from the following link: [MovieLens 100K](https://grouplens.org/datasets/movielens/100k/). The dataset has:
- 100,000 ratings (1-5) from 943 users on 1682 movies.
- Each user has rated at least 20 movies.
- Some demographics for the user (age, gender, occupation, zip code) are also available.

## Learning from Implicit Data

In the context of implicit feedback, we treat any observed interaction (e.g., a user watching a movie) as a positive instance and unobserved interactions as negative instances. The goal is to predict whether a user will interact with an item or not. This has limitations as it assumes that the absence of interaction implies a negative preference, which may not always be true. However, it is a common approach in implicit feedback settings.
Let's start by loading the dataset and preparing the data for training the models.

### Load the dataset

In [None]:
import pandas as pd
import numpy as np

# read in the data (movielens), data is in u.data and u.item files
path = 'data/movielens/'
user_data = pd.read_csv(path + 'u.data', sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
movie_data = pd.read_csv(path + 'u.item', sep='|', names=['item_id', 'title'], usecols=range(2), encoding='latin-1')

# merge the data on item_id
data = pd.merge(user_data, movie_data, on='item_id')

data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


### Build the labels using Negative Sampling

We will treat the user-item interactions as binary data, where any interaction is labeled as 1 (positive instance) and the unobserved interactions are labeled as 0 (negative instance). For every positive interaction, a fixed number of items that user hasn't interacted with randomly selected as negative. This is known as negative sampling.

In [6]:
# For each positive interaction (user, item) pair, sample 4 negative items

# get the unique user and item ids
users = data['user_id'].unique()
items = data['item_id'].unique()

# create a dictionary to store the negative samples
negative_samples = {}
for user in users:
    # get the items that the user has interacted with
    interacted_items = data[data['user_id'] == user]['item_id'].values
    # get the items that the user has not interacted with
    not_interacted_items = np.setdiff1d(items, interacted_items, assume_unique=True)
    # sample 4 negative items
    negative_samples[user] = np.random.choice(not_interacted_items, size=4, replace=False)

# create a new dataframe to store the negative samples
negative_data = pd.DataFrame(columns=['user_id', 'item_id', 'rating', 'timestamp', 'title'])
for user in negative_samples.keys():
    for item in negative_samples[user]:
        negative_data = negative_data.append({'user_id': user, 'item_id': item, 'rating': 0, 'timestamp': 0, 'title': movie_data[movie_data['item_id'] == item]['title'].values[0]}, ignore_index=True)

# concatenate the positive and negative samples
data = pd.concat([data, negative_data])

data.head()


AttributeError: 'DataFrame' object has no attribute 'append'

In [20]:
import pandas as pd
import random
from collections import defaultdict
from torch.utils.data import Dataset


class MovieLensDataset(Dataset):
    """
    A PyTorch Dataset for MovieLens data, based on the specifications in the NCF paper.
    Handles data loading, train-test split, and negative sampling.
    """
    def __init__(self, path, num_negatives=4, is_training=True, random_seed=42):
        """
        Initialize the MovieLens dataset.
        
        Parameters:
        - path: Path to the MovieLens dataset (expects u.data and u.item files).
        - num_negatives: Number of negative samples per positive sample.
          - 4 for training.
          - 99 for testing.
        - is_training: Whether this dataset is for training or testing.
        - random_seed: Seed for reproducibility.
        """
        random.seed(random_seed)

        # Load and preprocess data
        self.data = self._load_data(path)
        self.user_item_dict = self._build_interaction_dict(self.data)
        self.all_items = set(self.data["item_id"].unique())
        self.num_negatives = num_negatives
        self.is_training = is_training

        # Build datasets
        self.dataset = self._prepare_dataset()

    def _load_data(self, path):
        """
        Load and preprocess the MovieLens dataset.
        """
        user_df = pd.read_csv(path + 'u.data', sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
        movie_df = pd.read_csv(path + 'u.item', sep='|', names=['item_id', 'title'], usecols=range(2), encoding='latin-1')
        data = pd.merge(user_df, movie_df, on='item_id')
        data['interaction'] = 1  # Treat all ratings as positive interactions
        return data

    def _build_interaction_dict(self, data):
        """
        Build a dictionary of user-item interactions.
        """
        user_item_dict = defaultdict(set)
        for _, row in data.iterrows():
            user_item_dict[row["user_id"]].add(row["item_id"])
        return user_item_dict

    def _split_train_test(self):
        """
        Split the data into training and testing sets. The latest interaction per user is assigned to the test set.
        """
        train_items = {}
        test_items = {}

        for user, items in self.user_item_dict.items():
            items = sorted(items)  # Ensure consistent ordering
            test_item = items[-1]  # Latest interaction for testing
            train_items[user] = set(items[:-1])  # Remaining interactions for training
            test_items[user] = test_item  # Assign the test item

        return train_items, test_items

    def _prepare_dataset(self):
        """
        Prepare the dataset for training or testing with appropriate negative sampling.
        """
        dataset = []
        train_items, test_items = self._split_train_test()

        if self.is_training:
            # Prepare training dataset
            for user, positive_items in train_items.items():
                for pos_item in positive_items:
                    dataset.append((user, pos_item, 1))  # Positive sample
                    # Add negative samples
                    negative_items = random.sample(list(self.all_items - positive_items), self.num_negatives)
                    dataset.extend((user, neg_item, 0) for neg_item in negative_items)
        else:
            # Prepare testing dataset
            for user, test_item in test_items.items():
                dataset.append((user, test_item, 1))  # Positive sample
                # Add 99 negative samples for ranking
                negative_items = random.sample(
                    list(self.all_items - train_items[user] - {test_item}), self.num_negatives
                )
                dataset.extend((user, neg_item, 0) for neg_item in negative_items)

        return dataset

    def __len__(self):
        """
        Returns the size of the dataset.
        """
        return len(self.dataset)

    def __getitem__(self, idx):
        """
        Fetch a single sample from the dataset.
        """
        return self.dataset[idx]

In [21]:
from torch.utils.data import DataLoader

# Set dataset path
path = "data/movielens/"  # Replace with the correct path to your dataset

# Create training dataset
train_data = MovieLensDataset(path=path, num_negatives=4, is_training=True)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# Create testing dataset
test_data = MovieLensDataset(path=path, num_negatives=99, is_training=False)
test_loader = DataLoader(test_data, batch_size=1, shuffle=False)

# Example: Iterate through the training DataLoader
for batch in train_loader:
    users, items, labels = zip(*batch)
    print("Users:", users[:5])
    print("Items:", items[:5])
    print("Labels:", labels[:5])
    break

print("Train and test datasets are ready.")

ValueError: too many values to unpack (expected 3)