### Notebook dedicated to dividing data into training set and test set for the collaborative filtering method

#### Preparations

Importing required libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Loading data

In [2]:
ratings = pd.read_csv('../data/ratings.csv')
print("Number of ratings in the set: ", ratings.shape[0])
print("Sample ratings:")
print(ratings.head(2))

Number of ratings in the set:  32000204
Sample ratings:
   userId  movieId  rating  timestamp
0       1       17     4.0  944249077
1       1       25     1.0  944250228


#### Data division

The data is divided according to the following rules:
- 20% of the data is assigned to the test set
- 80% of the data is assigned to the training set
- this division is performed on the ratings of each user (stratification by users ensures that each user is represented in both sets)

In [3]:
train_data, test_data = train_test_split(
    ratings,
    test_size=0.2,
    stratify=ratings["userId"],
    random_state=264,
)

As we can see in the code above, the seed is set to a fixed value so that the data is always divided in the same way.
<br> The data division assumptions will be checked below.

In [4]:
print(f"\nTrain: {len(train_data)} | Test: {len(test_data)}")
print(f"Percentage of users in the test set: {test_data['userId'].nunique()/ratings['userId'].nunique():.1%}")


Train: 25600163 | Test: 6400041
Percentage of users in the test set: 100.0%


Collaborative filtering will not be able to make accurate predictions for movies it has not been trained on, so only movies that are in the training set should be selected for our test set.

In [5]:
train_movies = set(train_data["movieId"].unique())
test_data_clean = test_data[test_data["movieId"].isin(train_movies)]

Next, these deleted movies from the test set will be moved to the training set.

In [6]:
missing_in_test = test_data[~test_data["movieId"].isin(train_movies)]
final_train = pd.concat([train_data, missing_in_test])

Below are the results of changes in the sets.

In [7]:
print(f"\nAfter correction:")
print(f"Final Train: {len(final_train)}")
print(f"Test Clean: {len(test_data_clean)}")
print(f"Deleted from test: {len(test_data)-len(test_data_clean)}")
print(f"Percentage of deleted: {(len(test_data)-len(test_data_clean))/len(test_data):.3%}")


After correction:
Final Train: 25604902
Test Clean: 6395302
Deleted from test: 4739
Percentage of deleted: 0.074%


Next, our division will be verified.

1. Are all movies in the test set also in the training set?

In [8]:
assert test_data_clean["movieId"].nunique() == test_data_clean["movieId"].nunique(), "Incorrect number of movies in the test set"

2. Are all users present in both sets?

In [9]:
print(f"\nUsers in the training set: {final_train['userId'].nunique()}")
print(f"Users in the test set:    {test_data_clean['userId'].nunique()}")


Users in the training set: 200948
Users in the test set:    200948


3. Were the ratings of individual users divided in a ratio of $1/5$?

In [10]:
sample_user1 = ratings['userId'].iloc[0]
sample_user2 = ratings['userId'].iloc[1000]
print(f"\nFirst sample user: {sample_user1}")
print(f"Trening: {len(final_train[final_train['userId'] == sample_user1])} ocen")
print(f"Test:    {len(test_data_clean[test_data_clean['userId'] == sample_user1])} ocen")
print(f"\nSecond sample user: {sample_user2}")
print(f"Trening: {len(final_train[final_train['userId'] == sample_user2])} ocen")
print(f"Test:    {len(test_data_clean[test_data_clean['userId'] == sample_user2])} ocen")


First sample user: 1
Trening: 113 ocen
Test:    28 ocen

Second sample user: 10
Trening: 528 ocen
Test:    132 ocen


4. Sparsity of matrices

In [11]:
print(f"\nSparsity Trening: {(1 - len(final_train)/(final_train['userId'].nunique() * final_train['movieId'].nunique())):.2%}")
print(f"Sparsity Test: {(1 - len(test_data_clean)/(test_data_clean['userId'].nunique() * test_data_clean['movieId'].nunique())):.2%}")


Sparsity Trening: 99.85%
Sparsity Test: 99.94%


Saving the data

In [12]:
final_train.to_csv('../data/cf/train_rating.csv', index=False)
test_data_clean.to_csv('../data/cf/test_rating.csv', index=False)