<a href="https://colab.research.google.com/github/lizzzb/MovieLens-Data-Analysis-in-Python/blob/main/DataPrepLinkdInCourse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install recommenders
import pandas as pd
from datetime import datetime, timedelta

from recommenders.datasets.download_utils import maybe_download
from recommenders.datasets.python_splitters import (
    python_random_split,
    python_chrono_split,
    python_stratified_split
)

Collecting recommenders
  Downloading recommenders-1.2.0-py3-none-any.whl.metadata (13 kB)
Collecting category-encoders<3,>=2.6.0 (from recommenders)
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting cornac<2,>=1.15.2 (from recommenders)
  Downloading cornac-1.18.0-cp310-cp310-manylinux1_x86_64.whl.metadata (23 kB)
Collecting lightfm<2,>=1.17 (from recommenders)
  Downloading lightfm-1.17.tar.gz (316 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting locust<3,>=2.12.2 (from recommenders)
  Downloading locust-2.31.6-py3-none-any.whl.metadata (7.7 kB)
Collecting memory-profiler<1,>=0.61.0 (from recommenders)
  Downloading memory_profiler-0.61.0-py3-none-any.whl.metadata (20 kB)
Collecting notebook<8,>=7.0.0 (from recommenders)
  Downloading notebook-7.2.2-py3-none-any.whl.metadata (10 kB)
Collecting retryin

This code snippet defines variables with the URL of the MovieLens 100k dataset, the file path, and the names of columns for users, items, ratings, predictions, and timestamps. These variables can be used later in your code for data loading and processing.

This code snippet defines variables that can be reused later. This makes the code more readable and easier to maintain. For instance, if you decide to rename a column, you only need to change it in one place.

`COL_USER = "UserId"` is not standard Python vocabulary. It's a user-defined variable. In this case, it's a constant that stores the name of the column that represents the user ID in your dataset. Using constants like this improves code readability and maintainability.

In [2]:
DATA_URL = "https://files.grouplens.org/datasets/movielens/ml-100k/u.data"
DATA_PATH = "ml-100k.data"

COL_USER = "UserId"
COL_ITEM = "MovieId"
COL_RATING = "Rating"
COL_PREDICTION = "Rating"
COL_TIMESTAMP = "Timestamp"

# 1. Data Preparation

## 1.1 Data Understanding

The code filepath = maybe_download(DATA_URL, DATA_PATH) downloads the MovieLens 100k dataset if it hasn't been downloaded already. It uses the maybe_download function which likely checks if the file exists at DATA_PATH, and if not, downloads it from DATA_URL. Finally, it returns the path to the downloaded file and assigns it to the variable filepath.

In [3]:
filepath = maybe_download(DATA_URL, DATA_PATH)

100%|██████████| 1.93k/1.93k [00:00<00:00, 2.01kKB/s]


This line of code reads data from a CSV file into a Pandas DataFrame.

  `data = ...`: Assigns the result of the operation to a variable named data.

 ` pd.read_csv(...)`: Uses the Pandas library function to read a CSV (Comma Separated Values) file.

  `filepath`: This variable holds the location of the file to be read. It was  defined earlier in the code (as the output of maybe_download).
  
  `sep="\t"`: Specifies that the values in the file are separated by tabs (\t) rather than commas (the default).

  `names=[COL_USER, COL_ITEM, COL_RATING, COL_TIMESTAMP]`: Provides a list of column names to use for the DataFrame. These names are likely defined as constants elsewhere in the code for better readability and maintainability.

In [4]:
data = pd.read_csv(filepath, sep="\t", names=[COL_USER, COL_ITEM, COL_RATING, COL_TIMESTAMP])

In [5]:
data.head() #Timestamp is in Unix notation

Unnamed: 0,UserId,MovieId,Rating,Timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [6]:
data.describe()

Unnamed: 0,UserId,MovieId,Rating,Timestamp
count,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0
std,266.61442,330.798356,1.125674,5343856.0
min,1.0,1.0,1.0,874724700.0
25%,254.0,175.0,3.0,879448700.0
50%,447.0,322.0,4.0,882826900.0
75%,682.0,631.0,4.0,888260000.0
max,943.0,1682.0,5.0,893286600.0


A standard deviation of 1.125674 for ratings suggests a moderate level of dispersion around the average rating. It means that a significant portion of the ratings fall within 1.125674 points above or below the average rating.

In [7]:
print(
    "Total number of ratings are:\t{}".format(data.shape[0]),
    "Total number of users are:\t{}".format(data[COL_USER].nunique()),
    "Total number of items are:\t{}".format(data[COL_ITEM].nunique()),
    sep="\n"
)

Total number of ratings are:	100000
Total number of users are:	943
Total number of items are:	1682


In this code, format is a method used for string formatting. It replaces the curly braces {} in the string with the values provided as arguments to the format method. In this case, it inserts the values of data.shape[0], data[COL_USER].nunique(), and data[COL_ITEM].nunique() into the respective places.

## 1.2 Data Transformation

We need to transform timestamps in something we can understand.

This code converts the timestamps in the `COL_TIMESTAMP` column of the data DataFrame from Unix epoch time to a human-readable date and time format. Let's break down the code:

1. `data[COL_TIMESTAMP] = ...`: This part assigns the result of the following operation to the COL_TIMESTAMP column of the data DataFrame.
2. `data.apply(...)`: This applies a function to each row (axis=1) of the DataFrame.
3. `lambda x: ...`: This is a lambda function, which is a small anonymous function defined inline. It takes a row (x) as input.
4. `datetime.strftime(...)`: This function formats a datetime object into a string.
5. `datetime(1970, 1, 1, 0, 0, 0) + timedelta(seconds=...)`: This part creates a datetime object from the Unix epoch time.
6. `datetime(1970, 1, 1, 0, 0, 0)`: This is the Unix epoch (January 1, 1970, at 00:00:00 Coordinated Universal Time (UTC)).
7. `timedelta(seconds=...)`: This creates a timedelta object representing the time difference in seconds.
8. `x[COL_TIMESTAMP].item()`: This gets the timestamp value from the current row (x) and converts it to a native Python integer.
9. `"%Y-%m-%d %H:%M:%S"`: This is the format string used to convert the datetime object to a string (YYYY-MM-DD HH:MM:SS).

In [20]:
# Check if the column is already in the correct format
data[COL_TIMESTAMP] = pd.to_datetime(data[COL_TIMESTAMP], format="%Y-%m-%d %H:%M:%S")
data.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16


Experimentation protocol is usually set up to favor a reasonable evaluation for a specific recommendation scenario. For example,
* *Recommender-A* is to recommend movies to people by taking people's **collaborative rating similarities**. To make sure the evaluation is statisically sound, **the same set of users for both model building and testing should be used** (to avoid any cold-ness of users), and a **stratified splitting strategy** should be taken.
* *Recommender-B* is to recommend fashion products to customers. It makes sense that evaluation of the recommender considers **time-dependency of customer purchases,** as apparently, **tastes of the customers in fashion items may be drifting over time**. In this case, a **chronologically splitting** should be used.

# 3. Data Split

## 3.1 Random Split
Random split simply takes in a data set and outputs the splits of the data, given the split ratios.

In [22]:
data_train, data_test = python_random_split(data, ratio=0.7)
data_train.shape[0], data_test.shape[0]

(70000, 30000)

In [23]:
# Sometimes a multi-split is needed
data_train, data_validate, data_test = python_random_split(data, ratio=[0.6, 0.2, 0.2])
data_train.shape[0], data_validate.shape[0], data_test.shape[0]

  return bound(*args, **kwds)


(60000, 20000, 20000)

In [24]:
# Ratios can be integers as well
data_train, data_validate, data_test = python_random_split(data, ratio=[3, 1, 1])
data_train.shape[0], data_validate.shape[0], data_test.shape[0]


  return bound(*args, **kwds)


(60000, 20000, 20000)

## 3.2 Chronological Split
Chronogically splitting method takes in a dataset and splits it on timestamp.
### 3.2.1 "Filter by"
Chrono splitting can be either by "user" or "item". For example, if it is by "user" and the splitting ratio is 0.7, it means that **first 70% ratings for each user in the data will be put into one split while the other 30% is in another.** It is worth noting that a chronological split is not "random" because splitting is timestamp-dependent.

This means that for each user, their ratings are sorted by timestamp, and the first 70% of ratings are assigned to data_train, while the remaining 30% are assigned to data_test.

In [26]:
data_train, data_test = python_chrono_split(
    data, ratio=0.7, filter_by="user",
    col_user=COL_USER, col_item=COL_ITEM, col_timestamp=COL_TIMESTAMP
)

# Take a look at the results for one particular user:
data_train[data_train[COL_USER] == 1].tail(10) #The last 10 rows of the train data:

Unnamed: 0,UserId,MovieId,Rating,Timestamp
1989,1,90,4,1997-11-03 07:31:40
11807,1,219,1,1997-11-03 07:32:07
50026,1,167,2,1997-11-03 07:33:03
202,1,61,4,1997-11-03 07:33:40
16314,1,230,4,1997-11-03 07:33:40
43280,1,162,4,1997-11-03 07:33:40
51295,1,35,1,1997-11-03 07:33:40
820,1,265,4,1997-11-03 07:34:01
11154,1,112,1,1997-11-03 07:34:01
45732,1,57,5,1997-11-03 07:34:19


In [28]:
data_test[data_test[COL_USER] == 1].head(10) # The first 10 rows of the test data

Unnamed: 0,UserId,MovieId,Rating,Timestamp
5682,1,49,3,1997-11-03 07:34:38
24493,1,30,3,1997-11-03 07:35:15
6234,1,233,2,1997-11-03 07:35:52
39865,1,131,1,1997-11-03 07:35:52
4280,1,82,5,1997-11-03 07:36:29
96699,1,152,5,1997-11-03 07:36:29
25721,1,141,3,1997-11-03 07:36:48
5842,1,72,4,1997-11-03 07:37:58
333,1,33,4,1997-11-03 07:38:19
37810,1,158,3,1997-11-03 07:38:19


The python_chrono_split function, with filter_by="user", splits the data based on the timestamp for each user. The function ensures that all timestamps in the data_train DataFrame are earlier than the timestamps in the data_test DataFrame for a given user.

This is essential for evaluating a recommender system because it simulates a real-world scenario where you would use past data to predict future behavior.

### 3.2.2 Min-rating filter

A min-rating filter is applied to data before it is split by using chronological splitter. The reason of doing this is that, for multi-split, there should be sufficient number of ratings for user/item in the data.

For example, the following means splitting only applies to users that have at least 10 ratings.


In [29]:
data_train, data_test = python_chrono_split(
    data, filter_by="user", min_rating=10, ratio=0.7,
    col_user=COL_USER, col_item=COL_ITEM, col_timestamp=COL_TIMESTAMP
)

This parameter adds a filter to the split. Only users with at least 10 ratings will be included in the split.

This is useful to ensure that you have enough data for each user in both the training and testing sets, which can improve the reliability of your evaluation.

Number of rows in the yielded splits of data may not sum to the original ones as users with fewer than 10 ratings are filtered out in the splitting.

In [30]:
data_train.shape[0] + data_test.shape[0], data.shape[0]


(100000, 100000)

### 3.3 Stratified Split

A stratified split ensures that the distribution of users or items is similar in both training and testing sets. This is particularly useful when you have a large imbalance in the number of ratings per user or item.

For example, if you have a dataset where some users have rated hundreds of movies and others have only rated a few, a stratified split would ensure that both training and testing sets have a representative mix of users with different rating frequencies.

In [32]:
data_train, data_test = python_stratified_split(
    data, filter_by="user", min_rating=10, ratio=0.7,
    col_user=COL_USER, col_item=COL_ITEM
)
data_train.shape[0] + data_test.shape[0], data.shape[0]

(100000, 100000)