# **rectorch**: tutorial on data loading and processing

This tutorial will show examples of how data can be loaded and processed using **rectorch**. Moreover, we will also explore the different possibilities offered by the library to handle the dataset splitting (i.e., training, validation and test set).

## Preliminaries

### Dataset download
For the purposes of this tutorial we download the *movielens 1M* dataset. As the name suggests, this dataset contains roughly one million (5 stars) ratings about movies. For more details, please refer to the official web page https://grouplens.org/datasets/movielens/1m/.

In [None]:
%cd /content/
!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
!unzip ml-1m.zip
!rm ml-1m.zip

/content
--2020-09-13 19:35:09--  http://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip’


2020-09-13 19:35:10 (6.70 MB/s) - ‘ml-1m.zip’ saved [5917549/5917549]

Archive:  ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         


Let's have a look at the first lines of the dataset file...

In [None]:
!head ./ml-1m/ratings.dat

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368


The format is: `user_id`::`item_id`::`rating`::`timestamp`.

### **rectorch** installation

NOTE: in this version of the tutorial we load the *dev* version from [github](https://github.com/makgyver/rectorch).

In [None]:
%cd /content/
!git clone -b dev https://github.com/makgyver/rectorch.git
%cd rectorch
!pip install -r requirements.txt

/content
Cloning into 'rectorch'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (37/37), done.[K
remote: Total 1698 (delta 11), reused 30 (delta 11), pack-reused 1650[K
Receiving objects: 100% (1698/1698), 3.22 MiB | 5.83 MiB/s, done.
Resolving deltas: 100% (1127/1127), done.
/content/rectorch
Collecting munch>=2.5.0
  Downloading https://files.pythonhosted.org/packages/cc/ab/85d8da5c9a45e072301beb37ad7f833cd344e04c817d97e0cc75681d248f/munch-2.5.0-py2.py3-none-any.whl
Collecting pandas>=1.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/1c/11/e1f53db0614f2721027aab297c8afd2eaf58d33d566441a97ea454541c5e/pandas-1.1.2-cp36-cp36m-manylinux1_x86_64.whl (10.5MB)
[K     |████████████████████████████████| 10.5MB 10.4MB/s 
[31mERROR: google-colab 1.0.0 has requirement pandas~=1.0.0; python_version >= "3.0", but you'll have pandas 1.1.2 which is incompatible.[0m
Installing collected packages: mun

**WARNING: for compatibility issue with `pandas`, the runtime must be restarted after running the previous cells!**

In [None]:
%cd rectorch

/content/rectorch


## Configuration

To load and process the dataset using **rectorch**, it is necessary to define the data configuration dictionary (or JSON file). The configuration must contain all the information for reading, processing and splitting the dataset.

The structure of this configuration dictionary is the following:

```
{
    "processing" : {
        "data_path" : [str] the path where the dataset (csv) file is stored,
        "threshold" : [float] the (minimal) threshold to apply to the ratings to binarize them (0 if no threshold),
        "separator" : [str] the columns separator used in the csv file,
        "header" : [int] the header line/s (None if not present),
        "u_min" : [int] the minimum number of ratings that a user must have to be taken,
        "i_min" : [int] the minimum number of ratings that an item must have to be taken
    },
    "splitting" : {
        "split_type" : [str] 'vertical'/'horizontal' depending on the needs,
        "sort_by" : [str] the column used to sort the ratings (None if no ordering required),
        "seed" : [int] random seed,
        "shuffle" : [bool] whether the ratings must be shuffled before splitting,
        "valid_size" : [int,float] number of users or proportion of users/ratings to be held for validation,
        "test_size" : [int,float] number of users or proportion of users/ratings to be held for test,
        "test_prop" : [float] proportion of ratings used in the test (used only in case of vertical splitting)
        "cv" : [int] number of times this splitting has to be repeated. If omitted only one splitting will be performed.
    }
}
```

In the following example the dataset (i.e., `ml-1m`) is processed as follows:

* ratings are binarized according to the threshold rating 3.5, i.e., if $r_{ui} > 3.5$ than it is considered as a positive feedback, otherwise a negative feedback;
* users with less than 2 ratings are discarded;
* items with less than 1 ratings are discarded;

Then, it is splitted as follows:

* vertically, i.e., users that appear in the validation/test set are not included in the training;
* the validation set contains 100 users;
* the test set contains 100 users;
* both the test and the validation set consider 80% of the users' reatings as the "training part" of the users and the rest as "test part".

In [None]:
cfg_data = {
    "processing": {
        "data_path": "../ml-1m/ratings.dat",
        "threshold": 3.5,
        "separator": "::",
        "header": None,
        "u_min": 2,
        "i_min": 0
    },
    "splitting": {
        "split_type": "vertical",
        "sort_by": None,
        "seed": 98765,
        "shuffle": True,
        "valid_size": 100,
        "test_size": 100,
        "test_prop": 0.2
    }
}

### Horizontal vs. Vertical split

Splitting (training-test) a recommendation dataset can be done in mainly two ways. Given a dataset **D** (e.g., read from a csv file):

* **[Horizontal]** X% of the ratings are randomly taken from **D** to form the training set, and the rest (100-X)% for the test set. This is called *horizontal* splitting because, if we think of the rating matrix **R** (users on the rows), we are taking part of the rows as training and the rest as test.

* **[Vertical]** A fixed set of users is kept for training and the remaining for test. This is called *vertical* splitting because we are vertically cutting the rating matrix in two parts. However, for the test users, part of their ratings (1-test_prop) can be used as "known" ratings (to avoid cold-start).

Clearly, the same concept applies in the case of training-validation-test split.

### Using/Starting from default configurations

**rectorch** offers some default configurations to start with. For example, to get the standard configuration for the *movielens 1M* dataset use the `get_data_cfg` from the `rectorch.utils` module passing the `"ml1m"` as parameter.

In [None]:
from rectorch.utils import get_data_cfg
cfg = get_data_cfg("ml1m")
cfg

{'processing': {'data_path': './ml-1m/ratings.dat',
  'header': None,
  'i_min': 0,
  'separator': '::',
  'threshold': 3.5,
  'u_min': 5},
 'splitting': {'seed': 98765,
  'shuffle': 1,
  'sort_by': None,
  'split_type': 'vertical',
  'test_prop': 0.2,
  'test_size': 750,
  'valid_size': 750}}

The default configurations can be used "as is", or they can be used as a starting point (dataset specific settings are already correct!) to define different configurations. 

## Creating the dataset

Once the configuration is ready, the dataset creation is simply performed using the `rectorch.data.DataProcessing` class.

In [None]:
from rectorch.data import DataProcessing
dataset = DataProcessing(cfg_data).process_and_split()
dataset

[19:36:02-130920]  Reading raw data file ../ml-1m/ratings.dat.
[19:36:06-130920]  NumExpr defaulting to 2 threads.
[19:36:06-130920]  Thresholded 424928 ratings.
[19:36:06-130920]  Applying filtering.
[19:36:06-130920]  Filtered 1 ratings.
[19:36:06-130920]  Shuffling data.
[19:36:06-130920]  Calculating splits.
[19:36:06-130920]  Creating validation and test set.
[19:36:06-130920]  Skipped 2 ratings in validation set.
[19:36:06-130920]  Skipped 3 ratings in test set.


Dataset(n_users=6037, n_items=3528, n_ratings=575275)

General information about the dataset can be retrieved using its own attributes, e.g., 

In [None]:
print("# of users: %d" %dataset.n_user)
print("# of items: %d" %dataset.n_items)
print("# of ratings: %d" %dataset.n_ratings)

Training, validation and test set are contained (in form of `pandas.DataFrame`) inside the `dataset` object and they can be easily retrieved using the attributes `train_set`, `valid_set` or `test_set`.

In [None]:
# The training set
dataset.train_set

Unnamed: 0,uid,iid,rating,3
0,1103,0,5,978300760
3,1103,1,4,978300275
4,1103,2,5,978824291
6,1103,3,5,978302039
7,1103,4,5,978300719
...,...,...,...,...
1000202,3149,188,4,956704996
1000205,3149,741,5,956704887
1000206,3149,203,5,956704746
1000207,3149,100,4,956715648


Being the dataset vertically splitted, both the validation and test set are composed of two parts (know|unknown ratings as explained above). For this reason, `valid_test` and `test_test` will return a pair (`list`) of Dataframes. 

In [None]:
dataset.test_set

[         uid  iid  rating          3
 233     5971   51       5  978294008
 236     5971  493       4  978294260
 238     5971   36       5  978294199
 240     5971   77       4  978294008
 242     5971   86       5  978294199
 ...      ...  ...     ...        ...
 996816  5953  983       4  956763510
 996818  5953  577       5  956763317
 996820  5953  827       4  956763657
 996821  5953  615       4  956763639
 996824  5953   30       5  956763445
 
 [8043 rows x 4 columns],          uid   iid  rating          3
 235     5971   733       4  978294282
 237     5971   764       4  978294282
 239     5971    40       5  978294230
 13348   5985   579       4  977546807
 13354   5985  1118       4  977546875
 ...      ...   ...     ...        ...
 988561  5976   580       5  956975355
 988563  5976  1120       5  956975467
 996793  5953   190       4  956763473
 996794  5953  1761       5  956763510
 996796  5953   729       4  956763318
 
 [1961 rows x 4 columns]]

The `Dataset` class also offers methods for converting the dataset into different standard formats. For example:

In [None]:
array_tr, array_val, array_te = dataset.to_array()
array_tr

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

`Dataset.to_array()` converts the dataset to `numpy.ndarray`.

Similarly:
* `to_sparse()` converts to `scipy.sparse.csr_matrix`;
* `to_tensor()` converts to `torch.FloatTensor`;
* `to_dict()` converts to `dict` with users as keys and list of items as values.

### Save/load the dataset to/from file

Once the dataset is created, it can be saved to file(s) for later usage.

In [None]:
dataset.save("ml-1m/processed")

[19:36:12-130920]  Saving unique_iid.txt.
[19:36:12-130920]  Saving unique_uid.txt.
[19:36:12-130920]  Saving all the files.
[19:36:13-130920]  Dataset saved successfully!


The loading is as easy

In [None]:
from rectorch.data import Dataset
dataset2 = Dataset.load("ml-1m/processed")

Let's check if the two datasets are actually the same.

In [None]:
import numpy as np
print(np.all(dataset.train_set.values    == dataset2.train_set.values) &
      np.all(dataset.valid_set[0].values == dataset2.valid_set[0].values) &
      np.all(dataset.valid_set[1].values == dataset2.valid_set[1].values) &
      np.all(dataset.test_set[0].values  == dataset2.test_set[0].values) &
      np.all(dataset.test_set[1].values  == dataset2.test_set[1].values))

True


Yes, they are!

## ... and much more

More details about the `rectorch.data` module can be retrieved from the [official documention](https://makgyver.github.io/rectorch/data.html).