In [1]:
import pandas as pd
from pandas_profiling import ProfileReport

# My remarks

- Explicit/Implicit feedback
- Iteractions mapping (interaction_type): {0: views, 1: detail, 2: ratings, 3: purchases}
- Items mapping (item_type): {0: movies, 1: movies and clips in series, 2: TV movies or shows, 3: episodes of TV series}
- -1 value means null in interactions table
- Collected the data for over four months between 2018 and 2019
- ContentWise Impressions is comprised of three
different information layers. First, interactions of users with items
of the service, containing user-item pairs. Second, impressions with
a direct link to interactions, containing those recommendation lists
that generated interactions. Third, impressions without a direct link
to interactions, containing those recommendation lists that did not
generate interactions.

# ContentWise Impressions

Sources:
- https://github.com/ContentWise/contentwise-impressions
- https://arxiv.org/abs/2008.01212 **(detailed description of dataset!)**

ContentWise impressions is a dataset collected from an Over-the-Top media service that contains interactions and impressions.

If you use this work in any form, please cite our article:

```bibtex
@Article{ContentWiseImpressions-CIKM-2020,
author={Pérez Maurera, Fernando Benjamín
and Ferrari Dacrema, Maurizio
and Saule, Lorenzo
and Scriminaci, Mario
and Cremonesi, Paolo},
title={ContentWise Impressions: An industrial dataset with impressions included},
journal={Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM 2020)},
year={2020},
doi={},
Eprint={arXiv},
note={Source: \url{https://github.com/ContentWise/contentwise-impressions}},
}
```

## The dataset

All the data can be located under the `data/ContentWiseImpressions/CW10M` folder. All the files are stored in `parquet`. These are the files we provide:

- `interactions`: DataFrame that contains the interactions of users with items.
- `impressions-direct-link`: DataFrame that contains the impressions with direct links to interactions.
- `impressions-non-direct-link`: DataFrame that contains the impressions without direct links to interactions.
- `metadata.json`: JSON file containing the number of users, items, series and recommendations of the dataset.

Inside the `data/ContentWiseImpressions/CW10M` folder you'll also locate the `splits` folder. In there, we saved the data splits used in our experiments. You'll find in there:

- `urm_items.train.npz`: Training split.
- `urm_items.validation.npz`: Validation split for hyperparameter tuning.
- `urm_items.test.npz`: Testing split for final evaluation of best hyperparameters.
- `urm_items.impressions.npz`: Sum of `urm_items.impressions_direct_link.npz` and `urm_items.impressions_non_direct_link.npz`.
- `urm_items.impressions_direct_link.npz`: URM version of `impressions-direct-link`. Rows are users and columns are items. Each cell contains the number of times the item has been recommended to the user.
- `urm_items.impressions_non_direct_link.npz`: URM version of `impressions-non-direct-link`. Rows are users and columns are items. Each cell contains the number of times the item has been recommended to the user.
- `urm_series.train.npz`: Equivalent to `urm_items.train.npz` but columns are series.
- `urm_series.validation.npz`: Equivalent to `urm_items.validation.npz` but columns are series.
- `urm_series.test.npz`: Equivalent to `urm_items.test.npz` but columns are series.
- `urm_series.impressions.npz`: Equivalent to `urm_items.impressions.npz` but columns are series.
- `urm_series.impressions_direct_link.npz`: Equivalent to `urm_items.impressions_direct_link.npz` but columns are series.
- `urm_series.impressions_non_direct_link.npz`: Equivalent to `urm_items.impressions_non_direct_link.npz` but columns are series.

## Human-readable version of the dataset

We provide GZIP-compressed versions of the dataset in CSV (located inside the `CW10M-CSV` folder). This way, you can inspect the dataset and load it with tools that does not support Parquet files. Be careful, the uncompressed version of the dataset takes approximately `2.4 GiB` of disk.

If you plan to use the dataset, we highly encourage you to use the Parquet version of it. Parquet is an open source format for data. Readers and writers for parquet exists in several languages and they tend to work orders of magnitude faster than their CSV counterparts.

If you want to see an example of how to load the parquet version of the dataset, then go to the [official repo](https://github.com/ContentWise/contentwise-impressions) as we provided examples to load and use the dataset.

## Authors

- Fernando Benjamín Pérez Maurera - Politecnico di Milano / ContentWise - ([fernando.perez@contentwise.com](mailto:fernando.perez@contentwise.com) or [fernandobenjamin.perez@polimi.it](mailto:fernandobenjamin.perez@polimi.it)).
- Maurizio Ferrari Dacrema - Politecnico di Milano - ([maurizio.ferrari@polimi.it](mailto:maurizio.ferrari@polimi.it)).
- Lorenzo Saule - Politecnico di Milano / ContentWise ([lorenzo.saule@gmail.com](mailto:lorenzo.saule@gmail.com)).
- Mario Scriminaci - ContentWise - ([mario.scriminaci@contentwise.com](mailto:mario.scriminaci@contentwise.com)).
- Paolo Cremonesi - Politecnico di Milano - ([paolo.cremonesi@polimi.it](mailto:paolo.cremonesi@polimi.it)).

## Disclaimer

This is not an official ContentWise product.

## License

ContentWise Impressions by F. B. Pérez Maurera, Maurizio Ferrari, Lorenzo Saule, Mario Scriminaci, and Paolo Cremonesi is licensed under CC BY-NC-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0

# Parquets

## metadata

In [2]:
meta_data_path = "../../data/contentwise/data/contentwise/CW10M/metadata.json"
meta_data = pd.read_json(meta_data_path, typ="series")
meta_data

num_items              145074
num_recommendations    307454
num_series              28881
num_users               42153
dtype: int64

## interactions

In [3]:
# 1 of 4 part of parquet files
interactions_path = "../../data/contentwise/data/contentwise/CW10M/interactions/part.0.parquet"
interactions = pd.read_parquet(interactions_path, engine="pyarrow")
interactions

Unnamed: 0_level_0,user_id,item_id,series_id,episode_number,series_length,item_type,recommendation_id,interaction_type,vision_factor,explicit_rating
utc_ts_milliseconds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1546851602000,7285,116485,434,5,6,3,56402,0,1.00,-1.0
1546852592000,412,116485,434,5,6,3,-1,1,-1.00,-1.0
1546853242000,10811,116485,434,5,6,3,-1,1,-1.00,-1.0
1546853253000,10811,116485,434,5,6,3,-1,0,0.50,-1.0
1546856158000,10811,116485,434,5,6,3,-1,0,0.99,-1.0
...,...,...,...,...,...,...,...,...,...,...
1549021133000,40763,90910,19532,1,1,0,132437,1,-1.00,-1.0
1549022895000,40763,90910,19532,1,1,0,132437,0,0.26,-1.0
1549021537000,7183,66535,1428,1,1,0,158480,1,-1.00,-1.0
1549021513000,33495,111990,23969,1,1,1,67304,1,-1.00,-1.0


## impressions-direct-link

In [4]:
impressions_dl_path = "../../data/contentwise/data/contentwise/CW10M/impressions-direct-link/part.0.parquet"
impressions_dl = pd.read_parquet(impressions_dl_path, engine="pyarrow")
impressions_dl

Unnamed: 0_level_0,row_position,recommendation_list_length,recommended_series_list
recommendation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,10,"[20128, 6674, 4625, 19462, 19041, 23229, 5914,..."
1,0,10,"[7906, 1240, 1712, 8348, 3227, 7607, 24175, 15..."
2,0,10,"[13673, 15810, 16821, 3826, 26860, 22223, 1847..."
3,1,10,"[13673, 1272, 2293, 23996, 15810, 16821, 13737..."
4,0,6,"[21885, 22288, 7493, 17042, 18483, 9330]"
...,...,...,...
307449,0,12,"[21261, 26515, 5544, 1393, 5678, 22552, 9101, ..."
307450,1,10,"[20128, 4862, 6674, 28598, 27215, 4625, 19041,..."
307451,0,30,"[9969, 17425, 9101, 14797, 5743, 4172, 17953, ..."
307452,0,10,"[21079, 23099, 28598, 25404, 19462, 26304, 152..."


## impressions-non-direct-link

In [5]:
# 1 of 45 part of parquet files
impressions_ndl_path = "../../data/contentwise/data/contentwise/CW10M/impressions-non-direct-link/part.0.parquet"
impressions_ndl = pd.read_parquet(impressions_ndl_path, engine="pyarrow")
impressions_ndl

Unnamed: 0_level_0,row_position,recommendation_list_length,recommended_series_list
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,10,"[21079, 20128, 6674, 28598, 19462, 19041, 7677..."
0,0,5,"[7411, 27339, 28669, 9948, 14988]"
0,2,10,"[23099, 28082, 27641, 9701, 21833, 5654, 23393..."
0,0,10,"[5303, 27643, 19462, 26887, 15810, 3826, 7411,..."
0,1,10,"[5303, 21079, 20128, 6674, 28598, 4625, 19462,..."
...,...,...,...
921,0,10,"[13673, 2293, 23996, 15810, 28772, 16821, 1598..."
921,2,10,"[13673, 25454, 23996, 15810, 16821, 15983, 124..."
921,0,6,"[11791, 26232, 18830, 7430, 13673, 18697]"
921,2,10,"[23099, 2269, 14903, 25063, 27641, 9701, 25445..."


# Csv

## interactions

In [2]:
interactions_path = "../../data/contentwise/data/contentwise/CW10M-CSV/interactions.csv.gz"
interactions = pd.read_csv(interactions_path)
interactions

Unnamed: 0,utc_ts_milliseconds,user_id,item_id,series_id,episode_number,series_length,item_type,recommendation_id,interaction_type,vision_factor,explicit_rating
0,1546851602000,7285,116485,434,5,6,3,56402,0,1.00,-1.0
1,1546852592000,412,116485,434,5,6,3,-1,1,-1.00,-1.0
2,1546853242000,10811,116485,434,5,6,3,-1,1,-1.00,-1.0
3,1546853253000,10811,116485,434,5,6,3,-1,0,0.50,-1.0
4,1546856158000,10811,116485,434,5,6,3,-1,0,0.99,-1.0
...,...,...,...,...,...,...,...,...,...,...,...
10457805,1555317103000,20765,68405,14006,2,27,3,275661,0,1.00,-1.0
10457806,1555314593000,2878,135402,19376,25,25,3,276058,1,-1.00,-1.0
10457807,1555314596000,2878,98755,19376,2,25,3,276058,1,-1.00,-1.0
10457808,1555314597000,2878,73510,19376,3,25,3,276058,1,-1.00,-1.0


In [3]:
profile = ProfileReport(interactions, minimal=True)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## impressions-direct-link

In [8]:
impressions_dl_path = "../../data/contentwise/data/contentwise/CW10M-CSV/impressions-direct-link.csv.gz"
impressions_dl = pd.read_csv(impressions_dl_path)
impressions_dl

Unnamed: 0,recommendation_id,row_position,recommendation_list_length,recommended_series_list
0,0,0,10,[20128 6674 4625 19462 19041 23229 5914 76...
1,1,0,10,[ 7906 1240 1712 8348 3227 7607 24175 152...
2,2,0,10,[13673 15810 16821 3826 26860 22223 18470 284...
3,3,1,10,[13673 1272 2293 23996 15810 16821 13737 124...
4,4,0,6,[21885 22288 7493 17042 18483 9330]
...,...,...,...,...
307448,307449,0,12,[21261 26515 5544 1393 5678 22552 9101 226...
307449,307450,1,10,[20128 4862 6674 28598 27215 4625 19041 232...
307450,307451,0,30,[ 9969 17425 9101 14797 5743 4172 17953 104...
307451,307452,0,10,[21079 23099 28598 25404 19462 26304 15256 158...


## impressions-non-direct-link

In [9]:
impressions_ndl_path = "../../data/contentwise/data/contentwise/CW10M-CSV/impressions-non-direct-link.csv.gz"
impressions_ndl = pd.read_csv(impressions_ndl_path)
impressions_ndl

Unnamed: 0,user_id,row_position,recommendation_list_length,recommended_series_list
0,0,1,10,[21079 20128 6674 28598 19462 19041 7677 57...
1,0,0,5,[ 7411 27339 28669 9948 14988]
2,0,2,10,[23099 28082 27641 9701 21833 5654 23393 111...
3,0,0,10,[ 5303 27643 19462 26887 15810 3826 7411 184...
4,0,1,10,[ 5303 21079 20128 6674 28598 4625 19462 76...
...,...,...,...,...
23342612,42152,0,15,[ 7727 21261 3502 12158 982 1496 8268 225...
23342613,42152,1,10,[21079 20128 4862 6674 4625 19041 17421 224...
23342614,42152,3,8,[22159 3891 15514 11321 11844 16610 16760 422]
23342615,42152,7,15,[ 3601 21098 25872 24773 23316 25882 17861 139...
