In [None]:
import pandas as pd
from pandas_profiling import ProfileReport

# My remarks

- Explicit/Implicit feedback
- Iteractions mapping (interaction_type): {0: views, 1: detail, 2: ratings, 3: purchases}
- Items mapping (item_type): {0: movies, 1: movies and clips in series, 2: TV movies or shows, 3: episodes of TV series}
- -1 value means null in interactions table
- Collected the data for over four months between 2018 and 2019
- ContentWise Impressions is comprised of three
different information layers. First, interactions of users with items
of the service, containing user-item pairs. Second, impressions with
a direct link to interactions, containing those recommendation lists
that generated interactions. Third, impressions without a direct link
to interactions, containing those recommendation lists that did not
generate interactions.

# ContentWise Impressions

Sources:
- https://github.com/ContentWise/contentwise-impressions
- https://arxiv.org/abs/2008.01212 **(detailed description of dataset!)**

ContentWise impressions is a dataset collected from an Over-the-Top media service that contains interactions and impressions.

If you use this work in any form, please cite our article:

```bibtex
@Article{ContentWiseImpressions-CIKM-2020,
author={Pérez Maurera, Fernando Benjamín
and Ferrari Dacrema, Maurizio
and Saule, Lorenzo
and Scriminaci, Mario
and Cremonesi, Paolo},
title={ContentWise Impressions: An industrial dataset with impressions included},
journal={Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM 2020)},
year={2020},
doi={},
Eprint={arXiv},
note={Source: \url{https://github.com/ContentWise/contentwise-impressions}},
}
```

## The dataset

All the data can be located under the `data/ContentWiseImpressions/CW10M` folder. All the files are stored in `parquet`. These are the files we provide:

- `interactions`: DataFrame that contains the interactions of users with items.
- `impressions-direct-link`: DataFrame that contains the impressions with direct links to interactions.
- `impressions-non-direct-link`: DataFrame that contains the impressions without direct links to interactions.
- `metadata.json`: JSON file containing the number of users, items, series and recommendations of the dataset.

Inside the `data/ContentWiseImpressions/CW10M` folder you'll also locate the `splits` folder. In there, we saved the data splits used in our experiments. You'll find in there:

- `urm_items.train.npz`: Training split.
- `urm_items.validation.npz`: Validation split for hyperparameter tuning.
- `urm_items.test.npz`: Testing split for final evaluation of best hyperparameters.
- `urm_items.impressions.npz`: Sum of `urm_items.impressions_direct_link.npz` and `urm_items.impressions_non_direct_link.npz`.
- `urm_items.impressions_direct_link.npz`: URM version of `impressions-direct-link`. Rows are users and columns are items. Each cell contains the number of times the item has been recommended to the user.
- `urm_items.impressions_non_direct_link.npz`: URM version of `impressions-non-direct-link`. Rows are users and columns are items. Each cell contains the number of times the item has been recommended to the user.
- `urm_series.train.npz`: Equivalent to `urm_items.train.npz` but columns are series.
- `urm_series.validation.npz`: Equivalent to `urm_items.validation.npz` but columns are series.
- `urm_series.test.npz`: Equivalent to `urm_items.test.npz` but columns are series.
- `urm_series.impressions.npz`: Equivalent to `urm_items.impressions.npz` but columns are series.
- `urm_series.impressions_direct_link.npz`: Equivalent to `urm_items.impressions_direct_link.npz` but columns are series.
- `urm_series.impressions_non_direct_link.npz`: Equivalent to `urm_items.impressions_non_direct_link.npz` but columns are series.

## Human-readable version of the dataset

We provide GZIP-compressed versions of the dataset in CSV (located inside the `CW10M-CSV` folder). This way, you can inspect the dataset and load it with tools that does not support Parquet files. Be careful, the uncompressed version of the dataset takes approximately `2.4 GiB` of disk.

If you plan to use the dataset, we highly encourage you to use the Parquet version of it. Parquet is an open source format for data. Readers and writers for parquet exists in several languages and they tend to work orders of magnitude faster than their CSV counterparts.

If you want to see an example of how to load the parquet version of the dataset, then go to the [official repo](https://github.com/ContentWise/contentwise-impressions) as we provided examples to load and use the dataset.

## Authors

- Fernando Benjamín Pérez Maurera - Politecnico di Milano / ContentWise - ([fernando.perez@contentwise.com](mailto:fernando.perez@contentwise.com) or [fernandobenjamin.perez@polimi.it](mailto:fernandobenjamin.perez@polimi.it)).
- Maurizio Ferrari Dacrema - Politecnico di Milano - ([maurizio.ferrari@polimi.it](mailto:maurizio.ferrari@polimi.it)).
- Lorenzo Saule - Politecnico di Milano / ContentWise ([lorenzo.saule@gmail.com](mailto:lorenzo.saule@gmail.com)).
- Mario Scriminaci - ContentWise - ([mario.scriminaci@contentwise.com](mailto:mario.scriminaci@contentwise.com)).
- Paolo Cremonesi - Politecnico di Milano - ([paolo.cremonesi@polimi.it](mailto:paolo.cremonesi@polimi.it)).

## Disclaimer

This is not an official ContentWise product.

## License

ContentWise Impressions by F. B. Pérez Maurera, Maurizio Ferrari, Lorenzo Saule, Mario Scriminaci, and Paolo Cremonesi is licensed under CC BY-NC-SA 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0

# Parquets

## metadata

In [None]:
meta_data_path = "../../data/contentwise/data/contentwise/CW10M/metadata.json"
meta_data = pd.read_json(meta_data_path, typ="series")
meta_data

## interactions

In [None]:
# 1 of 4 part of parquet files
interactions_path = "../../data/contentwise/data/contentwise/CW10M/interactions/part.0.parquet"
interactions = pd.read_parquet(interactions_path, engine="pyarrow")
interactions

## impressions-direct-link

In [None]:
impressions_dl_path = "../../data/contentwise/data/contentwise/CW10M/impressions-direct-link/part.0.parquet"
impressions_dl = pd.read_parquet(impressions_dl_path, engine="pyarrow")
impressions_dl

## impressions-non-direct-link

In [None]:
# 1 of 45 part of parquet files
impressions_ndl_path = "../../data/contentwise/data/contentwise/CW10M/impressions-non-direct-link/part.0.parquet"
impressions_ndl = pd.read_parquet(impressions_ndl_path, engine="pyarrow")
impressions_ndl

# Csv

## interactions

In [None]:
interactions_path = "../../data/contentwise/data/contentwise/CW10M-CSV/interactions.csv.gz"
interactions = pd.read_csv(interactions_path)
interactions

In [None]:
profile = ProfileReport(interactions, minimal=True)
profile

## impressions-direct-link

In [None]:
impressions_dl_path = "../../data/contentwise/data/contentwise/CW10M-CSV/impressions-direct-link.csv.gz"
impressions_dl = pd.read_csv(impressions_dl_path)
impressions_dl

## impressions-non-direct-link

In [None]:
impressions_ndl_path = "../../data/contentwise/data/contentwise/CW10M-CSV/impressions-non-direct-link.csv.gz"
impressions_ndl = pd.read_csv(impressions_ndl_path)
impressions_ndl