This notebook contains data preparation pipeline for the [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) that was originaly collected in paper [Improving Recommendation Lists Through
Topic Diversification](http://www2.informatik.uni-freiburg.de/~cziegler/BX/WWW-2005-Preprint.pdf). Authors crawled [BookCrossing](https://www.bookcrossing.com/) website and collected data on 278 858 members and 1 157 112 ratings, both implicit and explicit, referring to 271 379 distinct ISBNs. Invalid ISBNs
were excluded from the outset.

# Setup



## Packages

Installing and importing packages. We will work with this nice python package called [isbnlib](https://github.com/xlcnd/isbnlib) that can be used to validate, clean, transform, hyphenate and get metadata for ISBN strings.

In [1]:
!pip install isbnlib

Collecting isbnlib
  Downloading isbnlib-3.10.9-py2.py3-none-any.whl (65 kB)
[?25l[K     |█████                           | 10 kB 20.1 MB/s eta 0:00:01[K     |██████████                      | 20 kB 9.7 MB/s eta 0:00:01[K     |███████████████                 | 30 kB 8.4 MB/s eta 0:00:01[K     |████████████████████            | 40 kB 7.6 MB/s eta 0:00:01[K     |█████████████████████████       | 51 kB 5.1 MB/s eta 0:00:01[K     |██████████████████████████████  | 61 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████████████| 65 kB 1.5 MB/s 
[?25hInstalling collected packages: isbnlib
Successfully installed isbnlib-3.10.9


In [2]:
import pandas as pd
import isbnlib
from tqdm import tqdm
import matplotlib.pyplot as plt

## Data

We download Book-Crossing dataset in CSV format, unzip it and load it into pandas DataFrame. During data loading, we need to set `encoding` parameter, because data are in `ISO-8859-1` encoding, not default `UTF-8`. Also, one book title contains quote, so we need to escape it, because quote is also used to encapsulate fileds in CSV file. Troubling title looks like this:

> `Peterman Rides Again: Adventures Continue with the Real \"J. Peterman\" Through Life &amp; the Catalog Business`

We need to explicitly set backslash (`\`) as escape chareacter.

In [3]:
!wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip

--2021-11-17 17:56:30--  http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
Resolving www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)... 132.230.105.133
Connecting to www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)|132.230.105.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘BX-CSV-Dump.zip’


2021-11-17 17:56:33 (16.4 MB/s) - ‘BX-CSV-Dump.zip’ saved [26085508/26085508]



In [4]:
!unzip BX-CSV-Dump.zip

Archive:  BX-CSV-Dump.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


In [5]:
users = pd.read_csv('BX-Users.csv', sep=';', encoding = "ISO-8859-1")
books = pd.read_csv('BX-Books.csv', sep=';', encoding = "ISO-8859-1", escapechar = "\\")
ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', encoding = "ISO-8859-1")