This notebook contains data preparation pipeline for the [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) that was originaly collected in paper [Improving Recommendation Lists Through
Topic Diversification](http://www2.informatik.uni-freiburg.de/~cziegler/BX/WWW-2005-Preprint.pdf). Authors crawled [BookCrossing](https://www.bookcrossing.com/) website and collected data on 278 858 members and 1 157 112 ratings, both implicit and explicit, referring to 271 379 distinct ISBNs. Invalid ISBNs
were excluded from the outset.

# Setup



## Packages

Installing and importing packages. We will work with this nice python package called [isbnlib](https://github.com/xlcnd/isbnlib) that can be used to validate, clean, transform, hyphenate and get metadata for ISBN strings.

In [None]:
!pip install isbnlib

Collecting isbnlib
  Downloading isbnlib-3.10.9-py2.py3-none-any.whl (65 kB)
[?25l[K     |█████                           | 10 kB 20.1 MB/s eta 0:00:01[K     |██████████                      | 20 kB 9.7 MB/s eta 0:00:01[K     |███████████████                 | 30 kB 8.4 MB/s eta 0:00:01[K     |████████████████████            | 40 kB 7.6 MB/s eta 0:00:01[K     |█████████████████████████       | 51 kB 5.1 MB/s eta 0:00:01[K     |██████████████████████████████  | 61 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████████████| 65 kB 1.5 MB/s 
[?25hInstalling collected packages: isbnlib
Successfully installed isbnlib-3.10.9


In [None]:
import pandas as pd
import isbnlib
from tqdm import tqdm
import matplotlib.pyplot as plt

## Data

We download Book-Crossing dataset in CSV format, unzip it and load it into pandas DataFrame. During data loading, we need to set `encoding` parameter, because data are in `ISO-8859-1` encoding, not default `UTF-8`. Also, one book title contains quote, so we need to escape it, because quote is also used to encapsulate fileds in CSV file. Troubling title looks like this:

> `Peterman Rides Again: Adventures Continue with the Real \"J. Peterman\" Through Life &amp; the Catalog Business`

We need to explicitly set backslash (`\`) as escape chareacter.

In [None]:
!wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip

--2021-11-17 17:56:30--  http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
Resolving www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)... 132.230.105.133
Connecting to www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)|132.230.105.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘BX-CSV-Dump.zip’


2021-11-17 17:56:33 (16.4 MB/s) - ‘BX-CSV-Dump.zip’ saved [26085508/26085508]



In [None]:
!unzip BX-CSV-Dump.zip

Archive:  BX-CSV-Dump.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


In [None]:
users = pd.read_csv('BX-Users.csv', sep=';', encoding = "ISO-8859-1")
books = pd.read_csv('BX-Books.csv', sep=';', encoding = "ISO-8859-1", escapechar = "\\")
ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', encoding = "ISO-8859-1")

# Data cleaning



## Useless columns

Let's look at columns in `books` table if there are some useful information.

In [None]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


We might use image in final application to show a preview of sugested books to a user. Or we could use it in multi-modal model as one of input for computing similarity between books. But for now, we will not need it, since we will do just simple proof-of-concept recommendation system.

In [None]:
books = books.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'])

In [None]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


## Renaming columns

What we could do to make our life easier is to rename columns.

In [None]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication',
       'Publisher'],
      dtype='object')

In [None]:
books.columns = ['ISBN', 'Title', 'Author', 'Year', 'Publisher']

In [None]:
books.columns

Index(['ISBN', 'Title', 'Author', 'Year', 'Publisher'], dtype='object')

In [None]:
ratings.columns

Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')

In [None]:
ratings.columns = ['User-ID', 'ISBN', 'Rating']

In [None]:
ratings.columns

Index(['User-ID', 'ISBN', 'Rating'], dtype='object')

In [18]:
users.columns

Index(['User-ID', 'Location', 'Age'], dtype='object')

Columns in `users` table are okey, we will keep them.

## Transforming ISBN to canonical form

Reasons for bothering with ISBN numbers:

- transform ISBN numbers into standard form to prevent duplicate entries
- use ISBN as unique and valid identificator of a book, so we could connect it with other resources where books are identified by ISBN number.

Data cleaning on ISBN in `books` table is connected to `ratings` table, because ISBN is used as primary key, so we need to work with both tables.

First, we create a helper function to print some statistics about number of books and number of books with unique ISBNs to see if it is changing during our ISBN transformation. 

In [19]:
def book_stats():

  # this is not the first run of this function
  if hasattr(book_stats, "books_count"):
    print("Removed books: ", book_stats.books_count - len(books))
    print("Removed unique books: ", book_stats.unique_books - len(books['ISBN'].unique()))
    print("Removed books in ratings: ", book_stats.ratings_count - len(ratings))   
    print("Removed unique books in ratings: ", book_stats.unique_ratings - len(ratings['ISBN'].unique()))
    print()

  # update count in each run
  book_stats.books_count = len(books)
  book_stats.ratings_count = len(ratings)
  book_stats.unique_books = len(books['ISBN'].unique())
  book_stats.unique_ratings = len(ratings['ISBN'].unique())

  print("Current number of all books in books: ", book_stats.books_count)
  print("Current number of entries in books with unique ISBN: ", book_stats.unique_books)
  print("Current number of all books in ratings: ", book_stats.ratings_count)
  print("Current number of entries in ratings with unique ISBN: ", book_stats.unique_ratings)

In [20]:
book_stats()

Current number of all books in books:  271379
Current number of entries in books with unique ISBN:  271379
Current number of all books in ratings:  1149780
Current number of entries in ratings with unique ISBN:  340556


In [21]:
books['ISBN'] = books['ISBN'].apply(lambda x: isbnlib.canonical(isbnlib.clean(x)))
ratings['ISBN'] = ratings['ISBN'].apply(lambda x: isbnlib.canonical(isbnlib.clean(x)))

In [22]:
book_stats()

Removed books:  0
Removed unique books:  431
Removed books in ratings:  0
Removed unique books in ratings:  8134

Current number of all books in books:  271379
Current number of entries in books with unique ISBN:  270948
Current number of all books in ratings:  1149780
Current number of entries in ratings with unique ISBN:  332422


As we can see, we still have the same number of entries in `books` and `ratings` tables, but we have less entries with unique ISBN. Multiple ISBNs were transformed into the same ISBN number. 