# Book Recommender Exploration
We'll take a quick look at the Goodreads Data Set and prepare the data for further analyses.

### Data Sourced From
https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home

### Citations
* Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in RecSys'18. [bibtex]
* Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", in ACL'19. [bibtex]

In [17]:
import gzip
import json

import pandas as pd


MYSTERY_BOOKS_PATH = '../data/goodreads_books_mystery_thriller_crime.json.gz'
MYSTERY_REVIEWS_PATH = '../data/goodreads_reviews_mystery_thriller_crime.json.gz'

def unpack_to_df_from_json_gz(file_ref, cnt=1000):
    """
    collective import via json/pandas throws decode exceptions;
    instead we read each line into a list of python objects
    discarding anything we fail to decode.
    """
    out = []
    with gzip.open(file_ref) as f:
        while cnt > 0:
            line = f.readline()
            try:
                out.append(json.loads(line))
            except json.JSONDecodeError:
                pass
            cnt -= 1
    return pd.DataFrame(out)
    
reviews = unpack_to_df_from_json_gz(MYSTERY_REVIEWS_PATH)
reviews.head()

Unnamed: 0,user_id,book_id,review_id,rating,review_text,date_added,date_updated,read_at,started_at,n_votes,n_comments
0,8842281e1d1347389f2ab93d60773d4d,6392944,5e212a62bced17b4dbe41150e5bb9037,3,I haven't read a fun mystery book in a while a...,Mon Jul 24 02:48:17 -0700 2017,Sun Jul 30 09:28:03 -0700 2017,Tue Jul 25 00:00:00 -0700 2017,Mon Jul 24 00:00:00 -0700 2017,6,0
1,8842281e1d1347389f2ab93d60773d4d,28684704,2ede853b14dc4583f96cf5d120af636f,3,"A fun, fast paced science fiction thriller. I ...",Tue Nov 15 11:29:22 -0800 2016,Mon Mar 20 23:40:27 -0700 2017,Sat Mar 18 23:22:42 -0700 2017,Fri Mar 17 23:45:40 -0700 2017,22,0
2,8842281e1d1347389f2ab93d60773d4d,32283133,8e4d61801907e591018bdc3442a9cf2b,0,http://www.telegraph.co.uk/culture/10...,Tue Nov 01 11:09:18 -0700 2016,Tue Nov 01 11:09:44 -0700 2016,,,9,0
3,8842281e1d1347389f2ab93d60773d4d,17860739,022bb6daffa49adc27f6b20b6ebeb37d,4,An amazing and unique creation: JJ Abrams and ...,Wed Mar 26 13:51:30 -0700 2014,Tue Sep 23 01:44:36 -0700 2014,Sun Sep 21 00:00:00 -0700 2014,Sat Jul 26 00:00:00 -0700 2014,7,0
4,8842281e1d1347389f2ab93d60773d4d,8694005,0e317947e1fd341f573192111bb2921d,3,The Name of the Rose is a thrilling Dan Brown-...,Wed Sep 08 01:22:27 -0700 2010,Wed Dec 14 12:30:43 -0800 2016,Mon Aug 10 00:00:00 -0700 2015,Mon Jul 20 00:00:00 -0700 2015,17,6


Now that we've extacted the basic features on a reduced sample size:
transform the data so that each row is user, each column a book. 

If we wanted to add more features of books, we'd be better off placing books as rows, I think, but that's for another time.

In [18]:
matrix = pd.pivot_table(reviews, values=['rating'], index=['user_id'], columns=['book_id'])
matrix.head()

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
book_id,10031164,1007501,10147,10148,10229001,10230496,102318,10236954,10244512,102504,...,9693927,9694893,9736930,9761998,9775904,9815629,9867599,9918053,9922624,9947386
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
012515e5802b2e0f42915118c90fa04b,,,,,,,,,,,...,,,,,,,,,,
012aa353140af13109d00ca36cdc0637,,,,,,,,,,,...,,,,,,,,,,
01d02898170634e6e7232650ebbf2e43,,,,,,,,,,,...,,,,,,,,,,
01ec1a320ffded6b2dd47833f2c8e4fb,,,,,,,,,,,...,,,,,,,,,,
020684c1b3dab4137230fc3cc309c107,,,,,,,,,,,...,,,,,,,,,,


After pivoting the data, we want to ensure that we only get high-quality recommendations which means we're going to drop weak connections -- those users and books that do not have at least two interactions.

In [50]:
n = matrix.drop([col for col, cnt in matrix.count().iteritems() if cnt < 2], axis=1)
n2 = n.drop([row for row, cnt in matrix.count(axis="columns").iteritems() if cnt < 2])
n2.fillna(0).head()

(87, 926)


Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
book_id,104507,105992,12368985,12859425,13129925,13145,14740588,153025,15776309,16031620,...,6218281,6526,66559,6853,6892870,8442457,84921,89724,960,968
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
012aa353140af13109d00ca36cdc0637,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
01ec1a320ffded6b2dd47833f2c8e4fb,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
06316bec7a49286f1f98d5acce24f923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0757e6c8076682b47d9d4dcebb6db776,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
0d8d07544717e84149df654caae803d0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
