# Uploading tables to Neo4j

Welcome to **Part 2** of my project. In this notebook I will be taking the tables I cleaned/created in `csv_clean.pynb` and manipulating them to be to create a non-relational, graph database in Neo4j.

## Importing Libraries

The py2neo library is a community created client library for working in Neo4j. It allows for simple and intuitive interactions with Neo4j, in a way that is much simplier than the company's properitory api.

In [1]:
from py2neo import Graph, Node, Relationship, NodeMatcher, NodeMatch, RelationshipMatcher

import pandas as pd
import numpy as np
import os
import re

## Instantiating Graph and Matchers/loading files

before I can do anything I need to create a space to add my datasets to. I also create matchers in case I need them. I'll also take this moment to read my previously created tables into a dataframe.

In [2]:
"""
two seperate graphs. For a local instance, use the first. 

For remote instance, use the second. This instance will take several hours to upload everything, so be careful with this one.

BE SURE TO COMMENT OUT WHATEVER YOU ARE NOT USING.
"""
# local instance
graph = Graph('neo4j://localhost:7687', user='neo4j',
              password='PnUPiYIuPmlUTQOXPMNs_i66Bws05VY73hJyOmvQ9SI')

# # remote instance
# graph = Graph('neo4j+s://937f5f62.databases.neo4j.io',user='neo4j',
#                password='c4X7H5lguOVKM3vJzx8zi_JUUTtfu9iZdjhoufcVvlY')

node_matcher = NodeMatcher(graph)
rel_matcher = RelationshipMatcher(graph)

In [3]:
book_df = pd.read_csv('data/books_cleaned.csv')
author_df = pd.read_csv('data/authors.csv')
publisher_df = pd.read_csv('data/publishers.csv')

book_df.info()
print()
author_df.info()
print()
publisher_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11127 entries, 0 to 11126
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   book_id             11127 non-null  int64  
 1   title               11127 non-null  object 
 2   isbn_10             11127 non-null  object 
 3   isbn_13             11127 non-null  int64  
 4   language_code       11127 non-null  object 
 5   audio_book          11127 non-null  bool   
 6   num_pages           11127 non-null  int64  
 7   ratings_count       11127 non-null  int64  
 8   average_rating      11127 non-null  float64
 9   text_reviews_count  11127 non-null  int64  
dtypes: bool(1), float64(1), int64(5), object(3)
memory usage: 793.4+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19244 entries, 0 to 19243
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   19244 non-null  int64

## **ADDENDUM**

I found another [iteration of the dataset on kaggle](https://www.kaggle.com/datasets/middlelight/goodreadsbookswithgenres) that enriches the semantic content of the nodes significantly, and I thought would be worth including. 

If I was starting from scratch, I would probably begin with this dataset. But in the spirit of Saint Provenance, I will instead concatenate the extra column onto my cleaned data, and use this to add extra information to my nodes.

By doing this we will be able to ask questions like:
- What publisher produces the most content in the sci-fi genre?
- Which authors write books that are both in the fantasy and young-adult genres who worked with a publisher that published books in Spanish?

Queries might start to sound like run on sentences, but they are very easy to write in Cypher and for Neo4j to interpret.

In [4]:
genre_df = pd.read_csv('data/books_with_genres.csv')
genre_column = genre_df['genres']

for genres in genre_column[:10]:
    genre_list = list(genres.split(';'))
    for genre in genre_list:
        if ',' in genre:
            genre, subgenre = genre.split(',')
            print(genre)
            print(subgenre)
        else:
            print(genre)

Fantasy
Young Adult
Fiction
Fantasy
Magic
Childrens
Adventure
Audiobook
Childrens
Middle Grade
Classics
Science Fiction Fantasy
Fantasy
Young Adult
Fiction
Fantasy
Magic
Childrens
Adventure
Audiobook
Childrens
Middle Grade
Classics
Science Fiction Fantasy
Fantasy
Fiction
Young Adult
Fantasy
Magic
Childrens
Childrens
Middle Grade
Audiobook
Adventure
Classics
Science Fiction Fantasy
Fantasy
Fiction
Young Adult
Fantasy
Magic
Childrens
Childrens
Middle Grade
Adventure
Audiobook
Classics
Science Fiction Fantasy
Fantasy
Young Adult
Fiction
Fantasy
Magic
Adventure
Fantasy
Supernatural
Mystery
Childrens
Fantasy
Paranormal
Childrens
Middle Grade
Fiction
Fantasy
Fiction
Young Adult
Fantasy
Magic
Childrens
Classics
Adventure
Science Fiction Fantasy
Novels
Paranormal
Wizards
Science Fiction
Fiction
Humor
Fantasy
Classics
Humor
Comedy
Science Fiction Fantasy
Adventure
Novels
European Literature
British Literature
Science Fiction
Fiction
Humor
Fantasy
Classics
Humor
Comedy
Science Fiction Fantasy
Ad

with the above approach, I should now be able to add these labels to my nodes. In order to match up the genres with my cleaned dataframe, I will also pull the `isbn13` column from the `genre_df` and use it to join the tables together.

In [5]:
genre_mask = genre_df[['genres', 'isbn13']]
genre_mask = genre_mask.rename(columns={'isbn13':'isbn_13'})

book_df = pd.merge(book_df, genre_mask, on='isbn_13', how='left')
book_df.head()

book_df

Unnamed: 0,book_id,title,isbn_10,isbn_13,language_code,audio_book,num_pages,ratings_count,average_rating,text_reviews_count,genres
0,1,Harry Potter and the Half-Blood Prince (Harry ...,0439785960,9780439785969,eng,False,652,2095690,4.57,27591,"Fantasy;Young Adult;Fiction;Fantasy,Magic;Chil..."
1,2,Harry Potter and the Order of the Phoenix (Har...,0439358078,9780439358071,eng,False,870,2153167,4.49,29221,"Fantasy;Young Adult;Fiction;Fantasy,Magic;Chil..."
2,3,Harry Potter and the Chamber of Secrets (Harry...,0439554896,9780439554893,eng,False,352,6333,4.42,244,"Fantasy;Fiction;Young Adult;Fantasy,Magic;Chil..."
3,4,Harry Potter and the Prisoner of Azkaban (Harr...,043965548X,9780439655484,eng,False,435,2339585,4.56,36325,"Fantasy;Fiction;Young Adult;Fantasy,Magic;Chil..."
4,5,Harry Potter Boxed Set Books 1-5 (Harry Potter...,0439682584,9780439682589,eng,False,2690,41428,4.78,164,"Fantasy;Young Adult;Fiction;Fantasy,Magic;Adve..."
...,...,...,...,...,...,...,...,...,...,...,...
11122,11123,Expelled from Eden: A William T. Vollmann Reader,1560254416,9781560254416,eng,False,512,156,4.06,20,"Fiction;Writing,Essays;Literature,American;The..."
11123,11124,You Bright and Risen Angels,0140110879,9780140110876,eng,False,635,783,4.08,56,Fiction;Science Fiction;Literature;Novels;Lite...
11124,11125,The Ice-Shirt (Seven Dreams #1),0140131965,9780140131963,eng,False,415,820,3.96,95,"Historical,Historical Fiction;Fiction;Novels;F..."
11125,11126,Poor People,0060878827,9780060878825,eng,False,434,769,3.72,139,"Nonfiction;Sociology;Social Issues,Poverty;His..."


Should finally be ready to upload to my database.

## Dumpster

Be wary of duplicates! It's very easy to create redundant data in Neo4j, so it's imperative that I make sure the graph is empty before construction.

In [6]:
graph.delete_all()

## Creating Nodes

There are three (or four depending on how you look at it) kinds of nodes I want to create:
### 1. Books
- Print Books: will be basically the same as they are in our cleaned dataset, except there will no longer be a `audio_book` property.
- Audio Books: largely the same, except since they don't have "pages", so while `num_pages` was helpful for determining audiobooks, it isn't necessary in the node. This is one of the beautiful things about graph databases. You can have two things that are "Books" with completely different properties (although in this case the properties are mostly the same).

### **ADDENDUM** 
will also be including genres as node labels. Because there is actually a `Audio Book` label in the genres column, I will be using this instead of the row I intially created, as I assume it will be a more reliable metric.

In [7]:

# Create Book Nodes
for index, row in book_df.iterrows():
    try:
        try:
            genres = row['genres'].title()
            genres = list(genres.split(';'))
        except:
            pass
        
        if 'Audiobook' in genres:
            book_node = Node('Book', 'Audio',
                     bookID=f'b{row["book_id"]}',
                     title=row['title'],
                     isbn10=row['isbn_10'],
                     isbn13=row['isbn_13'],
                     languageCode=row['language_code'],
                     ratingsCount=row['ratings_count'],
                     averageRating=row['average_rating'],
                     textReviewsCount=row['text_reviews_count']
                    )
        else:
            book_node = Node('Book', 'Print',
                             bookID=f'b{row["book_id"]}',
                             title=row['title'],
                             isbn10=row['isbn_10'],
                             isbn13=row['isbn_13'],
                             languageCode=row['language_code'],
                             numPages=row['num_pages'],
                             ratingsCount=row['ratings_count'],
                             averageRating=row['average_rating'],
                             textReviewsCount=row['text_reviews_count']
                            )
        try:
            for genre in genres:
                genre = genre.strip()
                if genre == 'Audiobook':
                    pass
                elif ',' in genre:
                    genre = genre.replace(' ', '')
                    genre, subgenre = genre.split(',')
                    book_node.add_label(genre)
                    book_node.update({f'{genre[0].lower() + genre[1:]}SubGenre': subgenre})
                else:
                    genre = genre.replace(' ', '')
                    book_node.add_label(genre)
        except:
            pass

        graph.create(book_node)
    
    except Exception as e:
        print(f'Error: {e} occured on line {index}')

### 2. Authors

Not much to say about these nodes, they will mainly have their id and names, most of the info in the `authors.csv` table will be used to create the relationships between books and authors, but we're getting ahead of ourselves.

In [8]:
authors = author_df[['author_id', 'author_name']].drop_duplicates()

for index, row in authors.iterrows():
    
    author_node = Node('Author',
                       authorID=f'a{row["author_id"]}',
                       authorName=row.author_name)
    graph.create(author_node)

### Publishers

similar to the authors, we need the id and name. The other columns will be useful for creating relationships

In [9]:
publishers = publisher_df[['publisher_id', 'publisher_name']].drop_duplicates()

for index, row in publishers.iterrows():
    
    publisher_node = Node('Publisher',
                       publisherID=f'p{row["publisher_id"]}',
                       publisherName=row.publisher_name)
    graph.create(publisher_node)

## Creating Relationships

now that all the nodes have been added to the database, I can start drawing connections between them. I want to have relationships between books and authors called `AUTHORED`, and one for books and publishers called `PUBLISHED`. Eventually, I might want to create more but I'll leave it at these for now.

### AUTHORED Relationship

In [10]:
author_df.columns

Index(['Unnamed: 0', 'author_id', 'author_name', 'book_id'], dtype='object')

In [11]:
for ind in author_df.index:
    try:
        auth_id = f'a{author_df["author_id"][ind]}'
        auth_name = author_df['author_name'][ind]
        book_id = f'b{author_df["book_id"][ind]}'
        
        author_node = node_matcher.match('Author', authorID=auth_id).first()
        book_node = node_matcher.match('Book', bookID=book_id).first()
        
        relationship = Relationship(author_node, 'AUTHORED', book_node)
        graph.create(relationship)
    
    except Exception as e:
        print(f'Error: {e}')

### PUBLISHED Relationship

In [12]:
publisher_df.columns

Index(['Unnamed: 0', 'publisher_id', 'publisher_name', 'publication_date',
       'book_id'],
      dtype='object')

In [13]:
for ind in publisher_df.index:
    try:
        pub_id = f'p{publisher_df["publisher_id"][ind]}'
        pub_name = publisher_df['publisher_name'][ind]
        pub_date = publisher_df['publication_date'][ind]
        book_id = f'b{publisher_df["book_id"][ind]}'
        
        publisher_node = node_matcher.match('Publisher', publisherID=pub_id).first()
        book_node = node_matcher.match('Book', bookID=book_id).first()
        
        relationship = Relationship(publisher_node, 'PUBLISHED', book_node,
                                   publicationDate=pub_date)
        graph.create(relationship)
    except Exception as e:
        print(f'Error: {e}')

## Future Works

This was an interesting exercise! I'm happy that I was able to add the genre node labels without any major issues. That being said, if I had more time I would probably comb through these and make sure there aren't any redudant labels. For example, "SciFiFantasy" and "ScienceFictionFantasy" could be replaced with a single label.

Additionally, it would be interesting to add country data based on the ISBNs, possibly in the `PUBLISHED` relationship. The beginning numbers of an ISBN coorispond to various registration groups. For example `4` denotes the book was published in Japan, or `81` for India. However, this would require [another webscrape](https://en.wikipedia.org/wiki/List_of_ISBN_registration_groups) and a lot of fine-tuning that I'm unwilling to implement at this time.