## UCSD METHOD CONTINUED

**Now, we are going to extract the reviews.** We currently have a draft of our final dataset, which has the one hot encoded columns of genres for each of the books that we need. This is stored in final_books_dataset.csv.

In [3]:
import pandas as pd
import json 
import gzip
from tqdm import tqdm
import regex as re
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor, as_completed
from sentiment_worker import process_book_sentiment # I put this function in a separate .py file 

In [4]:
# file path to the reviews JSON
file_path_to_reviews = "goodreads_reviews_dedup.json.gz"

# try opening the file
try: 
    with gzip.open(file_path_to_reviews, 'rt') as f:
        first_line = f.readline()
        print(first_line)
except EOFError:
    print("this file is corrupted or incomplete")

{"user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "24375664", "review_id": "5cd416f3efc3f944fce4ce2db2290d5e", "rating": 5, "review_text": "Mind blowingly cool. Best science fiction I've read in some time. I just loved all the descriptions of the society of the future - how they lived in trees, the notion of owning property or even getting married was gone. How every surface was a screen. \n The undulations of how society responds to the Trisolaran threat seem surprising to me. Maybe its more the Chinese perspective, but I wouldn't have thought the ETO would exist in book 1, and I wouldn't have thought people would get so over-confident in our primitive fleet's chances given you have to think that with superior science they would have weapons - and defenses - that would just be as rifles to arrows once were. \n But the moment when Luo Ji won as a wallfacer was just too cool. I may have actually done a fist pump. Though by the way, if the Dark Forest theory is right - and I see

In [5]:
# this file is huge. we need to process it in chunks to prevent crashes 
# my computer crashed the first time I tried to read the json file

chunk_size =10000

with gzip.open(file_path_to_reviews, 'rt') as f:
    reader = pd.read_json(f, lines=True, chunksize = chunk_size)

    for i, chunk in enumerate(reader):
        print("Processing chunk {i}...")
        print(chunk.head()) # show the first few rows of each chunk
        break # stop after the first chunk to rest

Processing chunk {i}...
                            user_id   book_id  \
0  8842281e1d1347389f2ab93d60773d4d  24375664   
1  8842281e1d1347389f2ab93d60773d4d  18245960   
2  8842281e1d1347389f2ab93d60773d4d   6392944   
3  8842281e1d1347389f2ab93d60773d4d  22078596   
4  8842281e1d1347389f2ab93d60773d4d   6644782   

                          review_id  rating  \
0  5cd416f3efc3f944fce4ce2db2290d5e       5   
1  dfdbb7b0eb5a7e4c26d59a937e2e5feb       5   
2  5e212a62bced17b4dbe41150e5bb9037       3   
3  fdd13cad0695656be99828cd75d6eb73       4   
4  bd0df91c9d918c0e433b9ab3a9a5c451       4   

                                         review_text  \
0  Mind blowingly cool. Best science fiction I've...   
1  This is a special book. It started slow for ab...   
2  I haven't read a fun mystery book in a while a...   
3  Fun, fast paced, and disturbing tale of murder...   
4  A fun book that gives you a sense of living in...   

                       date_added                    date_upd

## Define Sentiment Words

In [6]:
# moved this to sentiment_worker.py
sentiment_words = {
    # pacing of the book
    "fast-paced" : ["intense", "page turner", "fast", "fast paced", "quick", "thrilling", "gripping"],
    "slow-paced" : ["slow", "slow paced",  "slow pacing", "gradual", "steady", "builds slowly", "patient"],
    "suspenseful" : ["suspense", "suspenseful", "nail biting", "edge of your seat", "tension", "unpredictable"],
    "relaxing" : ["relaxing", "relaxed", "comforting", "cozy", "lighthearted", "gentle", "easygoing"],

    # themes and mood

    "romance" : ["romantic", "romance", "love", "emotional", "sweet", "dreamy", "steamy", "passionate", "chemistry"],
    "mysterious" : ["mysterious", "mystery", "intriguing", "unraveling", "puzzling", "confusing", "detective"],
    "philosophical" : ["philosophical", "deep", "existentialism", "helpful"],
    "magical" : ["magical", "magic", "enchanting", "charming", "whimsical", "fairytale", "wonderous", "fantastical", ],
    "realistic" : ["believable", "gritty", "grounded", "realistic", "authentic", "genuine", "slice of life", ],
    "nostalgic" : ["nostalgic", "reminiscent", "bittersweet", "memories", "childhood", "wistful", "sentimental"],
    "dark" : ["dark", "gloomy", "disturbing", "ominous", "gritty", "chiling", "sinister", ],
    "angry" : ["angry", "rage", "fiery", "furious", "frustrating", "heated", "aggressive"],
    "sad" : ["sad", "depression", "emotional", "tear-jerker", "cried"],
    "funny" : ["funny", "witty", "hilarious", "laughing", "sarcastic", "light-hearted", "humorous", "entertaining"],

    # emotional impact
    "heartwarming" : ["heartwarming", "sweet", "uplifting", "touching", "moving", "feel-good", "comforting", "joyful"],
    "heartbreaking" : ["painful", "heartbreaking", "tearjerking", "sad", "aching", "bittersweet", "poignant"],
    "depressing" : ["depressing", "sad", "dark", "depression", "somber", "tragic", "dystopian", "crushing", "heavy"],
    "hopeful" : ["hope", "hopeful", "optimistic", "encourage", "encouraging", "faith", "bright", "positive"],
    "inspiring" : ["inspiring", "powerful", "thought-provoking", "transformative", "stirring", "soulful", "meaningful"],
    "moving" : ["inspiring","moving", "powerful", "resonant",  "profound", "touching", "resonant", "stirring"],

    # story depth and characters
    "character-driven" : ["character development", "emotional depth", "well written", "relatable", "personal", "introspective"],
    "plot-driven" : ["action-packed", "plot driven", "adventure", "packed with surprises", "suspenseful"],

    # writing style and readability
    "descriptive" : ["descriptive", "vivid", "detailed", "atmospheric", "scenic", "evocative"],
    "clearly-written" : ["clear", "clearly written", "straightforward", "concise", "easy to read", "smooth"],
    "dense" : ["complex", "wordy", "intricate", "highly detailed", "wordy", "heavy"],
    "poetic" : ["lyrical", "poetry", "elegant", "artistic", "expressive", "soulful"],
    
}

In [7]:
# load in the final_books_dataset to get the relevant book IDs
df_books = pd.read_csv("final_books_dataset.csv", dtype=str, low_memory = False)  # Read as string first

# Remove NaN values and convert properly
df_books = df_books[df_books["book_id"].notna()]  # Drop rows where book_id is NaN
df_books["book_id"] = df_books["book_id"].astype(float).astype(int).astype(str)  # Remove decimal

book_ids = set(df_books["book_id"])  # Convert to set

print(f"Loaded {len(book_ids)} book IDs from final_books_dataset.csv")

Loaded 50000 book IDs from final_books_dataset.csv


In [8]:
# create a function to extract sentiment count
# moved this to sentiment_worker.py
def extract_sentiment_count(text):
    # if the text is not a string, return a dictionary with all the sentiment categories set to 0.0
    if not isinstance(text, str):
        return {sentiment: 0.0 for sentiment in sentiment_words}

    # extract words from text using regex, ignores punctuation
    words = re.findall(r'\b\w+\b', text.lower())
    total_words = len(words) #count the total number of words in the review

    # loops over each sentiment and its list of keywords
    # count how many words from each sentiment category appears in the review
    sentiment_counts = {sentiment: sum(1 for word in words if word in keywords) 
                        for sentiment, keywords in sentiment_words.items()}

    return sentiment_counts

In [9]:
test_text = "This book was fast, exciting and heartwarming!"
print(extract_sentiment_count(test_text))

{'fast-paced': 1, 'slow-paced': 0, 'suspenseful': 0, 'relaxing': 0, 'romance': 0, 'mysterious': 0, 'philosophical': 0, 'magical': 0, 'realistic': 0, 'nostalgic': 0, 'dark': 0, 'angry': 0, 'sad': 0, 'funny': 0, 'heartwarming': 1, 'heartbreaking': 0, 'depressing': 0, 'hopeful': 0, 'inspiring': 0, 'moving': 0, 'character-driven': 0, 'plot-driven': 0, 'descriptive': 0, 'clearly-written': 0, 'dense': 0, 'poetic': 0}


In [10]:
# we need to process the reviews in chunks and store the scores per book

book_sentiment = {}
total_lines = sum(1 for _ in gzip.open("goodreads_reviews_dedup.json.gz", "rt", encoding="utf-8"))  # Get total lines for tqdm

# open the compressed file
with gzip.open("goodreads_reviews_dedup.json.gz", "rt", encoding="utf-8") as f, tqdm(total=total_lines, desc="Processing Reviews") as pbar:
    for line in f:
        review = json.loads(line)  # Load each review
        book_id = str(review.get("book_id", ""))  # make sure that book_id is a string
        text = review.get("review_text", "")

        if book_id in book_ids:  # process books in final_books_dataset.csv
            if book_id not in book_sentiment:
                book_sentiment[book_id] = []
            if len(book_sentiment[book_id]) < 50:
                book_sentiment[book_id].append(text)

        pbar.update(1) # update the progress bar

Processing Reviews: 100%|██████████| 15739967/15739967 [03:34<00:00, 73298.71it/s]


In [11]:
print("Total books in dataset:", len(book_ids))  # Should be greater than 0
print("Sample book IDs from final_books_dataset.csv:", list(book_ids)[:5])

Total books in dataset: 50000
Sample book IDs from final_books_dataset.csv: ['5884082', '25660025', '20645624', '69510', '374388']


In [12]:
print(len(book_sentiment))

49951


In [13]:
# compute the sentiment scores (this takes a long time)
# we will use multiprocessing for parallel processing
from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm

# Parallel processing with tqdm progress bar
def compute_sentiment_parallel(book_sentiment):
    book_sentiment_list = list(book_sentiment.items())  
    num_workers = max(2, min(8, len(book_sentiment_list)))  # Use between 2 and 8 workers

    results = {}
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        with tqdm(total=len(book_sentiment_list), desc="Processing Books") as pbar:
            futures = {executor.submit(process_book_sentiment, book): book for book in book_sentiment_list}
            for future in as_completed(futures):
                book_id, sentiment = future.result()
                results[book_id] = sentiment
                pbar.update(1)  # Update progress bar
    return results

# Process a sample of 5000 books
book_sentiment_scores = compute_sentiment_parallel(book_sentiment_sample)

NameError: name 'book_sentiment_sample' is not defined

In [None]:
# Convert sentiment scores to DataFrame
df_sentiment = pd.DataFrame.from_dict(book_sentiment_scores, orient="index").reset_index()
df_sentiment.rename(columns={"index": "book_id"}, inplace=True)

# merge this with the book dataset
df_final = df_books.merge(df_sentiment, on="book_id", how="left")
columns_to_drop = ["isbn", "text_reviews_count", "series", "country_code", "language_code", 
                   "popular_shelves", "asin", "is_ebook", "average_rating", "kindle_asin", 
                   "similar_books", "description", "format", "link", "authors", "publisher", 
                   "num_pages", "publication_day", "isbn13", "publication_month", "edition_information", 
                   "publication_year", "image_url", "ratings_count", "work_id", "title_without_series"]
df_final = df_final.drop(columns=columns_to_drop, errors="ignore") # we want to ignore the errors if the columns don't exist

# save the final dataset
df_final.to_csv("final_book_dataset_with_reviews.csv", index=False)

print("Sentiment analysis completed and dataset saved!")

In [None]:
df_final

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming book_sentiment_scores is a dictionary {book_id: {sentiment: count, ...}}
df = pd.DataFrame.from_dict(book_sentiment_scores, orient='index')

# Sum sentiment occurrences across all books
sentiment_totals = df.sum().sort_values(ascending=False)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x=sentiment_totals.index, y=sentiment_totals.values, palette="viridis")
plt.xticks(rotation=45, ha="right")
plt.xlabel("Sentiment")
plt.ylabel("Total Count")
plt.title("Sentiment Distribution Across Books")
plt.show()
