We now want to analyze the emotion context of the books.
For example, if you want an exciting read -- we want to use the descriptions to determine suspensefulness

We will have 7 categories:
Anger, disgust, fear, joy, sadness, surprise, neutral

We will be fine-tuning our model for sentiment analysis on LLMS

# What is Fine-Tuning?

1. We take a pre-trained LLM model. 
2. Instead of taking a pre-trained LLM, we train it further on a dataset. This helps refine the weights and parameters, allowing the model to "specialize" in a task

We are going to use a model from hugging face that has been finetuned

In [5]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model_name = "j-hartmann/emotion-english-distilroberta-base"
bart_model = AutoModelForSequenceClassification.from_pretrained(model_name)
bart_tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline(
    "text-classification",
    model = model_name,
    top_k = None,
    device = "mps"
)
classifier("I love this")



[[{'label': 'joy', 'score': 0.9845667481422424},
  {'label': 'surprise', 'score': 0.004927205853164196},
  {'label': 'sadness', 'score': 0.004531427752226591},
  {'label': 'neutral', 'score': 0.003475314239040017},
  {'label': 'anger', 'score': 0.0013875315198674798},
  {'label': 'disgust', 'score': 0.0007134051411412656},
  {'label': 'fear', 'score': 0.0003984911891166121}]]

In [6]:
import pandas as pd
books = pd.read_csv("/Users/naomigong/Coding/Book_Recommender/final_data_set.csv")

In [10]:
import textwrap

# Get the first book description
desc = books["description"][0]

# Wrap the text at 80 characters per line
wrapped_text = textwrap.fill(desc, width=80)

# Print it so it displays over multiple lines
print(wrapped_text)

A NOVEL THAT READERS and critics have been eagerly anticipating for over a
decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames
is a preacher, the son of a preacher and the grandson (both maternal and
paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the
Reverend Ames’s life, and he is absorbed in recording his family’s story, a
legacy for the young son he will never see grow up. Haunted by his grandfather’s
presence, John tells of the rift between his grandfather and his father: the
elder, an angry visionary who fought for the abolitionist cause, and his son, an
ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames)
Boughton, his best friend’s lost son who returns to Gilead searching for
forgiveness and redemption. Told in John Ames’s joyous, rambling voice that
finds beauty, humour and truth in the smallest of life’s details, Gilead is a
song of celebration and acceptance of the best and the worst the world has

In [11]:
classifier(desc)

[[{'label': 'fear', 'score': 0.6548418402671814},
  {'label': 'neutral', 'score': 0.1698518544435501},
  {'label': 'sadness', 'score': 0.11640852689743042},
  {'label': 'surprise', 'score': 0.020700614899396896},
  {'label': 'disgust', 'score': 0.01910070702433586},
  {'label': 'joy', 'score': 0.01516131404787302},
  {'label': 'anger', 'score': 0.003935142420232296}]]

Here, we get that it is 65% chance it is fear, but from we are reading it seems to be more of joy. We don't really get fear from it. 

We want to capture the different emotions and themes

In [12]:
classifier(books["description"][0].split("."))

[[{'label': 'surprise', 'score': 0.7296029925346375},
  {'label': 'neutral', 'score': 0.1403854638338089},
  {'label': 'fear', 'score': 0.06816218048334122},
  {'label': 'joy', 'score': 0.04794241860508919},
  {'label': 'anger', 'score': 0.009156342595815659},
  {'label': 'disgust', 'score': 0.002628472400829196},
  {'label': 'sadness', 'score': 0.002122160280123353}],
 [{'label': 'neutral', 'score': 0.4493720233440399},
  {'label': 'disgust', 'score': 0.27359044551849365},
  {'label': 'joy', 'score': 0.10908288508653641},
  {'label': 'sadness', 'score': 0.09362718462944031},
  {'label': 'anger', 'score': 0.0404781699180603},
  {'label': 'surprise', 'score': 0.026970194652676582},
  {'label': 'fear', 'score': 0.006879037246108055}],
 [{'label': 'neutral', 'score': 0.6462165713310242},
  {'label': 'sadness', 'score': 0.2427329570055008},
  {'label': 'disgust', 'score': 0.04342255741357803},
  {'label': 'surprise', 'score': 0.028300538659095764},
  {'label': 'joy', 'score': 0.01421143393

Proposal: For each book, have a different column for each emotion classes. Instead of giving one emotion, take the highest probability for each emotion across the entire description.

In [None]:
#extract the max prediction for each emotion for each description
emotion_labels = ["anger", "disgust", "joy", "sadness", "surprise", "neutral"]
isbn = []
import numpy as np
#this will help create a dataframe, where the label (emotion) corresponds with the max prediction of the label among all sentences in a descrption
emotions_score = {label : [] for label in emotion_labels}
def calculate_max_emotion_score(predictions):
    per_emotion_score = {label : [] for label in emotion_labels}
    for prediction in predictions:
        sorted_predictions = sorted(prediction, key=lambda x: x["label"])
        for index, label in enumerate(emotion_labels):
            #sorted_predictions[index] indexes into the array, which holds hashmaps
            #Then get the value assorted with key score
            per_emotion_score[label].append(sorted_predictions[index]["score"])
    return {label: np.max(score) for label, score in per_emotion_score.items()}

In [22]:
pred = classifier(books["description"][0].split("."))
calculate_max_emotion_score(pred)

{'anger': 0.06413352489471436,
 'disgust': 0.27359044551849365,
 'joy': 0.9281687140464783,
 'sadness': 0.932797908782959,
 'surprise': 0.6462165713310242,
 'neutral': 0.9671575427055359}

In [None]:
for i in range(10):
    isbn.append(books["isbn13"][i])
    #split up the sentence for the current description
    sentences = books["description"][i].split(".")
    predictions = classifier(sentences)
    max_scores = calculate_max_emotion_score(predictions)
    for label in emotion_labels:
        emotions_score[label].append(max_scores[label]) 


In [25]:
emotions_score

{'anger': [0.06413352489471436,
  0.6126187443733215,
  0.06413352489471436,
  0.3514840602874756,
  0.08141219615936279,
  0.23222504556179047,
  0.5381841659545898,
  0.06413352489471436,
  0.30067068338394165,
  0.06413352489471436],
 'disgust': [0.27359044551849365,
  0.3482857048511505,
  0.1040065661072731,
  0.15072256326675415,
  0.18449510633945465,
  0.7271746397018433,
  0.1558549553155899,
  0.1040065661072731,
  0.2794807553291321,
  0.17792753875255585],
 'joy': [0.9281687140464783,
  0.9425276517868042,
  0.9723208546638489,
  0.36070626974105835,
  0.09504330158233643,
  0.051362715661525726,
  0.7474279999732971,
  0.4044966399669647,
  0.9155241250991821,
  0.051362715661525726],
 'sadness': [0.932797908782959,
  0.7044218182563782,
  0.7672365307807922,
  0.251880943775177,
  0.04056432843208313,
  0.04337584972381592,
  0.8725653886795044,
  0.04056432843208313,
  0.04056432843208313,
  0.04056432843208313],
 'surprise': [0.6462165713310242,
  0.8879396319389343,
  

In [26]:
from tqdm import tqdm

emotion_labels = ["anger", "disgust", "joy", "sadness", "surprise", "neutral"]
isbn = []
import numpy as np
#this will help create a dataframe, where the label (emotion) corresponds with the max prediction of the label among all sentences in a descrption
emotions_score = {label : [] for label in emotion_labels}

for i in tqdm(range(len(books))):
    isbn.append(books["isbn13"][i])
    #split up the sentence for the current description
    sentences = books["description"][i].split(".")
    predictions = classifier(sentences)
    max_scores = calculate_max_emotion_score(predictions)
    for label in emotion_labels:
        emotions_score[label].append(max_scores[label]) 

100%|██████████| 5693/5693 [05:18<00:00, 17.85it/s]


Now, create new dataframe with emotions as columns

In [66]:
books = pd.read_csv("final_data_set.csv")

In [68]:
emotions_df = pd.DataFrame(emotions_score) #emotions is a dictionary with label as the emotion name
emotions_df["isbn13"] = isbn
books = pd.merge(books, emotions_df, on="isbn13")
books.drop(columns="Unnamed: 0")

Unnamed: 0,isbn13,isbn10,title,authors,categories,thumbnail,description,published_year,average_rating,num_pages,...,joy_x,sadness_x,surprise_x,neutral_x,anger_y,disgust_y,joy_y,sadness_y,surprise_y,neutral_y
0,9780002005883,0002005883,Gilead,Marilynne Robinson,Fiction,http://books.google.com/books/content?id=KQZCP...,A NOVEL THAT READERS and critics have been eag...,2004.0,3.85,247.0,...,0.928169,0.932798,0.646217,0.967158,0.064134,0.273590,0.928169,0.932798,0.646217,0.967158
1,9780002261982,0002261987,Spider's Web,Charles Osborne;Agatha Christie,Detective and mystery stories,http://books.google.com/books/content?id=gA5GP...,A new 'Christie for Christmas' -- a full-lengt...,2000.0,3.83,241.0,...,0.942528,0.704422,0.887940,0.111690,0.612619,0.348286,0.942528,0.704422,0.887940,0.111690
2,9780006178736,0006178731,Rage of angels,Sidney Sheldon,Fiction,http://books.google.com/books/content?id=FKo2T...,"A memorable, mesmerizing heroine Jennifer -- b...",1993.0,3.93,512.0,...,0.972321,0.767237,0.549477,0.111690,0.064134,0.104007,0.972321,0.767237,0.549477,0.111690
3,9780006280897,0006280897,The Four Loves,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=XhQ5X...,Lewis' work on the nature of love divides love...,2002.0,4.15,170.0,...,0.360706,0.251881,0.732686,0.111690,0.351484,0.150723,0.360706,0.251881,0.732686,0.111690
4,9780006280934,0006280935,The Problem of Pain,Clive Staples Lewis,Christian life,http://books.google.com/books/content?id=Kk-uV...,"""In The Problem of Pain, C.S. Lewis, one of th...",2002.0,4.09,176.0,...,0.095043,0.040564,0.884389,0.475881,0.081412,0.184495,0.095043,0.040564,0.884389,0.475881
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5688,9788173031014,8173031010,Journey to the East,Hermann Hesse,Adventure stories,http://books.google.com/books/content?id=rq6JP...,This book tells the tale of a man who goes on ...,2002.0,3.70,175.0,...,0.051363,0.400263,0.883198,0.111690,0.064134,0.114383,0.051363,0.400263,0.883198,0.111690
5689,9788179921623,817992162X,The Monk Who Sold His Ferrari: A Fable About F...,Robin Sharma,Health & Fitness,http://books.google.com/books/content?id=c_7mf...,"Wisdom to Create a Life of Passion, Purpose, a...",2003.0,3.82,198.0,...,0.339217,0.947779,0.375756,0.066685,0.009997,0.009929,0.339217,0.947779,0.375756,0.066685
5690,9788185300535,8185300534,I Am that,Sri Nisargadatta Maharaj;Sudhakar S. Dikshit,Philosophy,http://books.google.com/books/content?id=Fv_JP...,This collection of the timeless teachings of o...,1999.0,4.51,531.0,...,0.459269,0.759456,0.951104,0.368110,0.064134,0.104007,0.459269,0.759456,0.951104,0.368110
5691,9789027712059,9027712050,The Berlin Phenomenology,Georg Wilhelm Friedrich Hegel,History,http://books.google.com/books/content?id=Vy7Sk...,Since the three volume edition ofHegel's Philo...,1981.0,0.00,210.0,...,0.051363,0.958549,0.915193,0.111690,0.064134,0.104007,0.051363,0.958549,0.915193,0.111690


In [70]:
books.to_csv("books_with_emotion.csv", index = False)

In [71]:
books.columns

Index(['Unnamed: 0', 'isbn13', 'isbn10', 'title', 'authors', 'categories',
       'thumbnail', 'description', 'published_year', 'average_rating',
       'num_pages', 'ratings_count', 'title_and_subtitle',
       'tagged_description', 'simplifed_categories', 'missing_cat_preds_x',
       'missing_cat_preds_y', 'anger_x', 'disgust_x', 'joy_x', 'sadness_x',
       'surprise_x', 'neutral_x', 'anger_y', 'disgust_y', 'joy_y', 'sadness_y',
       'surprise_y', 'neutral_y'],
      dtype='object')