### Introduction

Using a fine-tuning method, I would analyze books based on their descriptions to classify their emotions as one of the following: Sadness, Joy, Anger, Surprise, Fear, Disgust, or Neutral.

Instead of fine-tuning from scratch, I would use open source model from hugging face that has already been trained on emotion data.

The evaluation of the model is found on [dataloop](https://dataloop.ai/library/model/dennisjooo_emotion_classification/#:~:text=The%20model%20was%20trained%20on,better%20on%20this%20specific%20task.).

In [43]:
import numpy as np
import pandas as pd

from transformers import pipeline

from tqdm import tqdm

### Load the Dataset

In [3]:
books = pd.read_csv("books_with_categories.csv")

### Fine-Tuned Sentiment Classifier Model

In [4]:
classifier = pipeline("text-classification",
                      model="j-hartmann/emotion-english-distilroberta-base",
                      top_k = None,
                      device = "cpu")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu


In [5]:
# example inference
classifier("I love this!")

[[{'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'surprise', 'score': 0.008528684265911579},
  {'label': 'neutral', 'score': 0.005764586851000786},
  {'label': 'anger', 'score': 0.004419783595949411},
  {'label': 'sadness', 'score': 0.002092392183840275},
  {'label': 'disgust', 'score': 0.0016119900392368436},
  {'label': 'fear', 'score': 0.0004138521908316761}]]

#### Inference on Books Description

In [20]:
print(books.description[0])
classifier(books.description[0])

A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world has

[[{'label': 'fear', 'score': 0.6548399925231934},
  {'label': 'neutral', 'score': 0.1698525995016098},
  {'label': 'sadness', 'score': 0.11640939861536026},
  {'label': 'surprise', 'score': 0.02070068009197712},
  {'label': 'disgust', 'score': 0.019100721925497055},
  {'label': 'joy', 'score': 0.015161462128162384},
  {'label': 'anger', 'score': 0.003935154061764479}]]

The model predicted the description as showing an emotion of fear, but upon reading it, there is also a sentiment of joy. Since the model can predict the sentiment of individual sentences, I would analyze each sentence in the description and then predict its sentiment separately:

In [16]:
# predict the sentiment of each sentence in the first book
sentences = books.description[0].split(".")
predictions = classifier(sentences)

# first sentence and its prediction
print(sentences[0], "\n\n")
predictions[0]

A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives 




[{'label': 'surprise', 'score': 0.729602038860321},
 {'label': 'neutral', 'score': 0.14038607478141785},
 {'label': 'fear', 'score': 0.06816229224205017},
 {'label': 'joy', 'score': 0.04794258251786232},
 {'label': 'anger', 'score': 0.009156374260783195},
 {'label': 'disgust', 'score': 0.002628477755934},
 {'label': 'sadness', 'score': 0.0021221644710749388}]

In [21]:
# print all prediction
predictions

[[{'label': 'surprise', 'score': 0.729602038860321},
  {'label': 'neutral', 'score': 0.14038607478141785},
  {'label': 'fear', 'score': 0.06816229224205017},
  {'label': 'joy', 'score': 0.04794258251786232},
  {'label': 'anger', 'score': 0.009156374260783195},
  {'label': 'disgust', 'score': 0.002628477755934},
  {'label': 'sadness', 'score': 0.0021221644710749388}],
 [{'label': 'neutral', 'score': 0.44937077164649963},
  {'label': 'disgust', 'score': 0.27359139919281006},
  {'label': 'joy', 'score': 0.10908306390047073},
  {'label': 'sadness', 'score': 0.09362738579511642},
  {'label': 'anger', 'score': 0.040478240698575974},
  {'label': 'surprise', 'score': 0.02697017788887024},
  {'label': 'fear', 'score': 0.0068790484219789505}],
 [{'label': 'neutral', 'score': 0.6462157964706421},
  {'label': 'sadness', 'score': 0.24273352324962616},
  {'label': 'disgust', 'score': 0.04342266544699669},
  {'label': 'surprise', 'score': 0.028300534933805466},
  {'label': 'joy', 'score': 0.014211485

In [18]:
# sort the result of the prediction of the first sentence by label: 
sorted(predictions[0], key=lambda x: x["label"])

[{'label': 'anger', 'score': 0.009156374260783195},
 {'label': 'disgust', 'score': 0.002628477755934},
 {'label': 'fear', 'score': 0.06816229224205017},
 {'label': 'joy', 'score': 0.04794258251786232},
 {'label': 'neutral', 'score': 0.14038607478141785},
 {'label': 'sadness', 'score': 0.0021221644710749388},
 {'label': 'surprise', 'score': 0.729602038860321}]

Creating new features that represent the highest probability of each sentiments corresponding to sentences in a book description.

In [22]:
emotion_labels = ["anger", "disgust", "fear", "joy", "sadness", "surprise", "neutral"]

def calculate_max_emotion_scores(predictions):
    max_scores = {label: -np.inf for label in emotion_labels}
    
    for prediction in predictions:
        for p in prediction:  
            label, score = p["label"], p["score"]
            if score > max_scores[label]:
                max_scores[label] = score

    return max_scores

In [23]:
calculate_max_emotion_scores(predictions)

{'anger': 0.06413355469703674,
 'disgust': 0.27359139919281006,
 'fear': 0.9281681180000305,
 'joy': 0.9327982664108276,
 'sadness': 0.9671575427055359,
 'surprise': 0.729602038860321,
 'neutral': 0.6462157964706421}

In [24]:
# inference the first 10 books in the book dataset

isbn = []
emotion_scores = {label: [] for label in emotion_labels}

for i in range(10):
    isbn.append(books.loc[i, "isbn13"])   
    sentences = books.loc[i, "description"].split(".")
    predictions = classifier(sentences)  
    max_scores = calculate_max_emotion_scores(predictions) # book for each descriction

    for label in emotion_labels:
        emotion_scores[label].append(max_scores[label])

In [26]:
emotion_scores

{'anger': [0.06413355469703674,
  0.6126194000244141,
  0.06413355469703674,
  0.35148441791534424,
  0.08141238987445831,
  0.23222465813159943,
  0.5381842255592346,
  0.06413355469703674,
  0.3006700873374939,
  0.06413355469703674],
 'disgust': [0.27359139919281006,
  0.3482847809791565,
  0.10400658845901489,
  0.1507224589586258,
  0.18449527025222778,
  0.7271749377250671,
  0.155854731798172,
  0.10400658845901489,
  0.279481440782547,
  0.17792704701423645],
 'fear': [0.9281681180000305,
  0.9425276517868042,
  0.9723208546638489,
  0.36070606112480164,
  0.09504339843988419,
  0.05136274918913841,
  0.7474274635314941,
  0.40449756383895874,
  0.9155240654945374,
  0.05136274918913841],
 'joy': [0.9327982664108276,
  0.7044221758842468,
  0.7672382593154907,
  0.2518813908100128,
  0.040564365684986115,
  0.043375786393880844,
  0.8725655674934387,
  0.040564365684986115,
  0.040564365684986115,
  0.040564365684986115],
 'sadness': [0.9671575427055359,
  0.11169009655714035,


Putting it all together for all books:

In [27]:
books.shape

(5693, 14)

In [None]:
emotion_labels = ["anger", "disgust", "fear", "joy", "sadness", "surprise", "neutral"]
isbn = []
emotion_scores = {label: [] for label in emotion_labels}

start = 0
chunk_size = 300

for i in tqdm(range(start, len(books), chunk_size)):
    for i in tqdm(range(start, chunk_size)):
        isbn.append(books.loc[i, "isbn13"])   
        sentences = books.loc[i, "description"].split(".")
        predictions = classifier(sentences)  
        max_scores = calculate_max_emotion_scores(predictions) # book for each descriction
    
        for label in emotion_labels:
            emotion_scores[label].append(max_scores[label])

    start += 300
    chunk_size += 300

  0%|                                                                                                                               | 0/19 [00:00<?, ?it/s]
  0%|                                                                                                                              | 0/300 [00:00<?, ?it/s][A
  0%|▍                                                                                                                     | 1/300 [00:00<03:46,  1.32it/s][A
  1%|▊                                                                                                                     | 2/300 [00:01<03:37,  1.37it/s][A
  1%|█▏                                                                                                                    | 3/300 [00:01<02:23,  2.07it/s][A
  1%|█▌                                                                                                                    | 4/300 [00:01<01:56,  2.54it/s][A
  2%|█▉                                          

In [None]:
emotions_df = pd.DataFrame(emotion_scores)
emotions_df["isbn13"] = isbn
emotions_df.head(2)

Merge the new features to the `books` dataframe:

In [None]:
books = pd.merge(books, emotions_df, on = "isbn13")
books.head(2)