# Sentiment Data Analysis

Want to perform sentiment analysis on a data set where we look at the reviews of books. First step is to load the data and look at the results.


## Initial Model Load

In [1]:
import pandas as pd
import numpy as np

In [2]:
books_rating = pd.read_csv("../data/raw/Books_rating.csv")

In [3]:
books_rating.head()

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


The datasource I am dealing with is massive, so will implement using Ray for parrallel processing.

In [4]:
books_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000000 entries, 0 to 2999999
Data columns (total 10 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Id                  object 
 1   Title               object 
 2   Price               float64
 3   User_id             object 
 4   profileName         object 
 5   review/helpfulness  object 
 6   review/score        float64
 7   review/time         int64  
 8   review/summary      object 
 9   review/text         object 
dtypes: float64(2), int64(1), object(7)
memory usage: 228.9+ MB


I want to see how large the text is, with sentiment analysis with deployable models you often can't go above 512 tokens.

In [5]:
books_rating['review/text'][0]

"This is only for Julie Strain fans. It's a collection of her photos -- about 80 pages worth with a nice section of paintings by Olivia.If you're looking for heavy literary content, this isn't the place to find it -- there's only about 2 pages with text and everything else is photos.Bottom line: if you only want one book, the Six Foot One ... is probably a better choice, however, if you like Julie like I like Julie, you won't go wrong on this one either."

This is well above 512 tokens. In order to clean this up I will need to use a summarising model.

## Summary Large Language Model

The model I will use is the facebook/bart-large-cnn. This is able to summarise dense text into a size that can then be analysed with sentiment analysis.

In [1]:
from transformers import pipeline
import pandas as pd

books_rating = pd.read_csv("../data/raw/Books_rating.csv")

Let's take a really large chunk of text

In [2]:
books_rating['review/text'][1423]

'The book Economics In One Lesson is about economic policies, and how these policies should take into account the effect they will have on everyone involved, not just one group. It also discusses and illustrates how economics should be focused on long term effects, rather than ignoring or overlooking the long term and only focusing on the short term. The author states that the most common fallacy in economics is to concentrate on the short-run effects of policies on special groups and to ignore or belittle the long-run effects on the community as a whole. Throughout the book, the chapters give lessons and examples that bring this one overriding lesson into focus. The author sums up economics in a single sentence: "The art of economics consists in looking not merely at the immediate but at the longer effects of any act or policy; it consists in tracing the consequences of that policy not merely for one group but for all groups."The author discusses the enormous faith many people have in

This looks long, let's see if we can summarise the text.

In [3]:
# Load a pre-trained summarization model (e.g., BART)
summarizer = pipeline("summarization", model="../models/bart-large-cnn")

# Sample long text
long_text = books_rating['review/text'][1423]

summary = summarizer(long_text, max_length=512, min_length=100, do_sample=False)[0]['summary_text']

print("Summarized text:", summary)



Summarized text: Economics In One Lesson is about economic policies, and how these policies should take into account the effect they will have on everyone involved, not just one group. The author states that the most common fallacy in economics is to concentrate on the short-run effects of policies on special groups. In a recent speech given by a politician, he suggested that "we go back to the days of living without ATM machines, because they have destroyed the jobs of many bank tellers" This illustrates the fallacy of ignoring the long run positive effects on the community.


## Sentiment Analysis

In order to perform sentiment analysis we will use the distilbert large model. This is able to provide a positive/negative binary classification with a confidence score.

In [4]:

from transformers import pipeline

# Load the pre-trained DistilBERT model and tokenizer for sentiment analysis
model_name = "../models/distilbert-base-uncased-finetuned-sst-2-english"
sentiment_analyzer = pipeline("sentiment-analysis", model=model_name)

In [6]:
# Run sentiment analysis on the sample text
result_1 = sentiment_analyzer(summary)

print(result_1)

[{'label': 'NEGATIVE', 'score': 0.9757010340690613}]


Let's see if it can do this on the large corpus of text

In [7]:
# Run sentiment analysis on the sample text
result_2 = sentiment_analyzer(books_rating['review/text'][1423])

print(result_2)

Token indices sequence length is longer than the specified maximum sequence length for this model (978 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: The size of tensor a (978) must match the size of tensor b (512) at non-singleton dimension 1

In this example the tensor is too large, so our analysis is a sucess!