# Long Text Sentiment

So far, we have restricted the length of the text being fed into our models. Bert in particular is restricted to consuming 512 tokens per sample. For many use-cases, this is most likely not a problem - but in some cases it can be.

If we take the example of Reddit posts on the */r/investing* subreddit, many of the more important posts are **DD** (due-diligence), which often consists of deep dives into why the author thinks a stock is a good investment or not. On these longer pieces of text, the actual sentiment from the author may not be clear from the first 512 tokens. We need to consider the full post.

Before working through the logic that allows us to consider the full post, let's import and define everything we need to make a prediction on a single chunk of text (using much of what we covered in the last section).

In [1]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

# initialize our model and tokenizer
tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')

# and we will place the processing of our input text into a function for easier prediction later
def sentiment(tokens):
    # get output logits from the model
    output = model(**tokens)
    # convert to probabilities
    probs = torch.nn.functional.softmax(output[0], dim=-1)
    # we will return the probability tensor (we will not need argmax until later)
    return probs

Now let's get to how we apply sentiment to longer pieces of text. There are two approaches that we cover in these notebooks:

* Using neural text summarization to shorten the text to below 512 tokens.

* Iterating through the text using a *window* and calculate the average article sentiment.

In this notebook we will be using the second approach. The window in question will be a subsection of our tokenized text, of length `512`. First, let's define an example and tokenize it.

In [2]:
txt = """
I would like to get your all  thoughts on the bond yield increase this week.  I am not worried about the market downturn but the sudden increase in yields. On 2/16 the 10 year bonds yields increased by almost  9 percent and on 2/19 the yield increased by almost 5 percent.

Key Points from the CNBC Article:

* **The “taper tantrum” in 2013 was a sudden spike in Treasury yields due to market panic after the Federal Reserve announced that it would begin tapering its quantitative easing program.**
* **Major central banks around the world have cut interest rates to historic lows and launched unprecedented quantities of asset purchases in a bid to shore up the economy throughout the pandemic.**
* **However, the recent rise in yields suggests that some investors are starting to anticipate a tightening of policy sooner than anticipated to accommodate a potential rise in inflation.**

The recent rise in bond yields and U.S. inflation expectations has some investors wary that a repeat of the 2013 “taper tantrum” could be on the horizon.

The benchmark U.S. 10-year Treasury note climbed above 1.3% for the first time since February 2020 earlier this week, while the 30-year bond also hit its highest level for a year. Yields move inversely to bond prices.

Yields tend to rise in lockstep with inflation expectations, which have reached their highest levels in a decade in the U.S., powered by increased prospects of a large fiscal stimulus package, progress on vaccine rollouts and pent-up consumer demand.

The “taper tantrum” in 2013 was a sudden spike in Treasury yields due to market panic after the Federal Reserve announced that it would begin tapering its quantitative easing program.

Major central banks around the world have cut interest rates to historic lows and launched unprecedented quantities of asset purchases in a bid to shore up the economy throughout the pandemic. The Fed and others have maintained supportive tones in recent policy meetings, vowing to keep financial conditions loose as the global economy looks to emerge from the Covid-19 pandemic.

However, the recent rise in yields suggests that some investors are starting to anticipate a tightening of policy sooner than anticipated to accommodate a potential rise in inflation.

With central bank support removed, bonds usually fall in price which sends yields higher. This can also spill over into stock markets as higher interest rates means more debt servicing for firms, causing traders to reassess the investing environment.

“The supportive stance from policymakers will likely remain in place until the vaccines have paved a way to some return to normality,” said Shane Balkham, chief investment officer at Beaufort Investment, in a research note this week.

“However, there will be a risk of another ‘taper tantrum’ similar to the one we witnessed in 2013, and this is our main focus for 2021,” Balkham projected, should policymakers begin to unwind this stimulus.

Long-term bond yields in Japan and Europe followed U.S. Treasurys higher toward the end of the week as bondholders shifted their portfolios.

“The fear is that these assets are priced to perfection when the ECB and Fed might eventually taper,” said Sebastien Galy, senior macro strategist at Nordea Asset Management, in a research note entitled “Little taper tantrum.”

“The odds of tapering are helped in the United States by better retail sales after four months of disappointment and the expectation of large issuance from the $1.9 trillion fiscal package.”

Galy suggested the Fed would likely extend the duration on its asset purchases, moderating the upward momentum in inflation.

“Equity markets have reacted negatively to higher yield as it offers an alternative to the dividend yield and a higher discount to long-term cash flows, making them focus more on medium-term growth such as cyclicals” he said. Cyclicals are stocks whose performance tends to align with economic cycles.

Galy expects this process to be more marked in the second half of the year when economic growth picks up, increasing the potential for tapering.

## Tapering in the U.S., but not Europe

Allianz CEO Oliver Bäte told CNBC on Friday that there was a geographical divergence in how the German insurer is thinking about the prospect of interest rate hikes.

“One is Europe, where we continue to have financial repression, where the ECB continues to buy up to the max in order to minimize spreads between the north and the south — the strong balance sheets and the weak ones — and at some point somebody will have to pay the price for that, but in the short term I don’t see any spike in interest rates,” Bäte said, adding that the situation is different stateside.

“Because of the massive programs that have happened, the stimulus that is happening, the dollar being the world’s reserve currency, there is clearly a trend to stoke inflation and it is going to come. Again, I don’t know when and how, but the interest rates have been steepening and they should be steepening further.”

## Rising yields a ‘normal feature’

However, not all analysts are convinced that the rise in bond yields is material for markets. In a note Friday, Barclays Head of European Equity Strategy Emmanuel Cau suggested that rising bond yields were overdue, as they had been lagging the improving macroeconomic outlook for the second half of 2021, and said they were a “normal feature” of economic recovery.

“With the key drivers of inflation pointing up, the prospect of even more fiscal stimulus in the U.S. and pent up demand propelled by high excess savings, it seems right for bond yields to catch-up with other more advanced reflation trades,” Cau said, adding that central banks remain “firmly on hold” given the balance of risks.

He argued that the steepening yield curve is “typical at the early stages of the cycle,” and that so long as vaccine rollouts are successful, growth continues to tick upward and central banks remain cautious, reflationary moves across asset classes look “justified” and equities should be able to withstand higher rates.

“Of course, after the strong move of the last few weeks, equities could mark a pause as many sectors that have rallied with yields look overbought, like commodities and banks,” Cau said.

“But at this stage, we think rising yields are more a confirmation of the equity bull market than a threat, so dips should continue to be bought.”
"""
print(f"The size of your text: {len(txt)}")
tokens = tokenizer.encode_plus(txt, add_special_tokens=False)
print(f"Tokens: {tokens.input_ids}")
len(tokens['input_ids'])

Token indices sequence length is longer than the specified maximum sequence length for this model (1345 > 512). Running this sequence through the model will result in indexing errors


The size of your text: 6426
Tokens: [1045, 2052, 2066, 2000, 2131, 2115, 2035, 4301, 2006, 1996, 5416, 10750, 3623, 2023, 2733, 1012, 1045, 2572, 2025, 5191, 2055, 1996, 3006, 2091, 22299, 2021, 1996, 5573, 3623, 1999, 16189, 1012, 2006, 1016, 1013, 2385, 1996, 2184, 2095, 9547, 16189, 3445, 2011, 2471, 1023, 3867, 1998, 2006, 1016, 1013, 2539, 1996, 10750, 3445, 2011, 2471, 1019, 3867, 1012, 3145, 2685, 2013, 1996, 27166, 9818, 3720, 1024, 1008, 1008, 1008, 1996, 1523, 6823, 2099, 9092, 24456, 1524, 1999, 2286, 2001, 1037, 5573, 9997, 1999, 9837, 16189, 2349, 2000, 3006, 6634, 2044, 1996, 2976, 3914, 2623, 2008, 2009, 2052, 4088, 6823, 4892, 2049, 20155, 24070, 2565, 1012, 1008, 1008, 1008, 1008, 1008, 2350, 2430, 5085, 2105, 1996, 2088, 2031, 3013, 3037, 6165, 2000, 3181, 2659, 2015, 1998, 3390, 15741, 12450, 1997, 11412, 17402, 1999, 1037, 7226, 2000, 5370, 2039, 1996, 4610, 2802, 1996, 6090, 3207, 7712, 1012, 1008, 1008, 1008, 1008, 1008, 2174, 1010, 1996, 3522, 4125, 1999, 16189, 

1345

In [3]:
test_tokens_ids = tokens["input_ids"][0:8]
print(f"Tokens ids: {test_tokens_ids}")
tokenizer.convert_ids_to_tokens(test_tokens_ids, skip_special_tokens=False)

Tokens ids: [1045, 2052, 2066, 2000, 2131, 2115, 2035, 4301]


['i', 'would', 'like', 'to', 'get', 'your', 'all', 'thoughts']

If we tokenize this longer piece of text we get a total of **1345** tokens, far too many to fit into our BERT model containing a maximum limit of 512 tokens. We will need to split this text into chunks of 512 tokens at a time, and calculate our sentiment probabilities for each chunk seperately.

Because we are taking this slightly different approach, we have encoded our tokens using a different set of parameters to what we have used before. This time, we:

* Avoided adding special tokens `add_special_tokens=False` because this will add *[CLS]* and *[SEP]* tokens to the start and end of the full tokenized tensor of length **1345**, we will instead add them manually later.

* We will not specify `max_length`, `truncation`, or `padding` parameters (as we do not use any of them here).

* We will return standard Python *lists* rather than tensors by not specifying `return_tensors` (it will return lists by default). This will make the following logic steps easier to follow - but we will rewrite them using PyTorch code in the next section.

In [27]:
type(tokens['input_ids'])

list

First, we break our tokenized dictionary into `input_ids` and `attention_mask` variables.

In [28]:
input_ids = tokens['input_ids']
attention_mask = tokens['attention_mask']

We can now access slices of these lists like so:

In [29]:
input_ids[16:32]

[1045,
 2572,
 2025,
 5191,
 2055,
 1996,
 3006,
 2091,
 22299,
 2021,
 1996,
 5573,
 3623,
 1999,
 16189,
 1012]

We will be using this to break our lists into smaller sections, let's test it in a simple loop.

In [30]:
# define our starting position (0) and window size (number of tokens in each chunk)
start = 0
window_size = 512

# get the total length of our tokens
total_len = len(input_ids)

# initialize condition for our while loop to run
loop = True

# loop through and print out start/end positions
while loop:
    # the end position is simply the start + window_size
    end = start + window_size
    # if the end position is greater than the total length, make this our final iteration
    if end >= total_len:
        loop = False
        # and change our endpoint to the final token position
        end = total_len
    print(f"{start=}\n{end=}")
    # we need to move the window to the next 512 tokens
    start = end

start=0
end=512
start=512
end=1024
start=1024
end=1345


This logic works for shifting our window across the full length of input IDs, so now we can modify it to iterately predict sentiment for each window. There will be a few added steps for us to get this to work:

1. Extract the window from `input_ids` and `attention_mask`.

2. Add the start of sequence token `[CLS]`/`101` and seperator token `[SEP]`/`102`.

3. Add padding (only applicable to final batch).

4. Format into dictionary containing PyTorch tensors.

5. Make logits predictions with the model.

6. Calculate softmax and append softmax vector to a list `probs_list`.

In [41]:
import torch.nn.functional as F



# initialize probabilities list
probs_list = []

start = 0
window_size = 510  # we take 2 off here so that we can fit in our [CLS] and [SEP] tokens

loop = True

while loop:
    end = start + window_size
    if end >= total_len:
        loop = False
        end = total_len
    # (1) extract window from input_ids and attention_mask
    input_ids_chunk = input_ids[start:end]
    attention_mask_chunk = attention_mask[start:end]
   
    # (2) add [CLS] and [SEP]
    input_ids_chunk = [101] + input_ids_chunk + [102]
    attention_mask_chunk = [1] + attention_mask_chunk + [1]
    
    # (3) add padding upto window_size + 2 (512) tokens
    # print(f"window_size: {window_size}" )
    # print(f"len(input_ids_chunk): {len(input_ids_chunk)}" )
    # print(f"window_size - len(input_ids_chunk) + 2: {(window_size - len(input_ids_chunk) + 2)}" )
    input_ids_chunk += [0] * (window_size - len(input_ids_chunk) + 2)
    print(f"input_ids_chunk: {input_ids_chunk}")
    attention_mask_chunk += [0] * (window_size - len(attention_mask_chunk) + 2)
    
    # (4) format into PyTorch tensors dictionary
    input_dict = {
        'input_ids': torch.Tensor([input_ids_chunk]).long(),
        'attention_mask': torch.Tensor([attention_mask_chunk]).int()
    }
    # (5) make logits prediction
    outputs = model(**input_dict)
    
    # (6) calculate softmax and append to list
    probs = torch.nn.functional.softmax(outputs[0], dim=-1)
    print(f"probs: {probs}")
    pred = torch.argmax(probs)
    print(f"pred ---  indices of the maximum value: {pred}")
    probs_list.append(probs)

    start = end
    
# let's view the probabilities given
probs_list

input_ids_chunk: [101, 1045, 2052, 2066, 2000, 2131, 2115, 2035, 4301, 2006, 1996, 5416, 10750, 3623, 2023, 2733, 1012, 1045, 2572, 2025, 5191, 2055, 1996, 3006, 2091, 22299, 2021, 1996, 5573, 3623, 1999, 16189, 1012, 2006, 1016, 1013, 2385, 1996, 2184, 2095, 9547, 16189, 3445, 2011, 2471, 1023, 3867, 1998, 2006, 1016, 1013, 2539, 1996, 10750, 3445, 2011, 2471, 1019, 3867, 1012, 3145, 2685, 2013, 1996, 27166, 9818, 3720, 1024, 1008, 1008, 1008, 1996, 1523, 6823, 2099, 9092, 24456, 1524, 1999, 2286, 2001, 1037, 5573, 9997, 1999, 9837, 16189, 2349, 2000, 3006, 6634, 2044, 1996, 2976, 3914, 2623, 2008, 2009, 2052, 4088, 6823, 4892, 2049, 20155, 24070, 2565, 1012, 1008, 1008, 1008, 1008, 1008, 2350, 2430, 5085, 2105, 1996, 2088, 2031, 3013, 3037, 6165, 2000, 3181, 2659, 2015, 1998, 3390, 15741, 12450, 1997, 11412, 17402, 1999, 1037, 7226, 2000, 5370, 2039, 1996, 4610, 2802, 1996, 6090, 3207, 7712, 1012, 1008, 1008, 1008, 1008, 1008, 2174, 1010, 1996, 3522, 4125, 1999, 16189, 6083, 2008, 20

[tensor([[0.1384, 0.8145, 0.0471]], grad_fn=<SoftmaxBackward0>),
 tensor([[0.3757, 0.4670, 0.1574]], grad_fn=<SoftmaxBackward0>),
 tensor([[0.7290, 0.2006, 0.0704]], grad_fn=<SoftmaxBackward0>)]

Each section has been assign varying levels of sentiment. The first and second sections both score *negatively* (index *1*) and the final sections scores *positively* (index *0*). To calculate the average sentiment across the full text, we will merge these tensors using the `stack` method:

In [43]:
stacks = torch.stack(probs_list)
stacks

tensor([[[0.1384, 0.8145, 0.0471]],

        [[0.3757, 0.4670, 0.1574]],

        [[0.7290, 0.2006, 0.0704]]], grad_fn=<StackBackward0>)

From here we will calculate the mean score of each column (positive, negative, and neutral sentiment respectively) using `mean(dim=0)`. But before we do that we must reshape our tensor into a *3x3* shape - it is currently a 3x1x3:

In [44]:
shape = stacks.shape
shape

torch.Size([3, 1, 3])

We can reshape our tensor dimensions using the `resize_` method, and use dimensions `0` and `2` of our current tensor shape:

In [45]:
shape[0], shape[2]

(3, 3)

In [48]:
stacks.resize_(shape[0], shape[2])

RuntimeError: cannot resize variables that require grad

When we try to resize our tensor, we will receive this `RuntimeError` telling us that we cannot resize variables that require *grad*. What this is referring to is the *gradient updates* of our model tensors during training. PyTorch cannot calculate gradients for tensors that have been reshaped. Fortunately, we don't actually want to use this tensor during any training, so we can use the `torch.no_grad()` namespace to tell PyTorch that we do **not** want to calculate any gradients.

In [49]:
with torch.no_grad():
    # we must include our stacks operation in here too
    stacks = torch.stack(probs_list)
    # now resize
    stacks = stacks.resize_(stacks.shape[0], stacks.shape[2])
    print(f"Resized stack: {stacks}")
    # finally, we can calculate the mean value for each sentiment class
    mean = stacks.mean(dim=0)
    
mean

Resized stack: tensor([[0.1384, 0.8145, 0.0471],
        [0.3757, 0.4670, 0.1574],
        [0.7290, 0.2006, 0.0704]])


tensor([0.4144, 0.4940, 0.0916])

Our final sentiment prediction shows a reasonable balanced sentiment of both positive and negative classes, with a slightly stronger negative sentiment score overall. We can take the `argmax` too to specify our winning class.

In [50]:
torch.argmax(mean).item()

1