# Analyze Texts

This notebook combines code by Vi Mai and Moacir P. de Sá Pereira with default TDMStudio code provded by ProQuest/Clarivate.
Mai wrote the Huggingface interface and Sá Pereira put everything together and wrote the general workflow.

This notebook assumes the existence of two different datasets: 

1. A set of corpora available as various directories like `./data/{corpus_name}`, each of which contains $n$ xml files of the name `{goid}.xml`, where `goid` is a global id used by ProQuest for their articles

2. A set of parquet files in `./full_parquets`, each of the name `{corpus}.parquet`. The files are all concatenated versions of the chunked csvs in `./dataframe_files` and drop certain columns, use others, and omit rows (articles) that are overly long or have weekend dates of publication. These files were created in the `concatenate-corpora` notebook and have these columns:

- `index`: Int. A consecutive index.
- `goid`: Int. ProQuests global ID for their articles.
- `date`: DT. The publication date, in datetime format.
- `tokens`: Int. A naive word count, derived from splitting the full text on whitespace.
- `corpus`: Str. The corpus name. This is used in the next notebook.
- `daily_article_count`: Int. The number of articles in the corpus for that day.
- `daily_token_sum`: Int. The sum of naive tokens in the corpus for that day.

In this notebook, we iterate over each corpus and chunk each corpus back into 1000 article sized chunks. We analyze the sentiment of each article's full text, using the [distilroberta-finetuned-financial-news-sentiment-analysis](https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis) model by Miguel Romero, downloaded from Huggingface. 

We derive three new columns:

- `text_sentiment`: Float. A weighted average sentiment for the article. The score range is (-1, 1) to account for negative, neutral, and positive sentiment (positive numbers indicate positive sentiment).
- `text_error`: Float. A weighted inverse squared error that indicates the model’s confidence in its labeling. The higher this value is, the more confident the model is of its analysis.
- `text_input_tokens`: Int. The number of input tokens used by the model. The model uses the RoBERTa tokenizer, which uses byte pair encoding for subword tokenization. As a result, this number is typically larger than the `tokens` column but gives a sense of how many chunks the model split the article text into (the model can take only 512 tokens at a time).

Each new batch dataframe with the sentiment scores is saved to a new parquet file named `./analyzed_files/{corpus}_nnn.parquet`.

A final function will concatenate a single corpus’s analyzed files into a single parquet file for analysis in the rest of our project.    

## Imports

In [None]:
%conda update -n base -c conda-forge conda

In [None]:
%conda install pyarrow=15.0.0

In [None]:
%conda install pandas=2.2.3

In [None]:
%conda install pytorch=2.5

In [None]:
%conda install transformers

In [None]:
%conda install lxml

In [6]:
import os
import pandas as pd
import numpy as np
from lxml import etree
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

## Constants

In [23]:
corpora = [
    "dollar-tree",
    "lululemon",
    "ulta",
    "walgreens",
    "walmart"
]

root_path = "/home/ec2-user/SageMaker"
full_parquets_path = f"{root_path}/full_fixed_parquet_files"
analyzed_files_path = f"{root_path}/analyzed_batch_parquet_files"
analyzed_full_parquet_path = f"{root_path}/analyzed_full_parquet_files"

## Text Analysis Functions

In [24]:
# Function to read a single text from a single xml file.
def get_text(corpus, goid):
    text = ""
    try: 
        tree = etree.parse(f"{root_path}/data/{corpus}/{goid}.xml")
        root = tree.getroot()
        if root.find('.//FullText') is not None:
            text = root.find('.//FullText').text
        elif root.find('.//HiddenText') is not None:
            text = root.find('.//HiddenText').text
        elif root.find('.//Text') is not None:
            text = root.find('.//Text').text
        
        text = BeautifulSoup(text).get_text().replace('\n', ' ').replace('\\', '').strip()

    except Exception as e:
        print(f"Error while parsing file {file}: {e}")
        
    return text

In [25]:
class SentimentAnalysisEngine:
    def __init__(self, huggingface_model = "distilroberta-finetuned-financial-news-sentiment-analysis"):
        self.model_path = f"models/{huggingface_model}"
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_path, local_files_only = True)
        self.model = AutoModelForSequenceClassification.from_pretrained(self.model_path, local_files_only = True)
        self.max_length = 512 # Huggingface maximum
        
    def analyze(self, text):
        # Adapted from https://github.com/huggingface/transformers/issues/9321
        averages = []
        errors = []
        chunk_sizes = []
        
        inputs = self.tokenizer(text, return_tensors="pt")
        input_ids = inputs["input_ids"][0]
        length = len(input_ids)
        
        # Chunk input tensor into max_length-sized pieces.
        chunks = [
            input_ids[i:i + self.max_length]
            for i in range(0, length, self.max_length)
        ]
        
        for chunk in chunks:
            chunk_inputs={k: v[:, :len(chunk)].reshape(1, -1) for k, v in inputs.items()}
            chunk_inputs["input_ids"] = chunk.reshape(1, -1)
        
            preds = self.model(**chunk_inputs)
            weights = torch.softmax(preds.logits, dim=1) # could move to cuda in theory
            # print(torch.allclose(torch.sum(weights, dim=-1), torch.tensor(1.0)))  # Returns True if the sum is 1
            values = torch.linspace(-1, 1, steps=3)
            average = torch.dot(weights[0], values) # Don't need to divide. Weights sum to 1 in softmax
            averages.append(average.item())
            
            deviations = (values - average) ** 2
            variance = torch.dot(weights[0], deviations)
            error = torch.sqrt(variance)
            errors.append(error.item())
            
            chunk_sizes.append(len(chunk))
        
        chunk_sizes = torch.tensor(chunk_sizes, dtype=torch.float32)
        chunk_weighted_averages = torch.tensor(
            averages, dtype=torch.float32) * chunk_sizes
        errors = torch.tensor(errors, dtype=torch.float32)
        inverse_squared_errors = 1 / errors**2
        chunk_weighted_errors = inverse_squared_errors * chunk_sizes
        weighted_average = torch.sum(
            chunk_weighted_averages * inverse_squared_errors
            ) / torch.sum(chunk_weighted_errors)
        weighted_error = torch.sqrt(1.0 / torch.sum(1.0 / chunk_weighted_errors))
            
        # Return the final weighted average, final weighted error (higher = more confident),
        # and total number of input tokens.
        return weighted_average.item(), weighted_error.item(), length # total number of tokens


### Sanity Check

In [26]:
sentiment_analyzer = SentimentAnalysisEngine()

In [27]:
sentiment_analyzer.analyze("This stock is going to crash very soon")
# -0.9898354917397236

(-0.9898356199264526, 30.163711547851562, 10)

In [28]:
sentiment_analyzer.analyze("This stock is going to soar very soon")
# 0.9984612356163491

(0.9984613060951233, 69.96183013916016, 10)

In [29]:
sentiment_analyzer.analyze("This stock is going to blah")
# -3.935034447049273e-05

(-3.9350346924038604e-05, 237.02548217773438, 8)

## Iterate

In [152]:
sentiment_analyzer = SentimentAnalysisEngine()

def analyze_row(row):
    text = get_text(row.corpus, row.goid) # Why we added the `corpus` column.
    text_sentiment, text_error, text_input_tokens = sentiment_analyzer.analyze(text)
    return text_sentiment, text_error, text_input_tokens

In [153]:
# This function returns the last index analyzed 
# so we know where to begin our search in the full corpus
def get_last_index(corpus):
    analyzed_files = sorted([file for file in os.listdir(analyzed_files_path) if corpus in file])
    path = f"{analyzed_files_path}/{corpus}_{str(len(analyzed_files)).zfill(3)}.parquet"
    try:
        last_df = pd.read_parquet(path)
        last_index = int(last_df.tail(1)["index"].iloc[0])
    except FileNotFoundError:
        print(f"Couldn’t find {path}. This is probably OK.")
        last_index = -1
    
    return last_index

In [154]:
def iterate_on_corpus(corpus, batch_size):
    starting_index = get_last_index(corpus) + 1
    stopping_index = starting_index + batch_size
    df = pd.read_parquet(f"{full_parquets_path}/{corpus}.parquet")
    df_length = len(df)
    if starting_index >= df_length - 1:
        print(f"No remaining articles for {corpus}")
        return True
    if stopping_index >= df_length - 1:
        print(f"Final run for {corpus}")
        stopping_index = df_length - 1

    batch_df = df[starting_index:stopping_index].copy()
    batch_df[[
        "text_sentiment",
        "text_error", 
        "text_input_tokens",
    ]] = batch_df.progress_apply(lambda row: analyze_row(row), axis=1, result_type="expand")
    path = f"{analyzed_files_path}/{corpus}_{str(starting_index // batch_size + 1).zfill(3)}.parquet"
    batch_df.to_parquet(path)
    print(f"Wrote {path}")
    

In [156]:
tqdm.pandas(desc="Progress")

batch_size = 3

for i in range(3):
    for corpus in corpora:
        iterate_on_corpus(corpus, batch_size)
        

Couldn’t find /home/ec2-user/SageMaker/analyzed_batch_parquet_files/dollar-tree_000.parquet. This is probably OK.


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1352 > 512). Running this sequence through the model will result in indexing errors


Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/dollar-tree_001.parquet
Couldn’t find /home/ec2-user/SageMaker/analyzed_batch_parquet_files/lululemon_000.parquet. This is probably OK.


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/lululemon_001.parquet
Couldn’t find /home/ec2-user/SageMaker/analyzed_batch_parquet_files/ulta_000.parquet. This is probably OK.


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/ulta_001.parquet
Couldn’t find /home/ec2-user/SageMaker/analyzed_batch_parquet_files/walgreens_000.parquet. This is probably OK.


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/walgreens_001.parquet
Couldn’t find /home/ec2-user/SageMaker/analyzed_batch_parquet_files/walmart_000.parquet. This is probably OK.


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/walmart_001.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/dollar-tree_002.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/lululemon_002.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/ulta_002.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/walgreens_002.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/walmart_002.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/dollar-tree_003.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/lululemon_003.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/ulta_003.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/walgreens_003.parquet


Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Wrote /home/ec2-user/SageMaker/analyzed_batch_parquet_files/walmart_003.parquet


In [157]:
df = pd.read_parquet(f"{analyzed_files_path}/walmart_003.parquet")
df

Unnamed: 0,index,goid,date,tokens,corpus,daily_article_count,daily_token_sum,text_sentiment,text_error,text_input_tokens
6,6,2547227213,2019-08-02,579,walmart,229,259962,-1e-05,1226.034058,839.0
7,7,2847916030,2023-08-09,3683,walmart,225,401453,0.049922,261.916534,7821.0
8,8,2736604734,2022-11-16,1305,walmart,3852,3734991,0.11462,22.709007,1905.0


In [None]:
sample_count = 3

tqdm.pandas(desc="Progress")

for corpus in ["lululemon"]:
#for corpus in corpora:
    df = concat_csvs(corpus)
    sample_df = df.sample(n=sample_count, random_state=42)
    sample_df[[
        "text_sentiment",
        "text_error", 
        "text_input_tokens",
    ]] = sample_df.progress_apply(lambda row: analyze_row(row), axis=1, result_type="expand")
#     file_path = f"{analyzed_csv_path}/{corpus}.csv"
#     sample_df.to_csv(file_path)
#     print(f"Wrote {file_path}")

sample_df.head()

In [None]:
# Iterate


for corpus in corpora:
    for file in output_files:
        if corpus in file:
            df = pd.read_csv(f"{csv_path}/{file}")
            # Add new columns
            
            #token_breakpoint[corpus]
    
            file_path = f"{analyzed_csv_path}/{file}"
            df.to_csv(file_path)
            print(f"Wrote {file_path}")

In [None]:
# parquet



for corpus in corpora:
    file_path = f"{analyzed_parquet_path}/{corpus}.parquet"
    df = pd.DataFrame()
    for file in output_files:
        if corpus in file:
            df_chunk = pd.read_csv(f"{analyzed_csv_path}/{file}")
            df = pd.concat([df, df_chunk], ignore_index=True)
    df.to_parquet(file_path)
    print(f"Wrote {file_path}")


In [None]:
n = np.linspace(1, -1, 3)

In [None]:
np.average(n)

In [None]:
np.__version__

In [None]:
torch.__version__

In [None]:
!aws s3 cp ./analyze-texts.ipynb s3://pq-tdm-studio-results/tdm-ale-data/1876/results/

In [59]:
foo_orig = pd.DataFrame({
    "index": [0, 1, 2, 3, 4, 5, 6, 7, 8],
    "foo": ["bar0", "bar1", "bar2", "bar3", "bar4", "bar5", "bar6", "bar7", "bar8"]
})

foo_orig.to_parquet(f"{full_parquets_path}/foo.parquet")