<a href="https://colab.research.google.com/github/mm0097/bert_sentiment_progressive_analysis/blob/main/bert_sentiment_progressives.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Correlation analysis between sentiment and use of the progressive
This jupyter notebook aims to compile a corpus from convokit reddit corpus, preprocess a part of this data to fine-tune a BERT model for sentiment analysis using vader pre-annotation. For better results, the VADER scores can be manually edited, for instance in cases of ambiguous classifications. The fine-tuned model is then used for the sentiment (intensity) analysis of an additional dataset for the analysis. An algorithm identifies and classifies progressive constructions and additionally its linguistic features. This data is then used for correlation testing using logistic regression models.

## Installing initial dependencies/packages


In [None]:
!pip install convokit

Note: other packages are loaded throughout the code for specific segments. It is necessary, to restart the runtime after the following block which is needed because of version problems with numpy, scipy and convokit.

In [None]:
!pip install --upgrade --force-reinstall "scipy==1.10.1"
!pip install --upgrade --force-reinstall "numpy==1.23.5"

In [None]:
from convokit import Corpus, download
import pandas as pd
import csv
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## Corpus construction
This segment obtains textual data from the ConvoKit Reddit corpus and creates separate dataframes for analysis, as well as for fine-tuning of the BERT model. The size for the analysis dataset is preset to 20,000 utterances from the Reddit corpus but can easily be adjusted in the first code box.

In [None]:
analysis_size = 20000           # Number of posts for the linguistic analysis

In [None]:
c1 = Corpus(filename=download("subreddit-IdiotsInCars"))                      # Downloading the "IdiotsInCars" subreddit corpus and creating a dataframe from the content
df_base_c1 = c1.get_utterances_dataframe()
print("subreddit: IdiotsInCars - "+str(len(df_base_c1))+" Entries")

c2 = Corpus(filename=download("subreddit-AmItheAsshole"))                     # Downloading the "AmItheAsshole" subreddit corpus and creating a dataframe from the content
df_base_c2 = c2.get_utterances_dataframe()
print("subreddit: AmItheAsshole - "+str(len(df_base_c2))+" Entries")

c3 = Corpus(filename=download("subreddit-confessions"))                       # Downloading the "Confessions" subreddit corpus and creating a dataframe from the content
df_base_c3 = c3.get_utterances_dataframe()
print("subreddit: Showerthoughts - "+str(len(df_base_c3))+" Entries")

#combining the datasets and removing deleted and empty posts
df_full=pd.concat([df_base_c1,df_base_c2,df_base_c3])
df_full['text'] = df_full['text'].replace('', pd.NA)
df_full = df_full.dropna(subset=['text']).copy()                              #dropping empty strings
df_full['text'] = df_full['text'].replace('[deleted]', pd.NA)
df_full = df_full.dropna(subset=['text']).copy()                              #dropping [deleted] posts
df_full['text'] = df_full['text'].replace('[removed]', pd.NA)
df_full = df_full.dropna(subset=['text']).copy()                              #dropping [removed] posts
df_full['speaker'] = df_full['speaker'].replace('AutoModerator', pd.NA)
df_full = df_full.dropna(subset=['speaker']).copy()                           #dropping posts from AutoModerators
df_full['text'] = df_full['text'].str.replace('\n', ' ')
print("full set: "+str(len(df_full))+" entries")

#splitting into a dataset for analysis and one for ML fine-tuning
analysis_df = df_full.sample(n=200000, random_state=42)                       # Randomly sample 200k rows
finetuning_df = df_full.drop(analysis_df.index)
finetuning_df = finetuning_df.sample(n=250000, random_state=42)               # From the remaining posts, create a dataset for fine-tuning the BERT model
analysis_df = analysis_df.sample(n=analysis_size, random_state=42)            # Reduce the dataset for analysis to the variable set above

print("analysis: "+str(len(analysis_df))+" entries")
print("training: "+str(len(finetuning_df))+" entries")

## Fine-tuning *BERT* for sentiment valence analysis


### Preprocessing data for fine-tuning

#### Adding an additional column for the text without stopwords and removing stopwords

In [None]:
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Function to remove stopwords from text
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
    return ' '.join(filtered_text)


finetuning_df['text_without_stopwords'] = finetuning_df['text'].apply(remove_stopwords)     # Adding an additional column for the texts without stopwords

# Save the updated DataFrame
# finetuning_df.to_csv('finetuning_with_stopwords_removed.csv', sep="\t", index=False)

#### Using VADER to analyse the stopword-stripped text for sentiment valence
Negative, positive and neutral valences, as well as an additional compound score are predicted using VADER.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()

def get_sentiment_scores(text):
    scores = analyzer.polarity_scores(text)
    return scores

finetuning_df['sentiment'] = finetuning_df['text_without_stopwords'].apply(get_sentiment_scores)

finetuning_df['neg'] = finetuning_df['sentiment'].apply(lambda x: x['neg'])
finetuning_df['pos'] = finetuning_df['sentiment'].apply(lambda x: x['pos'])
finetuning_df['neu'] = finetuning_df['sentiment'].apply(lambda x: x['neu'])
finetuning_df['compound'] = finetuning_df['sentiment'].apply(lambda x: x['compound'])

### Fine-tuning the BERT model for text sequence classification

In [None]:
# using regression head
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertModel, AdamW
from torch.utils.data import DataLoader, Dataset
from torch.optim.lr_scheduler import StepLR
from tqdm.notebook import tqdm
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import torch
import numpy as np
from scipy.stats import pearsonr

# Load pre-annotated finetuning dataset
df = finetuning_df

# Split the dataset into training, validation, and test sets // NOTE: the test set is not used anymore in this version of the code
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and encode the text data
# Fill NaN values with an empty string in the 'text' column
train_df['text'].fillna('', inplace=True)
val_df['text'].fillna('', inplace=True)
test_df['text'].fillna('', inplace=True)

# Tokenize and encode the text data
train_encodings = tokenizer(list(train_df['text']), truncation=True, padding=True, max_length=128, return_tensors='pt')
val_encodings = tokenizer(list(val_df['text']), truncation=True, padding=True, max_length=128, return_tensors='pt')
test_encodings = tokenizer(list(test_df['text']), truncation=True, padding=True, max_length=128, return_tensors='pt')

# Create PyTorch datasets
class RedditDataset(Dataset):
    def __init__(self, encodings, df):
        self.encodings = encodings
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        item = {key: tensor[idx] for key, tensor in self.encodings.items()}
        item['labels'] = torch.tensor([
            self.df.iloc[idx]['neg'],
            self.df.iloc[idx]['pos'],
            self.df.iloc[idx]['compound'],
        ], dtype=torch.float32)
        return item

# Create PyTorch data loaders
train_dataset = RedditDataset(train_encodings, train_df)
val_dataset = RedditDataset(val_encodings, val_df)
test_dataset = RedditDataset(test_encodings, test_df)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Define the regression head
class BertRegressionHead(torch.nn.Module):
    def __init__(self, bert_model):
        super(BertRegressionHead, self).__init__()
        self.bert = bert_model
        self.regressor = torch.nn.Linear(self.bert.config.hidden_size, 3)

    def forward(self, input_ids, attention_mask, token_type_ids=None): # Include token_type_ids with default value
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids) # Pass token_type_ids to bert_model
        pooled_output = outputs.pooler_output
        return self.regressor(pooled_output)

# Initialize the model with the regression head
model = BertRegressionHead(bert_model)

# Move the model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Setting RMSprop optimizer
optimizer = torch.optim.RMSprop(model.parameters(), lr=1e-5)

# Set up learning rate scheduler
scheduler = StepLR(optimizer, step_size=1, gamma=0.1)

# Set up training parameters
epochs = 3
total_steps = len(train_loader) * epochs

# Early stopping parameters
early_stop_count = 2
early_stop_patience = 1
best_val_loss = float('inf')
no_improvement_count = 0

# Fine-tune the BERT model
log_interval = 100
model.train()
for epoch in range(epochs):
    total_loss = 0.0
    correct_predictions = 0
    total_predictions = 0
    for batch_idx, batch in enumerate(tqdm(train_loader, desc=f'Epoch {epoch + 1}/{epochs}')):
        # Separate inputs and labels
        inputs = {key: tensor.to(device) for key, tensor in batch.items() if key != 'labels'}  # Exclude 'labels' from inputs
        labels = batch['labels'].to(device)

        optimizer.zero_grad()

        outputs = model(**inputs) # Pass only the expected inputs
        loss = torch.nn.HuberLoss()(outputs, labels) # Calculate loss
        loss.backward()

        optimizer.step()

        # Compute total loss
        total_loss += loss.item()

        # Print or log loss and accuracy
        if (batch_idx + 1) % log_interval == 0:
            avg_loss = total_loss / log_interval
            print(f'Batch {batch_idx + 1}/{len(train_loader)} - Loss: {avg_loss:.4f}')
            total_loss = 0.0
            correct_predictions = 0
            total_predictions = 0

    # Print or log loss and accuracy after each epoch
    avg_loss = total_loss / len(train_loader)
    print(f'Epoch {epoch + 1}/{epochs} - Average Loss: {avg_loss:.4f}')

    # Learning rate schedule step
    scheduler.step()
    # Evaluate the model on the validation set
    model.eval()
    val_loss = 0.0
    all_preds = np.array([], dtype=float).reshape(0, 3)  # Ensures 2D shape (0 rows, 3 columns)
    all_labels = np.array([], dtype=float).reshape(0, 3)

    criterion = torch.nn.HuberLoss()

    with torch.no_grad():
        for batch in tqdm(val_loader, desc='Validation'):
            inputs = {key: tensor.to(device) for key, tensor in batch.items() if key!= 'labels'}
            labels = batch['labels'].to(device)

            outputs = model(**inputs)
            loss = criterion(outputs, labels)
            val_loss += loss.item()

            # Collect predictions and labels for metrics
            all_preds = np.concatenate((all_preds, outputs.cpu().numpy()))
            all_labels = np.concatenate((all_labels, labels.cpu().numpy()))

    # Compute average validation loss
    avg_val_loss = val_loss / len(val_loader)
    print(f'Validation Loss: {avg_val_loss:.4f}')

    # Calculate regression metrics by dimension
    for i in range(3):  # Iterate over 'neg', 'neu', 'pos', 'compound'
        mae = sklearn.metrics.mean_absolute_error(all_labels[:, i], all_preds[:, i])
        rmse = np.sqrt(sklearn.metrics.mean_squared_error(all_labels[:, i], all_preds[:, i]))
        accuracy = sklearn.metrics.accuracy_score(all_labels[:, i], all_preds[:, i])
        precision = sklearn.metrics.precision_score(all_labels[:, i], all_preds[:, i])
        recall = sklearn.metrics.recall_score(all_labels[:, i], all_preds[:, i])
        f1_score = sklearn.metrics.f1_score(all_labels[:, i], all_preds[:, i])
        print(f'Dimension {i} - MAE: {mae:.4f}, RMSE: {rmse:.4f}, Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1_score:.4f}')

    # Early stopping (based on avg_val_loss)
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        no_improvement_count = 0
    else:
        no_improvement_count += 1
        if no_improvement_count >= early_stop_patience:
            print(f'Early stopping after {epoch + 1} epochs without improvement.')
            break

# Save the trained model
torch.save(model.state_dict(), 'ft-bert-reddit-sentiment.pth')

## Print KDE plot for visualization of sentiment distributions (prediction vs. 'actual' sentiment)

# Separate predicted sentiment scores
all_preds_neg = all_preds[:, 0]
all_preds_pos = all_preds[:, 1]
all_preds_compound = all_preds[:, 2]

# Separate actual labels as well
all_labels_neg = all_labels[:, 0]
all_labels_pos = all_labels[:, 1]
all_labels_compound = all_labels[:, 2]

# Create the distribution plot
plt.figure(figsize=(12, 6))

# Use kernel density estimation (KDE) plots for better visualization of overlapping distributions
sns.kdeplot(all_labels_neg, label='Actual Negative')
sns.kdeplot(all_preds_neg, label='Predicted Negative')

sns.kdeplot(all_labels_pos, label='Actual Positive')
sns.kdeplot(all_preds_pos, label='Predicted Positive')

sns.kdeplot(all_labels_compound, label='Actual Compound')
sns.kdeplot(all_preds_compound, label='Predicted Compound')

plt.title('Distribution of Sentiment Scores')
plt.legend(title='Sentiment')
plt.show()


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, Dataset
from torch.optim.lr_scheduler import StepLR
from tqdm.notebook import tqdm
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import torch
import numpy as np
from scipy.stats import pearsonr

# Load pre-annotated finetuning dataset
df = finetuning_df

# Split the dataset into training, validation, and test sets // NOTE: the test set is not used anymore in this version of the code
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Tokenize and encode the text data
# Fill NaN values with an empty string in the 'text' column
train_df['text'].fillna('', inplace=True)
val_df['text'].fillna('', inplace=True)
test_df['text'].fillna('', inplace=True)

# Tokenize and encode the text data
train_encodings = tokenizer(list(train_df['text']), truncation=True, padding=True, max_length=128, return_tensors='pt')
val_encodings = tokenizer(list(val_df['text']), truncation=True, padding=True, max_length=128, return_tensors='pt')
test_encodings = tokenizer(list(test_df['text']), truncation=True, padding=True, max_length=128, return_tensors='pt')

# Create PyTorch datasets
class RedditDataset(Dataset):
    def __init__(self, encodings, df):
        self.encodings = encodings
        self.df = df

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        item = {key: tensor[idx] for key, tensor in self.encodings.items()}
        item['labels'] = torch.tensor([
            self.df.iloc[idx]['neg'],
            self.df.iloc[idx]['pos'],
            self.df.iloc[idx]['compound'],
        ], dtype=torch.float32)
        return item

# Create PyTorch data loaders
train_dataset = RedditDataset(train_encodings, train_df)
val_dataset = RedditDataset(val_encodings, val_df)
test_dataset = RedditDataset(test_encodings, test_df)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Move the model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Setting RMSprop optimizer
optimizer = torch.optim.RMSprop(model.parameters(), lr=1e-5)

# Set up learning rate scheduler
scheduler = StepLR(optimizer, step_size=1, gamma=0.1)

# Set up training parameters
epochs = 3
total_steps = len(train_loader) * epochs

# Early stopping parameters
early_stop_count = 2
early_stop_patience = 1
best_val_loss = float('inf')
no_improvement_count = 0

# Fine-tune the BERT model
log_interval = 100
model.train()
for epoch in range(epochs):
    total_loss = 0.0
    correct_predictions = 0
    total_predictions = 0
    for batch_idx, batch in enumerate(tqdm(train_loader, desc=f'Epoch {epoch + 1}/{epochs}')):
        inputs = {key: tensor.to(device) for key, tensor in batch.items()}
        labels = batch['labels'].to(device)

        optimizer.zero_grad()

        outputs = model(**inputs)
        loss = torch.nn.HuberLoss()(outputs.logits, labels)
        loss.backward()

        optimizer.step()

        # Compute total loss
        total_loss += loss.item()

        # Print or log loss and accuracy
        if (batch_idx + 1) % log_interval == 0:
            avg_loss = total_loss / log_interval
            print(f'Batch {batch_idx + 1}/{len(train_loader)} - Loss: {avg_loss:.4f}')
            total_loss = 0.0
            correct_predictions = 0
            total_predictions = 0

    # Print or log loss and accuracy after each epoch
    avg_loss = total_loss / len(train_loader)
    print(f'Epoch {epoch + 1}/{epochs} - Average Loss: {avg_loss:.4f}')

    # Learning rate schedule step
    scheduler.step()
    # Evaluate the model on the validation set
    model.eval()
    val_loss = 0.0
    all_preds = np.array([], dtype=float).reshape(0, 3)  # Ensures 2D shape (0 rows, 3 columns)
    all_labels = np.array([], dtype=float).reshape(0, 3)

    criterion = torch.nn.HuberLoss()

    with torch.no_grad():
        for batch in tqdm(val_loader, desc='Validation'):
            inputs = {key: tensor.to(device) for key, tensor in batch.items()}
            labels = batch['labels'].to(device)

            outputs = model(**inputs)
            loss = criterion(outputs.logits, labels)
            val_loss += loss.item()

            # Collect predictions and labels for metrics
            all_preds = np.concatenate((all_preds, outputs.logits.cpu().numpy()))
            all_labels = np.concatenate((all_labels, labels.cpu().numpy()))

    # Compute average validation loss
    avg_val_loss = val_loss / len(val_loader)
    print(f'Validation Loss: {avg_val_loss:.4f}')

    # Calculate regression metrics by dimension
    for i in range(3):  # Iterate over 'neg', 'neu', 'pos', 'compound'
        mae = sklearn.metrics.mean_absolute_error(all_labels[:, i], all_preds[:, i])
        rmse = np.sqrt(sklearn.metrics.mean_squared_error(all_labels[:, i], all_preds[:, i]))
        print(f'Dimension {i} - MAE: {mae:.4f}, RMSE: {rmse:.4f}')

    # Early stopping (based on avg_val_loss)
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        no_improvement_count = 0
    else:
        no_improvement_count += 1
        if no_improvement_count >= early_stop_patience:
            print(f'Early stopping after {epoch + 1} epochs without improvement.')
            break

# Save the trained model
torch.save(model.state_dict(), 'ft-bert-reddit-sentiment.pth')



## Print KDE plot for visualization of sentiment distributions (prediction vs. 'actual' sentiment)

# Separate predicted sentiment scores
all_preds_neg = all_preds[:, 0]
all_preds_pos = all_preds[:, 1]
all_preds_compound = all_preds[:, 2]

# Separate actual labels as well
all_labels_neg = all_labels[:, 0]
all_labels_pos = all_labels[:, 1]
all_labels_compound = all_labels[:, 2]

# Create the distribution plot
plt.figure(figsize=(12, 6))

# Use kernel density estimation (KDE) plots for better visualization of overlapping distributions
sns.kdeplot(all_labels_neg, label='Actual Negative')
sns.kdeplot(all_preds_neg, label='Predicted Negative')

sns.kdeplot(all_labels_pos, label='Actual Positive')
sns.kdeplot(all_preds_pos, label='Predicted Positive')

sns.kdeplot(all_labels_compound, label='Actual Compound')
sns.kdeplot(all_preds_compound, label='Predicted Compound')

plt.title('Distribution of Sentiment Scores')
plt.legend(title='Sentiment')
plt.show()

## Sentiment valence classification for analysis
This segment uses the fine-tuned bert model to analyse the reddit posts for sentiment valence using three separate scores: 'pos', 'neg', and 'compound'. After the sentiment analysis, the dataset is split into sentences.

### Sentiment analysis

In [None]:
# path to the fine-tuned model weights

ft_model_path = 'ft-bert-reddit-sentiment.pth' #change for the actual path of the fine-tuned model weights

In [None]:
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained model
# Create an instance of the BertForSequenceClassification model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Move the model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Load fine-tuned model weights // NOTE: the path needs to be adapted, as it accesses the model directly from my Google Drive
model.load_state_dict(torch.load(ft_model_path))
model.to(device)
model.eval()

# Load Reddit dataset
df = analysis_df

# Preprocessing
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def preprocess_text(text):
    """Applies tokenization and any other preprocessing steps used during training"""
    tokenized_input = tokenizer(text, truncation=True, padding=True, max_length=128, return_tensors='pt')
    return tokenized_input

# Prediction function
def predict_sentiment(df):
    predictions = []
    for utterance in df['text']:
        inputs = preprocess_text(utterance)
        inputs = {key: tensor.to(device) for key, tensor in inputs.items()}

        with torch.no_grad():
            outputs = model(**inputs)

        predictions.append(outputs.logits.cpu().numpy())

    # Add predictions as new columns directly to the DataFrame
    df['neg'] = [pred[0][0] for pred in predictions]
    df['pos'] = [pred[0][1] for pred in predictions]
    df['compound'] = [pred[0][2] for pred in predictions]
    return df

# Apply the prediction function to your DataFrame
df_with_predictions = predict_sentiment(df.copy())

# Save or use the predictions as needed
df_with_predictions.head(50)
#df_with_predictions.to_csv('reddit_sentiment_predictions.csv', index=False, sep="\t") # this line is commented out as it forwards the df to the next part without saving as csv

### Split into sentences

In [None]:
df_with_predictions.columns=['timestamp',
 'text',
 'speaker',
 'reply_to',
 'conversation_id',
 'meta_score',
 'meta_top_level_comment',
 'meta_retrieved_on',
 'meta_gilded',
 'meta_gildings',
 'meta_subreddit',
 'meta_stickied',
 'meta_permalink',
 'meta_author_flair_text',
 'vectors',
 'neg',
 'pos',
 'compound']
list(df_with_predictions.columns)

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
import pandas as pd

def split_sentences_with_sentiment(df):
    sentences = []
    sentence_ids = []
    utterance_ids = []
    neg_scores = []
    pos_scores = []
    compound_scores = []
    subreddits = []

    for index, row in df.iterrows():
        text = row['text']
        for sent_idx, sentence in enumerate(sent_tokenize(text)):
            sentences.append(sentence)
            sentence_ids.append(f"{index}_{sent_idx}")
            utterance_ids.append(index)
            neg_scores.append(row['neg'])  # Copy 'neg' score
            pos_scores.append(row['pos'])  # Copy 'pos' score
            compound_scores.append(row['compound'])  # Copy 'compound' score
            subreddits.append(row['meta_subreddit']) # Copy subreddit

    new_df = pd.DataFrame({'sentence': sentences,
                           'sentence_id': sentence_ids,
                           'utterance_id': utterance_ids,
                           'subreddit': subreddits,
                           'neg_score': neg_scores,
                           'pos_score': pos_scores,
                           'compound_score': compound_scores})
    return new_df

df_sentences_sentiment = split_sentences_with_sentiment(df_with_predictions)
df_sentences_sentiment.to_csv('analysis_sentences_sentiment.csv', sep='\t')
print(len(df_sentences_sentiment))

## Extraction of Linguistic Features
This part of the code iterates through the sentences dataset and identifies progressives, also extracting additional linguistic features such as verb category (action, state, mental, behavioural) and stores this data in the dataframe for later analysis.



In [None]:
import spacy
import pandas as pd
from nltk import pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
import nltk

# nltk dependencies
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Load the SpaCy English model
nlp = spacy.load('en_core_web_sm', disable=["ner", "lemmatizer"])

# using WordNet lexnames as described in the methodology section to classify present participles
def categorize_verb(verb):
    lemma = WordNetLemmatizer().lemmatize(verb, pos='v')
    synsets = wn.synsets(lemma, pos=wn.VERB)
    if synsets:
        for synset in synsets:
            category = synset.lexname().split('.')[1]
            if category in ['motion', 'competition', 'change', 'communication']:
                return "action"
            elif category == 'stative':
                return "stative"
            elif category in ['cognition', 'perception']:
                return "private"
    return "other"

# using SpaCy depency tree to distinguis main and subordinate clauses
def identify_clause(sentence, index):
    doc = nlp(sentence)
    token = doc[index]
    clause_type = "main_clause"
    while token.head != token:
        if token.dep_ in ('ccomp', 'xcomp', 'advcl', 'relcl', 'conj'):
            clause_type = "subclause"
            break
        token = token.head
    clause = ' '.join([tok.text_with_ws for tok in token.subtree]).strip()
    return clause_type, clause

# using a dictionary to destinguish present and past tense for the auxilliary verb
def get_tense(verb):
    tenses = {
        "is": "present", "am": "present", "are": "present",
        "was": "past", "were": "past",
        "being": "present",
        "been": "past"
    }
    return tenses.get(verb.lower(), "unknown")

# progressive identification algorithm
def identify_progressives(df):
    # initialising varibales for results
    df['has_progressive'] = 0
    df['form_of_to_be'] = ''
    df['intervening_words'] = ''
    df['progressive_verb'] = ''
    df['progressive_category'] = ''
    df['clause_type'] = ''
    df['tense'] = ''

    for index, row in df.iterrows():        # iterating through data frame, "read" and tokenize text into words and performing POS tagging
        text = str(row['sentence'])
        doc = nlp(text)
        tokens = [token.text for token in doc]
        tagged_tokens = pos_tag(tokens)

        progressive_count = 0               # initialising count variable (helper) for binary classification (has progressive or not)
        for i, (token, tag) in enumerate(tagged_tokens):
            if tag.startswith('VB') and token.lower() in ["is", "am", "are", "was", "were", "being", "been"]:     # identifying auxilliary form of to be
                has_neighbor = False                                                                              # helper in case this is end of sequence
                intervening_words = []                                                                            # initialising list for invervening words
                aux_token = doc[i]                                                                                # get the corresponding SpaCy token
                tense = get_tense(token)                                                                          # calling function to get tense of (to) be
                aux_clause_type, aux_clause = identify_clause(text, i)                                            # identify the clause for the auxiliary verb

                for j in range(i + 1, min(i + 4, len(tagged_tokens))):                                            # setting window for intervening words
                    next_token, next_tag = tagged_tokens[j]
                    if next_tag in ["RB", "RBR", "RBS", "WRB", "MD"]:                                             # storing adverbs and modals
                        intervening_words.append(next_token)
                    elif next_tag == "VBG":
                        if i > 0 and tagged_tokens[i - 1][0].lower() != "to":
                            if next_token.lower() == "going" and j + 2 < len(tagged_tokens) and tagged_tokens[j + 1][0].lower() == "to" and tagged_tokens[j + 2][1].startswith("VB"):
                                has_neighbor = False
                                break                                                                              # excluding instances of going to future (VBG followed by "to"+VB)
                            else:
                                vbg_clause_type, vbg_clause = identify_clause(text, j)
                                if vbg_clause == aux_clause:                                                       # if not going-to future: check whether components are in same clause
                                    prev_word, prev_tag = tagged_tokens[i - 1]
                                    if prev_tag not in ["IN", "TO", "MD"]:                                         # if "proper" progressive: store data
                                        df.at[index, 'form_of_to_be'] = token
                                        df.at[index, 'intervening_words'] = ' '.join(intervening_words)
                                        df.at[index, 'progressive_verb'] = next_token
                                        df.at[index, 'progressive_category'] = categorize_verb(next_token)
                                        df.at[index, 'clause_type'] = aux_clause_type
                                        df.at[index, 'tense'] = tense
                                        progressive_count += 1
                                        print(f"Progressive found: {token} {' '.join(intervening_words)} {next_token}; "
                                              f"Category: {categorize_verb(next_token)}; "
                                              f"Clause Type: {aux_clause_type}; Tense: {tense}")
                                        has_neighbor = True
                                        break

        df.at[index, 'has_progressive'] = 1 if progressive_count > 0 else 0                                       # turn count into binary var for logistic regression

    return df                                                                                                     # return updated data frame

df_sentences_sentiment = identify_progressives(df_sentences_sentiment.copy())
#df_sentences_sentiment.to_csv('all_progressive_forms.csv')

## Create Visualizations
The following code creates histograms of the sentiment distribution in the overall sentences dataset, as well as the dataset containing only sentences with progressives.

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 8))

# Data for sentences with progressives
df_prog = df_sentences_sentiment.query('has_progressive > 0')

# Data for all sentences
df_all = df_sentences_sentiment

# --- Upper Row: All Sentences ---
axes[0, 0].hist(df_all['neg_score'])
axes[0, 0].set_title('Negative Sentiment Distribution')
yaxis_limits = axes[0, 0].get_ylim()  # Store limits

axes[0, 1].hist(df_all['pos_score'])
axes[0, 1].set_title('Positive Sentiment Distribution')
axes[0, 1].set_ylim(yaxis_limits)  # Enforce common limits

axes[0, 2].hist(df_all['compound_score'])
axes[0, 2].set_title('Compound Sentiment Distribution')
axes[0, 2].set_ylim(yaxis_limits)  # Enforce common limits

fig.suptitle('Distribution of Sentiment Scores', fontsize=16)

# --- Lower Row: Sentences with Progressives ---
# Repeat the process with new limits
axes[1, 0].hist(df_prog['neg_score'])
axes[1, 0].set_title('Negative Sentiment Distribution')
yaxis_limits_prog = axes[1, 0].get_ylim()

axes[1, 1].hist(df_prog['pos_score'])
axes[1, 1].set_title('Positive Sentiment Distribution')
axes[1, 1].set_ylim(yaxis_limits_prog)

axes[1, 2].hist(df_prog['compound_score'])
axes[1, 2].set_title('Compound Sentiment Distribution')
axes[1, 2].set_ylim(yaxis_limits_prog)

fig.text(-0.001, 0.75, 'All Sentences', ha='center', va='center', rotation='vertical', fontsize=14)
fig.text(-0.001, 0.25, 'Sentences with Progressives', ha='center', va='center', rotation='vertical', fontsize=14)

plt.tight_layout()
plt.show()

## Statistical testing
The following code uses the prepared datasets to perform logistic regression over sentiment valence and use of the progressive, as well as verb types used in the progressive constructions.

### General correlation of sentiment and use of the progressive

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Assuming you have your data in a pandas dataframe called "df"
df = df_sentences_sentiment

binary_has_progressive = df['has_progressive'].astype(int)  # Ensure it's a True/False boolean
neg_score = df['neg_score']
pos_score = df['pos_score']
compound_score = df['compound_score']

scores = [neg_score, pos_score, compound_score]

for score in scores:
  X = sm.add_constant(score)  # Add a constant for the intercept term
  model = sm.Logit(binary_has_progressive, X)
  result = model.fit()
  print(result.summary())

### Creation of dummy variables and testing for correlations of sentiment and verb category

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

def create_progressive_category_dummies(df):
    # Create a copy of the DataFrame to avoid modifying the original
    df_copy = df.copy()

    # Create dummy variables for the 'progressive_category' column
    progressive_category_dummies = pd.get_dummies(df_copy['progressive_category'], prefix='category', drop_first=True)

    # Concatenate the dummy variables with the original DataFrame
    df_with_dummies = pd.concat([df_copy, progressive_category_dummies], axis=1)

    # Replace NaN values in the dummy variable columns with 0
    df_with_dummies = df_with_dummies.fillna(0)

    return df_with_dummies


def create_progressive_clause_dummies(df):
    # Create a copy of the DataFrame to avoid modifying the original
    df_copy = df.copy()

    # Create dummy variables for the 'progressive_category' column
    clause_dummies = pd.get_dummies(df_copy['clause_type'], prefix='cl', drop_first=True)

    # Concatenate the dummy variables with the original DataFrame
    df_with_dummies = pd.concat([df_copy, clause_dummies], axis=1)

    # Replace NaN values in the dummy variable columns with 0
    df_with_dummies = df_with_dummies.fillna(0)

    return df_with_dummies

def create_tense_dummies(df):
    # Create a copy of the DataFrame to avoid modifying the original
    df_copy = df.copy()

    # Create dummy variables for the 'progressive_category' column
    clause_dummies = pd.get_dummies(df_copy['tense'], prefix='t', drop_first=True)

    # Concatenate the dummy variables with the original DataFrame
    df_with_dummies = pd.concat([df_copy, clause_dummies], axis=1)

    # Replace NaN values in the dummy variable columns with 0
    df_with_dummies = df_with_dummies.fillna(0)

    return df_with_dummies


# Calling the function to create dummy variables
df = create_progressive_category_dummies(df_sentences_sentiment)
dfc= create_progressive_clause_dummies(df)
dfc= create_tense_dummies(dfc)

In [None]:
binary_action = dfc['category_action'].astype(int)
binary_private = dfc['category_private'].astype(int)
binary_stative = dfc['category_stative'].astype(int)
neg_score = dfc['neg_score']
pos_score = dfc['pos_score']
compound_score = dfc['compound_score']

scores = [neg_score, pos_score, compound_score]
deps = [binary_action, binary_private, binary_stative]

for score in scores:
  for dep in deps:
    X = sm.add_constant(score)  # Add a constant for the intercept term
    model = sm.Logit(dep, X)
    result = model.fit()
    print(result.summary())

In [None]:
dfcf=dfc.query('has_progressive>0')
binary_presentf = dfc['t_present'].astype(int)
binary_pastf = dfc['t_past'].astype(int)
neg_scoref = dfcf['neg_score']
pos_scoref = dfcf['pos_score']
compound_scoref = dfcf['compound_score']

scores = [neg_score, pos_score, compound_score]
deps = [binary_presentf, binary_pastf]

for score in scores:
  for dep in deps:
    X = sm.add_constant(score)  # Add a constant for the intercept term
    model = sm.Logit(dep, X)
    result = model.fit()
    print(result.summary())


In [None]:
dfcf=dfc.query('has_progressive>0')
binary_mainf = dfc['cl_main_clause'].astype(int)
binary_subf = dfc['cl_subclause'].astype(int)
neg_scoref = dfcf['neg_score']
pos_scoref = dfcf['pos_score']
compound_scoref = dfcf['compound_score']

scores = [neg_score, pos_score, compound_score]
deps = [binary_mainf, binary_subf]

for score in scores:
  for dep in deps:
    X = sm.add_constant(score)  # Add a constant for the intercept term
    model = sm.Logit(dep, X)
    result = model.fit()
    print(result.summary())


### Additional statistical info

In [None]:
counts = dfc[['category_action', 'category_stative', 'category_private', 'category_other']].sum()
df_filt = dfc.query('has_progressive > 0')
print("Sentences with progressives:")
print(len(df_filt))
print("Counts for each column:")
print(counts)

In [None]:
df_action=dfc.query('category_action>0')
df_states=dfc.query('category_stative>0')
df_mental=dfc.query('category_private>0')
df_uncat=dfc.query('category_other>0')

print("action mean \n", df_action['compound_score'].mean())
print("stative mean\n", df_states['compound_score'].mean())
print("private mean\n", df_mental['compound_score'].mean())
print("uncat mean\n", df_uncat['compound_score'].mean())
print("reference\n", df['compound_score'].mean())

In [None]:
pc_action=len(df_action)/len(df_filt)
pc_states=len(df_states)/len(df_filt)
pc_mental=len(df_mental)/len(df_filt)
pc_others=len(df_uncat)/len(df_filt)

print("percentages\n action: ",pc_action,"\n stative: ",pc_states,"\n private: ",pc_mental,"\n others: ", pc_others)

In [None]:
df_main=dfc.query('cl_main_clause>0')
df_sub=dfc.query('cl_subclause>0')

pc_main=len(df_main)/len(df_filt)
pc_sub=len(df_sub)/len(df_filt)

print("main clause: ",pc_main)
print("sub clause: ",pc_sub)