# Labelling of English Comments
In this file, we utilize 5 state-of-the-art sentiment models found on HuggingFace to label our english comments based on sentiment (negative, neutral and positive). We make use of majority voting, that is we only keep comments that appear in atleast 3 of the 5 models with the same label. To remove noise from our comments (which is given due to the high variablility of comments found on YouTube), we only analyze comments that got a confidence score of 80% or higher in every model. Furthermore, note that we keep track of the original comments at all times as we need to translate and augment on these rather than the fully processed ones later on. In particular, we found no comments to be highly classified as neutral, thus we omitted that class from analysis.

In [1]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from google.colab import files
from google.colab import drive
import glob
import zipfile
from collections import Counter, defaultdict

In [2]:
# Open a file upload dialog
# Select here all files to upload!
# If already uploaded, just press 'Cancel Upload'
# Note that we are here uploading all the english comments that we have past filtering phase
# aswell as pre filtering phase. We need to keep the original ones, as we later on
# translate them to other languages and have to do specific preprocessing (such as stopwords etc.)
# for the respective language to fine tune them for the models.
uploaded = files.upload()

Saving comments_original_split_0.csv to comments_original_split_0.csv
Saving comments_processed_split_0.csv to comments_processed_split_0.csv


In [34]:
# Set the path to the data
# On local machine use the relative path, for example
# path = 'NLP labelled data preview/english set/'
# On Google Colab use this path
# '/content/'
path = '/content/'

In [35]:
# Denote the split you are processing
SPLIT = 0

In [36]:
# Load the dataset
# For Mac users : do english_data/english_data/*.csv
# For Windows users : do english_data/*.csv
all_english_comments = pd.read_csv(path + 'comments_processed_split_{}.csv'.format(SPLIT))
all_english_comments_original = pd.read_csv(path + 'comments_original_split_{}.csv'.format(SPLIT))

In [37]:
# Now we prepare for the labelling phase using a pre-trained state-of-the-art model

# Turn dataframe into a list
comments = all_english_comments['Comment'].tolist()
comments_original = all_english_comments_original['Comment'].tolist()

# Turn all comments into strings
comments = [str(comment) for comment in comments]
comments_original = [str(comment) for comment in comments_original]

In [38]:
# Assure that we have the same amount of comments
assert len(comments) == len(comments_original), 'The number of comments is not equal!'
print("We have : {} many comments".format(len(comments)))

We have : 14450 many comments


In [39]:
# Load the different models, trained on different datasets
tokenizer_1 = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")
model_1 = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")

tokenizer_2 = AutoTokenizer.from_pretrained("aychang/roberta-base-imdb")
model_2 = AutoModelForSequenceClassification.from_pretrained("aychang/roberta-base-imdb")

tokenizer_3 = AutoTokenizer.from_pretrained("siebert/sentiment-roberta-large-english")
model_3 = AutoModelForSequenceClassification.from_pretrained("siebert/sentiment-roberta-large-english")

tokenizer_4 = AutoTokenizer.from_pretrained("lxyuan/distilbert-base-multilingual-cased-sentiments-student")
model_4 = AutoModelForSequenceClassification.from_pretrained("lxyuan/distilbert-base-multilingual-cased-sentiments-student")

tokenizer_5 = AutoTokenizer.from_pretrained("cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual")
model_5 = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual")



Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [40]:
# Move the models to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_1.to(device)
model_2.to(device)
model_3.to(device)
model_4.to(device)
model_5.to(device)

XLMRobertaForSequenceClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768,

In [41]:
# Initialize the pipelines
classifier_1 = pipeline('sentiment-analysis', model=model_1, tokenizer=tokenizer_1)
classifier_2 = pipeline('sentiment-analysis', model=model_2, tokenizer=tokenizer_2)
classifier_3 = pipeline('sentiment-analysis', model=model_3, tokenizer=tokenizer_3)
classifier_4 = pipeline('sentiment-analysis', model=model_4, tokenizer=tokenizer_4)
classifier_5 = pipeline('sentiment-analysis', model=model_5, tokenizer=tokenizer_5)


In [None]:
# Predict sentiment labels for each classifier
predictions_1 = classifier_1(comments)
predictions_2 = classifier_2(comments)
predictions_3 = classifier_3(comments)
predictions_4 = classifier_4(comments)
predictions_5 = classifier_5(comments)

In [None]:
# Extract the scores from the predictions
scores_1 = [prediction['score'] for prediction in predictions_1]
scores_2 = [prediction['score'] for prediction in predictions_2]
scores_3 = [prediction['score'] for prediction in predictions_3]
scores_4 = [prediction['score'] for prediction in predictions_4]
scores_5 = [prediction['score'] for prediction in predictions_5]
# Extract the labels from the predictions
labels_1 = [prediction['label'] for prediction in predictions_1]
labels_2 = [prediction['label'] for prediction in predictions_2]
labels_3 = [prediction['label'] for prediction in predictions_3]
labels_4 = [prediction['label'] for prediction in predictions_4]
labels_5 = [prediction['label'] for prediction in predictions_5]



In [None]:
# Extract the unique labels
unique_labels_1 = set(labels_1)
print(unique_labels_1)
unique_labels_2 = set(labels_2)
print(unique_labels_2)
unique_labels_3 = set(labels_3)
print(unique_labels_3)
unique_labels_4 = set(labels_4)
print(unique_labels_4)
unique_labels_5 = set(labels_5)
print(unique_labels_5)


In [None]:
# Set up the right labels for the different models
# We want to transform all labels to the same format ; all should be numbers where 0 is negative, 1 is neutral and 2 is positive
# Model 1 gives Negative, Neutral, Positive as labels, so we will transform them to 0, 1, 2
labels_1 = [0 if label == 'negative' else 1 if label == 'neutral' else 2 for label in labels_1]
# Model 2 gives neg and pos as labels, so we will transform them to 0, 2
labels_2 = [0 if label == 'neg' else 2 for label in labels_2]
# Model 3 gives only POSITIVE, NEGATIVE as labels, so we will transform them to 0, 2
labels_3 = [0 if label == 'NEGATIVE' else 2 for label in labels_3]
# Model 4 gives negative, neutral, positive as labels, so we will transform them to 0,1,2
labels_4 = [0 if label == 'negative' else 1 if label == 'neutral' else 2 for label in labels_4]
# Model 5 gives negative, neutral, positive as labels, so we will transform them to 0,1,2
labels_5 = [0 if label == 'negative' else 1 if label == 'neutral' else 2 for label in labels_5]




In [None]:
# Only keep comments with a confidence score of above 0.80
conf_score = 0.80

high_confidence_comments_1 = []
high_confidence_comments_2 = []
high_confidence_comments_3 = []
high_confidence_comments_4 = []
high_confidence_comments_5 = []

high_confidence_comments_1_original = []
high_confidence_comments_2_original = []
high_confidence_comments_3_original = []
high_confidence_comments_4_original = []
high_confidence_comments_5_original = []

high_confidence_predictions_1 = []
high_confidence_predictions_2 = []
high_confidence_predictions_3 = []
high_confidence_predictions_4 = []
high_confidence_predictions_5 = []

high_confidence_scores_1 = []
high_confidence_scores_2 = []
high_confidence_scores_3 = []
high_confidence_scores_4 = []
high_confidence_scores_5 = []

In [None]:
# For model 1 :
for i in range(len(scores_1)):
    if scores_1[i] > conf_score:
        high_confidence_predictions_1.append(labels_1[i])
        high_confidence_comments_1.append(comments[i])
        high_confidence_comments_1_original.append(comments_original[i])
        high_confidence_scores_1.append(scores_1[i])


In [None]:
# For model 2 :
for i in range(len(scores_2)):
    if scores_2[i] > conf_score:
        high_confidence_predictions_2.append(labels_2[i])
        high_confidence_comments_2.append(comments[i])
        high_confidence_comments_2_original.append(comments_original[i])
        high_confidence_scores_2.append(scores_2[i])

In [None]:
# For model 3 :
for i in range(len(scores_3)):
    if scores_3[i] > conf_score:
        high_confidence_predictions_3.append(labels_3[i])
        high_confidence_comments_3.append(comments[i])
        high_confidence_comments_3_original.append(comments_original[i])
        high_confidence_scores_3.append(scores_3[i])

In [None]:
# For model 4 :
for i in range(len(scores_4)):
    if scores_4[i] > conf_score:
        high_confidence_predictions_4.append(labels_4[i])
        high_confidence_comments_4.append(comments[i])
        high_confidence_comments_4_original.append(comments_original[i])
        high_confidence_scores_4.append(scores_4[i])

In [None]:
# For model 5 :
for i in range(len(scores_5)):
    if scores_5[i] > conf_score:
        high_confidence_predictions_5.append(labels_5[i])
        high_confidence_comments_5.append(comments[i])
        high_confidence_comments_5_original.append(comments_original[i])
        high_confidence_scores_5.append(scores_5[i])


In [None]:
# Combining all comments and their respective labels
combined_comments = high_confidence_comments_1 + high_confidence_comments_2 + high_confidence_comments_3 + high_confidence_comments_4 + high_confidence_comments_5
combined_comments_original = high_confidence_comments_1_original + high_confidence_comments_2_original + high_confidence_comments_3_original + high_confidence_comments_4_original + high_confidence_comments_5_original
combined_predictions = high_confidence_predictions_1 + high_confidence_predictions_2 + high_confidence_predictions_3 + high_confidence_predictions_4 + high_confidence_predictions_5
combined_scores = high_confidence_scores_1 + high_confidence_scores_2 + high_confidence_scores_3 + high_confidence_scores_4 + high_confidence_scores_5

# Counting the occurrence of each comment
comment_counter = Counter(combined_comments)
comment_original_counter = Counter(combined_comments_original)



# Filtering comments that appear in at least three models with high confidence score
filtered_comments = {comment for comment, count in comment_counter.items() if count >= 3}
filtered_comments_original = {comment for comment, count in comment_original_counter.items() if count >= 3}



# Dictionary to keep track of labels for each comment
comment_labels = defaultdict(list)
comment_original_labels = defaultdict(list)

# Dictionary to keep track of scores for each comment
comment_scores = defaultdict(list)
comment_original_scores = defaultdict(list)

# Populate the dictionary with labels and scores for each comment
for comment, score, label in zip(combined_comments, combined_scores, combined_predictions):
    if comment in filtered_comments:
        comment_labels[comment].append(label)
        comment_scores[comment].append((score, label))

# Also for the original ones
for comment, score, label in zip(combined_comments_original, combined_scores, combined_predictions):
    if comment in filtered_comments_original:
        comment_original_labels[comment].append(label)
        comment_original_scores[comment].append((score, label))


# Keep only labels that appear at least three times for each comment
final_comments = []
final_labels = []
final_scores = []
for comment, labels in comment_labels.items():
    label_count = Counter(labels)
    filtered_labels = [label for label, count in label_count.items() if count >= 3]
    if filtered_labels:
        final_comments.append(comment)
        final_labels.append(filtered_labels)
        # Also store the mean score of the models that provided the label gathered in filtered labels
        final_scores.append(np.mean([x[0] for x in comment_scores[comment] if x[1] in filtered_labels]))



final_comments_original = []
final_labels_original = []
final_scores_original = []
for comment, labels in comment_original_labels.items():
    label_count = Counter(labels)
    filtered_labels = [label for label, count in label_count.items() if count >= 3]
    if filtered_labels:
        final_comments_original.append(comment)
        final_labels_original.append(filtered_labels)
        final_scores_original.append(np.mean([x[0] for x in comment_original_scores[comment] if x[1] in filtered_labels]))



# The labels are all stored in lists, so we need to flatten them
final_labels = [label for labels in final_labels for label in labels]
final_labels_original = [label for labels in final_labels_original for label in labels]

print("Filtered Comments:", final_comments)
print("Respective Labels:", final_labels)
print("Respective mean scores:" , final_scores)


assert len(final_comments) == len(final_labels) == len(final_scores), "error"



# Check how many predictions we have in the respective classes
print("We have ", final_labels.count(0), " negative predictions.")
print("We have ", final_labels.count(1), " neutral predictions.")
print("We have ", final_labels.count(2), " positive predictions.")

# Seeing the lack of neutral predictions, we will omit this class for further analysis.


In [None]:
# Check how many comments are left after filtering by confidence score
print("We have ", len(final_comments), " comments left after filtering by confidence score " , conf_score , " .")


In [None]:
# Save to csv the comments and their label
high_confidence_comments_df = pd.DataFrame(final_comments, columns=['Comment'])
high_confidence_comments_df['Label'] = final_labels
high_confidence_comments_df['Score'] = final_scores
high_confidence_comments_df.to_csv(path + "High_Confidence_Comments_English_Split_{}.csv".format(SPLIT))
# Save original ones aswell
high_confidence_comments_original_df = pd.DataFrame(final_comments_original, columns=['Comment'])
high_confidence_comments_original_df['Label'] = final_labels_original
high_confidence_comments_original_df['Score'] = final_scores_original
high_confidence_comments_original_df.to_csv(path + "High_Confidence_Comments_Original_English_Split_{}.csv".format(SPLIT))

In [None]:
# Download the file to your local machine (from google colab)

files.download(path + "High_Confidence_Comments_English_Split_{}.csv".format(SPLIT))
files.download(path + "High_Confidence_Comments_Original_English_Split_{}.csv".format(SPLIT))