This script was used via a google colab notebook - https://colab.research.google.com/ which allows you to use one of googles GPUs for free
which significantly improves the processing duration of the classification.

Additionally, the script accesses and stores files via google drive which makes the handling of files a bit easier (especially if you have multiple files that need to be classified subsequently. If you don't want to use google drive you can alternatively import and export files directly within google golab within the files folder.

In this script, we will do the following things:
First, we mount google drive to access and store files, import the classifications_input file (which contains all the necessary data for training the classifer, and for comparing it's performance against the gold-standard).

Please note that the training- and goldstandard datasets that are provided in this repository underwent several pre-processing steps that are explained in more detail in the article "Beyond sentiment: an algorithmic strategy to classify evaluations in large text corpora" (https://www.tandfonline.com/doi/full/10.1080/19312458.2023.2285783) (Section: Workflow).

If you wil want to use the projections classifier for your own purposes, the data to be classified should first be pre-processed following these steps.  



In [None]:
# -*- coding: utf-8 -*-

# Let's get started. In the first step, we mount google drive and access the classifications input data. Make sure to store the classifications
# data in the folder as specified in the "load file" command.

!pip install simpletransformers

import os
import pandas as pd
import numpy as np
from simpletransformers.classification import ClassificationModel
from sklearn.utils.class_weight import compute_class_weight

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

## Next we gonna import the training data which is stored within the "Classifications_Input.xlsx" file.
## This file consists of both the training data (N=10.000) (IS_GOLD_STANDARD == NO) and the goldstandard data (N=1.633) which we will
## use afterwards for evaluating the performance of the classifer (IS_GOLD_STANDARD == YES).

# Load file (change location according to whever you store the file)
df = pd.read_excel('/content/drive/My Drive/.../Classifications_Input.xlsx')

# remove rows where IS_GOLD_STANDARD = YES to train the data only on the lines that are part of the training-data set and not the
# gold standard data

df = df[df['IS_GOLD_STANDARD'] == "NO"]
df.shape

# Data preprocessing
df["Q1"] = df["Q1"].replace(np.nan, 0)
df["Q1"] = df["Q1"].replace(2, 1)
df["Q1"] = df["Q1"].replace(3, 0)
df["Q1"] = df["Q1"].replace(4, 0)

df["text"] = df["SNIPPET"]
df["labels"] = df["Q1"]

df["text"] = df["text"].str.replace("[+]", "", regex=False)
df["text"] = df["text"].str.replace("[-]", "", regex=False)
df["text"] = df["text"].str.replace("[[", "", regex=False)
df["text"] = df["text"].str.replace("]]", "", regex=False)



Mounted at /content/drive


In [None]:
set(df["labels"])

{0.0, 1.0}

In [None]:
df['Q1'].value_counts()

Unnamed: 0_level_0,count
Q1,Unnamed: 1_level_1
0.0,6984
1.0,3020


In [None]:
# take a look at the first text after the above changes
df["text"].iloc[1]

'In addition to vowing that more bombs would be used to strike the Big Apple, the caller warned that attackers would also open fire with guns, a source said. >>> Police Commissioner James O Neill said Sunday, I know we re going to find out who did this and they ll be brought to justice. <<< The two witnesses who saw a possible person of interest in the case and who later spoke with the NYPD sketch artist were dining at the Krush bar and grill on West 32nd Street when they noticed the man, sources said. Passers by in Chelsea described feeling under siege as a result of the attack and the threats of more to come. '

In [None]:
df_new = df[["text","labels"]]
df_new

Unnamed: 0,text,labels
0,In addition to vowing that more bombs would be...,1.0
1,In addition to vowing that more bombs would be...,1.0
2,In addition to vowing that more bombs would be...,1.0
3,In addition to vowing that more bombs would be...,1.0
4,In addition to vowing that more bombs would be...,1.0
...,...,...
9999,>>> But they want everyone to forget about the...,1.0
10000,>>> But they want everyone to forget about the...,1.0
10001,>>> But they want everyone to forget about the...,1.0
10002,>>> But they want everyone to forget about the...,1.0


In [None]:
#Train model on the entire 10.000 snippets, save the model, and evaluate against the gold standard dataset
#Train model on all 10.000 snippets - use this step if you don't want to train the model but load the existing one

# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(df["labels"]), y=df["labels"])

# Define the model directory in Google Drive
model_dir = "/content/drive/My Drive/.../roberta_model"

# Check if the model is already saved
if not os.path.exists(model_dir):
    # Initiate roberta-base model for predicting 2 multiclass categories, with the class weights calculated above.
    model = ClassificationModel('roberta', 'roberta-base', num_labels=2, weight=class_weights.tolist(),
                                use_cuda=True, args={'reprocess_input_data': True, 'overwrite_output_dir': True,
                                                     "num_train_epochs": 10, 'output_dir': model_dir})

    # Train the model on the entire dataframe
    model.train_model(df)

    # Save the model
    model.save_model(model_dir)
else:
    # Load the saved model
    model = ClassificationModel('roberta', model_dir, num_labels=2, use_cuda=True)



In [None]:
#Test against gold standard dataset
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import pandas as pd


# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Load file after initial preprocessing steps in R (done to Q1 and Q2 only)
df_gold_standard = pd.read_excel('/content/drive/My Drive/.../Classifications_Input.xlsx')


# Focus on rows where IS_GOLD_STANDARD = YES
df_gold_standard = df_gold_standard[df_gold_standard['IS_GOLD_STANDARD'] == "YES"]
df_gold_standard.shape

# Data preprocessing for the gold standard dataset
df_gold_standard["Q1"] = df_gold_standard["Q1"].replace(np.nan, 0)
df_gold_standard["Q1"] = df_gold_standard["Q1"].replace(2, 1)
df_gold_standard["Q1"] = df_gold_standard["Q1"].replace(3, 0)
df_gold_standard["Q1"] = df_gold_standard["Q1"].replace(4, 0)

df_gold_standard["text"] = df_gold_standard["SNIPPET"]
df_gold_standard["labels"] = df_gold_standard["Q1"]

df_gold_standard["text"] = df_gold_standard["text"].str.replace("[+]", "", regex=False)
df_gold_standard["text"] = df_gold_standard["text"].str.replace("[-]", "", regex=False)
df_gold_standard["text"] = df_gold_standard["text"].str.replace("[[", "", regex=False)
df_gold_standard["text"] = df_gold_standard["text"].str.replace("]]", "", regex=False)

# Evaluate the model on the gold standard dataset
result, model_outputs, wrong_predictions = model.eval_model(df_gold_standard)

# Get the true and predicted labels
y_true = df_gold_standard["labels"]
y_pred = model_outputs.argmax(axis=1)

# Calculate and print the classification report
report = classification_report(y_true, y_pred, output_dict=True)
print("Classification Report:")
print(classification_report(y_true, y_pred))

# Calculate and print the confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(conf_matrix)



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  0%|          | 0/3 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/17 [00:00<?, ?it/s]

  with amp.autocast():


Classification Report:
              precision    recall  f1-score   support

         0.0       0.85      0.88      0.87       959
         1.0       0.82      0.78      0.80       674

    accuracy                           0.84      1633
   macro avg       0.84      0.83      0.83      1633
weighted avg       0.84      0.84      0.84      1633

Confusion Matrix:
[[843 116]
 [146 528]]


Now that we have trained our classifier, stored the model & evaluated its performance, we want to apply the model to other corpora. This is done in the following step.

In [None]:
### predict labels on datasets
!pip install simpletransformers

import os
import numpy as np
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import pandas as pd
from pandas.core.groupby import DataFrameGroupBy

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Define the model directory in Google Drive
model_dir = "/content/drive/My Drive/.../roberta_model"

import pandas as pd
import numpy as np

# Load the data without column titles and add column names
column_names = ["document ID", "date", "position", "SNIPPET"]
df_pred = pd.read_csv('/content/drive/My Drive/.../...csv', delimiter='\t', header=None, names=column_names)
# Display the shape of the dataframe
print(df_pred.shape)

# Add new columns based on existing ones
df_pred["document ID"] = df_pred["document ID"]
df_pred["date"] = df_pred["date"]
df_pred["position"] = df_pred["position"]
df_pred["SNIPPET"] = df_pred["SNIPPET"]

# Predict Q1 values on the entire dataframe
# Note: 'model' should be defined somewhere in your script or imported
predictionall = model.predict(df_pred["SNIPPET"].to_list())
df_pred['Q1_roBERTa'] = pd.Series(predictionall[0])

# Save the modified dataframe to an Excel file
df_pred.to_excel("/content/drive/My Drive/.../...xlsx", index=False)

df_pred.head()



FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/PROFECI/kwics_HE/kwics_54_1337_profeci_sr_HE_50505_2000-01-01_2024-12-31_tr'