## SetFit ABSA Training


In [12]:
# load packages
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [None]:
pip install spacy

In [None]:
!spacy download en_core_web_sm

CUDA is required to run SetFit ABSA model, run below code block to check if CUDA is available

In [None]:
# chekc if cuda is available
import torch
torch.cuda.is_available()

The training dataset we prepared for trainning our own SetFit ABSA model is made available through huggingface. 
https://huggingface.co/datasets/ginkgogo/ca_restaurants_random_sample We should be able to load the dataset directly from huggingface fter installing required setfit[absa] packages

In [1]:
from datasets import load_dataset

dataset = load_dataset("ginkgogo/ca_restaurants_random_sample", split="train")
# splitting dataset into two parts, one for training purposes and the other one for evaluation
train_dataset = dataset.select(range(50))
eval_dataset = dataset.select(range(50, 102))

Downloading data: 100%|██████████| 62.9k/62.9k [00:01<00:00, 56.1kB/s]


Generating train split: 0 examples [00:00, ? examples/s]

In [2]:
# quickly take a look at our training data
train_dataset

Dataset({
    features: ['text', 'span', 'label', 'ordinal'],
    num_rows: 50
})

In [3]:
# also spot on our evaluation data
eval_dataset

Dataset({
    features: ['text', 'span', 'label', 'ordinal'],
    num_rows: 52
})

Prepare a new instance of Absa model, with selected transformers and spacy large model

In [4]:
from setfit import AbsaModel

model = AbsaModel.from_pretrained(
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/all-mpnet-base-v2",
    spacy_model="en_core_web_sm",
)

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


### Training the SetFitABSA model
Prepare training arguments for the ABSA model and passing training dataset and evaluation dataset to the training process. We completed the training using Google Colab and it took about 1 hour using A100 GPU run-time environment. Therefore, we saved this model to huggingface so that we can use it whenever we want without rerun the training. Check "Using SetFitABSA model" below for details

In [None]:
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset, AbsaTrainer
from transformers import EarlyStoppingCallback

args = TrainingArguments(
    output_dir="models",
    num_epochs=5,
    use_amp=True,
    batch_size=50,
    evaluation_strategy="steps",
    eval_steps=50,
    save_steps=50,
    load_best_model_at_end=True,
)

trainer = AbsaTrainer(
    model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
)
trainer.train()

In order to inspect the model, we use the built-in method provided by the setfit[absa] package to check the accuracy of the model

In [None]:
metrics = trainer.evaluate(eval_dataset)
print(metrics)

In [None]:
# pip install -U "huggingface_hub[cli]"

### Saving the SetFitABSA model to huggingface

In [19]:
# uncomment below to login to huggingface
# !huggingface-cli login

In [20]:
# uncomment below to save the model to huggingface
# model.push_to_hub("ginkgogo/setfit-absa-bge-small-en-v1.5-restaurants")

### Using SetFitABSA model

In [None]:
from setfit import AbsaModel

# Download from the 🤗 Hub
model = AbsaModel.from_pretrained(
    "ginkgogo/setfit-absa-bge-small-en-v1.5-restaurants-aspect",
    "ginkgogo/setfit-absa-bge-small-en-v1.5-restaurants-polarity",
    spacy_model="en_core_web_sm",
)
# Run inference
preds = model("The food was great, but the venue is just way too busy.")
print(preds)

In [None]:
df = pd.read_csv(
  '/content/drive/MyDrive/699/ca_restaurants.csv'
)
# this is list of business ids that we used in training the SetFit ABSA model, 
# we need to ommit this from the random sample to avoid bias
bus_used_in_train = [234152, 88955, 174286, 228338, 203671, 151156, 88166, 64932, 142804, 210180, 35159, 90839, 137484, 85880, 128479, 92603, 20842, 200330, 175440, 8844, 61777, 3815, 123379, 125840, 180129, 206443, 219869, 101729, 107887, 188230, 244420, 49208, 139902, 242337, 35581, 228649, 44946, 32763, 69556, 152494, 5069963, 3915492, 4486491]

random_df_2000 = df.sample(2000)

for business in bus_used_in_train:
    if business in random_df_2000['business_id']:
        random_df_2000.drop(business, inplace=True)

In [None]:
# run inference on the random sample of 2k rows from the California resturant dataset
sentences = list(random_df_2000['text'].str.lower())
preds = model(sentences)

In [None]:
# quickly inspect model predictions
print(preds)
print(len(preds))

In [None]:
# if there's no sentiment extracted, use empty {} as the column value
aspects_sentiment = []
for i in preds:
    if len(i) > 0:
        aspects_sentiment.append(i)
    else:
        aspects_sentiment.append('{}')

random_df_2000['aspects_sentiment'] = aspects_sentiment

In [None]:
def extract(aspect_list):
    if isinstance(aspect_list, list):
        aspect_dict = {}
    for aspect in aspect_list:
        aspect_dict[aspect['span']] = aspect['polarity']
    return aspect_dict


In [None]:
# for a in aspects_sentiment:
#   if len(a) > 1:
#     print(a)

In [None]:
random_df_2000['aspects_sentiment'] = random_df_2000['aspects_sentiment'].apply(extract)

In [None]:
# random_df_2000[random_df_2000['aspects_sentiment'] != None]


In [None]:
random_df_2000.head()

In [None]:
with_aspect_df = random_df_2000.dropna(subset=['aspects_sentiment'])
print(with_aspect_df.shape)

In [None]:
random_df_2000['aspects_sentiment'] = random_df_2000['aspects_sentiment'].fillna('{}')

In [None]:
random_df_2000.head()

In [None]:
flatten_asepct = pd.json_normalize(random_df_2000['aspects_sentiment'])

In [None]:
aspects = list(flatten_asepct.columns)

In [None]:
flatten_asepct

In [None]:
random_df_2000.reset_index(inplace=True)
flatten_asepct.reset_index(inplace=True)
final_df = pd.concat([random_df_2000, flatten_asepct], axis=1)
final_df.head()

In [None]:
print(random_df_2000.shape)
print(flatten_asepct.shape)

In [None]:
final_df.shape

In [None]:
# len(final_df.business_id.unique())

In [None]:
with_aspect_df.reset_index(inplace=True)
with_aspect_flatten.reset_index(inplace=True)
with_aspect_flatten = pd.json_normalize(with_aspect_df['aspects_sentiment'])
with_aspect_final_df = pd.concat([with_aspect_df, with_aspect_flatten], axis=1)

In [None]:
with_aspect_flatten.shape

In [None]:
with_aspect_df.shape

In [None]:
with_aspect_final_df.shape

In [None]:
with_aspect_final_df.head()

In [None]:
# with_aspect_final_df.to_csv('with_aspect_from_random_2k.csv')
# !cp with_aspect_from_random_2k.csv '/content/drive/MyDrive/699/'

In [None]:
# final_df.to_csv('random_2k.csv')
# !cp random_2k.csv '/content/drive/MyDrive/699/'

In [None]:
import pandas as pd
out = pd.read_csv('/content/drive/MyDrive/699/with_aspect_from_random_2k.csv')
out.head()

In [None]:
out_random = pd.read_csv('/content/drive/MyDrive/699/random_2k.csv')
out_random.shape

In [16]:
# load manual evaluation results 
import pandas as pd
setfit_absa_eval_df = pd.read_csv('../data/results/SetFit_ABSA_manual_eval.csv')
setfit_absa_eval_df.head()

Unnamed: 0,review_id,user_id,business_id,has_aspects_model_label,aspects_extracted_manual_label,aspect,Model Label,Manual Label,is_actual_restaurant
0,b_mLN6YOXK50s9id9vA6og,YtDiXgpiP0d5zDmtMEUOow,KC8_Rx4Orlsz8LIonCYXsA,Y,Y,food,positive,positive,Y
1,xobTDm7QNP0RU2CvzUCFBg,B5s_DCLVrBLrL8U6TEVlwA,SsHMgOW3TT48Z7jeV5beqQ,Y,Y,service,positive,negative,Y
2,xobTDm7QNP0RU2CvzUCFBg,B5s_DCLVrBLrL8U6TEVlwA,SsHMgOW3TT48Z7jeV5beqQ,Y,N,food,not mentioned,neutral,Y
3,yGwx4jEh9E3XzP-H6fnL-g,UqKi8B6uct0E6WttZI11KA,skY6r8WAkYqpV7_TxNm23w,Y,Y,food,positive,positive,Y
4,yGwx4jEh9E3XzP-H6fnL-g,UqKi8B6uct0E6WttZI11KA,skY6r8WAkYqpV7_TxNm23w,Y,N,service,not mentioned,positive,Y


In [13]:
setfit_absa_eval_df.shape

(111, 9)

In [18]:
# calculate when there's an aspect extracted, the accuray of predicting the correct sentiment
cal_df = setfit_absa_eval_df[(setfit_absa_eval_df['has_aspects_model_label'] == 'Y') 
                             & (setfit_absa_eval_df['aspects_extracted_manual_label'] == 'Y')]
sentiment_correctness = len(cal_df[cal_df['Model Label'] == cal_df['Manual Label']])/len(cal_df)
print('Accuracy of predicting sentiment is :', format(sentiment_correctness, ".1%"))

Accuracy of predicting sentiment is : 64.3%
