# ABSA Case Study
You are a data scientist who has been tasked with developing a versatile ABSA model for extracting aspects, determining aspect polarity, and detecting aspect categories from textual data. The goal is to create a robust model/pipeline ensuring flexibility and accuracy across different contexts.

Aspect-Based Sentiment Analysis (ABSA) is a natural language processing (NLP) technique that involves extracting and analysing sentiment or emotion associated with specific aspects or features of a given target entity, such as a product, service, or topic.  

Data Description:
Your toolkit will include dataset comprising of 3000 customer reviews for a restaurant, all in English, enriched with human-authored annotations. These annotations contain the mentioned aspects of the target entities and the sentiment polarity of each aspect.


Tasks:

•	ABSA Model Development: Create an ABSA model that can extract aspect terms, determine aspect polarity, and identify aspect categories within a given text. Ensure that the model is generic and can be applied to different domains with ease.
•	Results Analysis: Analyze the results generated by the ABSA model. Use matplotlib/ plotly to show the overall as well category wise sentiment distribution. 
•	Actionable insights: Based on the above analysis, derive the conclusions and identify action items for the restaurant to work-upon.
•	Report: Discuss pros and cons of the approach you selected and further improvement which can be done to it given more time.


In particular, the task 1 consists of the following subtasks: 
 
Definition 1: Aspect term extraction 
For the given entity -  restaurant, identify the aspect terms present in the sentence and return a list containing all the distinct aspect terms. An aspect term names a particular aspect of the target entity(Restaurant). 
  
For example, "I liked the service and the staff, but not the food”, “The food was nothing much, but I loved the staff”. Here service, staff and food are aspects.

 Note: Multi-word aspect terms (e.g., “hard disk”) should be treated as single terms (e.g., in “The hard disk is very noisy” the only aspect term is “hard disk”). 
 
Definition 2: Aspect term polarity 
For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative, neutral or conflict (i.e., both positive and negative). 
  
For example: 
 
“I loved their fajitas” → {fajitas: positive} 
“I hated their fajitas, but their salads were great” → {fajitas: negative, salads: positive} 
“The fajitas are their first plate” → {fajitas: neutral} 
“The fajitas were great to taste, but not to see” → {fajitas: conflict} 
 
Definition 3: Aspect category detection 
Decide a predefined set of aspect categories (e.g., price, food, service etc.) for restaurant, identify the aspect categories discussed in a given sentence. Aspect categories are typically coarser than the aspect terms of Subtask 1, and they do not necessarily occur as terms in the given sentence. 
  
For example, given the set of aspect categories {food, service, price, ambience, anecdotes/miscellaneous}: 
“The restaurant was too expensive”  → {aspect -expensive, category - price} 
“The restaurant was expensive, but the menu was great” → { aspect -expensive, category - price, aspect- menu, category - food} 

Definition 4:
Action Items: Areas which can be improved to improve the overall sentiment of the customer.
 


#### List Packages we need to Install
- xmltodict
- pandas
- nltk
- matplotlib
- plotly
- torch
- transformers
- datasets

In [1]:
# !pip install xmltodict


## Loading the XML Data and Pre-processing

In [58]:
import xmltodict

# Open and read the XML file
with open('Restaurants_Train_v2.xml', 'r') as f:
    xml_content = f.read()

# Convert the XML to a dictionary
xml_dict = xmltodict.parse(xml_content)



#### Let's First see how that Data Looks

In [68]:
print(xml_dict['sentences']['sentence'][0:10])

[{'@id': '3121', 'text': 'But the staff was so horrible to us.', 'aspectTerms': {'aspectTerm': {'@term': 'staff', '@polarity': 'negative', '@from': '8', '@to': '13'}}, 'aspectCategories': {'aspectCategory': {'@category': 'service', '@polarity': 'negative'}}}, {'@id': '2777', 'text': "To be completely fair, the only redeeming factor was the food, which was above average, but couldn't make up for all the other deficiencies of Teodora.", 'aspectTerms': {'aspectTerm': {'@term': 'food', '@polarity': 'positive', '@from': '57', '@to': '61'}}, 'aspectCategories': {'aspectCategory': [{'@category': 'food', '@polarity': 'positive'}, {'@category': 'anecdotes/miscellaneous', '@polarity': 'negative'}]}}, {'@id': '1634', 'text': "The food is uniformly exceptional, with a very capable kitchen which will proudly whip up whatever you feel like eating, whether it's on the menu or not.", 'aspectTerms': {'aspectTerm': [{'@term': 'food', '@polarity': 'positive', '@from': '4', '@to': '8'}, {'@term': 'kitchen

In [67]:
print(len(xml_dict['sentences']['sentence']))

3041


## First Observation what we can understand from the Data
- It has total 3041 Text and ID's
- Aspect Term: Some place it has multiple Data, some places only single or No Entries.
    - Each Term is coming with label Positive or Negative or Neutral
- Aspect Category: This also in some place Single Data or Multiple Data
    - Similar to Term, it also have each category has label Positive or Negative or Neutral(Not shown here but assuming that is the case)


## First Impression and Understanding

After reading and seeing the dataset, the first impression for the task ABSA Model Development are below:
    - For TERM's the basic impression is NER that would extract words which is Important to Resttaurant Business. 
    - For TERM we also need to Sentence Inferencing between extracted NER and Text to find Polarity i.e. Positive, Negative & Neutral. 
    - For Category we need to do Multi-Label sentence Classification for given labels: food, service, price, ambience, anecdotes/miscellaneous.
    - For Category also we have to do Sentence Inferencing between Category and Text to find Polarity i.e. Positive, Negative & Neutral. Since in Sample Data viewing we can't find Neutral therefore this is still assumption that there are three labels, will Do EDA to find it

**With above thought we have to do NER and Multi-Label Sentence Classification training at least and for Sentence Inferencing we will see if we need to train Sentence Inferencing**


**I will try to find if any NER and Multi-Label Sentence Classification models in huggingface hub is available or not which will act as a baseline**  


**Data will have to be splitted between Train and Test as there is not seperate test data was shared with the package**

## We can't use XML Data as given, so we are doing some pre-processing. We would convert data for NER, Multi-Label Sentence Classification and Sentence Inferencing

##### **get_clean_text:** This function is used for Text clean-up. Found some corner case in the text which causing exception so doing pre-prcessing on the text.

In [59]:
import re

def get_clean_text(text):
    # Step 1: Remove single or double quotes surrounding words
    cleaned_string = re.sub(r"(['\"])(\w+)\1", r"\2", text)

    # Step 2: Remove any remaining special characters except apostrophes
    cleaned_string = re.sub(r"[^A-Za-z0-9\s']", '', cleaned_string)
    return cleaned_string



##### **get_ner:** Main motive of this function is to create NER Labels in ***CoNLL*** format. It takes these Input Parameter i.e. Splited Text, Term, From & To positions of the Term in Text and NER Label which would be changed and sent back by updating the NER Label's for the Term. NER Label's are {0: Others, 1: Begin, 2: Intermediate}. The Output Parameter would be Split text & NER Label

In [60]:
def get_ner(split_text,term,from_to,ner_label):
  # split_text = text.split()
  split_term = term.split()

  next=0
  last=0
  word_range=[]
  t_word_range=[]
  index_range=[]

  for k,v in enumerate(split_text):
    word_range.append(v)
    t_word_range.append( get_clean_text(v))
    if v[0] not in ["'",'"']:
      index_range.append(next)
    else:
      index_range.append(next+1)
    # print(v[0])
    if k == 0:
      last = len(v)+1
       
    else:
      last=last+len(v)+1
       
    next = last
  t_index = 0
  try:
    t_index = index_range.index(from_to)
    # print(t_index)
  except:
     
    try:
      t_index = t_word_range.index(split_term[0] )
    except:
      for k,i in enumerate(t_word_range):
        if split_term[0] in i:
          t_index = k
       
  end = t_index + len(split_term)
  
  for k,v in enumerate(range(t_index,end)):
    if k==0:
      ner_label[v]= 1
    else:
      ner_label[v]= 2
  # print(ner_label)
  return word_range, ner_label

##### **get_dataset:** This is the main function which would be called for pre-processing the xml-data. Only Input Parameter is XML_Data and Output Parameter is a list of Dictionary.

In [61]:
def get_dataset(xml_dict):
  dataset_aspect = []
  all_lb = set()
  for k,v in enumerate(xml_dict['sentences']['sentence']):
    ds = {}
    # if k == 5:
    #   break
    dict_keys = v.keys()
    ds['ID'] = v['@id']
    ds['text'] = v['text']
    ds['NER_INFERENCE'] ={}
    ds['NER_INFERENCE']['term']=[]
    ds['NER_INFERENCE']['polarity']=[]
    ds['NER_INFERENCE']['from_to']=[]
    ds['NER_INFERENCE']['text']= v['text'].split()
    ds['NER_INFERENCE']['ner_label']= [0]*len(v['text'].split())
    ds['categories']={}
    ds['categories']['category']=[]
    ds['categories']['polarity']=[]
    # print( v['aspectCategories'])
    # print(dict_keys)
    if 'aspectTerms' in dict_keys:
      if type(v['aspectTerms']['aspectTerm'])==dict:
        # tmp_dict={}
        ds['NER_INFERENCE']['term'].append(v['aspectTerms']['aspectTerm']['@term'])
        ds['NER_INFERENCE']['polarity'].append(v['aspectTerms']['aspectTerm']['@polarity'])
        ds['NER_INFERENCE']['from_to'].append((int(v['aspectTerms']['aspectTerm']['@from']),int(v['aspectTerms']['aspectTerm']['@to'])))
        ds['NER_INFERENCE']['text'],ds['NER_INFERENCE']['ner_label']= get_ner(
                                                                                ds['NER_INFERENCE']['text'],
                                                                                v['aspectTerms']['aspectTerm']['@term'],
                                                                                int(v['aspectTerms']['aspectTerm']['@from']),
                                                                                  ds['NER_INFERENCE']['ner_label']
                                                                              )

      else:
        for key,value in enumerate(v['aspectTerms']['aspectTerm']):
          # tmp_dict={}
          ds['NER_INFERENCE']['term'].append(value['@term'])
          ds['NER_INFERENCE']['polarity'].append(value['@polarity'])
          ds['NER_INFERENCE']['from_to'].append((int(value['@from']),int(value['@to'])))
          # try:
          ds['NER_INFERENCE']['text'],ds['NER_INFERENCE']['ner_label']= get_ner(
                                                                                ds['NER_INFERENCE']['text'],
                                                                                value['@term'],
                                                                                int(value['@from']),
                                                                                  ds['NER_INFERENCE']['ner_label']
                                                                              )
          # except Exception as e:
          #   print(e)
          #   print('Text: ',ds['text'],' Term: ',value['@term']," from: ",int(value['@from']))
          #   return "error"

    if 'aspectCategories' in dict_keys :
      # print( v['aspectCategories'])
      if type(v['aspectCategories']['aspectCategory'])==dict:
        # print( v['aspectCategories']['aspectCategory']) #['@category']
        ds['categories']['category'].append(v['aspectCategories']['aspectCategory']['@category'])
        ds['categories']['polarity'].append(v['aspectCategories']['aspectCategory']['@polarity'])
        all_lb.add(v['aspectCategories']['aspectCategory']['@category'])
      else:
        # print(v['aspectCategories']['aspectCategory'])
        for key,value in enumerate(v['aspectCategories']['aspectCategory']):
          # print(value['@category'])
          ds['categories']['category'].append(value['@category'])
          ds['categories']['polarity'].append(value['@polarity'])
          all_lb.add(value['@category'])
      # pass
    dataset_aspect.append(ds)
  return dataset_aspect,all_lb

In [62]:
dataset_apect,all_lb= get_dataset(xml_dict)

##### List of Category

In [63]:
all_lb

{'ambience', 'anecdotes/miscellaneous', 'food', 'price', 'service'}

##### After Pre-processing just vewing the data

In [70]:
dataset_apect[:1]

[{'ID': '3121',
  'text': 'But the staff was so horrible to us.',
  'NER_INFERENCE': {'term': ['staff'],
   'polarity': ['negative'],
   'from_to': [(8, 13)],
   'text': ['But', 'the', 'staff', 'was', 'so', 'horrible', 'to', 'us.'],
   'ner_label': [0, 0, 1, 0, 0, 0, 0, 0]},
  'categories': {'category': ['service'], 'polarity': ['negative']}}]

In [73]:
import json

with open("artifacts/data/process_data.jsonl", "w") as final:
    json.dump(dataset_apect, final)

## Now Loading Dataset in Huggingface Datasets

In [74]:
from datasets import load_dataset
dataset = load_dataset("json", data_files="artifacts/data/process_data.jsonl")
dataset

Generating train split: 3041 examples [00:00, 77302.29 examples/s]


DatasetDict({
    train: Dataset({
        features: ['ID', 'text', 'NER_INFERENCE', 'categories'],
        num_rows: 3041
    })
})

##### Doing Train Test split and saving the dataset

In [90]:
dataset = dataset['train'].train_test_split(test_size=0.1,seed=199)

dataset.save_to_disk('artifacts/dataset')

Saving the dataset (1/1 shards): 100%|██████████| 2736/2736 [00:00<00:00, 67544.15 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 305/305 [00:00<00:00, 50104.29 examples/s]


In [46]:
# text = """Don't go alone---even two people isn't 'enough' for the whole experience, with pickles and a selection of meats and seafoods."""
# from nltk.tokenize import TweetTokenizer
# tokenizer = TweetTokenizer()

# tokenizer.tokenize(text)


In [47]:
# text = "Don't go alone---even two people isn't enough for the whole experience, with pickles and a selection of meats and seafoods."
# term="selection of meats and seafoods"
# from_to =(91,122)

# split_text = text.split()
# ner_label = [0]*len(split_text)
# # print(split_text)
# def get_ner(split_text,term,from_to,ner_label):
#   # split_text = text.split()
#   split_term = term.split()

#   next=0
#   last=0
#   word_range=[]
#   index_range=[]
#   for k,v in enumerate(split_text):
#     word_range.append(v)
#     index_range.append(next)
#     if k == 0:
#       last = len(v)+1
#     else:
#       last=last+len(v)+1
#     next = last
#   t_index = index_range.index(from_to[0])
#   end = t_index + len(split_term)
#   for k,v in enumerate(range(t_index,end)):
#     if k==0:
#       ner_label[v]= 1
#     else:
#       ner_label[v]= 2
#   # print(ner_label)
#   return word_range, ner_label
#   # print(split_text)
# word_range, ner_label=get_ner(split_text,term,from_to,ner_label)

# word_range, ner_label