# Action-Object-Extraction

This notebook takes the results from the [prior notebook](01_extract_individual_sentences.ipynb) where spaCy has been used to split up the extracted text into its individual sentences.

It takes these individual sentences to do a pre-selection or filtering of sentences containing certain words. Words used for this pre-selection are for example 'meeting', 'review', 'send', 'call'. They imply an intent and are saved as 'targets'. 

Using spaCy again, from these target-sentences action-object-pairs (also called verb-object-pairs) will be extracted.
Finally the DF is being exploded to create a better base for further experiments, conducted in the [next notebook](03_filter_action_object_pairs.ipynb)

In [1]:
import pandas as pd
from collections import Counter
import spacy
from spacy.matcher import Matcher
from tqdm import tqdm
tqdm.pandas()


# Load the English model
nlp = spacy.load("en_core_web_lg")

In [2]:
df = pd.read_parquet('../../data/processed/avocado_train_individual_sentences.parquet')

### Create list of words for filtering requests

First create a list of words that imply certain intents.

In [3]:
intent_words = [
    'dinner', 'lunch', 'breakfast', 'meeting', 'appointment', 'reminder', 'review', 'send me', 
    'need', 'how', 'schedule', 'please', 'send', 'sent', 'join', 'make sure', 
    'discuss', 'email', 'attend', 'call', 'provide', 'help', 'are there', 'are you', 'available', 
    'can i', 'can you', 'can we', 'can he', 'can she', 'can they', 'could you', 'could we', 'could i', 
    'did you', 'did i', 'did we', 'did he', 'did she', 'did they', 'do you', 'do they', 'does he', 'does she', 'do we',
    'do not', "don't", 'want', 'does that', 'does this', 'give', 'go ahead',
    'have you', 'have there', 'mail', 'is it', 'possible', 
]

In [4]:

def filter_sentences(sentences, intent_words):
    if sentences is None:
        return []  # Return an empty list for None values
    return [sentence for sentence in sentences if any(keyword in sentence.lower() for keyword in intent_words)]

df['targets'] = df['sentences'].progress_apply(lambda sentences: filter_sentences(sentences, intent_words))


100%|██████████| 503917/503917 [00:38<00:00, 13034.66it/s]


### Extract Action-Object Pairs

In [5]:
def filter_action_object(targets):
    pairs = []
    matcher = Matcher(nlp.vocab)
    pattern = [{"POS": "VERB"}]
    matcher.add("VERB", [pattern])

    for target in targets:
        doc = nlp(target)
        matches = matcher(doc)
        match_list = []
        for match_id, start, end in matches:
            verb = doc[start]
            for child in verb.children:
                # Check for direct or prepositional object that is alphabetic
                if child.dep_ in ("dobj", "pobj") and child.is_alpha:
                    match_list.append(verb.lemma_ + "_" + child.text)
        pairs.append(match_list)
    return pairs

df['action_object_pairs'] = df['targets'].progress_apply(lambda targets: filter_action_object(targets))

100%|██████████| 503917/503917 [2:02:40<00:00, 68.46it/s]   


### Explode Dataframe

In [7]:
# Explode the columns action_object_pairs and targets to create individual row-entries for target-sentences,
# assumingly containing intents with action-object-pairs extracted from the target-sentence
df_exploded = df.explode(['action_object_pairs','targets'], ignore_index=True)

In [11]:
# Save the exploded DF
df_exploded.to_parquet("../../data/processed/targets/avocado_train_targets_exploded.parquet", engine="pyarrow")