## Filter Action-Object Pairs

In [1]:
import pandas as pd
from collections import Counter
import spacy
from spacy.matcher import Matcher
from tqdm import tqdm
tqdm.pandas()


# Load the English model
nlp = spacy.load("en_core_web_lg")

In [2]:
df = pd.read_parquet('../../data/processed/targets/avocado_train_targets_exploded.parquet', engine='fastparquet')
df = df.rename(columns={'targets': 'target'})


In [None]:
df.shape

In [4]:
# Remove empty entries

df_filtered = df[~df['action_object_pairs'].apply(lambda x: x is None or (isinstance(x, list) and len(x) == 0))]

In [None]:
df_filtered.shape

### Inspect Action-Object-Pairs results

In [6]:
all_words = [word for sublist in df_filtered['action_object_pairs'] for word in sublist]

# Use Counter to count occurrences of each word
word_counts = Counter(all_words)

# Convert to DataFrame (optional, if you want to keep it in tabular form)
distinct_words_df = pd.DataFrame(word_counts.items(), columns=['Word', 'Count'])
distinct_words_df = distinct_words_df.sort_values(by='Count', ascending=False)

In [None]:
distinct_words_df.head(20)

### Inspect Messages


| Action-Object-Pair    | Count |
|-------------|-------------|
 start_Server |	12831
send_it |	12052
have_questions |	11024
send_message |	10704
send_email |	8186
fail_Message |	8065
start_Failures |	6983
start_Occurrences |	6428
call_me	| 6177
thank_you | 4472
contact_me | 3995
post_message | 3790
miss_UNIVERSE | 3742
do_what | 3584
send_mail | 3490
start_occurrence | 3152
do_it | 2816
give_call | 2642
need_help | 2541
unsubscribe_mailto | 2468

After inspecting the most-frequently occurring action-object-pairs, the following ones will not be considered further:

| Action-Object-Pair    | Reason |
|-------------|-------------|
|fail_Message|belongs to error message and thus does not contain a humanly intent|
|start_Failures|belongs to error message and thus does not contain a humanly intent|
|start_Occurrences|belongs to error message and thus does not contain a humanly intent|
|post_message|belongs to an automated message|
|unsubscribe_mailto|belongs to an automated message|
|miss_UNIVERSE|belongs to an automated message|
|start_occurrence|belongs to an automated message|

also the extremeprogramming unsubscribe message is being filtered from the original dataframe, as well as java messages

In [None]:
df_filtered.iloc[1]

In [None]:
df_filtered[df['action_object_pairs']]