## SentiSystem

This notebook should represent the combined approach of ORL (heads) and predicate extraction (assuming they are the subjectivity expression). 

Tag the sentence *ORL-model*.

Extract predicate *Spacy-model*.

Feed into the *SVC-Classifier*.

In [1]:
# CONFIG-VARIABLES
SKLEARN_MODEL_PATH="./data/svc_relation_prediction.sav"
TRAINING_DATA_PATH="./data/training_data.csv"
FASTTEXT_MODEL_BIN_PATH="../../stancer_setup/models/cc.de.300.bin"
ORL_MODEL_PATH="../ORL/data/trained_model_german_bert"

In [2]:
!pip install spacy



In [4]:
!python -m spacy download de_core_news_sm en_core_web_sm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting de-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.4.0/de_core_news_sm-3.4.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


In [2]:
# Load the ORL model.
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_path = ORL_MODEL_PATH

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

pipe = pipeline(task="token-classification",
                # model=trainer.model, -- in case freshly trained
                model=model,
                tokenizer=tokenizer,
                aggregation_strategy='simple')

classified = pipe("Er sagt, dass der Präsident dem Volk etwas vorgemacht hat .")

print(classified)

[{'entity_group': 'LABEL_0', 'score': 0.9997595, 'word': 'Er sagt, dass der', 'start': 0, 'end': 17}, {'entity_group': 'LABEL_1', 'score': 0.98157835, 'word': 'Präsident', 'start': 18, 'end': 27}, {'entity_group': 'LABEL_0', 'score': 0.99997735, 'word': 'dem', 'start': 28, 'end': 31}, {'entity_group': 'LABEL_2', 'score': 0.54729056, 'word': 'Volk', 'start': 32, 'end': 36}, {'entity_group': 'LABEL_0', 'score': 0.99583834, 'word': 'etwas vorgemacht hat.', 'start': 37, 'end': 59}]


**Extract Target-Holder-Pairs**

In [3]:
from typing import NamedTuple

class PAS(NamedTuple):
    arg1: str
    arg2: str
    vLemma: str

In [4]:
label_mapper = {
    "LABEL_0": "NEUTRAL",
    "LABEL_1": "HOLDER",
    "LABEL_2": "TARGET"
}

def extract_args(bert_output):
    """Obtain arguments from a dict-list of a BERT model."""
    # the holder is the first argument, the target is the second argument
    arg1=None
    arg2=None
    for c in classified:
        if label_mapper[c["entity_group"]] == "HOLDER":
            arg1=c["word"]
        elif label_mapper[c["entity_group"]] == "TARGET":
            arg2=c["word"]
        else:
            pass
    return arg1, arg2

**Initialize dependency parsing with spacy.**

Labels are described [here](https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/mitarbeiter-innen/hagen/STTS_Tagset_Tiger).

In [5]:
import spacy
nlp = spacy.load('de_core_news_sm')
nlp.add_pipe("conll_formatter", last=True)

# text = ('Der Minister prangert das Urteil an.')
text = ('Der Minister prangert die missliche Lage an!')

doc = nlp(text)

**Install SpaCy CoNLL / To output stanced text in CoNLL**

In [6]:
!pip install spacy_conll textacy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [7]:
# generate conll

from spacy_conll import init_parser

nlp = init_parser("de_core_news_sm",
                  "spacy",
                  ext_names={"conll_pd": "pandas"},
                  conversion_maps={"deprel": {"ROOT": "root"}})

doc = nlp('Sie mag ihn nicht!')

print(doc._.pandas)

   id   form  lemma upostag xpostag  \
0   1    Sie    sie    PRON    PPER   
1   2    mag  mögen     AUX   VMFIN   
2   3    ihn    ihn    PRON    PPER   
3   4  nicht  nicht    PART  PTKNEG   
4   5      !     --   PUNCT      $.   

                                               feats  head deprel deps  \
0  Case=Nom|Gender=Fem|Number=Sing|Person=3|PronT...     2     sb    _   
1  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbF...     0   root    _   
2  Case=Acc|Gender=Masc|Number=Sing|Person=3|Pron...     2     oa    _   
3                                                  _     2     ng    _   
4                                                  _     2  punct    _   

            misc  
0              _  
1              _  
2              _  
3  SpaceAfter=No  
4  SpaceAfter=No  


In [8]:
import textacy
# extract svo-triples for classifier

# nlp = spacy.load("en_core_web_sm")

# doc = nlp('He does not like Peter.')

text_ext = textacy.extract.subject_verb_object_triples(doc)

print([t for t in text_ext])

[]


In [9]:
print ([token.text for token in doc])

['Sie', 'mag', 'ihn', 'nicht', '!']


In [10]:
for token in doc:
    print (token.text, token.tag_, token.head.text, token.dep_, token.lemma_)

Sie PPER mag sb sie
mag VMFIN mag ROOT mögen
ihn PPER mag oa ihn
nicht PTKNEG mag ng nicht
! $. mag punct --


**Approach 1: Search for separated verbs.**

Pattern syntax is described [here]()

In [8]:
from spacy.matcher import DependencyMatcher

pattern_separated_verbs = [
    {
        "RIGHT_ID": "main",
        "RIGHT_ATTRS": {"TAG" : {"IN": ["VVFIN", "VVINF", "VVPP"]}}
    },
    {
        "LEFT_ID": "main",
        "REL_OP": ">",
        "RIGHT_ID": "zusatz",
        "RIGHT_ATTRS": {"TAG" : "PTKVZ"}
    }
]

simple_pattern = [
    {
        "RIGHT_ID": "main",
        "RIGHT_ATTRS": {"TAG" : {"IN": ["VVFIN", "VVINF", "VVPP"]}}
    },
]


separable_matcher = DependencyMatcher(nlp.vocab)
separable_matcher.add("SEPARATED_VERBS", [pattern_separated_verbs])

simple_matcher = DependencyMatcher(nlp.vocab)
simple_matcher.add("SIMPLE_VERB", [simple_pattern])

def obtain_predicate(sent_string):
    """Obtain the predicate from a sentence string."""
    pred = ""

    text = (sent_string)
    doc = nlp(text)
    
    for token in doc:
        print(token.text, token.tag_, token.head.text, token.dep_, token.lemma_)
    
    matches = separable_matcher(doc)
    print(matches)

    # only focus on non-nested statements
    if len(matches) == 1:
        # extract and put together predicate
        for e in matches[0][1][::-1]:
            pred += doc[e].lemma_

    matches = simple_matcher(doc)
    print(matches)

    if pred == "" and len(matches) == 1:
        # standard predicate
        pred = f"{doc[matches[0][1][0]].lemma_}"
    return pred

In [13]:
# Function doing the heavy lifting, here we assume one relation per sentence (no compositionality).
def extract_pas(sentence_string, verb):
    # sentence with holder / target labels
    labelled_sentence = pipe(sentence_string)
    # obtain arguments from labels
    arg1, arg2 = extract_args(labelled_sentence)
    # obtain predicate
    # pred = obtain_predicate(sentence_string)
    pred = verb
    return PAS(arg1, arg2, pred)


**Load the SVC classifier with the pro/con relations**

In [11]:
import fasttext
import numpy as np

def load_fasttext_embeddings_from_pas(filepath, pas_list):
    # verb embeddings
    model = fasttext.load_model(filepath)
    
    vEmbs = [model[w] for w in list([pas.vLemma for pas in pas_list])]
    # get the embeddings for the NP heads
    args1_np_head = [model[w] for w in list([pas.arg1 for pas in pas_list])]
    args2_np_head = [model[w] for w in list([pas.arg2 for pas in pas_list])]
    # clear from memory
    del model
    return args1_np_head, args2_np_head, vEmbs

def make_features(args1, args2, vEmbs):
    # Horizontally concatenate the embeddings for each training instance.
    for i in range(0, len(args1)):
        if i == 0:
            X = np.concatenate((args1[i], args2[i], vEmbs[i]))
        else:
            X = np.vstack((X, np.concatenate((args1[i], args2[i], vEmbs[i]))))
    return X

In [125]:
import pandas as pd

# --- CLEAN ---

training_data = pd.read_csv(TRAINING_DATA_PATH)

print(training_data.rel_type.value_counts())

# count duplicates (assumes multiple PAS in a given sentence)
counts_per_sent = training_data.groupby(['full_sentence_text']).size().reset_index(name='counts').sort_values(by='counts', ascending=False)

# print(counts_per_sent)

# drop duplicates by sentence
training_data_dedup = training_data.drop_duplicates(subset=['full_sentence_text'])

# counts_per_sent.info()
# training_data_dedup.info()

# join new with counts
joined_training_data = pd.merge(training_data_dedup, counts_per_sent, on='full_sentence_text')

# print(joined_training_data)

# select only sentences with a count of 1 PAS.
simplified_training_data = joined_training_data[joined_training_data["counts"] == 1]

# simplified_training_data = simplified_training_data[simplified_training_data["rel_type"] == "pro"]

simplified_training_data = simplified_training_data[(simplified_training_data.arg1_head != ".") & (simplified_training_data.arg2_head != ".")]

simplified_training_data

neutral    21252
pro         3320
con         2427
Name: rel_type, dtype: int64


Unnamed: 0,doc_num,verb_form,verb_lemma,arg1,arg1_pos,arg1_head,arg2,arg2_pos,arg2_head,rel_type,pred_serial,full_sentence_text,counts
0,0,verbessern,verbessern,Ein Inkrafttreten des Gegenentwurfs,N,Inkrafttreten,die Situation,N,Situation,pro,"Predicate(type='pro', args=(Head(sentence=23, ...",Ein Inkrafttreten des Gegenentwurfs wird die S...,1
1,0,durchsetzen,durchsetzen,Die Initiative,N,Initiative,ein vollständiges Werbeverbot für Tabak,N,Werbeverbot,pro,"Predicate(type='pro', args=(Head(sentence=9, t...",Die Initiative will jedoch auf versteckte Weis...,1
2,3,verbieten,verbieten,Der Bundesrat,N,Bundesrat,die Tabakwerbung,N,Tabakwerbung,con,"Predicate(type='con', args=(Head(sentence=2, t...",Der Bundesrat will nun die Tabakwerbung in Kin...,1
3,3,verbieten,verbieten,der Bundesrat,N,Bundesrat,das Verteilen von Gratismustern,N,Verteilen,con,"Predicate(type='con', args=(Head(sentence=13, ...",Allerdings will der Bundesrat das Verteilen vo...,1
4,3,unterstütze,unterstützen,die Wirtschaft,N,Wirtschaft,ein nationales Verkaufsverbot,N,Verkaufsverbot,pro,"Predicate(type='pro', args=(Head(sentence=9, t...",Ebenso unterstütze die Wirtschaft zusätzlich e...,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20630,8536,sank,sinken,der Anteil der über 65-Jährigen,N,Anteil,22 Prozent,N,Prozent,neutral,"Predicate(type='neutral', args=(Head(sentence=...",Beispielsweise bei der Studie in England sank ...,1
20631,8536,bleiben,bleiben,Demenz,N,Demenz,eine grosse Herausforderung für die alternden ...,N,Herausforderung,neutral,"Predicate(type='neutral', args=(Head(sentence=...","Das wäre ein grosser Fehler , findet ­ Monique...",1
20632,8536,ist,sein,Die Evidenz,N,Evidenz,ziemlich schwach,ADV,schwach,neutral,"Predicate(type='neutral', args=(Head(sentence=...",Die Evidenz ist immer noch ziemlich schwach » ...,1
20633,8536,schützen,schützen,Bessere Lebensbedingungen und mehr Bildung,N,Lebensbedingungen,Bessere Lebensbedingungen und mehr Bildung,N,Lebensbedingungen,neutral,"Predicate(type='neutral', args=(Head(sentence=...",Bessere Lebensbedingungen und mehr Bildung sch...,1


**Prepare data for experiment #1, details can be found here: [R-BERT](https://github.com/monologg/R-BERT)**

In [121]:
!pip install matplotlib

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting matplotlib
  Downloading matplotlib-3.6.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pillow>=6.2.0
  Downloading Pillow-9.3.0-cp310-cp310-manylinux_2_28_aarch64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting cycler>=0.10
  Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting contourpy>=1.0.1
  Downloading contourpy-1.0.6-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (278 kB)
[2K 

In [123]:
# simplified_training_data
import re
import copy

def generate_head_based_entity_sentences(df):
    entified_list = []
    for t in df.itertuples():
        sent = copy.deepcopy(t.full_sentence_text)
        
        try:
            res = re.search(t.arg1_head, t.full_sentence_text)
            e1_s, e1_e = res.span()
            entified_1 = sent[0:max(0, e1_s - 1)] + " <e1> " + sent[e1_s:min(len(sent), e1_e + 1)] + "</e1> " + sent[min(len(sent), e1_e + 1):]

            res = re.search(t.arg2_head, entified_1)
            e2_s, e2_e = res.span()
            entified_2 = entified_1[0:max(0, e2_s - 1)] + " <e2> " + entified_1[e2_s:min(len(entified_1), e2_e + 1)] + "</e2> " + entified_1[min(len(entified_1), e2_e + 1):]

        except Exception as e:
            if not res:
                entified_list.append(None)
                continue

        entified_list.append(entified_2)
    return entified_list

entified_list = generate_head_based_entity_sentences(simplified_training_data)

simplified_training_data["entified_sents"] = entified_list

# simplified_training_data["rel_type"] = simplified_training_data["rel_type"].map({
#    "pro": "pro(e1, e2)",
#    "con": "con(e1, e2)"
# })

reduced_training_data = simplified_training_data[["rel_type", "entified_sents", "arg1_head", "arg2_head", "full_sentence_text"]].copy(deep=True)

reduced_training_data.rel_type = simplified_training_data.rel_type.replace(to_replace=dict(pro="pro(e1, e2)", con="con(e1, e2)", neutral="neu(e1, e2)"))

reduced_training_data.rel_type.value_counts()

# reduced_training_data.to_csv("./data/experiment_1_training_data.tsv", sep="\t", header=False, index=False)

neu(e1, e2)    10006
pro(e1, e2)     2295
con(e1, e2)     1560
Name: rel_type, dtype: int64

**Experiment 0: Run the basic classifier**

TODO: Require some method to extract the verb (lemma). SVO.

In [None]:
# --- CLASSIFY ---

pas_structures = [extract_pas(p, v) for p, v in zip(training_data.full_sentence_text.to_list(), training_data.verb_lemma.to_list())]

args1_np_head, args2_np_head, vEmbs = load_fasttext_embeddings_from_pas(FASTTEXT_MODEL_BIN_PATH, pas_structures)

X_np_args = make_features(args1_np_head, vEmbs, args2_np_head)

In [None]:
# load sklearn label encoder
encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')

In [56]:
import pandas
from sklearn import model_selection
from sklearn import preprocessing, svm
import pickle

# load the model from disk
loaded_model = pickle.load(open(SKLEARN_MODEL_PATH, 'rb'))

# result = loaded_model.score(X_test, Y_test)

preds = loaded_model.predict(X_np_args)

print(preds)

print(training_data.rel_type.to_list())

[0 0 0 2 0 0 1 1 1 1]
['pro', 'pro', 'con', 'pro', 'con', 'pro', 'pro', 'pro', 'pro', 'pro']


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
