This notebook was used to generate graphics for our report.

In [1]:
# !conda install -c conda-forge textacy

# Triplet Generation
To export triplets for each file into a new folder in data. 
- 1 triplet = 1 line 
- filename to match source name 
- filename to 
- deliminator to be | as indicated by kshitj
- for any tags, to use [] 

Why: 
- to be pushed into a file for triplet to text OR for reading, etc. 

example: 
001.txt in folder `docs\triplets_LO`
```
profits | were, buoyed | gains, users
economy | tanked | 10000%
greenspan | boasts | record growth 
...

```

In [2]:
import os 
import spacy 
import textacy  # this is built on top of spacy. it has some out of the box functions like YAKE, but documentation is kinda sucky
from pprint import pprint
from tqdm import tqdm

# Load articles to memory 

In [3]:
# FOLDER = "./data/BBC/News Articles/business"
FOLDER = "./data/BBC/News Articles_w_CovarianceRes/business" # Coref

# Extract Keywords from Title
def load_articles(PATH):
    news_files = sorted([f for f in os.listdir(PATH) if f.endswith('.txt')])

    news_list = []
    for file in news_files:
        with open(os.path.join(PATH, file), 'r') as f:
            text = f.read()
            news_list.append(text)
    
    return news_list

articles = load_articles(FOLDER)
print(articles[0])

Ad sales boost Time Warner profit

Quarterly profits at Time Warner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

Time Warner benefited from sales of high-speed internet connections and higher advert sales. Time Warner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Time Warner profit

Quarterly profits at US media giant TimeWarner were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that Time Warner now owns 8% of search-engine Google. But AOL had has mixed fortunes. AOL lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, Time Warner said AOLunderlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. Time Warner hopes to increase subscribers by offering the online service free to Time Warner internet customers and will try to sign up AOLexisting customers f

# Generate Keywords from Title


In [4]:
##########################################################################################################################################
# HELPER FUNCTIONS
##########################################################################################################################################

def seperate_title_and_body(article, VERBOSE = False): 
    """
    Input: a string file
    Returns: list of format (title, [para1, para2, para3, ...])
    """

    paragraphs = article.split("\n\n")

    title = paragraphs[0]
    body = ' '.join(paragraphs[1:])
    if VERBOSE: 
        print("title:", title)
        print("body:", body)
    
    return (title, body)


# Get Keywords from title 
def get_title_keywords_v1(tup_article, nlp, formatting = "compound", VERBOSE = False):
    """
    =================
    Input
    =================
    tup_article: 
    - a Tuple pair  of format (title, body)
    
    nlp:
    - a spacy nlp pipeline. basic one will suffice 

    formatting: 
    - "atomic" returns lemmatized single words for each compound phrase
    - "compound" returns compound phrase in lowercasing  (default)

    verbose:
    - if True, will show renders and step-wise outputs

    =================
    Returns
    =================
    Returns a tuple pair of string keywords: [[entities],[other keywords]]
    
    =================
    Notes
    =================
    - Entity Keywords are NOT lower cased because this is relevant information
    - Other Keywords are lowercased and lemmatized 

    formatting

    """ 
    # CUSTOMIZE HERE
    title = tup_article[0]
    doc_title = nlp(title)
    tags = ["NOUN", "ADJ", "VERB"]
    entities = ["PERSON", "ORG", "PRODUCT", "EVENT", "GPE", "FAC", "NORP"]
    keywords_ents = []
    keywords_others = []
    processed = []

    if VERBOSE:
        spacy.displacy.render(doc_title, style = "dep")
        spacy.displacy.render(doc_title, style = "ent")
        

    keywords_ents.extend([ent.text for ent in doc_title.ents if ent.label_ in entities])
    
    if VERBOSE:
        print("Keywords (entites):", keywords_ents)

    for tok in doc_title:
        if ((tok.pos_ in tags) 
            and (tok.lemma_ not in processed) 
            and (tok.text not in keywords_others) 
            and (tok.text not in keywords_ents)):
            if tok.dep_ == "compound":
                if formatting == "atomic": 
                    a = tok.lemma_
                    b = tok.head.lemma_
                    keywords_others.append(a)
                    keywords_others.append(b)
                    processed.extend([tok.lemma_, tok.head.lemma_])
                if formatting == "compound": 
                    a = tok.text.lower()
                    b = tok.head.text.lower()
                    keywords_others.append(f"{a} {b}")
                    processed.extend([tok.lemma_, tok.head.lemma_])
                if formatting == "hybrid":  
                    a = tok.text.lower()
                    b = tok.head.text.lower()
                    keywords_others.append(tok.lemma_)
                    keywords_others.append(tok.head.lemma_)
                    keywords_others.append(f"{a} {b}")
                    processed.extend([tok.lemma_, tok.head.lemma_])
            else:
                keywords_others.append(tok.text.lower())

    if VERBOSE:
        print("Keywords (entities + others):", keywords_others)
    return (keywords_ents, keywords_others)
    


##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

# Init pipeline
nlp = spacy.load("en_core_web_lg")
# nlp.add_pipe("merge_entities")  # removed because it doesn't make a noticable difference in ouput

article_keywords = []
for i in range(len(articles)):
    # print(f'\n=========ARTICLE: {i+1}.txt=========')
    article = seperate_title_and_body(articles[i], VERBOSE = False)
    keywords = get_title_keywords_v1(tup_article = article, nlp = nlp, formatting = "atomic", VERBOSE = False)
    article_keywords.append(keywords)
    # print(keywords)

print(f"articles processed: {len(article_keywords)}")


articles processed: 510


# Tokenize & Tag body according to Keywords

In [5]:
##########################################################################################################################################
# HELPER FUNCTIONS
##########################################################################################################################################

def tag_and_tokenize_keywords(doc):
    # Create a lemmatized version of the input document

    matcher_ent = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LEMMA")  
    matcher_other = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
    patterns_ent = [nlp(keyword) for keyword in keywords[0]]
    patterns_other = [nlp.make_doc(keyword) for keyword in keywords[1]]
    matcher_ent.add("KEYWORD_ENT", patterns_ent)
    matcher_other.add("KEYWORD_OTHER", patterns_other)

    lemmatized_doc = nlp(" ".join([token.lemma_ for token in doc]))

    # Retokenize the document to merge multi-token keyword spans
    with doc.retokenize() as retokenizer:
        # Iterate through the matchers (matcher_ent, matcher_other) and their target documents (lemmatized_doc, doc)
        for matcher, target_doc in zip([matcher_ent, matcher_other], [lemmatized_doc, doc]):
            # Find matches in the target document using the current matcher
            for match_id, start, end in matcher(target_doc):
                # Create a span in the original document corresponding to the matched keywords
                span = doc[start:end]
                # Merge the span into a single token
                retokenizer.merge(span)

                # Set the 'is_keyword' and 'custom_tag' attributes for the merged token
                span[0].set_extension("is_keyword", default=False, force=True)
                span[0]._.is_keyword = True
                span[0].set_extension("custom_tag", default=None, force=True)
                span[0]._.custom_tag = nlp.vocab.strings[match_id]

    return doc

def add_keywords_to_ents(doc):
    new_ents = []
    for token in doc:
        if token._.is_keyword:
            new_ent = spacy.tokens.Span(doc, token.i, token.i + 1, label=token._.custom_tag)
            new_ents.append(new_ent)

    # Filter out existing entities that overlap with the new keyword entities
    filtered_ents = []
    for ent in doc.ents:
        overlaps = [ent.start <= new_ent.start < ent.end or new_ent.start <= ent.start < new_ent.end for new_ent in new_ents]
        if not any(overlaps):
            filtered_ents.append(ent)

    doc.set_ents(filtered_ents + new_ents)
    return doc

def convert_entities_to_tokens(doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            # Create a span for the current entity
            span = doc[ent.start:ent.end]
            # Merge the span into a single token
            retokenizer.merge(span)

    return doc

def merge_noun_chunks(doc):
    with doc.retokenize() as retokenizer:
        for np in list(doc.noun_chunks):
            retokenizer.merge(np)
    return doc

def merge_symb2num(doc):
    """
    Merge adjacent currency symbol and number tokens.
    """
    i = 1
    while i < len(doc):
        if doc[i].is_digit and doc[i - 1].is_currency:
            span = doc[doc[i - 1].i: doc[i].i + 1]
            with doc.retokenize() as retokenizer:
                retokenizer.merge(span)
        else:
            i += 1
    return doc


# Displacy Formatting 
col_highlight1 = "magenta"
col_highlight2 = "yellow"
col_others = "lightblue"
options_ent = {
    "ents": ["KEYWORD_ENT", "KEYWORD_OTHER",
             "ORG", "PRODUCT", "GPE", "LOC", "PERSON", "FAC", "NORP", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "LANGUAGE", "EVENT", "LAW", "WORK_OF_ART"],
    "colors": {"KEYWORD_ENT": col_highlight1, 
               "KEYWORD_OTHER": col_highlight2, 
               "ORG": col_others, 
               "PRODUCT": col_others, 
               "GPE": col_others, 
               "LOC": col_others, 
               "PERSON": col_others, 
               "FAC": col_others, 
               "NORP": col_others, 
               "DATE": col_others, 
               "TIME": col_others, 
               "PERCENT": col_others, 
               "MONEY": col_others, 
               "QUANTITY": col_others, 
               "ORDINAL": col_others, 
               "CARDINAL": col_others, 
               "LANGUAGE": col_others, 
               "EVENT": col_others, 
               "LAW": col_others, 
               "WORK_OF_ART": col_others}  
}

options_dep = {'compact': True, 'color': 'black', 'bg': 'white', 'offset': 100, 'distance': 100, 'font': 'Arial'}

##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

# Generate Sample
doc_idx = 368
title = seperate_title_and_body(articles[doc_idx])[0]
body = seperate_title_and_body(articles[doc_idx])[1]
keywords = article_keywords[doc_idx]
print("title: ", title)
print("title keywords: ", keywords)
print("body: ", body)

# Create new NLP instances, incl matchers
nlp = spacy.load("en_core_web_lg")

matcher_ent = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LEMMA")  
matcher_other = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
patterns_ent = [nlp(keyword) for keyword in keywords[0]]
patterns_other = [nlp.make_doc(keyword) for keyword in keywords[1]]
matcher_ent.add("KEYWORD_ENT", patterns_ent)
matcher_other.add("KEYWORD_OTHER", patterns_other)

# Pipeline the inputs
doc = nlp(body)
doc = tag_and_tokenize_keywords(doc)
doc = add_keywords_to_ents(doc)
# doc = convert_entities_to_tokens(doc)
# doc = merge_noun_chunks(doc)
# doc = merge_symb2num(doc)

# spacy.displacy.render(doc, style = "dep")
spacy.displacy.render(doc, style = "dep", options = options_dep)
spacy.displacy.render(doc, style = "ent", options = options_ent)


title:  Beer giant swallows Russian firm
title keywords:  (['Russian'], ['beer', 'giant', 'swallows', 'firm'])
body:  Alfa-Eco'shas Alfa-Eco's stake in Sun Interbrew, Russia's second-largest brewer . agreed gives Beer giant, near-total control over Russian firm . Beer giant bought out another partner in August 2004. Beer giant brands include Bass, Stella Artois, Hoegaarden and Staropramen. Beer giant employs 77,000 people, running operations in over 30 countries across the Americas, Europe and Asia Pacific. Beer giant said Beer giant would own 97.3% of the voting shares and 98.8% of the non-voting shares of Russian firm . agreed is expected to be completed in the first quarter of 2005. Beer giant was formed in August 2004 when Belgium's Interbrew bought Brazilian brewer Ambev. Russian firm owns breweries in eight Russian cities - Klin, Ivanovo, Saransk, Kursk, Volzhsky, Omsk, Perm and Novocheboksarsk. There are also three breweries in Ukraine, in the cities of Chernigov, Nikolaev and K

# Generate Tagged Triplets based on keywords

**Note on SVO function**

The subject_verb_object_triples function in Textacy is designed to extract Subject-Verb-Object (SVO) triples from a given text. It follows a set of heuristics based on dependency parsing and linguistic rules. Here's an explanation of the key steps:

1. Iterate over sentences: The function works on a sentence level, so it first iterates over the sentences in the input text.
2. Initialize verb_sos dictionary: This dictionary stores the subjects and objects associated with each verb in the sentence. The keys in the dictionary are verbs, andthe values are dictionaries containing sets of subjects and objects.
3. Iterate over tokens: For each sentence, the function iterates over its tokens. Each token is analyzed based on its dependency label and part-of-speech tag. The primary purpose of this iteration is to identify the subjects and objects associated with each verb.
4. Identify subjects and objects: The function checks if a token is a subject or an object based on its dependency label. If a token is a subject or object, it is added to the corresponding set in the verb_sos dictionary for the associated verb. Subjects can be nominal (nouns) or clausal (subordinate clauses). Objects can be nominal (nouns), prepositional (introduced by a preposition), or clausal (subordinate clauses).
5. Expand subjects and objects: The function expands subjects and objects to include related tokens, such as compound nouns, conjuncts, or tokens within a clause. This is done using helper functions like expand_noun, expand_verb, and the .subtree attribute of tokens.
6. Handle verb conjuncts: The function addresses cases where multiple verbs are connected by a conjunction (e.g., "I read and wrote"). In such cases, the subjects and objects of the conjunct verbs are shared. The function updates the verb_sos dictionary to ensure that all conjunct verbs have the correct subjects and objects.
7. Generate SVO triples: Finally, the function iterates over the verb_sos dictionary and, for each verb with associated subjects and objects, creates an SVO triple. Each triple is a tuple containing three lists: subjects, verbs, and objects. The subjects, verbs, and objects are sorted by their position in the text (using the .i attribute).

In summary, the function generates SVO triples by analyzing the input text's dependency structure and applying heuristics based on linguistic rules. This approach is effective in many cases but may not always produce perfect results, as it relies on the accuracy of the dependency parser and the underlying linguistic assumptions.

Source code: 
`https://textacy.readthedocs.io/en/latest/_modules/textacy/extract/triples.html#subject_verb_object_triples`

In [9]:

##########################################################################################################################################
# Helper Functions
##########################################################################################################################################
# The following generates triples regardless of keywords. 
# See ref: https://textacy.readthedocs.io/en/latest/api_reference/extract.html#triples


def generate_triplet2sentence_pairs(doc, arr_keyword, return_type = 'std', keyword_filter = False, VERBOSE = False):
    '''
    Returns a list of tuple-pairs: 
    [(triple, sentence), (triple, sentence), ...]

    Parameters:
    -----------
    doc : spacy.tokens.Doc
        A SpaCy document containing the text to extract SVO triplets from.
    arr_keyword : List[List[str]]
        A nested list of keywords (strings), where each sublist represents a set of related keywords.
    keyword_filter : bool, optional (default=False)
        If True, only include triplets containing at least one keyword from the given keyword list.
    VERBOSE : bool, optional (default=False)
        If True, print additional information, such as the list of flattened keywords.

    Returns:
    --------
    pairs : List[Tuple[str, spacy.tokens.Span]]
        A list of tuple pairs, where each tuple contains a formatted triplet (as a string) and
        the sentence.

    Example usage:
    --------------
    keywords = ['Time Warner', 'sales', 'boost', 'profit']
    svo_pairs = generate_triplet2sentence_pairs(doc, arr_keyword=keywords, keyword_filter=True,
    
    '''

    pairs = []
    combined_data = []
    
    # filtering, if it happens
    flat_keywords = [item for sublist in arr_keyword for item in sublist]
    if VERBOSE: print("flat keywords: ", flat_keywords)

    # Iterate through sentences in the doc
    for sent in doc.sents:
        # Extract SVO triples for each sentence
        triples = list(textacy.extract.subject_verb_object_triples(sent))
        formatted_triples = []
        
        for triple in triples:


            ###########################
            # visualizer 
            ###########################
            if VERBOSE:
                options = {'compact': True, 'color': 'black', 'bg': 'white', 'offset': 100, 'distance': 100, 'font': 'Arial'}
                spacy.displacy.render(sent, style='dep', jupyter=True, options=options)

                sent_start = sent[0].idx
                entities_s = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'S'} for t in triple[0]]
                entities_v = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'V'} for t in triple[1]]
                entities_o = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'O'} for t in triple[2]]
                all_entities = entities_s + entities_v + entities_o
                displacy_ex = [{'text': sent.text, 'ents': all_entities, 'title': None}]

                spacy.displacy.render(displacy_ex, style='ent', jupyter=True, manual=True, options={'colors': {'S': 'yellow', 'V': 'magenta', 'O': 'orange'}})
            

            ###########################
            # modify triplets 
            ###########################
            # for s, v, o, print their entity types 
            if VERBOSE: 
                for svo in triple:
                    for tok in svo:
                        if hasattr(tok, "ent_id"):
                            print(tok, " : ", tok.ent_type_)
            
            for svo in triple:
                    
                for idx, tok in enumerate(svo[:-1]): 

                    # drop if its AUX 
                    if tok.pos_ == "AUX":
                        # drop from v_toks
                        svo.remove(tok)
                    
                    # drop if its PART
                    if tok.pos_ == "PART":
                        # drop from v_toks
                        svo.remove(tok)
                    
                    # if entity = money, include the prceeding token if its a symbol. new token should be put in front of the token
                    if tok.ent_type_ == "MONEY" and tok.nbor(-1).pos_ == "SYM":
                        # check if the token to the left of tok is already in the list
                        if idx > 0 and svo[idx-1] == tok.nbor(-1):
                            continue
                        svo.insert(idx - 1, tok.nbor(-1))
                    
                    
                    # FIXME - if token's child is linked to it by a possesive modifier, replace the token with the child
                    # has_poss_child = any(child.dep_ == "poss" for child in tok.children)
                    # if has_poss_child:
                    #     print("POSS FOUND: ", tok, " : ", tok.children[0])
                    #     svo[idx] = tok.children[0]


            ###########################
            # reject triples
            ###########################
            
            # Final check: only proceed if subject, verb or object have less than 5 tokens each
            if len(triple[0]) > 5 or len(triple[1]) > 5 or len(triple[2]) > 5:
                if VERBOSE: print("REJECT: Triple is too long. Passing")
                continue

            # reject subject is it is not an entity or noun 
            if (not any([tok.ent_type_ for tok in triple[0]]) and not any([tok.pos_ == "NOUN" for tok in triple[0]])):
                if VERBOSE: print("REJECT: Subject does not have entity type or is not a noun. Passing")
                continue
            
            # reject object is it is not an entity or noun 
            # if (not any([tok.ent_type_ for tok in triple[2]]) and not any([tok.pos_ == "NOUN" for tok in triple[2]])):
            #     if VERBOSE: print("REJECT: Object does not have entity type or is not a noun. Passing")
            #     continue
            
            
            if VERBOSE: 
                print(f"processed: {[tok.text for tok in triple[0]], [tok.text for tok in triple[1]], [tok.text for tok in triple[2]]}")
            ###########################
            # final formatting 
            ###########################

            subject = []
            verb = []
            obj = []

            for tok in triple[0]:
                if tok.ent_type_ and tok.ent_type_ != "KEYWORD_OTHER":
                    subject.append(tok.text)
                else:
                    subject.append(tok.lemma_)
            
            for tok in triple[1]:
                if tok.ent_type_:
                    verb.append(tok.text)
                else:
                    verb.append(tok.lemma_)
            
            for tok in triple[2]:
                if tok.ent_type_ and tok.ent_type_ != "KEYWORD_OTHER":
                    obj.append(tok.text)
                else:
                    obj.append(tok.lemma_)

            if keyword_filter:
                # Check if any keyword is present in the subject, verb, or object
                if not any(keyword in subject + verb + obj for keyword in flat_keywords):
                    continue
            
            formatted_triples = f"{'_'.join(subject)} | {'_'.join(verb)} | {'_'.join(obj)}"

            # store to main lists 
            pairs.append((formatted_triples, sent.text))
            combined_data.append(formatted_triples + " <==> " + sent.text)
            
            if VERBOSE: print("added: ", formatted_triples)

    if return_type == "std":
        return pairs
    if return_type == "llm":
        return combined_data

##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

keywords = article_keywords[doc_idx]
# print(keywords)
pairs = generate_triplet2sentence_pairs(doc, arr_keyword = keywords, keyword_filter = False, VERBOSE= True)
print ([a for a,b in pairs])


IndexError: list index out of range

# Generate TXT files

In [7]:
##########################################################################################################################################
# Helper Functions
##########################################################################################################################################

def save_tuple_pairs_to_files(pairs, file1_path, file2_path):
    with open(file1_path, 'w', encoding='utf-8') as file1, open(file2_path, 'w', encoding='utf-8') as file2:
        for a, b in pairs:
            file1.write(f"{a}\n")
            file2.write(f"{b}\n")


## Generic Triplets for Team Use

In [None]:
stop

In [8]:
##########################################################################################################################################
# Generate Triples and Sentences, orgainzed by filename
# USECASE: for general usage by team
##########################################################################################################################################

# Generate Triples and their Reference Sentences. Save it to a file 
PATH_TRIPLE = './data/BBC/Training_strict/business_triples/'
PATH_SENTENCE = './data/BBC/Training_strict/business_sentences/'

article_keywords = []
article_pairs = [] 
for i, article in tqdm(enumerate(articles)):
    # NOTE: error in article 409 , 476. replaced with simple triplets
    if i < 476:
        continue
    
    ####################################
    ## Generate Keywords
    nlp = spacy.load("en_core_web_lg")

    title_body = seperate_title_and_body(article, VERBOSE = False)
    keywords = get_title_keywords_v1(tup_article = title_body, nlp = nlp, formatting = "atomic", VERBOSE = False)
    article_keywords.append(keywords)
    # print(keywords)
    
    ####################################
    ## Retokenize Body 

    # Create new NLP instances, incl matchers
    body = seperate_title_and_body(article)[1]
    nlp_retok = spacy.load("en_core_web_lg")
    doc = nlp_retok(title_body[1])
    doc = tag_and_tokenize_keywords(doc)
    doc = add_keywords_to_ents(doc)

    ####################################
    ## Generate Triples
    pairs = generate_triplet2sentence_pairs(doc = doc, arr_keyword = keywords, keyword_filter = False, VERBOSE= False)
    article_pairs.append(pairs)
    filename =f"{i+1:03d}.txt"
    FILE_TRIPLE = PATH_TRIPLE + filename
    FILE_SENTENCE = PATH_SENTENCE + filename

    save_tuple_pairs_to_files(pairs, FILE_TRIPLE, FILE_SENTENCE)


print(f"articles processed: {len(article_keywords)}")


486it [00:40, 11.95it/s] 


KeyboardInterrupt: 

## Triplets for TextGen

In [None]:
##########################################################################################################################################
# Generate FILTERED Triples
# USECASE: for generating summaries
# NOTES: triples are filtered for presence of title keywords in sub, verb, obj. If none, the triple is discarded.
##########################################################################################################################################

# Generate Triples and their Reference Sentences. Save it to a file 
PATH_TRIPLE = './data/BBC/Generation/Business/'
PATH_SENTENCE = './data/BBC/Training_strict/triples/'

article_keywords = []
article_pairs = [] 
for i, article in tqdm(enumerate(articles)):

    # print(f'\n=========ARTICLE: {i+1}.txt=========')
    
    ####################################
    ## Generate Keywords
    nlp = spacy.load("en_core_web_lg")

    title_body = seperate_title_and_body(article, VERBOSE = False)
    keywords = get_title_keywords_v1(tup_article = title_body, nlp = nlp, formatting = "atomic", VERBOSE = False)
    article_keywords.append(keywords)
    
    ####################################
    ## Retokenize Body 

    # Create new NLP instances, incl matchers
    nlp_retok = spacy.load("en_core_web_lg")
    doc = nlp_retok(title_body[1])
    doc = tag_and_tokenize_keywords(doc, matcher_ent, matcher_other)
    doc = add_keywords_to_ents(doc)

    ####################################
    ## Generate Triples
    
    pairs = generate_triplet2sentence_pairs(doc = doc, arr_keyword = keywords, keyword_filter = False, VERBOSE= False)
    article_pairs.append(pairs)
    filename =f"{i+1:03d}.txt"
    FILE_TRIPLE = PATH_TRIPLE + filename
    FILE_SENTENCE = PATH_SENTENCE + filename

    save_tuple_pairs_to_files(pairs, FILE_TRIPLE, FILE_SENTENCE)


print(f"articles processed: {len(article_keywords)}")


## Triplets for LLM Tuning


In [None]:
##########################################################################################################################################
# Generate per-line triples and their reference sentences
# USECASE: for tuning a LLM
# NOTES: for triplet 2 sentence, we return type = "llm". This returns a list of strings, where each string is a triplet + sentence pair
##########################################################################################################################################

# Generate Triples and their Reference Sentences. Save it to a file 

FOLDER = './data/BBC/Training_strict/business_gpt2/'
FILENAME = "business_gpt2.txt"
PATH = FOLDER + FILENAME

article_keywords = []
article_pairs = [] 
training_pairs = [] 
for i, article in tqdm(enumerate(articles)):

    ####################################
    ## Generate Keywords

    nlp = spacy.load("en_core_web_lg")

    title_body = seperate_title_and_body(article, VERBOSE = False)
    keywords = get_title_keywords_v1(tup_article = title_body, nlp = nlp, formatting = "atomic", VERBOSE = False)
    article_keywords.append(keywords)
    print(keywords)
    
    ####################################
    ## Retokenize Body 
    # Create new NLP instances, incl matchers

    nlp_retok = spacy.load("en_core_web_lg")
    doc = nlp_retok(title_body[1])
    doc = tag_and_tokenize_keywords(doc, matcher_ent, matcher_other)
    doc = add_keywords_to_ents(doc)

    ####################################
    ## Generate Triples
    
    pairs = generate_triplet2sentence_pairs(doc = doc, arr_keyword = keywords, return_type = "llm", keyword_filter = False, VERBOSE= False)
    training_pairs.extend(pairs)
    

with open(PATH, 'w', encoding= "utf-8") as file:
    for line in training_pairs:
        file.write(f"{line}\n")
    

print(f"articles processed: {len(article_keywords)}")


# OLD 

In [None]:
title = title
body = body 
keywords = keywords
doc = doc

def extract_triplets(doc_body, VERBOSE = False):
    triplets = []


    # Split the text into sentences
    sentences = list(doc_body.sents)

    for sentence in sentences:
        # Find the keywords in the sentence
        keywords = [token for token in sentence if token._.is_keyword]
        if VERBOSE: 
            print(f"keywords: {keywords}")
            print(f"sentence: {sentence.text}")
            print(f"""sentence root: {sentence.root.text}""")

        # Iterate through the keywords and extract the triplets
        for keyword in keywords:
            subject, verb, obj = None, None, None

            # Traverse the dependency tree to find the subject, verb, and object related to the keyword
            for token in sentence:
                if VERBOSE: 
                    print(f"""token: {token.text}, {token.dep_}""")
                    print(f"""head: {token.head.text}""")
                     
                if token.dep_ in ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"] and token.head == keyword:
                    subject = token
                    verb = token.head
                elif token.dep_ in ["dobj", "attr", "prep", "dative"] and token.head == keyword:
                    obj = token
                    verb = token.head
            # if VERBOSE: print(f"sub, verb, obj: {subject}, {verb}, {obj}")

            # Check if a complete triplet was found
            if subject and verb and obj:
                # Simplify the triplet (e.g., lemmatize the verb)
                simplified_subject = subject.text
                simplified_verb = verb.lemma_
                simplified_object = obj.text

                # Add the triplet to the list
                triplets.append((simplified_subject, simplified_verb, simplified_object))

    return triplets

extract_triplets(doc, VERBOSE= True)

In [None]:
##########################################################################################################################################
# HELPER FUNCTIONS
##########################################################################################################################################

# def apply_custom_tags(doc, matcher):
#     matches = matcher(doc)
#     for match_id, start, end in matches:
#         tag = nlp.vocab.strings[match_id]
#         for token in doc[start:end]:
#             token.set_extension("custom_tag", default=None, force=True)
#             token._.custom_tag = tag
#     return doc

# Helper functions
def tag_and_tokenize_keywords(doc):
    with doc.retokenize() as retokenizer:
        for match_id, start, end in matcher(doc):
            span = doc[start:end]
            retokenizer.merge(span)
            span[0].set_extension("is_keyword", default=False, force=True)
            span[0]._.is_keyword = True
            span[0].set_extension("custom_tag", default=None, force=True)
            span[0]._.custom_tag = nlp.vocab.strings[match_id]
    return doc


def add_keywords_to_ents(doc):
    new_ents = []
    for token in doc:
        if token._.is_keyword:
            new_ent = spacy.tokens.Span(doc, token.i, token.i + 1, label=token._.custom_tag)
            new_ents.append(new_ent)

    # Filter out existing entities that overlap with the new keyword entities
    filtered_ents = []
    for ent in doc.ents:
        overlaps = [ent.start <= new_ent.start < ent.end or new_ent.start <= ent.start < new_ent.end for new_ent in new_ents]
        if not any(overlaps):
            filtered_ents.append(ent)

    doc.set_ents(filtered_ents + new_ents)
    return doc


##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

# Generate input variables
doc_idx = 1
body = seperate_title_and_body(articles[doc_idx])[1]
keywords = article_keywords[doc_idx]
print("body: ", body)
print("keywords: ", keywords)


# Create new NLP pipeline
nlp = spacy.load("en_core_web_lg")

# # create object matcher for phrases 
matcher = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
patterns_ent = [nlp.make_doc(keyword) for keyword in keywords[0]]
patterns_other = [nlp.make_doc(keyword) for keyword in keywords[1]]
matcher.add("KEYWORD_ENT", patterns_ent)
matcher.add("KEYWORD_OTHER", patterns_other)


doc = nlp(body)
doc = tag_and_tokenize_keywords(doc)
doc = add_keywords_to_ents(doc)

for token in doc:
    if token._.is_keyword:
        print(token.text)

# spacy.displacy.render(doc, style = "dep")

col_highlight1 = "magenta"
col_highlight2 = "yellow"
col_others = "lightblue"
options = {
    "ents": ["KEYWORD_ENT", "KEYWORD_OTHER",
             "ORG", "PRODUCT", "GPE", "LOC", "PERSON", "FAC", "NORP", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "LANGUAGE", "EVENT", "LAW", "WORK_OF_ART"],
    "colors": {"KEYWORD_ENT": col_highlight1, 
               "KEYWORD_OTHER": col_highlight2, 
               "ORG": col_others, 
               "PRODUCT": col_others, 
               "GPE": col_others, 
               "LOC": col_others, 
               "PERSON": col_others, 
               "FAC": col_others, 
               "NORP": col_others, 
               "DATE": col_others, 
               "TIME": col_others, 
               "PERCENT": col_others, 
               "MONEY": col_others, 
               "QUANTITY": col_others, 
               "ORDINAL": col_others, 
               "CARDINAL": col_others, 
               "LANGUAGE": col_others, 
               "EVENT": col_others, 
               "LAW": col_others, 
               "WORK_OF_ART": col_others}
}

spacy.displacy.render(doc, style = "ent", options = options)

# Next Steps 
# for all keyword entities, find their equivalents 
# for all entities, tokenzie
# for matching, ensure it matches on lemmatize words 

In [None]:
OLD 
##########################################################################################################################################
# HELPER FUNCTIONS
##########################################################################################################################################

# def apply_custom_tags(doc, matcher):
#     matches = matcher(doc)
#     for match_id, start, end in matches:
#         tag = nlp.vocab.strings[match_id]
#         for token in doc[start:end]:
#             token.set_extension("custom_tag", default=None, force=True)
#             token._.custom_tag = tag
#     return doc

def tag_and_tokenize_keywords(doc):
    with doc.retokenize() as retokenizer:
        for match_id, start, end in matcher(doc):
            span = doc[start:end]
            retokenizer.merge(span)
            span[0].set_extension("is_keyword", default=False, force=True)
            span[0]._.is_keyword = True
    return doc


def add_keywords_to_ents(doc):
    '''
    returns a doc with entity tags based on keywords.
    does not override existing ones  
    '''
    new_ents = []
    for token in doc:
        if token._.is_keyword:
            # Check if the token is already part of an existing entity
            if not any(token.i >= ent.start and token.i < ent.end for ent in doc.ents):
                new_ent = spacy.tokens.Span(doc, token.i, token.i + 1, label="KEYWORD")
                new_ents.append(new_ent)
    doc.set_ents(list(doc.ents) + new_ents)
    return doc

def add_keywords_to_ents(doc):
    new_ents = []
    for token in doc:
        if token._.is_keyword:
            new_ent = spacy.tokens.Span(doc, token.i, token.i + 1, label="KEYWORD_ENT")
            new_ent = spacy.tokens.Span(doc, token.i, token.i + 1, label="KEYWORD_OTHER")
            new_ents.append(new_ent)

    # Filter out existing entities that overlap with the new keyword entities
    filtered_ents = []
    for ent in doc.ents:
        overlaps = [ent.start <= new_ent.start < ent.end or new_ent.start <= ent.start < new_ent.end for new_ent in new_ents]
        if not any(overlaps):
            filtered_ents.append(ent)

    doc.set_ents(filtered_ents + new_ents)
    return doc


##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

# Generate input variables
doc_idx = 0
body = seperate_title_and_body(articles[doc_idx])[1]
keywords = article_keywords[doc_idx]
print("body: ", body)
print("keywords: ", keywords)



# Create new NLP pipeline
nlp = spacy.load("en_core_web_lg")

# create object matcher for phrases 
matcher = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(keyword) for keyword in keywords[0]]
matcher.add("KEYWORD_ENT", patterns)
patterns = [nlp.make_doc(keyword) for keyword in keywords[1]]
matcher.add("KEYWORD_OTHER", patterns)

doc = nlp(body)
doc = tag_and_tokenize_keywords(doc)
doc = add_keywords_to_ents(doc)

for token in doc:
    if token._.is_keyword:
        print(token.text)

# spacy.displacy.render(doc, style = "dep")

col_highlight1 = "red"
col_highlight2 = "orange"
col_others = "lightblue"
options = {
    "ents": ["KEYWORD_ENT", "KEYWORD_OTHER",
             "ORG", "PRODUCT", "GPE", "LOC", "PERSON", "FAC", "NORP", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "LANGUAGE", "EVENT", "LAW", "WORK_OF_ART"],
    "colors": {"KEYWORD_ENT": col_highlight1, 
               "KEYWORD_OTHER": col_highlight2, 
               "ORG": col_others, 
               "PRODUCT": col_others, 
               "GPE": col_others, 
               "LOC": col_others, 
               "PERSON": col_others, 
               "FAC": col_others, 
               "NORP": col_others, 
               "DATE": col_others, 
               "TIME": col_others, 
               "PERCENT": col_others, 
               "MONEY": col_others, 
               "QUANTITY": col_others, 
               "ORDINAL": col_others, 
               "CARDINAL": col_others, 
               "LANGUAGE": col_others, 
               "EVENT": col_others, 
               "LAW": col_others, 
               "WORK_OF_ART": col_others}
}

spacy.displacy.render(doc, style = "ent", options = options)

# Next Steps 
# for all keyword entities, find their equivalents 
# for all entities, tokenzie

In [None]:
ERR
# GET TITLE KEYWORDS 
nlp = spacy.load("en_core_web_sm")
doc = nlp(articles[0])

# title = doc.sents[0].text
# print(title)
# Get 

def seperate_title_and_body(article): 
    """
    Input: a string file
    Returns: list of format (title, [para1, para2, para3, ...])
    """

    paragraphs = article.split("\n\n")

    title = paragraphs[0]
    body = paragraphs[1:]
    
    return (title, body)



print(seperate_title_and_body(articles[0]))
# print(seperate_title_and_body(articles[1])[0])
# print(seperate_title_and_body(articles[2])[0])

def separate_noun_chunks_and_entities(doc):
    new_doc = spacy.tokens.Doc(doc.vocab)
    for old_token in doc:
        if old_token.ent_type_ == '':
            # If token is not an entity, add it to the new_doc
            new_doc += old_token
        else:
            # If token is an entity, split it into individual tokens and add them to the new_doc
            for ent_token in old_token:
                new_doc += ent_token
        if old_token.dep_ == 'compound':
            # If token is part of a compound noun, split it into individual tokens and add them to the new_doc
            for chunk_token in old_token.subtree:
                new_doc += chunk_token
        elif old_token.n_lefts + old_token.n_rights > 0:
            # If token has children, add it to the new_doc
            new_doc += old_token
    return new_doc

# Add custom component to the pipeline before merging noun chunks

# Get Keywords from title 
def get_title_keywords(tup_article, nlp, VERBOSE = False):
    """
    ## Input
    a Tuple pair  of format (title, [para1, para2, para3, ...])
    
    # Details
    
    First it creates 2 custom tags from the title: 
    1. title_ent: this is any entity in the title (e.g. "Alan Greenspan", "")
    2. title_details: this is any noun, propn or adj related to the title_ent

    Then, it goes through the body and tags the text accordingly


    # Returns

    Returns a list of string keywords

    """ 
    title, paragraphs = tup_article 
    
    doc_title = nlp(title)
    
    # Process article body
    body_text = " ".join(paragraphs)
    doc_body = nlp(body_text)

    if VERBOSE:
        spacy.displacy.render(doc_title, style = "dep")
        spacy.displacy.render(doc_title, style = "ent")
        # spacy.displacy.render(doc_body, style = "dep")
        # spacy.displacy.render(doc_body, style = "ent")

    keywords = [ent.text for ent in doc_title.ents]
    # Entities in title
    # title_entities = 
    if VERBOSE:
        print("Entities in title:", keywords)

    # Keywords in title
    tags = ["NOUN", "ADJ"]
    processed = []

    for tok in doc_title:
        if (tok.pos_ in tags) and (tok.text not in processed) and (tok.text not in keywords):
            if tok.dep_ == "compound":
                compound_word = f"{tok.text} {tok.head.text}"
                keywords.append(compound_word)
                processed.extend([tok.text, tok.head.text])
            else:
                keywords.append(tok.text)

    
    if VERBOSE:
        print("Keywords in title:", keywords)
    
            
    # keywords = [tok.text for tok in doc if tok.pos_ in ["NOUN", "PROPN", "ADJ"]]


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("merge_entities")
# nlp.add_pipe("merge_noun_chunks")

for i in range(5):
    print(f'\n ARTICLE: {i+1}.txt')
    article = seperate_title_and_body(articles[i])

    print(f"title: {article[0]}")
    print(get_title_keywords(article, nlp, VERBOSE = True))
