This notebook is to export triplets for each file into a new folder in data. 
- 1 triplet = 1 line 
- filename to match source name 
- filename to 
- deliminator to be | as indicated by kshitj
- for any tags, to use [] 

Why: 
- to be pushed into a file for triplet to text OR for reading, etc. 

example: 
001.txt in folder `docs\triplets_LO`
```
profits | were, buoyed | gains, users
economy | tanked | 10000%
greenspan | boasts | record growth 
...

```

In [None]:
# !conda install -c conda-forge textacy

In [None]:
import os 
import spacy 
import textacy  # this is built on top of spacy. it has some out of the box functions like YAKE, but documentation is kinda sucky
from pprint import pprint
from tqdm import tqdm

# Load articles to memory 

In [None]:
# FOLDER = "./data/BBC/News Articles/business"
FOLDER = "./data/BBC/News Articles_w_CovarianceRes/business" # Coref

# Extract Keywords from Title
def load_articles(PATH):
    news_files = sorted([f for f in os.listdir(PATH) if f.endswith('.txt')])

    news_list = []
    for file in news_files:
        with open(os.path.join(PATH, file), 'r') as f:
            text = f.read()
            news_list.append(text)
    
    return news_list

articles = load_articles(FOLDER)
print(articles[0])

Ad sales boost Time Warner profit

Quarterly profits at Time Warner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.

Time Warner benefited from sales of high-speed internet connections and higher advert sales. Time Warner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Time Warner profit

Quarterly profits at US media giant TimeWarner were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that Time Warner now owns 8% of search-engine Google. But AOL had has mixed fortunes. AOL lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, Time Warner said AOLunderlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. Time Warner hopes to increase subscribers by offering the online service free to Time Warner internet customers and will try to sign up AOLexisting customers 

# Generate Keywords from Title


In [None]:
##########################################################################################################################################
# HELPER FUNCTIONS
##########################################################################################################################################

def seperate_title_and_body(article, VERBOSE = False): 
    """
    Input: a string file
    Returns: list of format (title, [para1, para2, para3, ...])
    """

    paragraphs = article.split("\n\n")

    title = paragraphs[0]
    body = ' '.join(paragraphs[1:])
    if VERBOSE: 
        print("title:", title)
        print("body:", body)
    
    return (title, body)


# Get Keywords from title 
def get_title_keywords_v1(tup_article, nlp, formatting = "compound", VERBOSE = False):
    """
    =================
    Input
    =================
    tup_article: 
    - a Tuple pair  of format (title, body)
    
    nlp:
    - a spacy nlp pipeline. basic one will suffice 

    formatting: 
    - "atomic" returns lemmatized single words for each compound phrase
    - "compound" returns compound phrase in lowercasing  (default)

    verbose:
    - if True, will show renders and step-wise outputs

    =================
    Returns
    =================
    Returns a tuple pair of string keywords: [[entities],[other keywords]]
    
    =================
    Notes
    =================
    - Entity Keywords are NOT lower cased because this is relevant information
    - Other Keywords are lowercased and lemmatized 

    formatting

    """ 
    # CUSTOMIZE HERE
    title = tup_article[0]
    doc_title = nlp(title)
    tags = ["NOUN", "ADJ", "VERB"]
    entities = ["PERSON", "ORG", "PRODUCT", "EVENT", "GPE", "FAC", "NORP"]
    keywords_ents = []
    keywords_others = []
    processed = []

    if VERBOSE:
        spacy.displacy.render(doc_title, style = "dep")
        spacy.displacy.render(doc_title, style = "ent")
        

    keywords_ents.extend([ent.text for ent in doc_title.ents if ent.label_ in entities])
    
    if VERBOSE:
        print("Keywords (entites):", keywords_ents)

    for tok in doc_title:
        if ((tok.pos_ in tags) 
            and (tok.lemma_ not in processed) 
            and (tok.text not in keywords_others) 
            and (tok.text not in keywords_ents)):
            if tok.dep_ == "compound":
                if formatting == "atomic": 
                    a = tok.lemma_
                    b = tok.head.lemma_
                    keywords_others.append(a)
                    keywords_others.append(b)
                    processed.extend([tok.lemma_, tok.head.lemma_])
                if formatting == "compound": 
                    a = tok.text.lower()
                    b = tok.head.text.lower()
                    keywords_others.append(f"{a} {b}")
                    processed.extend([tok.lemma_, tok.head.lemma_])
                if formatting == "hybrid":  
                    a = tok.text.lower()
                    b = tok.head.text.lower()
                    keywords_others.append(tok.lemma_)
                    keywords_others.append(tok.head.lemma_)
                    keywords_others.append(f"{a} {b}")
                    processed.extend([tok.lemma_, tok.head.lemma_])
            else:
                keywords_others.append(tok.text.lower())

    if VERBOSE:
        print("Keywords (entities + others):", keywords_others)
    return (keywords_ents, keywords_others)
    


##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

# Init pipeline
nlp = spacy.load("en_core_web_lg")
# nlp.add_pipe("merge_entities")  # removed because it doesn't make a noticable difference in ouput

article_keywords = []
for i in range(0,20):
    # print(f'\n=========ARTICLE: {i+1}.txt=========')
    article = seperate_title_and_body(articles[i], VERBOSE = False)
    keywords = get_title_keywords_v1(tup_article = article, nlp = nlp, formatting = "atomic", VERBOSE = False)
    article_keywords.append(keywords)
    print(keywords)

print(f"articles processed: {len(article_keywords)}")


(['Time Warner'], ['ad', 'sale', 'boost', 'profit'])
(['Greenspan'], ['dollar', 'gain', 'speech'])
(['Yukos'], ['unit', 'buyer', 'faces', 'loan', 'claim'])
(['BA'], ['high', 'fuel', 'price', 'hit', 'profits'])
(['Pernod', 'Domecq'], ['takeover', 'talk', 'lifts'])
(['Japan'], ['escapes', 'recession'])
(['US'], ['jobs', 'growth', 'slow'])
(['India'], ['calls', 'fair', 'trade', 'rule'])
(['Ethiopia'], ['crop', 'production', '%'])
(['Court'], ['rejects', '280bn', 'tobacco', 'case'])
([], ['ask', 'tips', 'online', 'ad', 'revival'])
(['Indonesians'], ['face', 'fuel', 'price', 'rise'])
(['Peugeot', 'Mitsubishi'], ['deal', 'boosts'])
(['Telegraph'], ['newspapers', 'axe', 'jobs'])
(['EU'], ['air', 'passenger', 'win', 'new', 'rights'])
(['China'], ['keeps', 'tight', 'rein', 'credit'])
(['Parmalat'], ['boasts', 'doubled', 'profits'])
(['India'], ['rupee', 'hits', 'year', 'high'])
(['India'], ['widens', 'access'])
([], ['call', 'user', 'centre', 'user', "'", 'lose', 'patience'])
articles processed

# Tokenize & Tag body according to Keywords

In [None]:
##########################################################################################################################################
# HELPER FUNCTIONS
##########################################################################################################################################

def tag_and_tokenize_keywords(doc):

    matcher_ent = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LEMMA")  
    matcher_other = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
    patterns_ent = [nlp(keyword) for keyword in keywords[0]]
    patterns_other = [nlp.make_doc(keyword) for keyword in keywords[1]]
    matcher_ent.add("KEYWORD_ENT", patterns_ent)
    matcher_other.add("KEYWORD_OTHER", patterns_other)
    # Create a lemmatized version of the input document
    lemmatized_doc = nlp(" ".join([token.lemma_ for token in doc]))

    # Retokenize the document to merge multi-token keyword spans
    with doc.retokenize() as retokenizer:
        # Iterate through the matchers (matcher_ent, matcher_other) and their target documents (lemmatized_doc, doc)
        for matcher, target_doc in zip([matcher_ent, matcher_other], [lemmatized_doc, doc]):
            # Find matches in the target document using the current matcher
            for match_id, start, end in matcher(target_doc):
                # Create a span in the original document corresponding to the matched keywords
                span = doc[start:end]
                # Merge the span into a single token
                retokenizer.merge(span)

                # Set the 'is_keyword' and 'custom_tag' attributes for the merged token
                span[0].set_extension("is_keyword", default=False, force=True)
                span[0]._.is_keyword = True
                span[0].set_extension("custom_tag", default=None, force=True)
                span[0]._.custom_tag = nlp.vocab.strings[match_id]

    return doc

def add_keywords_to_ents(doc):
    new_ents = []
    for token in doc:
        if token._.is_keyword:
            new_ent = spacy.tokens.Span(doc, token.i, token.i + 1, label=token._.custom_tag)
            new_ents.append(new_ent)

    # Filter out existing entities that overlap with the new keyword entities
    filtered_ents = []
    for ent in doc.ents:
        overlaps = [ent.start <= new_ent.start < ent.end or new_ent.start <= ent.start < new_ent.end for new_ent in new_ents]
        if not any(overlaps):
            filtered_ents.append(ent)

    doc.set_ents(filtered_ents + new_ents)
    return doc

def convert_entities_to_tokens(doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            # Create a span for the current entity
            span = doc[ent.start:ent.end]
            # Merge the span into a single token
            retokenizer.merge(span)

    return doc

# Displacy Formatting 
col_highlight1 = "magenta"
col_highlight2 = "yellow"
col_others = "lightblue"
options = {
    "ents": ["KEYWORD_ENT", "KEYWORD_OTHER",
             "ORG", "PRODUCT", "GPE", "LOC", "PERSON", "FAC", "NORP", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "LANGUAGE", "EVENT", "LAW", "WORK_OF_ART"],
    "colors": {"KEYWORD_ENT": col_highlight1, 
               "KEYWORD_OTHER": col_highlight2, 
               "ORG": col_others, 
               "PRODUCT": col_others, 
               "GPE": col_others, 
               "LOC": col_others, 
               "PERSON": col_others, 
               "FAC": col_others, 
               "NORP": col_others, 
               "DATE": col_others, 
               "TIME": col_others, 
               "PERCENT": col_others, 
               "MONEY": col_others, 
               "QUANTITY": col_others, 
               "ORDINAL": col_others, 
               "CARDINAL": col_others, 
               "LANGUAGE": col_others, 
               "EVENT": col_others, 
               "LAW": col_others, 
               "WORK_OF_ART": col_others}  
}


##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

# Generate Sample
doc_idx = 1
title = seperate_title_and_body(articles[doc_idx])[0]
body = seperate_title_and_body(articles[doc_idx])[1]
keywords = article_keywords[doc_idx]
print("title: ", title)
print("title keywords: ", keywords)
print("body: ", body)

# Create new NLP instances, incl matchers
nlp = spacy.load("en_core_web_lg")

# Pipeline the inputs
doc = nlp(body)
doc = tag_and_tokenize_keywords(doc)
doc = add_keywords_to_ents(doc)
doc = convert_entities_to_tokens(doc)

# spacy.displacy.render(doc, style = "dep")
# spacy.displacy.render(doc, style = "dep", options = options)
spacy.displacy.render(doc, style = "ent", options = options)


title:  Dollar gains on Greenspan speech
title keywords:  (['Greenspan'], ['dollar', 'gain', 'speech'])
body:  Dollar has hit Dollar highest level against the euro in almost three months after Greenspan said the US trade deficit is set to stabilise. And Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce the US trade deficit. In late trading in New York, Dollar reached $1.2871 against the euro , from $1.2974 on Thursday. Market concerns about the US trade deficit has hit Dollar in recent months. On Friday, Greenspan speech sent Dollar higher after Dollar had earlier tumbled on the back of worse-than-expected US jobs data. "I think Greenspan 's taking a much more sanguine view on the US trade deficit than Greenspan's taken for some time," said I New York . "Greenspan's taking a longer-term view, laying out a set of conditions under which the US trade deficit can improve this year and next." Worries about

# Generate Tagged Triplets based on keywords

**Note on SVO function**

The subject_verb_object_triples function in Textacy is designed to extract Subject-Verb-Object (SVO) triples from a given text. It follows a set of heuristics based on dependency parsing and linguistic rules. Here's an explanation of the key steps:

1. Iterate over sentences: The function works on a sentence level, so it first iterates over the sentences in the input text.
2. Initialize verb_sos dictionary: This dictionary stores the subjects and objects associated with each verb in the sentence. The keys in the dictionary are verbs, andthe values are dictionaries containing sets of subjects and objects.
3. Iterate over tokens: For each sentence, the function iterates over its tokens. Each token is analyzed based on its dependency label and part-of-speech tag. The primary purpose of this iteration is to identify the subjects and objects associated with each verb.
4. Identify subjects and objects: The function checks if a token is a subject or an object based on its dependency label. If a token is a subject or object, it is added to the corresponding set in the verb_sos dictionary for the associated verb. Subjects can be nominal (nouns) or clausal (subordinate clauses). Objects can be nominal (nouns), prepositional (introduced by a preposition), or clausal (subordinate clauses).
5. Expand subjects and objects: The function expands subjects and objects to include related tokens, such as compound nouns, conjuncts, or tokens within a clause. This is done using helper functions like expand_noun, expand_verb, and the .subtree attribute of tokens.
6. Handle verb conjuncts: The function addresses cases where multiple verbs are connected by a conjunction (e.g., "I read and wrote"). In such cases, the subjects and objects of the conjunct verbs are shared. The function updates the verb_sos dictionary to ensure that all conjunct verbs have the correct subjects and objects.
7. Generate SVO triples: Finally, the function iterates over the verb_sos dictionary and, for each verb with associated subjects and objects, creates an SVO triple. Each triple is a tuple containing three lists: subjects, verbs, and objects. The subjects, verbs, and objects are sorted by their position in the text (using the .i attribute).

In summary, the function generates SVO triples by analyzing the input text's dependency structure and applying heuristics based on linguistic rules. This approach is effective in many cases but may not always produce perfect results, as it relies on the accuracy of the dependency parser and the underlying linguistic assumptions.

Source code: 
`https://textacy.readthedocs.io/en/latest/_modules/textacy/extract/triples.html#subject_verb_object_triples`

In [None]:

##########################################################################################################################################
# Helper Functions
##########################################################################################################################################
# The following generates triples regardless of keywords. 
# See ref: https://textacy.readthedocs.io/en/latest/api_reference/extract.html#triples

def generate_triplet2sentence_pairs(doc, arr_keyword, return_type = 'std', keyword_filter = False, VERBOSE = False):
    '''
    Returns a list of tuple-pairs: 
    [(triple, sentence), (triple, sentence), ...]

    Parameters:
    -----------
    doc : spacy.tokens.Doc
        A SpaCy document containing the text to extract SVO triplets from.
    arr_keyword : List[List[str]]
        A nested list of keywords (strings), where each sublist represents a set of related keywords.
    keyword_filter : bool, optional (default=False)
        If True, only include triplets containing at least one keyword from the given keyword list.
    VERBOSE : bool, optional (default=False)
        If True, print additional information, such as the list of flattened keywords.

    Returns:
    --------
    pairs : List[Tuple[str, spacy.tokens.Span]]
        A list of tuple pairs, where each tuple contains a formatted triplet (as a string) and
        the sentence.

    Example usage:
    --------------
    keywords = ['Time Warner', 'sales', 'boost', 'profit']
    svo_pairs = generate_triplet2sentence_pairs(doc, arr_keyword=keywords, keyword_filter=True,
    
    '''
    pairs = []
    combined_data = []
    
    flat_keywords = [item for sublist in arr_keyword for item in sublist]
    if VERBOSE: print("flat keywords: ", flat_keywords)

    # Iterate through sentences in the doc
    for sent in doc.sents:
        # Extract SVO triples for each sentence
        triples = list(textacy.extract.subject_verb_object_triples(sent))
        formatted_triples = []
        for s, v, o in triples:
            # if VERBOSE: 
            #     print("text: ", t[0])
            #     print("type: ", type(t[0]))
            #     print("type: ", type(t[0][0]))
            #     print("lemma: ", [tok.lemma_ for tok in t[0]])
                
            subject = [tok.text.lower() for tok in s]
            verb = [tok.text.lower() for tok in v]
            obj = [tok.text.lower() for tok in o]

            if keyword_filter:
                # Check if any keyword is present in the subject, verb, or object
                if not any(keyword in subject + verb + obj for keyword in flat_keywords):
                    continue
        
            
            formatted_triples = f"{' '.join(subject)} | {' '.join(verb)} | {' '.join(obj)}"
            pairs.append((formatted_triples, sent.text))
            combined_data.append(formatted_triples + " <==> " + sent.text)
            if VERBOSE: print("added: ", formatted_triples)

    if return_type == "std":
        return pairs
    if return_type == "llm":
        return combined_data

##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

keywords = article_keywords[doc_idx]
print(keywords)
pairs = generate_triplet2sentence_pairs(doc, arr_keyword = keywords, keyword_filter = True, VERBOSE= True)
pprint(pairs)

(['Greenspan'], ['dollar', 'gain', 'speech'])
flat keywords:  ['Greenspan', 'dollar', 'gain', 'speech']
added:  dollar | has hit | level
added:  dollar | reached | 1.2871
added:  market concerns | has hit | dollar
added:  greenspan speech | sent | dollar
[('dollar | has hit | level',
  'Dollar has hit Dollar highest level against the euro in almost three months '
  'after Greenspan said the US trade deficit is set to stabilise.'),
 ('dollar | reached | 1.2871',
  'In late trading in New York, Dollar reached $1.2871 against the euro , from '
  '$1.2974 on Thursday.'),
 ('market concerns | has hit | dollar',
  'Market concerns about the US trade deficit has hit Dollar in recent '
  'months.'),
 ('greenspan speech | sent | dollar',
  'On Friday, Greenspan speech sent Dollar higher after Dollar had earlier '
  'tumbled on the back of worse-than-expected US jobs data.')]


# Generate TXT files

In [None]:
##########################################################################################################################################
# Helper Functions
##########################################################################################################################################

def save_tuple_pairs_to_files(pairs, file1_path, file2_path):
    with open(file1_path, 'w', encoding='utf-8') as file1, open(file2_path, 'w', encoding='utf-8') as file2:
        for a, b in pairs:
            file1.write(f"{a}\n")
            file2.write(f"{b}\n")


## Generic Triplets for Team Use

In [None]:
##########################################################################################################################################
# Generate Triples and Sentences, orgainzed by filename
# USECASE: for general usage by team
##########################################################################################################################################

# Generate Triples and their Reference Sentences. Save it to a file 
PATH_TRIPLE = 'data/BBC/Generation/Business'
PATH_SENTENCE = './data/BBC/Training/business_sentences/'

article_keywords = []
article_pairs = [] 
for i, article in tqdm(enumerate(articles)):
    
    ####################################
    ## Generate Keywords
    nlp = spacy.load("en_core_web_lg")

    title_body = seperate_title_and_body(article, VERBOSE = False)
    keywords = get_title_keywords_v1(tup_article = title_body, nlp = nlp, formatting = "atomic", VERBOSE = False)
    article_keywords.append(keywords)
    print(keywords)
    
    ####################################
    ## Retokenize Body 

    # Create new NLP instances, incl matchers
    nlp_retok = spacy.load("en_core_web_lg")
    doc = nlp_retok(title_body[1])

    ####################################
    ## Generate Triples
    pairs = generate_triplet2sentence_pairs(doc = doc, arr_keyword = keywords, keyword_filter = True, VERBOSE= False)
    article_pairs.append(pairs)
    filename =f"{i+1:03d}.txt"
    FILE_TRIPLE = PATH_TRIPLE + filename
    FILE_SENTENCE = PATH_SENTENCE + filename

    save_tuple_pairs_to_files(pairs, FILE_TRIPLE, FILE_SENTENCE)


print(f"articles processed: {len(article_keywords)}")


0it [00:00, ?it/s]

(['Time Warner'], ['ad', 'sale', 'boost', 'profit'])


1it [00:03,  3.41s/it]

(['Greenspan'], ['dollar', 'gain', 'speech'])


2it [00:06,  3.15s/it]

(['Yukos'], ['unit', 'buyer', 'faces', 'loan', 'claim'])


3it [00:10,  3.63s/it]

(['BA'], ['high', 'fuel', 'price', 'hit', 'profits'])


4it [00:15,  4.28s/it]

(['Pernod', 'Domecq'], ['takeover', 'talk', 'lifts'])


5it [00:18,  3.63s/it]

(['Japan'], ['escapes', 'recession'])


6it [00:21,  3.59s/it]

(['US'], ['jobs', 'growth', 'slow'])


7it [00:24,  3.19s/it]

(['India'], ['calls', 'fair', 'trade', 'rule'])


8it [00:26,  2.93s/it]

(['Ethiopia'], ['crop', 'production', '%'])


9it [00:32,  3.87s/it]

(['Court'], ['rejects', '280bn', 'tobacco', 'case'])


10it [00:34,  3.42s/it]

([], ['ask', 'tips', 'online', 'ad', 'revival'])


11it [00:38,  3.58s/it]

(['Indonesians'], ['face', 'fuel', 'price', 'rise'])


12it [00:42,  3.57s/it]

(['Peugeot', 'Mitsubishi'], ['deal', 'boosts'])


13it [00:44,  3.24s/it]

(['Telegraph'], ['newspapers', 'axe', 'jobs'])


14it [00:48,  3.34s/it]

(['EU'], ['air', 'passenger', 'win', 'new', 'rights'])


15it [00:50,  3.09s/it]

(['China'], ['keeps', 'tight', 'rein', 'credit'])


16it [00:53,  2.88s/it]

(['Parmalat'], ['boasts', 'doubled', 'profits'])


17it [00:56,  2.91s/it]

(['India'], ['rupee', 'hits', 'year', 'high'])


18it [00:58,  2.73s/it]

(['India'], ['widens', 'access'])


19it [01:03,  3.50s/it]

([], ['call', 'user', 'centre', 'user', "'", 'lose', 'patience'])


20it [01:06,  3.24s/it]

([], ['rank', 'set', 'sell', 'film', 'unit'])


21it [01:08,  2.97s/it]

(['German'], ['sluggish', 'economy', 'hits', 'jobs'])


22it [01:17,  4.52s/it]

(['French'], ['mixed', 'signals', 'economy'])


23it [01:19,  3.94s/it]

(['US'], ['trade', 'gap', 'hits', 'record'])


24it [01:22,  3.47s/it]

(['Yukos', 'US'], ['loses', 'bankruptcy', 'battle'])


25it [01:25,  3.42s/it]

(['GM'], ['safety', 'alert', 'recalls', 'cars'])


26it [01:27,  3.09s/it]

([], ['steel', 'firm', 'cut', 'jobs'])


27it [01:29,  2.85s/it]

([], ['strong', 'demand', 'triggers', 'oil', 'rally'])


28it [01:32,  2.88s/it]

(['UK', 'Venezuelan'], ['firm', 'faces', 'land', 'row'])


29it [01:35,  2.73s/it]

([], ['soaring', 'oil', "'", 'hits', 'world', 'economy'])


30it [01:38,  2.79s/it]

(['Irish'], ['markets', 'reach', 'time', 'high'])


31it [01:40,  2.66s/it]

(['Japanese'], ['banking', 'battle', 'end'])


32it [01:42,  2.58s/it]

(['Colombia'], ['grab', 'poor', 'fund'])


33it [01:45,  2.68s/it]

(['Rover'], ['deal', 'cost', 'jobs'])


34it [01:48,  2.57s/it]

(['WPP'], ['ad', 'firm', 'profits', 'surge', '%'])


35it [01:51,  2.68s/it]

(['US'], ['gives', 'foreign', 'firms', 'extra', 'time'])


36it [01:53,  2.59s/it]

(['Japanese'], ['mogul', 'arrested', 'fraud'])


37it [01:55,  2.53s/it]

(['Deutsche Telekom'], ['sees', 'mobile', 'gain'])


38it [01:58,  2.70s/it]

(['Chinese', 'Italy'], ['wine', 'tempts', 'illva'])


39it [02:01,  2.58s/it]

(['Umbro'], ['profits', 'lifted'])


40it [02:03,  2.50s/it]



41it [02:06,  2.62s/it]

(['UK'], ['plunges', 'deeper', 'loss'])


42it [02:08,  2.57s/it]

(['Saudi'], ['ministry', 'employ', 'women'])


43it [02:11,  2.53s/it]

(['Japan'], ['economy', 'slides', 'recession'])


44it [02:14,  2.76s/it]

(['US'], ['crude', 'price', 'surge'])


45it [02:18,  2.93s/it]

(['Japan'], ['industrial', 'output', 'falls'])


46it [02:26,  4.70s/it]

(['Ryanair', 'Boeing'], ['4bn', 'plane', 'deal'])


47it [02:29,  4.17s/it]

([], ['return', 'stockmarket'])


48it [02:32,  3.66s/it]

(['BMW', 'Mini'], ['cash', 'fuel', 'production'])


49it [02:36,  3.93s/it]

(['Nestle'], ['bad', 'weather', 'hits', 'sales'])


50it [02:40,  3.71s/it]

(['Fiat', 'Ferrari'], ['mulls', 'market', 'listing'])


51it [02:43,  3.59s/it]

(['Italy'], ['get', 'economic', 'action', 'plan'])


52it [02:47,  3.79s/it]

(['Reuters'], ['weak', 'dollar', 'hits'])


53it [02:50,  3.55s/it]

(['Hyundai', 'India'], ['build', 'new', 'plant'])


54it [02:53,  3.45s/it]

(['SA'], ['unveils', 'more', 'budget'])


55it [02:57,  3.58s/it]

(['BMW'], ['drives', 'record', 'sales'])


56it [03:00,  3.43s/it]

([], ['economy', "'", 'stronger', 'forecast'])


57it [03:04,  3.61s/it]

([], ['electrolux', 'export', 'jobs'])


58it [03:08,  3.62s/it]

(['Worldcom'], ['-', 'boss', 'launches', 'defence'])


59it [03:11,  3.59s/it]

([], ['insurance', 'boss', 'plead', 'guilty'])


60it [03:15,  3.72s/it]

(['UK'], ['rise', 'jobless', 'total'])


61it [03:18,  3.51s/it]

(['Wembley'], ['firm', 'make', 'profit'])


62it [03:22,  3.60s/it]

(['US'], ['cars', 'pull', 'retail', 'figures'])


63it [03:26,  3.50s/it]

(['Lufthansa', 'Bush'], ['sue', 'visit'])


64it [03:29,  3.36s/it]

(['EU'], ['aiming', 'fuel', 'aid', 'development', 'aid'])


65it [03:33,  3.59s/it]

(['German'], ['business', 'confidence', 'slides'])


66it [03:36,  3.40s/it]

(['FAO'], ['warns', 'impact', 'subsidies'])


67it [03:39,  3.31s/it]

(['India'], ['seeks', 'boost', 'construction'])


68it [03:43,  3.50s/it]

(['Boeing', '777'], ['unveils', 'new', 'aircraft'])


69it [03:46,  3.37s/it]

(['Circuit City'], ['gets', 'takeover', 'offer'])


70it [03:50,  3.56s/it]

(['Japan'], ['turns', 'beer', 'alternative'])


71it [03:53,  3.51s/it]

(['S Korean'], ['s', 'korean', 'consumers', 'spending'])


72it [03:57,  3.49s/it]

(['German'], ['growth', 'goes', 'reverse'])


73it [04:00,  3.50s/it]

(['Turkey', 'Iran'], ['mobile', 'deal', 'risk'])


74it [04:03,  3.36s/it]

(['EU'], ['ministers', 'mull', 'jet', 'fuel', 'tax'])


75it [04:08,  3.91s/it]

(['Palestinian'], ['economy', 'decline'])


76it [04:11,  3.66s/it]

(['China', 'Yukos'], ['had', 'role', 'split', 'up'])


77it [04:15,  3.49s/it]

(['Fiat'], ['deadline', 'nears', 'deal'])


78it [04:19,  3.72s/it]

(['US'], ['id', 'surge', 'theft', 'surge', 'hits', 'consumers'])


79it [04:22,  3.55s/it]

(['Argentina', 'Venezuela'], ['oil', 'deal'])


80it [04:25,  3.43s/it]

(['BMW'], ['recall', 'faulty', 'diesel', 'car'])


81it [04:29,  3.54s/it]

(['Exel'], ['takeover', 'rumour', 'lifts', 'shares'])


82it [04:32,  3.37s/it]

(['Yukos'], ['accused', 'lying', 'court'])


83it [04:36,  3.51s/it]

(['Mexican', 'US'], ['send', '16bn', 'home'])


84it [04:39,  3.34s/it]

([], ['home', 'loan', 'approvals', 'rising'])


85it [04:42,  3.23s/it]

([], ['manufacturing', 'recovery', 'slowing'])


86it [04:46,  3.49s/it]

(['Worldcom'], ['boss', 'left', 'books'])


87it [04:49,  3.33s/it]

(['Metlife', 'Citigroup'], ['buys', 'insurer'])


88it [04:52,  3.23s/it]

(['US'], ['blames', 'weak', 'dollar'])


89it [04:56,  3.42s/it]

(['French'], ['wine', 'gets', 'euro', 'top', 'up'])


90it [04:59,  3.31s/it]

(['Russia'], ['gets', 'investment', 'blessing'])


91it [05:03,  3.51s/it]

(['Iranian'], ['mps', 'threaten', 'mobile', 'deal'])


92it [05:06,  3.36s/it]

(['Argentina'], ['closes', 'debt', 'swap'])


93it [05:09,  3.24s/it]

(['US'], ['economy', 'shows', 'solid', 'gdp', 'growth'])


94it [05:13,  3.49s/it]

(['Profitsfell'], ['profits', 'slide', '%', 'research', 'cost', 'rose', 'sales', 'flagged'])


95it [05:16,  3.35s/it]

(['Burren', 'Egyptian'], ['awarded', 'contracts'])


96it [05:19,  3.50s/it]

([], ['strong', 'dollar', 'call', 'halt', 'slide'])


97it [05:23,  3.40s/it]

(['German'], ['cuts', 'growth', 'estimate'])


98it [05:26,  3.31s/it]

(['GM', 'Ford'], ['cut', 'output', 'sales', 'fall'])


99it [05:30,  3.65s/it]

(['WorldCom'], ['ebbers', 'denies'])


100it [05:33,  3.52s/it]

(['Australia'], ['rates', 'year', 'high'])


101it [05:37,  3.45s/it]

(['US', 'Benin'], ['company', 'admits', 'bribery'])


102it [05:41,  3.58s/it]

(['US', 'Marsh'], ['insurer', 'Marsh', 'cuts', 'jobs'])


103it [05:43,  3.23s/it]

(['US', '280bn'], ['seeks', 'new', 'smoker', 'ruling'])


104it [05:45,  2.98s/it]

(['Budget Aston', 'Porsche'], ['takes'])


105it [05:49,  3.04s/it]

(['Gordon Brown'], ['golden', 'rule', 'intact', 'says', 'exwill', 'meet', 'margin', 'spare', 'according', 'former', 'chief', 'economic', 'adviser'])


106it [05:51,  2.84s/it]

(['Liberian'], ['economy', 'starts', 'grow'])


107it [05:54,  2.86s/it]

(['US'], ['slowdown', 'hits', 'factory', 'growth'])


108it [05:56,  2.73s/it]

(['Lufthansa'], ['flies', 'profit'])


109it [05:59,  2.64s/it]

(['Japanese'], ['growth', 'grinds', 'halt'])


110it [06:02,  2.82s/it]

(['Unilever'], ['shake', 'profit', 'slips'])


111it [06:04,  2.68s/it]

(['France Telecom'], ['gets', 'boost'])


112it [06:07,  2.73s/it]

(['Brussels'], ['raps', 'mobile', 'call', 'charge'])


113it [06:11,  3.09s/it]

(['WorldCom'], ['director', 'admits', 'lying'])


114it [06:14,  3.02s/it]

(['Glaxo'], ['aims', 'profit', 'fall'])


115it [06:17,  3.00s/it]

(['Japan'], ['bank', 'shares', 'link', 'talk'])


116it [06:21,  3.28s/it]

(['Mercedes'], ['car', 'giant', 'hit', 'slump'])


117it [06:24,  3.18s/it]

(['Ericsson'], ['sees', 'earnings', 'improve'])


118it [06:27,  3.34s/it]

(['Bank'], ['opts', 'leave', 'rates', 'hold'])


119it [06:31,  3.25s/it]

(['Nigeria'], ['boost', 'cocoa', 'production'])


120it [06:33,  3.14s/it]

(['US'], ['interest', 'rate', 'increased', '%'])


121it [06:37,  3.40s/it]

(['US', 'SEC'], ['bank', 'settlement'])


122it [06:40,  3.25s/it]

(['Aires', 'Buenos Aires'], ['train', 'strike', 'grips', 'caused', 'traffic', 'chaos', 'large', 'queues', 'bus', 'stop'])


123it [06:43,  3.16s/it]

(['Softbank'], ['bargain', 'call', 'widen', 'loss'])


124it [06:47,  3.35s/it]

([], ['profits', 'bid', 'criticism'])


125it [06:50,  3.20s/it]

(['Barclays'], ['profits', 'hit', 'record', 'level'])


126it [06:53,  3.14s/it]

(['Yukos', 'Russia'], ['owner', 'sues'])


127it [06:57,  3.38s/it]

(['Qantas'], ['sees', 'profits', 'fly', 'record'])


128it [07:00,  3.24s/it]

(['Iraq'], ['invite', 'phone', 'licence', 'bids'])


129it [07:04,  3.40s/it]

(['India'], ['aviation', 'firm', 'eye', 'booming'])


130it [07:06,  3.26s/it]

(['Russian', 'Yukos'], ['oil', 'merger', 'excludes'])


131it [07:09,  3.16s/it]

(['Brazil', 'Belgium', 'Inbev'], ['buy', 'boosts'])


132it [07:13,  3.44s/it]

(['Cameroon'], ['salary', 'scandal'])


133it [07:16,  3.29s/it]

(['US'], ['adds', 'more', 'jobs', 'expected'])


134it [07:19,  3.20s/it]

([], ['feta', 'cheese', 'battle', 'reaches', 'court'])


135it [07:23,  3.36s/it]

(['Ukraine'], ['revisits', 'state', 'off', 'sell', 'off'])


136it [07:26,  3.24s/it]

([], ['set', 'leave', 'rates', 'hold'])


137it [07:29,  3.15s/it]

([], ['winter', 'freeze', 'keeps', 'oil'])


138it [07:33,  3.40s/it]

(['German'], ['jobless', 'rate', 'new', 'record'])


139it [07:36,  3.24s/it]

([], ['ore', 'cost', 'hit', 'global', 'steel', 'firm'])


140it [07:40,  3.42s/it]

(['BMW'], ['reveals', 'new', 'model', 'pipeline'])


141it [07:43,  3.31s/it]

(['Asian'], ['banks', 'halt', 'dollar', 'slide'])


142it [07:46,  3.20s/it]

(['Cadbury'], ['weak', 'dollar', 'trims', 'profits'])


143it [07:50,  3.44s/it]

([], ['oil', 'price', 'fall', 'highs'])


144it [07:53,  3.30s/it]

([], ['winn', 'file', 'bankruptcy'])


145it [07:56,  3.19s/it]

([], ['few', 'targets', 'better', 'many'])


146it [08:00,  3.41s/it]

(['Malaysia', 'Islamic'], ['lifts', 'bank', 'limit'])


147it [08:03,  3.30s/it]

(["Alfa Romeos '", 'GM'], ['get', 'engines'])


148it [08:06,  3.42s/it]

(['Saab', 'Cadillacs', 'Sweden'], ['build'])


149it [08:09,  3.31s/it]

([], ['shares', 'hit', 'drug', 'suspension'])


150it [08:12,  3.21s/it]

(['Bank'], ['voted', 'rate', 'change'])


151it [08:16,  3.45s/it]

([], ['crude', 'oil', 'price'])


152it [08:19,  3.29s/it]

(['House'], ['prices', 'show', 'slight', 'increase'])


153it [08:22,  3.22s/it]

([], ['golden', 'rule', 'boost'])


154it [08:26,  3.40s/it]

(['Macy', 'Macy'], ['owner', 'buys'])


155it [08:29,  3.25s/it]

(['Japan'], ['industrial', 'revival', 'hope'])


156it [08:33,  3.42s/it]

(['Khodorkovsky'], ['ally', 'denies', 'charges'])


157it [08:36,  3.27s/it]

(['Qatar', 'Shell'], ['6bn', 'deal', 'gas', 'deal'])


158it [08:39,  3.17s/it]

(['India'], ['unveils', 'anti', '-', 'poverty', 'budget'])


159it [08:43,  3.43s/it]

(['GM', 'Fiat'], ['pays', '2bn', 'evade', 'buyout'])


160it [08:46,  3.26s/it]

(['Ex-Boeing'], ['ex', '-', 'boeing', 'director', 'gets', 'jail', 'term'])


161it [08:49,  3.39s/it]

(['Verizon', 'MCI'], ['seal', 'takeover'])


162it [08:53,  3.37s/it]

(['US'], ['data', 'sparks', 'inflation', 'worry'])


163it [08:56,  3.41s/it]

(['Yukos'], ['sues', 'firms'])


164it [09:01,  3.71s/it]

(['US'], ['consumer', 'spending', 'lifts', 'growth'])


165it [09:04,  3.51s/it]

([], ['crossrail', 'link', 'get', 'go'])


166it [09:07,  3.36s/it]

(['Hariri', 'Beirut'], ['killing', 'hits', 'shares'])


167it [09:10,  3.50s/it]

([], ['small', 'firms', "'", 'hit', 'rising', 'costs'])


168it [09:13,  3.32s/it]

(["Jet Airways'"], ['buyers', 'snap', 'shares'])


169it [09:17,  3.47s/it]

(['House'], ['prices', 'suffer', 'festive', 'fall'])


170it [09:20,  3.33s/it]

(['Deutsche Boerse'], ['boosts', 'dividend'])


171it [09:23,  3.24s/it]

(['EU'], ['newest', 'members', 'underpin', 'growth'])


172it [09:27,  3.47s/it]

(['Brewers', 'Brewers'], ['profits', 'lose', 'fizz'])


173it [09:30,  3.30s/it]

(["Yangtze Electric's"], ['profits', 'double'])


174it [09:33,  3.20s/it]

(['French'], ['consumer', 'spending', 'rising'])


175it [09:37,  3.34s/it]

(['GSK'], ['aims', 'stop', 'profiteers'])


176it [09:40,  3.20s/it]

(['UK'], ['optimism', 'remains', 'housing'])


177it [09:43,  3.34s/it]

(['Russia', 'WTO'], ['talks', 'make', 'progress'])


178it [09:46,  3.22s/it]

(['Irish', 'Man Utd'], ['duo', 'block', 'bid'])


179it [09:49,  3.17s/it]

([], ['dollar', 'drops', 'reserve', 'concern'])


180it [09:53,  3.40s/it]

(['India', 'Russia'], ['energy', 'talk'])


181it [09:56,  3.23s/it]

(['French'], ['weak', 'datum', 'buffet', 'economy'])


182it [09:59,  3.11s/it]

(['EU'], ['business', 'fears', 'sluggish', 'economy'])


183it [10:03,  3.31s/it]

(['M&S'], ['cuts', 'prices', 'average', '%'])


184it [10:06,  3.18s/it]

(['US'], ['bank', 'loses', 'customer', 'detail'])


185it [10:08,  3.08s/it]

(['Jet Airways'], ['huge', 'rush', 'shares'])


186it [10:12,  3.37s/it]

(['Pinochet'], ['bank', 'payout', 'victims'])


187it [10:15,  3.21s/it]

(['MCI'], ['spark', 'bidding', 'war'])


188it [10:19,  3.37s/it]

(['Fiat'], ['chief', 'takes', 'steering', 'wheel'])


189it [10:22,  3.21s/it]

(['French'], ['consumers', 'drive', 'economy'])


190it [10:25,  3.09s/it]

(['US'], ['regulator', 'rule', 'pain', 'drug'])


191it [10:29,  3.36s/it]

(['Yukos', 'US'], ['bankruptcy', 'matter'])


192it [10:31,  3.21s/it]

(['Borussia', 'Dortmund'], ['bust'])


193it [10:34,  3.13s/it]

([], ['-', 'lull', 'lending'])


194it [10:38,  3.32s/it]

(['UK'], ['risks', 'breaking', 'golden', 'rule'])


195it [10:41,  3.24s/it]

(['Worldcom'], ['director', 'ends', 'evidence'])


196it [10:45,  3.47s/it]

(['Ukraine'], ['steel', 'sell', 'illegal', 'court', 'ruled'])


197it [10:48,  3.33s/it]

(['Cairn'], ['shares', 'new', 'oil', 'find'])


198it [10:51,  3.26s/it]

(['Georgia'], ['plans', 'hidden', 'asset', 'pardon'])


199it [10:55,  3.37s/it]

(['Cuba'], ['winds', 'economic', 'clock'])


200it [10:58,  3.26s/it]

(['Novartis'], ['hits', 'acquisition', 'trail'])


201it [11:02,  3.38s/it]

(['MCI'], ['shareholder', 'sues', 'stop', 'bid'])


202it [11:05,  3.23s/it]

(['Deutsche', 'LSE'], ['bid'])


203it [11:07,  3.11s/it]

(['Bush'], ['outline', 'toughest', 'budget'])


204it [11:11,  3.37s/it]

([], ['orange', 'colour', 'clash', 'set', 'court'])


205it [11:14,  3.22s/it]

(['Standard Life'], ['cuts', 'policy', 'bonus'])


206it [11:17,  3.12s/it]



207it [11:21,  3.33s/it]

(['China', 'Shanda', 'Sina'], ['buys', 'stake'])


208it [11:24,  3.19s/it]

([], ['mixed', 'reaction', 'offer'])


209it [11:27,  3.12s/it]

([], ['gold', 'falls', 'sale', 'concern'])


210it [11:30,  3.30s/it]

([], ['electronic', 'firm', 'eye', 'deal', 'plasma', 'deal'])


211it [11:33,  3.19s/it]

(['MG Rover China'], ['tie', 'delayed'])


212it [11:37,  3.36s/it]

(['US', 'Smith'], ['bank', 'boss', 'hails', 'genius'])


213it [11:40,  3.22s/it]

([], ['economy', 'strong', 'election', 'year'])


214it [11:43,  3.11s/it]

(['SEC', 'post-Enron'], ['rethink', 'post', '-', 'rules'])


215it [11:47,  3.32s/it]

(['Nissan', 'Ghosn'], ['name', 'successor'])


216it [11:50,  3.19s/it]

(['Ukraine'], ['trims', 'privatisation', 'check'])


217it [11:53,  3.14s/it]

(['Absa', 'Barclays'], ['talks', 'continue'])


218it [11:56,  3.30s/it]

(['Borussia', 'Dortmund'], ['rescue', 'hope'])


219it [11:59,  3.18s/it]

(['Standard Life', 'LSE'], ['concern', 'bid'])


220it [12:03,  3.36s/it]

(['BP'], ['surges', 'high', 'oil', 'price'])


221it [12:06,  3.22s/it]

(['Russian'], ['oil', 'company', 'get', 'setback'])


222it [12:09,  3.11s/it]

(['UK'], ['gaming', 'firm', 'sell', 'dog', 'tracks'])


223it [12:13,  3.37s/it]

(['Glazer'], ['open', 'books'])


224it [12:16,  3.26s/it]

([], ['sales', "'", 'fail', 'boost'])


225it [12:19,  3.40s/it]

(["McDonald's", 'MTV'], ['sponsor', 'show'])


226it [12:22,  3.20s/it]

([], ['call', 'save', 'manufacturing', 'job'])


227it [12:25,  3.02s/it]

(['Sri Lanka'], ["'", 'hit', 'banks'])


228it [12:28,  3.18s/it]

(['Man Utd'], ['shares', 'rise', 'new', 'offer'])


229it [12:31,  3.01s/it]

(['Yukos'], ['drops', 'banks', 'court', 'bid'])


230it [12:34,  3.08s/it]

(['Venezuela'], ['reviews', 'foreign', 'deals'])


231it [12:37,  2.99s/it]

(["Lloyd's of London", 'FSA'], ['head', 'chides'])


232it [12:40,  2.95s/it]

([], ['bat', 'firm', 'spit', 'firm', 'drug', 'firm', 'goes', 'market'])


233it [12:44,  3.45s/it]

(['Vodafone', 'Japan'], ['appoints', 'new', 'boss'])


234it [12:47,  3.25s/it]

([], ['pension', 'hitch', 'living', 'men'])


235it [12:50,  3.18s/it]

([], ['card', 'fraudster', 'targeting', 'web'])


236it [12:54,  3.25s/it]

(['Britannia'], ['members', 'windfall'])


237it [12:56,  3.00s/it]

([], ['firms', 'pump', 'billions', 'pensions'])


238it [12:59,  2.88s/it]

(['UK', 'Â£3.3 trillion'], ['homes', 'hit', 'total'])


239it [13:02,  3.13s/it]

([], ['economy', 'strong', 'election', 'year'])


240it [13:06,  3.16s/it]

(['G7'], ['backs', 'debt', 'relief', 'plan'])


241it [13:12,  4.20s/it]

(['Malcolm Glazer', 'Man Utd'], ['q&a'])


242it [13:16,  3.93s/it]

([], ['making', 'office', 'work'])


243it [13:19,  3.74s/it]

(['Aurora'], ['market', 'unfazed', 'setback'])


244it [13:23,  3.83s/it]

(['US'], ['ticking', 'budget', 'facing'])


245it [13:26,  3.66s/it]

(['WorldCom'], ['ebbers', 'aware'])


246it [13:29,  3.54s/it]

(['Renault'], ['boss', 'hails', 'great', 'year'])


247it [13:34,  3.72s/it]

([], ['survey', 'confirms', 'property', 'slowdown'])


248it [13:37,  3.58s/it]

(['Bush'], ['budget', 'seeks', 'deep', 'cutbacks'])


249it [13:41,  3.77s/it]

(['China', 'Lenovo'], ['profit', 'stall'])


250it [13:44,  3.60s/it]

(['MCI'], ['shares', 'climb', 'takeover', 'bid'])


251it [13:47,  3.49s/it]

(['BT'], ['offers', 'equal', 'access', 'rivals'])


252it [13:51,  3.65s/it]

(['US'], ['jobs', 'growth', 'slow'])


253it [13:55,  3.55s/it]

(['News Corp'], ['eye', 'market', 'video', 'game'])


254it [13:58,  3.46s/it]

(['UK'], ['call', 'overhaul', 'state', 'pension'])


255it [14:02,  3.69s/it]

(['Singapore'], ['growth', '%'])


256it [14:05,  3.55s/it]

(['Turkey'], ['knocks', 'zeros', 'lira'])


257it [14:10,  3.80s/it]

(['S Korea'], ['spending', 'boost', 'economy'])


258it [14:13,  3.63s/it]

([], ['sees', 'strong', 'growth'])


259it [14:16,  3.53s/it]

([], ['asia', 'share', 'defy', 'post', '-', 'quake', 'gloom'])


260it [14:21,  3.72s/it]

([], ['booming', 'markets', 'shed', 'few', 'tears'])


261it [14:24,  3.57s/it]

(['Asian', 'European'], ['quake', 'hits', 'shares'])


262it [14:28,  3.76s/it]

(['Â£194m'], ['split', 'cap', 'pay', 'compensation'])


263it [14:32,  3.93s/it]

(['French', 'LSE'], ['suitor', 'holds', 'meeting'])


264it [14:36,  3.73s/it]

(['Marsh', 'SEC'], ['troubled', 'scrutiny'])


265it [14:39,  3.68s/it]

(['Yukos'], ['blessing', 'disguise'])


266it [14:42,  3.34s/it]

(['Nasdaq'], ['planning', 'share', 'sale'])


267it [14:44,  3.18s/it]

([], ['giving', 'financial', 'gifts', 'children'])


268it [14:50,  4.01s/it]

(['Air China', 'London'], ['listing'])


269it [14:54,  3.87s/it]

([], ['oil', 'price', 'reach', 'month', 'low'])


270it [14:57,  3.71s/it]

([], ['seek', 'full', 'share', 'listing'])


271it [15:02,  3.90s/it]

(['Brazilian'], ['markets', 'signal', 'recovery'])


272it [15:05,  3.60s/it]

([], ['markets', 'fall', 'weak', 'dollar', 'fear'])


273it [15:08,  3.43s/it]

(['Google'], ['shares', 'fall', 'staff', 'sell'])


274it [15:11,  3.53s/it]

(['Germans'], ['work'])


275it [15:14,  3.36s/it]

(['India'], ['power', 'share', 'jump', 'debut'])


276it [15:18,  3.49s/it]

(['Turkey'], ['turns', 'economic', 'charm'])


277it [15:21,  3.37s/it]

(['SBC'], ['plans', 'post', '-', 'takeover', 'job', 'cut'])


278it [15:24,  3.26s/it]

(['German', 'LSE'], ['bidder', 'talks'])


279it [15:28,  3.49s/it]

(['Amex'], ['shares', 'spin', 'news'])


280it [15:31,  3.38s/it]

(['Axa Sun Life'], ['cuts', 'bonus', 'payment'])


281it [15:34,  3.27s/it]

(['Chinese', 'Beijing'], ['dam', 'firm', 'defies'])


282it [15:38,  3.37s/it]

(['Japan'], ['stock', 'market', 'eye', 'Japan', 'recovery'])


283it [15:41,  3.22s/it]

(['Iraqi'], ['voters', 'turn', 'economic', 'issues'])


284it [15:44,  3.17s/it]

(['EU'], ['slow', 'economic', 'reforms'])


285it [15:48,  3.34s/it]

(['China'], ['continues', 'breakneck', 'growth'])


286it [15:51,  3.21s/it]

(['Japan'], ['ageing', 'workforce', 'built', 'last'])


287it [15:54,  3.32s/it]

(['GE'], ['sees', 'excellent', 'world', 'economy'])


288it [15:57,  3.21s/it]

(['UK'], ['economy', 'facing', 'major', 'risks'])


289it [16:00,  3.13s/it]

(['Bank'], ['holds', 'interest', 'rate', '%'])


290it [16:04,  3.34s/it]

([], ['tobacco', 'giant', 'ruling'])


291it [16:07,  3.24s/it]

(['US'], ['steady', 'job', 'growth', 'continues'])


292it [16:10,  3.15s/it]

(['Glazer', 'Man Utd'], ['makes', 'new', 'approach'])


293it [16:14,  3.32s/it]

(['Bush'], ['cheers', 'victory'])


294it [16:16,  3.16s/it]

(['Japan'], ['business', 'confidence', 'dips'])


295it [16:19,  3.14s/it]

([], ['millions', "'", 'lose', 'textile', 'job'])


296it [16:22,  3.08s/it]

(['Dutch'], ['bank', 'lay', 'staff'])


297it [16:25,  3.06s/it]

(["Fannie Mae '"], ['restate', 'books'])


298it [16:29,  3.32s/it]

(['US', 'Yukos'], ['rule', 'refuge', 'call'])


299it [16:32,  3.24s/it]

(['J&J'], ['agrees', '25bn', 'guidant', 'deal'])


300it [16:35,  3.15s/it]

(['Libya'], ['takes', 'unfrozen', 'funds'])


301it [16:39,  3.35s/it]

(['Phytopharm'], ['cactus', 'deal', 'diet', 'deal'])


302it [16:42,  3.23s/it]

(['Brazil', 'Varig'], ['plays', 'rescue'])


303it [16:46,  3.44s/it]

(['Bombardier', 'Bombardier'], ['chief', 'leave'])


304it [16:49,  3.28s/it]

(['Brazil'], ['approves', 'bankruptcy', 'reform'])


305it [16:52,  3.19s/it]

([], ['retail', 'sales', 'show', 'festive', 'fervour'])


306it [16:55,  3.24s/it]

(['Cairn'], ['shares', 'slump', 'oil', 'setback'])


307it [16:58,  3.01s/it]

(['French', 'EADS'], ['boss', 'leave'])


308it [17:00,  2.86s/it]

(['AstraZeneca'], ['hit', 'drug', 'failure'])


309it [17:03,  2.94s/it]

(['Nike'], ['strong', 'quarterly', 'growth'])


310it [17:06,  2.80s/it]

(['Stormy year'], ['stormy', 'year', 'property', 'insurer'])


311it [17:09,  2.88s/it]

(['Parmalat'], ['sues', 'banks', 'crash'])


312it [17:11,  2.75s/it]

(['Irish', 'Iraqi'], ['company', 'hit', 'report'])


313it [17:14,  2.69s/it]

(['Yukos'], ['unit', 'fetches', '9bn', 'auction'])


314it [17:17,  2.85s/it]

(['S&N', 'Indian'], ['extends', 'beer', 'venture'])


315it [17:20,  2.73s/it]

(['Euronext', 'LSE'], ['poised', 'make', 'bid'])


316it [17:22,  2.66s/it]

([], ['shoppers', 'flock', 'tills'])


317it [17:25,  2.84s/it]

(['Yukos'], ['mystery', 'surrounds', 'new', 'owner'])


318it [17:28,  2.73s/it]

(['Euronext', 'LSE'], ['joins', 'bid', 'battle'])


319it [17:31,  2.84s/it]

(['Iraq', 'Afghanistan', 'WTO'], ['talks'])


320it [17:35,  3.36s/it]

(['Diageo', 'US'], ['buy', 'wine', 'firm'])


321it [17:39,  3.35s/it]

(['Tokyo'], ['says', 'deflation', 'controlled'])


322it [17:43,  3.48s/it]

([], ['seasonal', 'lift', 'house', 'market'])


323it [17:46,  3.44s/it]

(['Yukos'], ['seeks', 'court', 'action', 'sale'])


324it [17:49,  3.32s/it]

(['Indy', 'India'], ['buys', 'paper'])


325it [17:57,  4.86s/it]

(['Fannie Mae'], ['bosses', 'resign'])


326it [18:01,  4.59s/it]

([], ['cannabis', 'hopes', 'drug', 'firm'])


327it [18:05,  4.22s/it]

(['Bush'], ['get', 'tough', 'deficit'])


328it [18:09,  4.26s/it]

(['House'], ['prices', 'drop', 'sales', 'slow'])


329it [18:13,  4.04s/it]

(['Argentine'], ['fresh', 'hope', 'crisis'])


330it [18:17,  4.08s/it]

(['Disney'], ['settles', 'disclosure', 'charge'])


331it [18:22,  4.57s/it]

(['Putin', 'Yukos'], ['backs', 'state', 'grab'])


332it [18:26,  4.26s/it]

(['Marsh', 'SEC'], ['troubled', 'scrutiny'])


333it [18:29,  3.97s/it]

(['US', 'Iraq'], ['firm', 'pulls'])


334it [18:34,  4.19s/it]

(['Boeing', 'Japan'], ['secures', 'giant', 'order'])


335it [18:38,  4.05s/it]

([], ['banker', 'loses', 'sexism', 'claim'])


336it [18:43,  4.28s/it]

([], ['building', 'giant', 'asbestos', 'payout'])


337it [18:46,  3.92s/it]

(['Chinese'], ['police', 'detain', 'milk', 'boss'])


338it [18:49,  3.63s/it]

(['India', 'Deccan'], ['seals', 'deal'])


339it [18:53,  3.76s/it]

(['Venezuela', 'China'], ['sign', 'oil', 'deal'])


340it [18:56,  3.52s/it]

(['Jarvis', 'Tube', 'Spain'], ['sells', 'stake'])


341it [18:59,  3.36s/it]

(['Honda', 'China'], ['wins', 'copyright', 'ruling'])


342it [19:02,  3.49s/it]

(['Air Jamaica'], ['state', 'control'])


343it [19:05,  3.32s/it]

([], ['battered', 'dollar', 'hits', 'low'])


344it [19:08,  3.27s/it]

([], ['economic', 'costs', 'emerging'])


345it [19:13,  3.59s/it]

([], ['disaster', 'claims', 'less'])


346it [19:16,  3.43s/it]

(['India', 'Pakistan'], ['peace', 'boosts', 'trade'])


347it [19:19,  3.32s/it]

(['US'], ['probe', 'airline', 'chaos', 'travel', 'chaos'])


348it [19:23,  3.54s/it]

(['S Korean'], ['s', 'korean', 'lender', 'faces', 'liquidation'])


349it [19:26,  3.38s/it]

([], ['dollar', 'hits', 'new', 'low'])


350it [19:30,  3.51s/it]

(['US'], ['mild', 'winter', 'drives', 'oil', '%'])


351it [19:33,  3.34s/it]

(['Reliance'], ['share', 'boost', 'feud', 'hit'])


352it [19:36,  3.23s/it]

([], ['giant', 'wave', 'economy', 'damage', 'economy'])


353it [19:40,  3.50s/it]

([], ['asia', 'share', 'defy', 'post', '-', 'quake', 'gloom'])


354it [19:43,  3.34s/it]

(['Israeli'], ['economy', 'picking', 'pace'])


355it [19:46,  3.23s/it]

(['S Korea'], ['spending', 'boost', 'economy'])


356it [19:50,  3.40s/it]

(['Soros', 'Kazakh'], ['group', 'warns'])


357it [19:53,  3.28s/it]

(['Deutsche', 'Yukos'], ['attacks', 'case'])


358it [19:56,  3.21s/it]

(['GM', 'Fiat'], ['crunch', 'talk', 'future'])


359it [20:00,  3.44s/it]

(['Chilean'], ['record', 'year', 'copper'])


360it [20:02,  3.28s/it]

(['US'], ['consumer', 'confidence'])


361it [20:06,  3.46s/it]

([], ['cash', 'gives', 'way', 'flexible', 'friend'])


362it [20:09,  3.32s/it]

(['Balkan'], ['go', 'ahead', 'oil', 'pipeline'])


363it [20:12,  3.24s/it]

([], ['durex', 'maker', 'ssl', 'awaits', 'firm', 'bid'])


364it [20:16,  3.39s/it]

(['Nasdaq', '100m'], ['planning', 'share', 'sale'])


365it [20:19,  3.25s/it]

(['WMC', 'Xstrata'], ['says', 'bid', 'low'])


366it [20:23,  3.40s/it]

(['Sunderland FC'], ['takeover', 'offer'])


367it [20:26,  3.26s/it]



368it [20:29,  3.18s/it]

(['Russian'], ['beer', 'giant', 'swallows', 'firm'])


369it [20:33,  3.41s/it]

(['US'], ['manufacturing', 'expands'])


370it [20:36,  3.28s/it]

(['Singapore'], ['growth', '%'])


371it [20:39,  3.15s/it]

(['Madagascar'], ['completes', 'currency', 'switch'])


372it [20:42,  3.40s/it]

(['Rossignol'], ['moves'])


373it [20:46,  3.28s/it]

([], ['dollar', 'hovers', 'record', 'low'])


374it [20:49,  3.20s/it]

(['S Korean'], ['s', 'korean', 'credit', 'card', 'firm', 'rescued'])


375it [20:52,  3.40s/it]

(['New Year'], ['dollar', 'slides'])


376it [20:55,  3.28s/it]

(['VW', 'Indian'], ['considers', 'opening', 'plant'])


376it [20:59,  3.35s/it]


KeyboardInterrupt: 

## Triplets for TextGen

In [None]:
##########################################################################################################################################
# Generate FILTERED Triples
# USECASE: for generating summaries
# NOTES: triples are filtered for presence of title keywords in sub, verb, obj. If none, the triple is discarded.
##########################################################################################################################################

# Generate Triples and their Reference Sentences. Save it to a file 
PATH_TRIPLE = './data/BBC/Generation/Business/'
PATH_SENTENCE = './data/BBC/Generation/Temp/'

article_keywords = []
article_pairs = [] 
for i, article in tqdm(enumerate(articles)):

    # print(f'\n=========ARTICLE: {i+1}.txt=========')
    
    ####################################
    ## Generate Keywords
    nlp = spacy.load("en_core_web_lg")

    title_body = seperate_title_and_body(article, VERBOSE = False)
    keywords = get_title_keywords_v1(tup_article = title_body, nlp = nlp, formatting = "atomic", VERBOSE = False)
    article_keywords.append(keywords)
    
    ####################################
    ## Retokenize Body 

    # Create new NLP instances, incl matchers
    nlp_retok = spacy.load("en_core_web_lg")
    doc = nlp_retok(title_body[1])

    ####################################
    ## Generate Triples
    
    pairs = generate_triplet2sentence_pairs(doc = doc, arr_keyword = keywords, keyword_filter = True, VERBOSE= False)
    article_pairs.append(pairs)
    filename =f"{i+1:03d}.txt"
    FILE_TRIPLE = PATH_TRIPLE + filename
    FILE_SENTENCE = PATH_SENTENCE + filename

    save_tuple_pairs_to_files(pairs, FILE_TRIPLE, FILE_SENTENCE)


print(f"articles processed: {len(article_keywords)}")


## Triplets for LLM Tuning


In [None]:
##########################################################################################################################################
# Generate per-line triples and their reference sentences
# USECASE: for tuning a LLM
# NOTES: for triplet 2 sentence, we return type = "llm". This returns a list of strings, where each string is a triplet + sentence pair
##########################################################################################################################################

# Generate Triples and their Reference Sentences. Save it to a file 

FOLDER = './data/BBC/Training/business_gpt2/'
FILENAME = "business_gpt2.txt"
PATH = FOLDER + FILENAME

article_keywords = []
article_pairs = [] 
training_pairs = [] 
for i, article in tqdm(enumerate(articles)):

    ####################################
    ## Generate Keywords

    nlp = spacy.load("en_core_web_lg")

    title_body = seperate_title_and_body(article, VERBOSE = False)
    keywords = get_title_keywords_v1(tup_article = title_body, nlp = nlp, formatting = "atomic", VERBOSE = False)
    article_keywords.append(keywords)
    print(keywords)
    
    ####################################
    ## Retokenize Body 
    # Create new NLP instances, incl matchers

    nlp_retok = spacy.load("en_core_web_lg")
    doc = nlp_retok(title_body[1])

    ####################################
    ## Generate Triples
    
    pairs = generate_triplet2sentence_pairs(doc = doc, arr_keyword = keywords, return_type = "llm", keyword_filter = False, VERBOSE= False)
    training_pairs.extend(pairs)
    

with open(PATH, 'w', encoding= "utf-8") as file:
    for line in training_pairs:
        file.write(f"{line}\n")
    

print(f"articles processed: {len(article_keywords)}")


0it [00:00, ?it/s]

(['Time Warner'], ['ad', 'sale', 'boost', 'profit'])


1it [00:03,  3.30s/it]

(['Greenspan'], ['dollar', 'gain', 'speech'])


2it [00:06,  3.10s/it]

(['Yukos'], ['unit', 'buyer', 'faces', 'loan', 'claim'])


3it [00:09,  3.07s/it]

(['BA'], ['high', 'fuel', 'price', 'hit', 'profits'])


4it [00:13,  3.47s/it]

(['Pernod', 'Domecq'], ['takeover', 'talk', 'lifts'])


5it [00:17,  3.45s/it]


KeyboardInterrupt: 