In [1]:
%load_ext autoreload
%autoreload 2

import summarizer_utils as sutils
import story_converter as sconv
import pickle
import nltk.tokenize as tokenize
import os
from nltk.tokenize.moses import MosesDetokenizer
import pandas as pd
import re

## Get the test data:

In [84]:
article_df = pd.read_csv('test_roman.csv')

def remove_date_title_and_id(s):
    _s = s.split('\n')
    return ''.join(_s[3:])


#replace period followed by a capital with a period space capital
def fix_periods(s):
    fixed = re.sub('\.([A-Z])', '. \g<1>', s)
    return fixed

articles = article_df['full_text'].apply(remove_date_title_and_id)
articles = articles.apply(fix_periods)

#date_and_id_regex = '^[0-9]{8}[\n][A-Z0-9]{32}[\n]'
#articles = article_df['full_text'].apply(lambda x: re.sub(date_and_id_regex, '', x))
#ids = article_df['docno_x']

In [83]:
articles[0]

'Net management fees fell by 9pc over the course of the year to $691m while performance fees dived by almost two-thirds to $112m. As a result the firm’s shares dropped by 10pc in early trading, but recovered later in the day to a more modest 1.8pc loss. In part that was due to a rise in overall assets under management which increased by 3pc to $80.9bn. The company prefers to focus on adjusted pre-tax profits, which came in at $205m, down from 2015’s profit of $400m. Chief executive Luke Ellis, who took the reins in the middle of 2016, said Man Group has “made real progress in repositioning the firm for the future” and “continued to control our cost base”.“Looking forward to 2017, we have started the year with a good pipeline of interest from clients and encouraging performance across most of our strategies as the new global political environment has created many alpha opportunities, but it remains early days in an uncertain market,” he said.'

## Convert articles to the format read by the TensorFlow model:

In [58]:
sconv.process_and_save_to_disk(articles[:10], "test.bin")

## Run TensorFlow model in decoder mode:

In [59]:
summarizer_internal_pickle = "pickles/decoded_stories.pickle"
#sutils.run_summarization_model_decoder(summarizer_internal_pickle, "pretrained_model_tf1.2.1_original")
#sutils.run_summarization_model_decoder(summarizer_internal_pickle, "coverage_trained")
sutils.run_summarization_model_decoder(summarizer_internal_pickle, "more_coverage")

Starting TensorFlow Decoder...
INFO:tensorflow:Starting seq2seq_attention in asdfdsaf decode mode...
INFO:tensorflow:Current folder /Users/arturs/gpu-projects/Zentropy/components/summarizer








max_size of vocab was specified as 50000; we now have 50000 words. Stopping reading.
Finished constructing vocabulary of 50000 total words. Last word added: farina
INFO:tensorflow:Building graph...
example_generator completed reading all datafiles. No more data.
INFO:tensorflow:The example generator for this example queue filling thread has exhausted data.
INFO:tensorflow:single_pass mode is on, so we've finished reading dataset. This thread is stopping.
INFO:tensorflow:Adding attention_decoder timestep 0 of 1
INFO:tensorflow:Time to build graph: 0 seconds
INFO:tensorflow:Loading checkpoint experiments/more_coverage/train/model.ckpt-363378
INFO:tensorflow:Restoring parameters from experiments/more_coverage/train/model.ckpt-363378
INFO:tensorflow:Finished reading dataset in single_pass mode.


## Look at results:

In [60]:
summarization_output = pickle.load(open(summarizer_internal_pickle, "rb" ))

In [61]:
for s in summarization_output['summaries']:
    print(s+"\n\n")


net management fees fell by 9pc over the course of the year to $ 691m while performance fees dived by almost two-thirds to $ 112m . chief executive luke ellis , who took the reins in the middle of 2016 , said man group has `` made real progress in repositioning the firm for the future '' he said man group has `` made real progress in repositioning the firm for the future ''


british american tobacco took the plastic film off another industry consolidation effort . british american tobacco took the plastic film off another industry consolidation effort . the figures imply a total current value of $ 59.64 bn for the 57.8 % of reynolds not already owned by bat.it .


glencore plc rode a wave of surging commodity prices in 2016 to return to a profit of $ 1.4 billion . chief executive ivan glasenberg is known as one of the mining industry 's most voracious deal-makers . the company earlier said it plans to pay out $ 1 billion in dividends in 2017 .


intel will buy israel 's mobileye in a 

## All lower case - Named entity detector will complain!

In [43]:
tokenized_summaries = sutils.try_fix_upper_case_for_summaries(articles[:10], summarization_output['summaries_tokens'])

detokenizer = MosesDetokenizer()

detokenized_summaries = []

for s in tokenized_summaries:
    s_detok = detokenizer.detokenize(s, return_str=True)
    detokenized_summaries.append(s_detok)
    print(s_detok+"\n\n")

Hedge fund giant Man Group stumbles to loss in choppy markets. Net management fees fell by 9pc over the year to $691m while performance fees dived by almost two-thirds to $112m. Chief executive Luke Ellis, who took the reins in the middle of 2016, said Man Group has ``made real progress ''


British American tobacco took the plastic film off another industry consolidation effort. Reynolds shareholders will receive $29.44 in cash and 0.5260 BAT ordinary shares. the BAT also represents a premium of 26% over the closing price of Reynolds common stock on 20 October 2016.


Glencore PLC rode a wave of surging commodity prices in 2016 to return to a profit of $1.4 billion. Chief executive Ivan Glasenberg now faces a tough decision: start splurging on new mergers or acquisitions, or return the rewards to shareholders in the form of dividends. Shares in the company were up more than 4% in early-afternoon trading in london.


Intel will buy Israel's Mobileye in a deal valued at about $15 billio

## Much better! Next let's look at our baseline summaries:

In [85]:
#Get 3 sentence summaries
from nltk.tokenize import sent_tokenize



def get_3_sentence_summaries(articles):
    summaries = []
    for article in articles:
        sent_tokens = sent_tokenize(article)
        summaries.append(' '.join(sent_tokens[:3]))
    return summaries

summaries = get_3_sentence_summaries(articles)


In [87]:
sent_tokenize(articles[3])

["SANTA CLARA, Calif. (AP) -- Intel will buy Israel's Mobileye in a deal valued at about $15 billion, instantly propelling the computer chip and technology giant to the forefront of autonomous vehicle technology.",
 "The deal announced Monday combines Mobileye's market-leading software that processes information from cameras and other sensors with Intel's hardware, data centers and its own software, giving automakers a one-stop place to shop for fully autonomous car systems.",
 '"This acquisition essentially merges the intelligent eyes of the autonomous car with the intelligent brain that actually drives the car," Intel CEO Brian Krzanich wrote in a note to employees about the acquisition.',
 "The combination, expected to close by year's end, will allow the companies to bring components to market faster at a lower cost, solidifying Mobileye's leadership position, officials from the companies said.",
 'Automakers and some technology companies are testing autonomous vehicles in Californi

In [70]:
print(summaries[0])
print(articles[0])

Net management fees fell by 9pc over the course of the year to $691m while performance fees dived by almost two-thirds to $112m.As a result the firm’s shares dropped by 10pc in early trading, but recovered later in the day to a more modest 1.8pc loss.In part that was due to a rise in overall assets under management which increased by 3pc to $80.9bn.The company prefers to focus on adjusted pre-tax profits, which came in at $205m, down from 2015’s profit of $400m.Chief executive Luke Ellis, who took the reins in the middle of 2016, said Man Group has “made real progress in repositioning the firm for the future” and “continued to control our cost base”.“Looking forward to 2017, we have started the year with a good pipeline of interest from clients and encouraging performance across most of our strategies as the new global political environment has created many alpha opportunities, but it remains early days in an uncertain market,” he said.
Net management fees fell by 9pc over the course o

In [44]:
print("Extractive summaries:\n")
for s1 in story_data['summaries_extractive']:
    print(s1+"\n\n")

print("3 sentence summaries:\n")
for s2 in story_data['summaries_3sent']:
    print(s2+"\n\n")

Extractive summaries:



NameError: name 'story_data' is not defined

## Send data to named entity extractor:

In [105]:
summarizer_output_pickle = "pickles/summarizer_output.pickle"

summarizer_output = {
    'urls': story_data['urls'],
    'titles': story_data['titles'],
    'stories': story_data['stories'],
    'summaries_extractive': story_data['summaries_extractive'],
    'summaries_model': tokenized_summaries,
    'summaries_3sent': story_data['summaries_3sent']
}

pickle.dump(summarizer_output, open(summarizer_output_pickle, "wb"))

## Load NER library:

In [106]:
import sys
sys.path.append("../ner")

import NERutils as ner


## Showtime:

In [112]:
all_orgs = []

for story in detokenized_summaries:
    storyCombined = story.replace('\n', ' ')

    print('RUNNING TOKENIZER')
    storyTokenized = tokenize.word_tokenize(storyCombined)

    print('SPLITTING SENTENCES LINE BY LINE')
    split = ner.sentenceSplitter(storyTokenized)

    inputFile = open(r'../ner/input.txt','w')
    ner.writeArticle(split,inputFile)
    inputFile.close()

    print('RUNNING MODEL')
    os.system('python2.7 ../ner/tagger-master/tagger.py --model ../ner/tagger-master/models/english/ --input ../ner/input.txt --output ../ner/output.txt')

    with open(r'../ner/output.txt','r') as namedStory:
        namedStory=namedStory.read().replace('\n', ' ')

    print('NAMED ENTITIES:')
    orgs  = ner.findNamedEntities(namedStory.split(' '))
    all_orgs.append(orgs)
    print(orgs)


RUNNING TOKENIZER
SPLITTING SENTENCES LINE BY LINE
RUNNING MODEL
NAMED ENTITIES:
['Philips']
RUNNING TOKENIZER
SPLITTING SENTENCES LINE BY LINE
RUNNING MODEL
NAMED ENTITIES:
['Musk', 'Elon Musk']
RUNNING TOKENIZER
SPLITTING SENTENCES LINE BY LINE
RUNNING MODEL
NAMED ENTITIES:
[]


In [113]:
all_orgs

[['Philips'], ['Musk', 'Elon Musk'], []]

In [114]:
detokenized_summaries

["Lenders will take control of the Nashville-based company, which was founded in 1902. it made an ill-fated acquisition of Philips' consumer audio division four years ago for $135m. the firm makes its electric guitars in Nashville and Memphis, while its acoustic guitars are manufactured in Bozeman, Montana.",
 "analysts: Elon Musk got testy with analysts amid concerns over company's future. Tesla investors gave a rare rebuke to Musk after he cut off analysts asking about future profit potential. Tesla investors gave a rare rebuke to Musk after he cut off analysts asking about future profit potential.",
 'irreparable Aibo robotic dogs are marked in much the same way as that of humans. the demise of irreparable Aibo robotic dogs is marked in much the same way as that of humans. the firm stopped repairing malfunctioning Aibo in 2014, leaving owners whose pets were beyond repair unsure.']