In [30]:
%load_ext autoreload
%autoreload 2

import summarizer_utils as sutils
import story_converter as sconv
import pickle
import nltk.tokenize as tokenize
import os
from nltk.tokenize.moses import MosesDetokenizer

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Get some articles to summarize!

In [3]:
articles = [
    "https://yle.fi/uutiset/osasto/news/finnish_capital_planning_rules_for_airbnb_market/10139225",
    "http://metropolitan.fi/entry/finnish-amer-sports-acquires-peak-performance-athletic-apparel-brand",
    "http://metropolitan.fi/entry/koskenkorva-factory-finland-12-hour-shift"
]

print("Downloading articles...")
story_data = sutils.fetch_and_pickle_stories(articles, True)
print("Downloading articles DONE")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Downloading articles...
Downloading articles DONE


## Convert articles to the format read by the TensorFlow model:

In [4]:
sconv.process_and_save_to_disk(story_data['stories'], "test.bin")

## Run TensorFlow model in decoder mode:

In [5]:
summarizer_internal_pickle = "pickles/decoded_stories.pickle"
sutils.run_summarization_model_decoder(summarizer_internal_pickle, "more_coverage")

Starting TensorFlow Decoder...
INFO:tensorflow:Starting seq2seq_attention in asdfdsaf decode mode...
INFO:tensorflow:Current folder /Users/arturs/gpu-projects/Zentropy/components/summarizer








max_size of vocab was specified as 50000; we now have 50000 words. Stopping reading.
Finished constructing vocabulary of 50000 total words. Last word added: farina
INFO:tensorflow:Building graph...
example_generator completed reading all datafiles. No more data.
INFO:tensorflow:The example generator for this example queue filling thread has exhausted data.
INFO:tensorflow:single_pass mode is on, so we've finished reading dataset. This thread is stopping.
INFO:tensorflow:Adding attention_decoder timestep 0 of 1
INFO:tensorflow:Time to build graph: 0 seconds
INFO:tensorflow:Loading checkpoint experiments/more_coverage/train/model.ckpt-363378
INFO:tensorflow:Restoring parameters from experiments/more_coverage/train/model.ckpt-363378
INFO:tensorflow:Finished reading dataset in single_pass mode.


## Look at results:

In [13]:
summarization_output = pickle.load(open(summarizer_internal_pickle, "rb" ))

In [17]:
for s in summarization_output['summaries']:
    print(s+"\n\n")


researcher pekka mustonen from helsinki city administration says there were 2,300 airbnb-accommodations available in helsinki last year . anna varakas has been renting her property in ullanlinna through airbnb for two years . anna varakas has been renting her property in ullanlinna through airbnb for two years .


finnish amer sports acquires peak performance athletic apparel brand peak performance for 255 million euros . the company known for it 's premium sports gear was sold by danish fashion company ic group . the total price is 255 million euros , and it will be financed with cash assets as well as bank financing .


altia 's koskenkorva factory in south pohjanmaa uses abnormally long twelve hour shifts . the facility produces grain alcohol , starch , livestock feed and carbon dioxide . the facility produces grain alcohol , starch , livestock feed and carbon dioxide . the facility produces grain alcohol , starch , livestock feed and carbon dioxide .




## All lower case - Named entity detector will complain!

In [43]:
tokenized_summaries = sutils.try_fix_upper_case_for_summaries(story_data['stories'], summarization_output['summaries_tokens'])

detokenizer = MosesDetokenizer()

detokenized_summaries = []

for s in tokenized_summaries:
    s_detok = detokenizer.detokenize(s, return_str=True)
    detokenized_summaries.append(s_detok)
    print(s_detok+"\n\n")

Researcher Pekka Mustonen from Helsinki city administration says there were 2,300 Airbnb-accommodations available in Helsinki last year. Anna Varakas has been renting her property in Ullanlinna through Airbnb for two years. Anna Varakas has been renting her property in Ullanlinna through Airbnb for two years.


Finnish Amer sports acquires Peak Performance athletic apparel brand Peak Performance for 255 Million euros. the company known for it's premium sports gear was sold by Danish fashion company IC Group. the total price is 255 Million euros, and it will be financed with cash assets as well as bank financing.


Altia's Koskenkorva factory in South Pohjanmaa uses abnormally long twelve hour shifts. the facility produces grain alcohol, starch, livestock feed and carbon dioxide. the facility produces grain alcohol, starch, livestock feed and carbon dioxide. the facility produces grain alcohol, starch, livestock feed and carbon dioxide.




## Much better! Next let's look at our baseline summaries:

In [28]:
print("Extractive summaries:\n")
for s1 in story_data['summaries_extractive']:
    print(s1+"\n\n")

print("3 sentence summaries:\n")
for s2 in story_data['summaries_3sent']:
    print(s2+"\n\n")

Extractive summaries:

Researcher Pekka Mustonen from Helsinki city administration says that statistics compiled by AirDNA show that there were 2,300 Airbnb-accommodations available in Helsinki last year.
Regulation of Airbnb plannedThe figures of Helsinki's Airbnb beds became as a surprise to Mustonen and Laura Aalto, the chief executive at Helsinki Marketing.
Nonetheless the city of Helsinki plans to introduce rules for residents who want to rent their property through Airbnb.
According to Lappi, Airbnb used to be part of the sharing economy, but it has now turned into a professional business.
Airbnb is changing the market in a way that hotels do not like, she says.


Finnish Amer Sports acquires Peak Performance athletic apparel brandFinnish company acquires athletics apparel brand Peak Performance for 255 Million euros.
Finnish sports equipment and apparel company Amer Sports has acquired all the shares in Peak Performance.
The total price is 255 Million euros, and it will be finan

## Send data to named entity extractor:

In [29]:
summarizer_output_pickle = "pickles/summarizer_output.pickle"

summarizer_output = {
    'urls': story_data['urls'],
    'titles': story_data['titles'],
    'stories': story_data['stories'],
    'summaries_extractive': story_data['summaries_extractive'],
    'summaries_model': tokenized_summaries,
    'summaries_3sent': story_data['summaries_3sent']
}

pickle.dump(summarizer_output, open(summarizer_output_pickle, "wb"))

## Load NER library:

In [36]:
import sys
sys.path.append("../ner")

import NERutils as ner


## Showtime:

In [41]:
all_orgs = []

for story in summarizer_output['stories']:
    storyCombined = story.replace('\n', ' ')

    print('RUNNING TOKENIZER')
    storyTokenized = tokenize.word_tokenize(storyCombined)

    print('SPLITTING SENTENCES LINE BY LINE')
    split = ner.sentenceSplitter(storyTokenized)

    inputFile = open(r'../ner/input.txt','w')
    ner.writeArticle(split,inputFile)
    inputFile.close()

    print('RUNNING MODEL')
    os.system('python2.7 ../ner/tagger-master/tagger.py --model ../ner/tagger-master/models/english/ --input ../ner/input.txt --output ../ner/output.txt')

    with open(r'../ner/output.txt','r') as namedStory:
        namedStory=namedStory.read().replace('\n', ' ')

    print('NAMED ENTITIES:')
    orgs  = ner.findNamedEntities(namedStory.split(' '))
    all_orgs.append(orgs)
    print(orgs)


RUNNING TOKENIZER
SPLITTING SENTENCES LINE BY LINE
RUNNING MODEL
NAMED ENTITIES:
['AirDNA', 'Helsinki Marketing', 'Airbnb', 'Finnish Hospitality Association', 'Airbnb guests. ” Hotels']
RUNNING TOKENIZER
SPLITTING SENTENCES LINE BY LINE
RUNNING MODEL
NAMED ENTITIES:
['Marlboro', 'IC Group', 'Finnish Amer Sports', 'Peak Performance', 'Peak Perofmrance', 'Amer Sport', 'Amer', 'Amer Sports', 'Salomon']
RUNNING TOKENIZER
SPLITTING SENTENCES LINE BY LINE
RUNNING MODEL
NAMED ENTITIES:
['AM', 'Altia']


In [42]:
all_orgs

[['AirDNA',
  'Helsinki Marketing',
  'Airbnb',
  'Finnish Hospitality Association',
  'Airbnb guests. ” Hotels'],
 ['Marlboro',
  'IC Group',
  'Finnish Amer Sports',
  'Peak Performance',
  'Peak Perofmrance',
  'Amer Sport',
  'Amer',
  'Amer Sports',
  'Salomon'],
 ['AM', 'Altia']]

In [44]:
detokenized_summaries

['Researcher Pekka Mustonen from Helsinki city administration says there were 2,300 Airbnb-accommodations available in Helsinki last year. Anna Varakas has been renting her property in Ullanlinna through Airbnb for two years. Anna Varakas has been renting her property in Ullanlinna through Airbnb for two years.',
 "Finnish Amer sports acquires Peak Performance athletic apparel brand Peak Performance for 255 Million euros. the company known for it's premium sports gear was sold by Danish fashion company IC Group. the total price is 255 Million euros, and it will be financed with cash assets as well as bank financing.",
 "Altia's Koskenkorva factory in South Pohjanmaa uses abnormally long twelve hour shifts. the facility produces grain alcohol, starch, livestock feed and carbon dioxide. the facility produces grain alcohol, starch, livestock feed and carbon dioxide. the facility produces grain alcohol, starch, livestock feed and carbon dioxide."]