# Abstractive summarization

Abstractive summarization is where the algorithm doesn't just pick sentences from the original, but it synthesizes a summary from the text, potentially also including words or phrases that were not present in the original text.

There are multiple github projects where people have implemented abstract summarizers using RNNs, so I'll be trying to use them here.

Github abstract summarizers:
* [Pytorch implementation](https://github.com/Iwontbecreative/Abstractive-summarization-OpenNMT)
* [Another pytorch implementations, with better documentation](https://github.com/alesee/abstractive-text-summarization)
* [Tensorflow implementation (summarizes Amazon food reviews)](https://github.com/JRC1995/Abstractive-Summarization)


We'll start with trying to implement the first Pytorch implementation by 'Iwontbecreative', on top of OpenNMT.

The first thing that it wants is it wants four text files: it wants source and target training text files, and then it also wants source and target validation files. In this case, the source will be the full CNN story, and the target will be the highlights of the story. I therefore need to process 100 of these CNN stories, and then I can split them into train/validation of 80/20.

In [48]:
import re

In [4]:
%%bash
ls ../data/Abstractive-summarization-OpenNMT

CNN_100_stories.txt
CONTRIBUTORS.md
Dockerfile
LICENSE.md
README.md
data
docs
mkdocs.yml
onmt
opts.py
preprocess.py
requirements.txt
setup.py
test
tools
train.py
train_nb.ipynb
translate.py


In [7]:
## We'll open the CNN_100_stories.txt file, put it into an array, and then use that to process the CNN stories
## into source and target.

with open("../data/Abstractive-summarization-OpenNMT/CNN_100_stories.txt", "r") as f:
    CNN_100_stories = f.read()

In [10]:
CNN_100_stories_filenames = CNN_100_stories.strip().split("\n")

In [17]:
%%bash
ls ../../../resources

1509.00685.pdf
1609.07034.pdf
1706.06681.pdf
1708.00625.pdf
1711.09357.pdf
1807.00199.pdf
2744372.pdf
80_2_6_0.pdf
Feature-Based-Time-Series-Analysis_Rob-Hyndman.pdf
GoogleNews-vectors-negative300.bin
Hands-On-Natural-Language-Processing-with-Python
Simons-Institute-Manning-2017.pdf
W17-2307.pdf
cnn
eaao5580.full.pdf
ed3book.pdf
glove.6B
glove.6B.zip


In [23]:
PATH_TO_CNN_STORIES = "../../../resources/cnn/stories/"

In [41]:
src_train = ""
tgt_train = ""
src_val = ""
tgt_val = ""

In [71]:
src_lines = []
tgt_lines = []

In [77]:
for i in range(len(CNN_100_stories_filenames)):
    with open(PATH_TO_CNN_STORIES+CNN_100_stories_filenames[i], "rb") as f:
        text = f.read().decode("utf-8")
        temp = text.split("@highlight")
        article = re.sub(r'\s+', ' ', temp[0].strip())
        src_lines.append(article)

        highlights = [re.sub(r'\n', '', highlight) for highlight in temp[1:]]
        highlights = ". ".join(highlights)
        tgt_lines.append(highlights)

Let's split the src lines and target lines into a train and validation set.

In [78]:
src_train_lines = src_lines[0:80]
tgt_train_lines = tgt_lines[0:80]

src_val_lines = src_lines[80:]
tgt_val_lines = tgt_lines[80:]

Now, let's join the lines together for each list, and then write these to text files.

In [82]:
src_train_CNN = "\n".join(src_train_lines)
tgt_train_CNN = "\n".join(tgt_train_lines)

src_val_CNN = "\n".join(src_val_lines)
tgt_val_CNN = "\n".join(tgt_val_lines)

In [64]:
%%bash
ls ../data/Abstractive-summarization-OpenNMT/data

README.md
morph
src-test.txt
src-train.txt
src-val.txt
test_model2.src
test_model2.tgt
tgt-train.txt
tgt-val.txt


In [83]:
with open("../data/Abstractive-summarization-OpenNMT/data/src-train-CNN.txt", "w") as f:
    f.write(src_train_CNN)

In [84]:
with open("../data/Abstractive-summarization-OpenNMT/data/tgt-train-CNN.txt", "w") as f:
    f.write(tgt_train_CNN)

In [85]:
with open("../data/Abstractive-summarization-OpenNMT/data/src-val-CNN.txt", "w") as f:
    f.write(src_val_CNN)

In [86]:
with open("../data/Abstractive-summarization-OpenNMT/data/tgt-val-CNN.txt", "w") as f:
    f.write(tgt_val_CNN)

Okay, so overall I think that I have probably 3 viable options:

1. Instead of the OpenNMT implementation that Iwontbecreative has in their github project, I think that I had better clone OpenNMT directly and follow their [example](http://opennmt.net/OpenNMT-py/Summarization.html) here on how to do abstractive summarization with the most up to date version of OpenNMT.
2. I can try the model implemented by 'Hand on NLP with Python' which uses a Tensorflow model, and see how that works. I think that I would still want to see if I could use the biomedical word2vec vectors that I downloaded from bio.nlplab.org, and see if that made any difference.
    * I could probably try to see whether or not I can even just feed in an article's text and even get a summary out, if that's even something that I can get working without spending copious amounts of time on it.
    * Once I can figure out how to even get it to take an article, then I can see whether or not I can figure out how to substitute word2vec vectors for GloVe vectors.
3. The third model that I think I can try would be alesee's project [here](https://github.com/alesee/abstractive-text-summarization) on github, and the reason why I think this one might work is honestly mostly because he updated it within the last month, and he also actually has jupyter notebooks in his repository - and presumably these notebooks actually run!