In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 1000)

In [2]:
import spacy
nlp = spacy.load('en_core_web_md')

In [3]:
from gensim.summarization import summarize, keywords

Using TensorFlow backend.


## TL;DR

Gensim summarizer only extracts important sentence from source text,
and seems to rank those sentences based on their scores, and return as results.
So, it is not suitable for multi-document summarization, since there is no correlation or coherence
between documents(tweets, reddits, etc.)
And for abstractive summarization, I think the summary will result in same problem.

## Summarize Long Text

Summarizer only makes sense when summarizing single document.

In [5]:
text = \
"""Photographic style transfer is a long-standing problem that seeks to transfer the style of a reference style photo onto another input picture.
For instance, by appropriately choosing the reference style photo, one can make the input picture look like it has been taken under a different illumination, time of day, or weather, or that it has been artistically retouched with a different intent.
So far, existing techniques are either limited in the diversity of scenes or transfers that they can handle or in the faithfulness of the stylistic match they achieve.
In this paper, we introduce a deep-learning approach to photographic style transfer that is at the same
time broad and faithful, i.e., it handles a large variety of
image content while accurately transferring the reference
style. Our approach builds upon the recent work on Neural
Style transfer by Gatys et al. [5]. However, as shown in
Figure 1, even when the input and reference style images
are photographs, the output still looks like a painting, e.g.,
straight edges become wiggly and regular textures wavy.
One of our contributions is to remove these painting-like effects by preventing spatial distortion and constraining the
transfer operation to happen only in color space. We achieve
this goal with a transformation model that is locally affine in
colorspace, which we express as a custom fully differentiable
energy term inspired by the Matting Laplacian [9]. We show
that this approach successfully suppresses distortion while
having a minimal impact on the transfer faithfulness. Our
other key contribution is a solution to the challenge posed
by the difference in content between the input and reference
images, which could result in undesirable transfers between
unrelated content. For example, consider an image with less
sky visible in the input image; a transfer that ignores the
difference in context between style and input may cause the
style of the sky to “spill over” the rest of the picture. We
show how to address this issue using semantic segmentation
[3] of the input and reference images. We demonstrate
the effectiveness of our approach with satisfying photorealistic
style transfers for a broad variety of scenarios including
transfer of the time of day, weather, season, and artistic edits.
From a practical perspective, our contribution is an effective
algorithm for photographic style transfer suitable
for many applications such as altering the time of day or
weather of a picture, or transferring artistic edits from a
photo to another. To achieve this result, we had to address
two fundamental challenges.
There is an inherent tension in
our objectives. On the one hand, we aim to achieve very
local drastic effects, e.g., to turn on the lights on individual
skyscraper windows (Fig. 1). On the other hand, these effects
should not distort edges and regular patterns, e.g., so that the
windows remain aligned on a grid. Formally, we seek a transformation
that can strongly affect image colors while having
no geometric effect, i.e., nothing moves or distorts. Reinhard
et al. [12] originally addressed this challenge with a global
color transform. However, by definition, such a transform
cannot model spatially varying effects and thus is limited
in its ability to match the desired style. More expressivity
requires spatially varying effects, further adding to the challenge
of preventing spatial distortion. A few techniques exist
for specific scenarios [8, 15] but the general case remains
unaddressed. Our work directly takes on this challenge and
provides a first solution to restricting the solution space to
photorealistic images, thereby touching on the fundamental
task of differentiating photos from paintings.
Semantic accuracy and transfer faithfulness. The complexity
of real-world scenes raises another challenge: the
transfer should respect the semantics of the scene. For instance,
in a cityscape, the appearance of buildings should be
matched to buildings, and sky to sky; it is not acceptable to make the sky look like a building. One plausible approach is
to match each input neural patch with the most similar patch
in the style image to minimize the chances of an inaccurate
transfer. This strategy is essentially the one employed by
the CNNMRF method [10]. While plausible, we find that it
often leads to results where many input patches get paired
with the same style patch, and/or that entire regions of the
style image are ignored, which generates outputs that poorly
match the desired style.
One solution to this problem is to transfer the complete
“style distribution” of the reference style photo as captured
by the Gram matrix of the neural responses [5]. This approach
successfully prevents any region from being ignored.
However, there may be some scene elements more (or less)
represented in the input than in the reference image. In such
cases, the style of the large elements in the reference style
image “spills over” into mismatching elements of the input
image, generating artifacts like building texture in the sky. A
contribution of our work is to incorporate a semantic labeling
of the input and style images into the transfer procedure so
that the transfer happens between semantically equivalent
subregions and within each of them, the mapping is close
to uniform. As we shall see, this algorithm preserves the
richness of the desired style and prevents spillovers. These
issues are demonstrated in Figure 2.
"""

In [6]:
import re
text = text.replace('\n', ' ')
text = re.sub(r'\s+', ' ', text)
text = text.strip()
text = ' '.join([sent.text for sent in nlp(text).sents])

In [8]:
import wikipedia

In [10]:
summarize(wikipedia.summary("dog"), ratio=0.2, split=True)

['The domestic dog (Canis lupus familiaris or Canis familiaris) is a member of genus Canis (canines) that forms part of the wolf-like canids, and is the most widely abundant carnivore.']

In [29]:
keywords(text, ratio=0.05, split=True, scores=True, lemmatize=True)

[('transferring', array([ 0.34156552])),
 ('style', array([ 0.28404195])),
 ('image', array([ 0.23064807])),
 ('semantics', array([ 0.16740352])),
 ('effective', array([ 0.14613938])),
 ('spatially', array([ 0.14069739])),
 ('like', array([ 0.13472327]))]

## Summarize tweets in the same cluster.

In [1]:
import sys
sys.path.append('../data_helpers/')
sys.path.append('../cluster/')

from twitter_data_helper import TwitterDataHelper
from cluster import Cluster

data_helper = TwitterDataHelper()
df = data_helper.get_data(['2017-07-17', '2017-07-18', '2017-07-19'])

cluster = Cluster()

df = cluster.cluster(df)

* Loading SpaCy "en_core_web_md" corpus...
* Success.
--------------------------------------------------------------------------------------------------------------------
* Parsing texts...
* Cleaning texts...
--------------------------------------------------------------------------------------------------------------------
* Transforming texts to feature vectors...
* Clustering...
* Done.
--------------------------------------------------------------------------------------------------------------------


In [2]:
len(df)

650

In [3]:
cluster.best_cluster_model.n_clusters

3

In [53]:
text = '\n'.join([text for text in df[df['cluster'] == 2].text])

In [54]:
""" Apparently, it just ranks the important tweets and return the results. """

summarize(text, ratio=0.1, split=True)

['What did we learn and what happens next from the Securing Agile Delivery track at CyberUK?',
 "Find out why #machinelearning and #AI are the future of #cybersecurity at Andrew Gardner's #BHUSA talk next week: https://t.co/cCxsWwJSOT",
 'Naked Security | Wait, you didn’t want to clean the toilets?',
 'Help shape the future of #AI &amp; virtual assistants with a #Cortana research internship… https://t.co/B3MVxQGhMQ',
 'I‘m a city boy…So, I was surprised to learn cowboys are using smart watches and wearables to track their cows:… https://t.co/oThyp7MPBL',
 "I opened a VM saved in April (so shouldn't full updated), disable the network, check Office update, guess what I go… https://t.co/PtrojBdmfX",
 "Hmm, while looking at @scriptjunkie1's Office update issue https://t.co/5fmWl7baFe, I think I found another shitty one..",
 "I should clarify: Tesla stock is obviously high based on past &amp; present, but low if you believe in Tesla's future.… https://t.co/XzG3z7jo0A",
 'When I do cyber sec

In [34]:
keywords(text, ratio=0.005, split=True, scores=True, lemmatize=True)

[('https', array([ 0.82361567])),
 ('security', array([ 0.16643479])),
 ('new', array([ 0.12400706])),
 ('network', array([ 0.08043786])),
 ('registered', array([ 0.07908443]))]

In [42]:
""" Ignore here, just for debugging interest of rows """
debug_idxs = []
for i, text in enumerate(df.text):
    if 'Clustering' in text.split(): debug_idxs.append(i)

df.loc[debug_idxs]

## Let's try to summarize text extracted from URL

### Test source: [Your tl;dr by an ai: a deep reinforced model for abstractive summarization](https://einstein.ai/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization)

In [55]:
from newspaper import Article

In [114]:
url = 'https://einstein.ai/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization'
article = Article(url)

In [115]:
article.build()

In [116]:
article.title

'Salesforce Research'

In [117]:
article.text

'Your tl;dr by an ai: a deep reinforced model for abstractive summarization\n\nThe last few decades have witnessed a fundamental change in the challenge of taking in new information. The bottleneck is no longer access to information; now it’s our ability to keep up. We all have to read more and more to keep up-to-date with our jobs, the news, and social media. We’ve looked at how AI can improve people’s work by helping with this information deluge and one potential answer is to have algorithms automatically summarize longer texts.\n\nTraining a model that can generate long, coherent, and meaningful summaries remains an open research problem. In fact, generating any kind of longer text is hard for even the most advanced deep learning algorithms. In order to make summarization successful, we introduce two separate improvements: a more contextual word generation model and a new way of training summarization models via reinforcement learning (RL).\n\nThe combination of the two training met

### Summarize using Newspaper3k

In [118]:
""" The newspaper3k's summary looks bad. """
article.summary.split('\n')

['Extractive vs. Abstractive SummarizationAutomatic summarization models can work in one of two ways: by extraction or by abstraction.',
 'In this work, we tackle these issues and design a more robust and coherent abstractive summarization model.',
 'Combined in this way, the joint model is able to read any text and generate a different text from it.',
 'This framework is called an encoder-decoder RNN (or Seq2Seq) and is the basis of our summarization model.',
 'When such a metric is used with our reinforced summarization model summaries may improve even further.']

In [119]:
article.keywords

['summarization',
 'summary',
 'salesforce',
 'model',
 'summaries',
 'input',
 'hidden',
 'research',
 'blair',
 'different',
 'text',
 'models']

### Summarize using Gensim

In [123]:
""" 
Try using gensim summarizer. It is much better then `newspaper3k` 
since the results matches ground truths. 
"""
summarize(article.text, ratio=0.1, split=True)

['In order to make summarization successful, we introduce two separate improvements: a more contextual word generation model and a new way of training summarization models via reinforcement learning (RL).',
 'The combination of the two training methods enables the system to create relevant and highly readable multi-sentence summaries of long text, such as news articles, significantly improving on previous results.',
 'Recurrent neural networks (RNNs) are deep learning models that can process sequences (e.g. text) of variable length and compute useful representations (or hidden state) for each phrase.',
 'At each step, the RNN hidden state is used to generate a new word that is added to the final output text and fed in as the next input.',
 'To make our model outputs more coherent, we allow the decoder to look back at parts of the input document when generating a new word with a technique called temporal attention.',
 'This attention is then modulated to ensure that the model uses diffe

In [121]:
""" article.keywords is better then gensim's keywords(),
    since article.keywords are given by human.
"""
keywords(article.text, ratio=0.05, split=True, scores=True, lemmatize=True)

[('summary', array([ 0.27793276])),
 ('models', array([ 0.22582236])),
 ('different', array([ 0.17744981])),
 ('generated', array([ 0.15655664])),
 ('new', array([ 0.15101865])),
 ('word', array([ 0.13918288])),
 ('rnns', array([ 0.13767702])),
 ('abstractive summarization', array([ 0.1325364])),
 ('training', array([ 0.12755642])),
 ('balls', array([ 0.11759542])),
 ('google', array([ 0.11356443])),
 ('rouge', array([ 0.1115001])),
 ('decoding', array([ 0.11005754])),
 ('evaluation', array([ 0.10528888])),
 ('improve', array([ 0.10460957])),
 ('attention', array([ 0.10140333])),
 ('outputs', array([ 0.10027907])),
 ('sequence', array([ 0.09976675])),
 ('insured', array([ 0.09962752])),
 ('information', array([ 0.09446021]))]