# Appendix A

## ZHAW, CAS Machine Learning, Text Analytics
## Text Summarization with gensim
**Giovanni López, June 2018**

*Notebook available at https://github.com/pandastrail/MAIN/blob/master/Appendix_A.ipynb*

We are going to try to solve a simple Text Summarization problem by using the gensim library in Python. Gensim is a Python library for topic modelling, document indexing and similarity retrieval.

This summarizer is based on the "TextRank" algorithm and "BM25 ranking function". It only works for English, stopwords are removed and words are stemmed.

With gensim already installed, we are going to call the *'summarization'* method of the library and import a few sub-methods:

In [11]:
from gensim.summarization import summarize
from gensim.summarization import keywords
import requests
# requests will be later needed to get a larger text from an URL

Now, let's copy a few lines from a random [article](https://www.nytimes.com/2018/06/26/business/trump-harley-davidson-tariffs.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news) from the New York Times, something like the following:

In [3]:
text = '''President Trump lashed out at one of his favorite American manufacturers on Tuesday, 
criticizing Harley-Davidson over its plans to move some of its motorcycle production abroad and threatening it with punitive taxes in return.
In a series of tweets on Tuesday, the president accused the Wisconsin-based company of having “surrendered” in Mr. Trump’s trade war with Europe. 
He told Republican lawmakers at a White House meeting that the move amounted to a betrayal, saying, “I’ve been very good to Harley-Davidson.”
“If they move, watch, it will be the beginning of the end — they surrendered, they quit!” the president wrote on Twitter. 
“The Aura will be gone and they will be taxed like never before!”
A day earlier, Harley-Davidson announced that it would shift some of its production overseas in response to the European Union’s 
new 31 percent tariff on its imported bikes, which was imposed in retaliation for Mr. Trump’s steel and aluminum tariffs.
'''

Let's explore the method *summarize* imported from gensim 

In [4]:
help(summarize)

Help on function summarize in module gensim.summarization.summarizer:

summarize(text, ratio=0.2, word_count=None, split=False)
    Get a summarized version of the given text.
    
    The output summary will consist of the most representative sentences
    and will be returned as a string, divided by newlines.
    
    Note
    ----
    The input should be a string, and must be longer than :const:`~gensim.summarization.summarizer.INPUT_MIN_LENGTH`
    sentences for the summary to make sense.
    The text will be split into sentences using the split_sentences method in the :mod:`gensim.summarization.texcleaner`
    module. Note that newlines divide sentences.
    
    
    Parameters
    ----------
    text : str
        Given text.
    ratio : float, optional
        Number between 0 and 1 that determines the proportion of the number of
        sentences of the original text to be chosen for the summary.
    word_count : int or None, optional
        Determines how many words will the

So, let's run a summarization leaving all the optional arguments with the default values, thus, we just pass the string:

In [5]:
summarize(text)

'In a series of tweets on Tuesday, the president accused the Wisconsin-based company of having “surrendered” in Mr. Trump’s trade war with Europe.'

Not e that the output does not contain the word "Trump", that should be expected regardless of the output lenght.

The default ratio of 0.2 on means that 20% of the original text will be returned, let's play with this ratio, going from 0.2 to 0.5 and then going back to 0.1:

In [6]:
summarize(text, ratio=0.5)

'President Trump lashed out at one of his favorite American manufacturers on Tuesday, \ncriticizing Harley-Davidson over its plans to move some of its motorcycle production abroad and threatening it with punitive taxes in return.\nIn a series of tweets on Tuesday, the president accused the Wisconsin-based company of having “surrendered” in Mr. Trump’s trade war with Europe.\nA day earlier, Harley-Davidson announced that it would shift some of its production overseas in response to the European Union’s '

In [7]:
summarize(text, ratio=0.1)

''

When using a ration of 0.1 it seems it does not find a sentece or structure to represent the summary that is only 10% of the original text. It appears that default ratio of 0.2 does a good job on getting the summary, let's now change the word count to get in the output:

In [8]:
summarize(text, word_count=60)

'criticizing Harley-Davidson over its plans to move some of its motorcycle production abroad and threatening it with punitive taxes in return.\nIn a series of tweets on Tuesday, the president accused the Wisconsin-based company of having “surrendered” in Mr. Trump’s trade war with Europe.\nA day earlier, Harley-Davidson announced that it would shift some of its production overseas in response to the European Union’s '

Let's now try the *keywords* method:

In [9]:
help(keywords)

Help on function keywords in module gensim.summarization.keywords:

keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=('NN', 'JJ'), lemmatize=False, deacc=True)
    Get most ranked words of provided text and/or its combinations.
    
    Parameters
    ----------
    
    text : str
        Input text.
    ratio : float, optional
        If no "words" option is selected, the number of sentences is reduced by the provided ratio,
        else, the ratio is ignored.
    words : int, optional
        Number of returned words.
    split : bool, optional
        Whether split keywords if True.
    scores : bool, optional
        Whether score of keyword.
    pos_filter : tuple, optional
        Part of speech filters.
    lemmatize : bool, optional
        If True - lemmatize words.
    deacc : bool, optional
        If True - remove accentuation.
    
    Returns
    -------
    result: list of (str, float)
        If `scores`, keywords with scores **OR**
    resul

In [10]:
keywords(text, scores=True)

[('production', 0.27158057325121843),
 ('president', 0.25302201457447515),
 ('harley', 0.22931345586493718),
 ('republican', 0.20655423881078555),
 ('american', 0.20655423881078552),
 ('house', 0.2065542388107855),
 ('taxes', 0.20655423881078536),
 ('taxed', 0.20655423881078536),
 ('tariff', 0.18373798111238868),
 ('tariffs', 0.18373798111238868)]

Now let's use a larger text, let's get an ebook from the Gutenberg database, for example something Emma by Jane Austen:

In [12]:
ebook_url = 'http://www.gutenberg.org/files/158/158-0.txt'
ebook_text = requests.get(ebook_url).text

In [29]:
'Original length of text is: ', len(ebook_text)

('Original length of text is: ', 919021)

The ebook retrieved close to 1 million characters. After trying to get the summary there was a Memory Error encountered. So we will reduce the text to around 1/10 of the original length and let's get a summary of that reduced text to a 1%:

In [30]:
sub = int(len(ebook_text) / 10)
ebook_sub = ebook_text[:sub]

In [31]:
'Reduced length of text is: ', sub

('Reduced length of text is: ', 91902)

In [32]:
summarize(ebook_sub, ratio=0.01)

"Emma Woodhouse, handsome, clever, and rich, with a comfortable home\r\ndifference between a Mrs. Weston, only half a mile from them, and a Miss\r\n'poor Miss Taylor.' I have a great regard for you and Emma; but when it\r\n“Well,” said Emma, willing to let it pass--“you want to hear about\r\n“Dear Emma bears every thing so well,” said her father.\nsuccess, you know!--Every body said that Mr. Weston would never marry\r\nwould be a very good thing for Miss Taylor if Mr. Weston were to marry\r\nHarriet Smith's intimacy at Hartfield was soon a settled thing.\n“Well done, Mrs. Martin!” thought Emma.\nindoors man, else they do not want for any thing; and Mrs. Martin talks\r\nKnightley, “of this great intimacy between Emma and Harriet Smith, but I\r\nEmma must do Harriet good: and by supplying her with a\r\nnew object of interest, Harriet may be said to do Emma good.\nnothing herself, and looks upon Emma as knowing every thing.\nEmma could not feel a doubt of having given Harriet's fancy a pr

Not only took quite a bit of time to process the text (around 5 minutes), it also outputs something that is dificult to value as an actual summary of the book.

This let us conclude, that the extractive automatic summarization in general is, at this point in technology,  getting better for non-fictional text like news, science articles and similar formal representations, but it gets very difficult to extract a meaninful summary for other kinds of text, like in this case, a piece of a novel. An abstractive reinforced learning method may be a better solution to get the summary of this novel.

Let's finally output the first 10 keywords of the ebook we downloaded:

In [34]:
keywords(ebook_sub, scores=True)[:10]

[('emma', 0.34802493396112461),
 ('little', 0.26814972277758409),
 ('_little_', 0.26814972277758409),
 ('harriet', 0.25106911199019366),
 ('good', 0.24990148228268771),
 ('goodness', 0.24990148228268771),
 ('miss', 0.16575848802960919),
 ('missing', 0.16575848802960919),
 ('missed', 0.16575848802960919),
 ('mrs', 0.15705022829834347)]

And without a doubt, the gensim function does apparently a good job by provinding a base point for further analysis.