# Appendix A

## ZHAW, CAS Machine Learning, Text Analytics
## Text Summarization with gensim
**Giovanni López, June 2018**

We are going to try to solve a simple Text Summarization problem by using the gensim library in Python. Gensim is a Python library for topic modelling, document indexing and similarity retrieval.

This summarizer is based on the "TextRank" algorithm and "BM25 ranking function". It only works for English, stopwords are removed and words are stemmed.

With gensim already installed, we are going to call the *'summarization'* method of the module and import a few sub-methods:

In [18]:
from gensim.summarization import summarize
from gensim.summarization import keywords

Now, let's copy a few lines from a random [article](https://www.nytimes.com/2018/06/26/business/trump-harley-davidson-tariffs.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news) from the New York Times, something like the following:

In [4]:
text = '''President Trump lashed out at one of his favorite American manufacturers on Tuesday, 
criticizing Harley-Davidson over its plans to move some of its motorcycle production abroad and threatening it with punitive taxes in return.
In a series of tweets on Tuesday, the president accused the Wisconsin-based company of having “surrendered” in Mr. Trump’s trade war with Europe. 
He told Republican lawmakers at a White House meeting that the move amounted to a betrayal, saying, “I’ve been very good to Harley-Davidson.”
“If they move, watch, it will be the beginning of the end — they surrendered, they quit!” the president wrote on Twitter. 
“The Aura will be gone and they will be taxed like never before!”
A day earlier, Harley-Davidson announced that it would shift some of its production overseas in response to the European Union’s 
new 31 percent tariff on its imported bikes, which was imposed in retaliation for Mr. Trump’s steel and aluminum tariffs.
'''

Let's explore the method *summarize* imported from gensim 

In [6]:
help(summarize)

Help on function summarize in module gensim.summarization.summarizer:

summarize(text, ratio=0.2, word_count=None, split=False)
    Get a summarized version of the given text.
    
    The output summary will consist of the most representative sentences
    and will be returned as a string, divided by newlines.
    
    Note
    ----
    The input should be a string, and must be longer than :const:`~gensim.summarization.summarizer.INPUT_MIN_LENGTH`
    sentences for the summary to make sense.
    The text will be split into sentences using the split_sentences method in the :mod:`gensim.summarization.texcleaner`
    module. Note that newlines divide sentences.
    
    
    Parameters
    ----------
    text : str
        Given text.
    ratio : float, optional
        Number between 0 and 1 that determines the proportion of the number of
        sentences of the original text to be chosen for the summary.
    word_count : int or None, optional
        Determines how many words will the

So, let's run a summarization leaving all the optional arguments with the default values, thus, we just pass the string:

In [7]:
summarize(text)

'In a series of tweets on Tuesday, the president accused the Wisconsin-based company of having “surrendered” in Mr. Trump’s trade war with Europe.'

The default ratio of 0.2 on means that 20% of the original text will be returned, let's play with this ratio, going from 0.2 to 0.5 and then going back to 0.1:

In [9]:
summarize(text, ratio=0.5)

'President Trump lashed out at one of his favorite American manufacturers on Tuesday, \ncriticizing Harley-Davidson over its plans to move some of its motorcycle production abroad and threatening it with punitive taxes in return.\nIn a series of tweets on Tuesday, the president accused the Wisconsin-based company of having “surrendered” in Mr. Trump’s trade war with Europe.\nA day earlier, Harley-Davidson announced that it would shift some of its production overseas in response to the European Union’s '

In [10]:
summarize(text, ratio=0.1)

''

When using a ration of 0.1 it seems it does not find a sentece or structure to represent the summary that is only 10% of the original text. It appears that default ratio of 0.2 does a good job on getting the summary, let's now change the word count to get in the output:

In [17]:
summarize(text, word_count=60)

'criticizing Harley-Davidson over its plans to move some of its motorcycle production abroad and threatening it with punitive taxes in return.\nIn a series of tweets on Tuesday, the president accused the Wisconsin-based company of having “surrendered” in Mr. Trump’s trade war with Europe.\nA day earlier, Harley-Davidson announced that it would shift some of its production overseas in response to the European Union’s '

Let's now try the *keywords* method:

In [19]:
help(keywords)

Help on function keywords in module gensim.summarization.keywords:

keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=('NN', 'JJ'), lemmatize=False, deacc=True)
    Get most ranked words of provided text and/or its combinations.
    
    Parameters
    ----------
    
    text : str
        Input text.
    ratio : float, optional
        If no "words" option is selected, the number of sentences is reduced by the provided ratio,
        else, the ratio is ignored.
    words : int, optional
        Number of returned words.
    split : bool, optional
        Whether split keywords if True.
    scores : bool, optional
        Whether score of keyword.
    pos_filter : tuple, optional
        Part of speech filters.
    lemmatize : bool, optional
        If True - lemmatize words.
    deacc : bool, optional
        If True - remove accentuation.
    
    Returns
    -------
    result: list of (str, float)
        If `scores`, keywords with scores **OR**
    resul

In [20]:
keywords(text, scores=True)

[('production', 0.27158057325121909),
 ('president', 0.25302201457447548),
 ('harley', 0.22931345586493701),
 ('house', 0.20655423881078544),
 ('taxes', 0.20655423881078536),
 ('taxed', 0.20655423881078536),
 ('american', 0.20655423881078533),
 ('republican', 0.20655423881078516),
 ('tariff', 0.1837379811123882),
 ('tariffs', 0.1837379811123882)]