![Text Summarization - Extractive](https://images.unsplash.com/photo-1484480974693-6ca0a78fb36b?ixlib=rb-0.3.5&ixid=eyJhcHBfaWQiOjEyMDd9&s=612621fd686897b4812287430c8be9db&auto=format&fit=crop&w=1052&q=80)

Source: https://unsplash.com/photos/RLw-UC03Gwc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText

# Text Summarization - Extractive

To catch a quick idea of long document, we will always to do a summarization when we read a article or book. In English, the first (or first two) sentence(s) of each article has a very high chance of representing the whole article. Of course, the topic sentence can be the last sentence in sometimes.

In NLP, there are two approaches to do the text summarization. The first one, extractive approach, is a simple approach which is extracting key words or sentences from article. There are some limitations and proved that the performance is not very good. The second one, abstractive approach, is generating a new sentences base on given article. It needs more advanced technique.

After read this article:
- Understand PageRank algorithm
- Understand TextRank algorithm
- How can we use TextRank algorithm to have a summarization

![PageRank Algorithm](https://cdn-images-1.medium.com/max/800/0*OoVjAZzO8II2Oq4N.jpg)

Source: https://www.youtube.com/watch?v=P8Kt6Abq_rM

PageRank algorithm is developed by Google for searching the most importance of website so that Google search result is relevant to query. 

In PageRank, it is a directed graph. At the beginning, all node have equal score (1 / total number of node). 
The algorithm

![PageRank Formula](https://blogs.cornell.edu/info2040/files/2015/10/formula-pagerank-seqlz8.jpg)

Source: https://blogs.cornell.edu/info2040/2015/10/17/will-outbound-links-reduce-the-pagerank/

The first formula is the simplified version of PageRank and we will use this one for demo. The second one is a little bit complicated as it involved one more parameter which is damping factor, "d". By default d is 0.85

Let take a look in the simplified version. In iteration 1, here is how PageRank calculate:
- A: (1/4)/3. As only C is pointing to A, so we use previous C score (iteration 0) divided by number of node (i.e. 3) that C is pointing
- B: (1/4)/2 + (1/4)/3. Both A and C are pointing to B, so previous A score (iteration 0) divided by number of node (i.e. 2) that A is pointing. For C, it is same as previous one which is (1/4)/3.

For detail, you may checkout the [youtbue](https://www.youtube.com/watch?v=P8Kt6Abq_rM) for full explantation. 

Question: When should we stop the iteration? <br>
According to theory, it should calculate until no big update on score.

# TextRank
Why we need to introduce PageRank before TextRank? Because the idea of TextRank comes from PageRank and using similar algorithm (graph concept) to calculate the importance.

Difference: 
- TextRank graph is undirected. Meaning that all edge are bidirectional
- The weight of edge is difference while it is 1 in PageRank. There are different way to calculate such as BM25, TF-IDF.

There are a lot of different document similarity implementation such as BM25,  cosine similarity, IDF-modified-cosine. You may choose the best fit for your problem. If you do not have idea about those algorithm, please let us know and we will include it in later sharing.

![gensim](https://radimrehurek.com/gensim/_static/images/gensim.png)

Source: https://radimrehurek.com/gensim/

gensim provides a simple API to calculate TextRank by using BM25 (Best Match 25). 

Step 1: Environment Setup

pip install gensim==3.4.0

Step 2: Import library

In [1]:
import gensim 
print('gensim Version: %s' % (gensim.__version__))

gensim Version: 3.4.0


In [2]:
# Capture from https://www.cnbc.com/2018/06/01/microsoft--github-acquisition-talks-resume.html

content = "Microsoft held talks in the past few weeks " + \
    "to acquire software developer platform GitHub, Business " + \
    "Insider reports. One person familiar with the discussions " + \
    "between the companies told CNBC that they had been " + \
    "considering a joint marketing partnership valued around " + \
    "$35 million, and that those discussions had progressed to " + \
    "a possible investment or outright acquisition. It is " + \
    "unclear whether talks are still ongoing, but this " + \
    "person said that GitHub's price for a full acquisition " + \
    "was more than Microsoft currently wanted to pay. GitHub " + \
    "was last valued at $2 billion in its last funding round " + \
    "2015, but the price tag for an acquisition could be $5 " + \
    "billion or more, based on a price that was floated " + \
    "last year. GitHub's tools have become essential to " + \
    "software developers, who use it to store code, " + \
    "keep track of updates and discuss issues. The privately " + \
    "held company has more than 23 million individual users in " + \
    "more than 1.5 million organizations. It was on track to " + \
    "book more than $200 million in subscription revenue, " + \
    "including more than $110 million from companies using its " + \
    "enterprise product, GitHub told CNBC last fall.Microsoft " + \
    "has reportedly flirted with buying GitHub in the past, " + \
    "including in 2016, although GitHub denied those " + \
    "reports. A partnership would give Microsoft another " + \
    "connection point to the developers it needs to court to " + \
    "build applications on its various platforms, including " + \
    "the Azure cloud. Microsoft could also use data from " + \
    "GitHub to improve its artificial intelligence " + \
    "producs. The talks come amid GitHub's struggle to " + \
    "replace CEO and founder Chris Wanstrath, who stepped " + \
    "down 10 months ago. Business Insider reported that " + \
    "Microsoft exec Nat Friedman -- who previously " + \
    "ran Xamarin, a developer tools start-up that Microsoft " + \
    "acquired in 2016 -- may take that CEO role. Google's " + \
    "senior VP of ads and commerce, Sridhar Ramaswamy, has " + \
    "also been in discussions for the job, says the report. " + \
    "Microsoft declined to comment on the report. " + \
    "GitHub did not immediately return a request for comment."

In [3]:
print('Original Content:')
print(content)
for ratio in [0.3, 0.5, 0.7]:
    summarized_content = gensim.summarization.summarize(content, ratio=ratio)
    print()
    print('---> Summarized Content (Ratio is %.1f):' % ratio)
    print(summarized_content)

Original Content:
Microsoft held talks in the past few weeks to acquire software developer platform GitHub, Business Insider reports. One person familiar with the discussions between the companies told CNBC that they had been considering a joint marketing partnership valued around $35 million, and that those discussions had progressed to a possible investment or outright acquisition. It is unclear whether talks are still ongoing, but this person said that GitHub's price for a full acquisition was more than Microsoft currently wanted to pay. GitHub was last valued at $2 billion in its last funding round 2015, but the price tag for an acquisition could be $5 billion or more, based on a price that was floated last year. GitHub's tools have become essential to software developers, who use it to store code, keep track of updates and discuss issues. The privately held company has more than 23 million individual users in more than 1.5 million organizations. It was on track to book more than $

In [4]:
print('Original Content:')
print(content)
for word_count in [10, 30, 50]:
    summarized_content = gensim.summarization.summarize(content, word_count=word_count)
    print()
    print('---> Summarized Content (Word Count is %d):' % word_count)
    print(summarized_content)

Original Content:
Microsoft held talks in the past few weeks to acquire software developer platform GitHub, Business Insider reports. One person familiar with the discussions between the companies told CNBC that they had been considering a joint marketing partnership valued around $35 million, and that those discussions had progressed to a possible investment or outright acquisition. It is unclear whether talks are still ongoing, but this person said that GitHub's price for a full acquisition was more than Microsoft currently wanted to pay. GitHub was last valued at $2 billion in its last funding round 2015, but the price tag for an acquisition could be $5 billion or more, based on a price that was floated last year. GitHub's tools have become essential to software developers, who use it to store code, keep track of updates and discuss issues. The privately held company has more than 23 million individual users in more than 1.5 million organizations. It was on track to book more than $

# Conclusion
For entire code, you may check out from github. Let us know if you also want to understand about abstractive approach. Will arrange an article later on

- According to gensim source code, at least 10 sentences is recommend for the input 
- No training data or model building is required. 
- It fits not only English but also any other a bag of input (Symbol, Japanese etc). You may also read [TextRank research paper](http://www.aclweb.org/anthology/W04-3252) for detail understanding.