In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

## Overview:

The summarizer should follow this structure:

```py
def main(text):
    clean_text = clean(text)
    tokenized_text = tokenize(clean_text)
    vector_list = vec_encode(tokenized_text)
    clusters = cluster(vector_list)
    summary = extract_most_meaningful(clusters)
    return summary
```


In [16]:
#test = "Is this    a test?\n \t(No, it's only a sample.)\n Testing!,  testing...\n"
#test = "Maria Eliza Rundell (1745 – 16 December 1828) was an English writer. In 1805 when she was over 60, she sent an unedited collection of recipes and household advice to her friend John Murray, of whose family owned a publishing house.\n\nMurray published the work, A New System of Domestic Cookery, in November 1805. It was a huge success and several editions followed; the book sold around half a million copies in Rundell's lifetime. The book was aimed at middle class housewives. In addition to dealing with food preparation, it offers advice on medical remedies and how to set up a home brewery and includes a section entitled \"Directions to Servants\". The book contains an early recipe for tomato sauce—possibly the first—and the first recipe in print for Scotch eggs. Rundell also advises readers on being economical with their food and avoiding waste.\n\nIn 1819 Rundell asked Murray to stop publishing Domestic Cookery, as she was increasingly unhappy with the way the work had declined with each subsequent edition. She wanted to issue a new edition with a new publisher. A court case ensued, and legal wrangling between the two sides continued until 1823, when Rundell accepted Murray's offer of £2,100 for the rights to the work.\n\nRundell wrote a second book, Letters Addressed to Two Absent Daughters, published in 1814. The work contains the advice a mother would give to her daughters on subjects such as death, friendship, how to behave in polite company and the types of books a well-mannered young woman should read. She died in December 1828 while visiting Lausanne, Switzerland."
#test = "HONG KONG — Police fired tear gas against protesters in Hong Kong before meetings Monday between the territory’s leader and Communist Party officials in Beijing, ending a lull in what have become regular clashes between riot squads and demonstrators.\n\nPolice said they fired the choking gas after unrest erupted Sunday night in the Mongkok district of Kowloon.\n\nProtesters threw bricks at officers and tossed traffic cones at a police vehicle, police said. They also set fires, blocked roads and smashed traffic lights with hammers.\n\nVideo footage showed truncheon-wielding riot officers squirting pepper spray at a man in a group of journalists and ganging up to beat and manhandle him.\n\nThe violence and scattered confrontations in shopping malls earlier Sunday, where police also squirted pepper spray and made several arrests, ended what had been a lull of a couple of weeks in clashes between police and protesters.\n\nThe uptick in tension came as Hong Kong leader Carrie Lam was in Beijing on Monday to brief President Xi Jinping on the situation in the semi-autonomous Chinese territory.\n\nHong Kong’s protest movement erupted in June against now-scrapped legislation that would have allowed criminal suspects to be extradited for trial in Communist Party-controlled courts in mainland China.\n\nIt has snowballed into a full-blow challenge to the government and Communist leaders in Beijing, with an array of demands, including that Hong Kong’s leader and legislators all be fully elected."
test = "President Donald Trump's new trade deal with China will further integrate the world's two largest economies if Beijing honors its commitments in areas ranging from intellectual property protection to agriculture, U.S. Trade Representative Robert Lighthizer said Sunday.\n\n\"Ultimately, whether this whole agreement works is going to be determined by who's making the decisions in China, not in the United States,\" Lighthizer said on CBS News' \"Face the Nation.\" \"If the hardliners are making the decisions, we're going to get one outcome. If the reformers are making the decisions, which is what we hope, then we're going to get another outcome.\"\n\nThe \"phase one\" trade deal announced Friday cancels additional duties that were scheduled to go into effect Sunday, reduces duties on about $120 billion of Chinese goods to 7.5 percent, from 15 percent previously, and leaves a 25 percent duty in place on another $250 billion worth of Chinese goods.\n\nChina, in addition to making promises to better protect U.S. intellectual property, has pledged to buy another $200 billion worth of goods and services from the United States over the next two years, including about $40 billion to $50 billion worth of agricultural products each year.\n\n\"You could think of it as $80 to $100 billion in new sales for agriculture over the course of the next two years. Just massive numbers,\" Lighthizer said.\n\nThat has prompted questions about whether U.S. farmers can actually accommodate the increased demand, without siphoning sales away from other export markets they already have.\n\nFor much of the last two years, there has been a debate about whether Trump's true aim by imposing tariffs on hundreds of billions of dollars of Chinese good was to separate, or \"decouple,\" the U.S. economy from China, rather reach an actual trade agreement.\n\nOn Sunday, Lighthizer indicated the objective was to tie the two economies closer together.\n\n\"The way to think about this deal, is this is a first step in trying to integrate two very different systems to the benefit of both of us,\" the trade chief said.\n\nThe Trump administration also got another trade win last week when House Democrats and the AFL-CIO endorsed a newly revised North American trade agreement with Mexico and Canada, after changes were made to toughen labor enforcement provisions and weaken intellectual property protections for life-saving biologic medicine.\n\nSome of the tweaks made to shore up Democratic support have annoyed Republicans, who have different views of both issues. But that's not expected to block congressional approval. The House is expected to vote on the bill this week and the Senate to follow suit in early 2020, after it finishes Trump's impeachment trial.\n\nLighthizer conceded weakening the biologics provision made the trade deal worse on that point. But he said the overall package was \"better\" as a result of the changes demanded by Democrats.\n\n\"There's nothing about being against labor enforcement that's Republican,\" Lighthizer said. \"The president wants Mexico to enforce its labor laws. He doesn't want American manufacturing workers to have to compete with people who are in very difficult conditions.\""

### clean(text)

`clean(text)` removes newline characters and repeated whitespaces, returning a one-line string with all the text.

In later versions, this function should consider that different texts have different structures.
For example, it should remove signatures from emails, section headers or symbols separating sections of text.


In [10]:
def clean(text):
    return re.sub("\s+", " ", text).strip()

test2 = clean(test)
print("|" + test2 + "|")

|President Donald Trump's new trade deal with China will further integrate the world's two largest economies if Beijing honors its commitments in areas ranging from intellectual property protection to agriculture, U.S. Trade Representative Robert Lighthizer said Sunday. "Ultimately, whether this whole agreement works is going to be determined by who's making the decisions in China, not in the United States," Lighthizer said on CBS News' "Face the Nation." "If the hardliners are making the decisions, we're going to get one outcome. If the reformers are making the decisions, which is what we hope, then we're going to get another outcome." The "phase one" trade deal announced Friday cancels additional duties that were scheduled to go into effect Sunday, reduces duties on about $120 billion of Chinese goods to 7.5 percent, from 15 percent previously, and leaves a 25 percent duty in place on another $250 billion worth of Chinese goods. China, in addition to making promises to better protect

### tokenize(clean_text)

`tokenize()` splits text and returns a list of sentences using NLTK's pre-trained Punkt Sentence Tokenizer.

In later versions, this tokenizer model should be customised and perfected with PunktSentenceTokenizer(text).
It might also be expanded to handle languages other than English as well.

In [11]:
def tokenize(text):
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    return sent_detector.tokenize(text.strip())

test3 = tokenize(test2)
print(test3)

["President Donald Trump's new trade deal with China will further integrate the world's two largest economies if Beijing honors its commitments in areas ranging from intellectual property protection to agriculture, U.S. Trade Representative Robert Lighthizer said Sunday.", '"Ultimately, whether this whole agreement works is going to be determined by who\'s making the decisions in China, not in the United States," Lighthizer said on CBS News\' "Face the Nation."', '"If the hardliners are making the decisions, we\'re going to get one outcome.', 'If the reformers are making the decisions, which is what we hope, then we\'re going to get another outcome."', 'The "phase one" trade deal announced Friday cancels additional duties that were scheduled to go into effect Sunday, reduces duties on about $120 billion of Chinese goods to 7.5 percent, from 15 percent previously, and leaves a 25 percent duty in place on another $250 billion worth of Chinese goods.', 'China, in addition to making promis

### vec_encode(tokenized_text)

`vec_encode(tokenized_text)` uses Skip-Thought Encoder to encode tokenized sentences into NumPy arrays.

For this version, I will use the pre-trained model in the open-source code that the author of the skip-thoughts paper made available. It is a Python 2.7 script that I converted to Python 3.
In future versions, I would train my own model. I would also look into a Quick Thought Vectors solution.

In [13]:
def vec_encode(tokenized_text, encoder):
    return encoder.encode(tokenized_text, verbose=False)

test4 = vec_encode(test3, codec)
print(test4)

[[ 0.0074926   0.0033992   0.00188786 ... -0.08552077  0.01064042
  -0.00445188]
 [-0.00032187 -0.00718423  0.00456905 ... -0.017874    0.02079725
   0.00088872]
 [ 0.00200476  0.01619893  0.0036355  ... -0.00549309  0.00734067
  -0.0009319 ]
 ...
 [ 0.00389003  0.01036633 -0.00318061 ... -0.01905366 -0.04961745
   0.00242122]
 [-0.00231767 -0.00231783  0.00931677 ... -0.03071932  0.00890862
  -0.00982254]
 [-0.01030919  0.0134872   0.01820052 ... -0.00563525  0.01157318
   0.00058838]]


### cluster(vector_list)

`cluster(vector_list)` uses a KMeans model to cluster the given vectors.

In [14]:
def cluster(vector_list):
    n_clusters = int(vector_list.shape[0]*0.3) #30% of the sentences of the original text
    model = KMeans(n_clusters=n_clusters, random_state=0, n_jobs=-1).fit(vector_list)
    return model

test5 = cluster(test4)
print(test5)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=6, n_init=10, n_jobs=-1, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)


### extract_most_meaningful(clusters)

`extract_most_meaningful(clusters)` selects and returns the most meaningful sentence in each cluster (i.e. the one closest to the cluster centre). These sentences are ordered according to the averaged sequence order in which they appear in the original text.

In [15]:
def extract_most_meaningful(tokenized_text, vector_list, model):
    extracted = []
    for k in range(model.cluster_centers_.shape[0]):
        index = np.where(model.labels_ == k)[0]
        min_i,_ = pairwise_distances_argmin_min(model.cluster_centers_[k].reshape(1,-1), vector_list)
        p_snt = tokenized_text[int(min_i)]
        extracted.append((np.mean(index), p_snt))

    extracted.sort(key=lambda x: x[0])
    return " ".join([s[1] for s in extracted])

test6 = extract_most_meaningful(test3, test4, test5)
print(test6)

President Donald Trump's new trade deal with China will further integrate the world's two largest economies if Beijing honors its commitments in areas ranging from intellectual property protection to agriculture, U.S. Trade Representative Robert Lighthizer said Sunday. If the reformers are making the decisions, which is what we hope, then we're going to get another outcome." "The president wants Mexico to enforce its labor laws. Some of the tweaks made to shore up Democratic support have annoyed Republicans, who have different views of both issues. "There's nothing about being against labor enforcement that's Republican," Lighthizer said. But that's not expected to block congressional approval.


## SUMMARIZER

In [12]:
%run "../pretrained-skipthought-encoder/skipthoughts.py"

In [21]:
def init_encoder():
    return Encoder(load_model())
try: # has codec been initialized?
    codec
except NameError:
    codec = init_encoder()

In [28]:
def main(text):
    clean_text = clean(text)
    tokenized_text = tokenize(clean_text)
    vector_list = vec_encode(tokenized_text, codec)
    model = cluster(vector_list)
    summary = extract_most_meaningful(tokenized_text, vector_list, model)
    return summary

In [30]:
main(test)

'President Donald Trump\'s new trade deal with China will further integrate the world\'s two largest economies if Beijing honors its commitments in areas ranging from intellectual property protection to agriculture, U.S. Trade Representative Robert Lighthizer said Sunday. If the reformers are making the decisions, which is what we hope, then we\'re going to get another outcome." "The president wants Mexico to enforce its labor laws. Some of the tweaks made to shore up Democratic support have annoyed Republicans, who have different views of both issues. "There\'s nothing about being against labor enforcement that\'s Republican," Lighthizer said. But that\'s not expected to block congressional approval.'