### Text Summarization
**Author** - Kushal 
* An extractive text summarization technique has been implemented.The procedure is as follows:
    * First we obtain vector representation of all the sentences in our document/paragraph. A vector representation or sentence embeddings are obtained by training complex neural net architectures inlvolving transformers, LSTM etc on a large corpus of data(e.g. wikipedia dump) in an unsupervised way. I have used BERT for this (explained later).
    * These sentence embeddings are then clustered into, say k clusters where k is the number of desired sentences in the summary. k can be square root of original number of sentences in the text or 30% of it. I did this using K-means from sklearn.
    * Once clustered the cluster centers can be accessed. These cluster centers represent or convey the overall meaning of all the sentences that fall in the corresponding cluster.
    * The representative sentence is decided by calculating the most similar sentence to each of the cluster center. This has been done using cosine similarity.

* **BERT**
  (Bidirectional Encoder Representations for Transformers) was a breakthrough paper in 2018 and was discovered by Google. It outperformed all the then existing SOTA architectures in a number of NLP tasks. It uses the transformer's structure to encode sentences. The representations generated by BERT are contextualized, i.e they take into consideration the context of the word used in the sentence. Earlier pre-trained embeddings like word2vec and glove were context-free.
* I have used the following library to generate sentence embeddings : https://github.com/hanxiao/bert-as-service
  
    

In [1]:
import pickle
import numpy as np
import pandas as pd
from langdetect import detect,detect_langs
from nltk import sent_tokenize
from bert_serving.client import BertClient
from sklearn.cluster import KMeans

In [2]:
df = pd.read_pickle('./eng_df')

In [3]:
df.head()

Unnamed: 0,Cleaned Emails,lang
0,"Merci pour votre message! Malheureusement, mon...",en
2,この問題についてご連絡いただきありがとうございます。 申し訳ありませんが私は日本語語が話せま...,en
3,Grazie mille per aver trovato il tempo per met...,en
4,Thank you so much for reaching out and taking ...,en
5,Thank you so much for reaching out and taking ...,en


In [81]:
bert = BertClient()
def get_embeddings(row):
    '''Generates bert sentence embeddings.Bert server is open in the terminal.'''
    
    text = row['Cleaned Emails']
    sents = sent_tokenize(text)
    embeddings = bert.encode(sents)
    return embeddings

df['Embeddings'] = df.apply(get_embeddings,axis=1)

here is what you can do:
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)


Unnamed: 0,Cleaned Emails,lang,Embeddings
0,"Merci pour votre message! Malheureusement, mon...",en,"[[-0.17367607, 0.06847995, 0.1780457, -0.28497..."
2,この問題についてご連絡いただきありがとうございます。 申し訳ありませんが私は日本語語が話せま...,en,"[[-0.14160722, -0.5160878, 0.37131366, 0.02110..."
3,Grazie mille per aver trovato il tempo per met...,en,"[[-0.7721293, -0.2312476, 1.1882508, -0.083789..."
4,Thank you so much for reaching out and taking ...,en,"[[0.114546, -0.41043538, 0.4535624, 0.09567887..."
5,Thank you so much for reaching out and taking ...,en,"[[0.114546, -0.41043538, 0.4535624, 0.09567887..."


In [83]:
def get_cluster_centers(row):
    ''' 
    Performs clustering of sentences in the text. Number of clusters or the number of required sentences in summary
    is the square root of total sentences in the text.Returns cluster centers.
    '''
    
    text = row['Cleaned Emails']
    sents = sent_tokenize(text)
    clusters = int(np.ceil(len(sents)**0.5))
    embeddings = row['Embeddings']
    kmeans = KMeans(n_clusters=clusters).fit(embeddings)
    
    return kmeans.cluster_centers_

df['Cluster Centers'] = df.apply(get_cluster_centers,axis=1)

In [84]:
df.head()

Unnamed: 0,Cleaned Emails,lang,Embeddings,Cluster Centers
0,"Merci pour votre message! Malheureusement, mon...",en,"[[-0.17367607, 0.06847995, 0.1780457, -0.28497...","[[-0.42146394, 0.30830446, 0.1823968, -0.21394..."
2,この問題についてご連絡いただきありがとうございます。 申し訳ありませんが私は日本語語が話せま...,en,"[[-0.14160722, -0.5160878, 0.37131366, 0.02110...","[[-0.14160722, -0.5160878, 0.37131366, 0.02110..."
3,Grazie mille per aver trovato il tempo per met...,en,"[[-0.7721293, -0.2312476, 1.1882508, -0.083789...","[[0.12635724, -0.29532984, 0.37011546, 0.21694..."
4,Thank you so much for reaching out and taking ...,en,"[[0.114546, -0.41043538, 0.4535624, 0.09567887...","[[0.16755143, -0.06252824, 0.43795574, 0.03444..."
5,Thank you so much for reaching out and taking ...,en,"[[0.114546, -0.41043538, 0.4535624, 0.09567887...","[[-0.007928453, -0.075849526, 0.17494668, -0.0..."


In [94]:
def get_summary(row):
    '''
    Generates summary by choosing the sentences in the text that are closest to the centroid.
    '''
    text = row['Cleaned Emails']
    sents = sent_tokenize(text)
    centroids = row['Cluster Centers']
    embeddings = row['Embeddings']
    clusters = centroids.shape[0]
    sents_len = len(sents)
    summary = []
    for i in range(clusters):
        select = -1
        m = -np.inf
        for j in range(sents_len):
            similarity = np.dot(centroids[i],embeddings[j])
            if similarity > m:
                m = similarity
                select = j
        summary.append(select)
    summary.sort()
    summary = ''.join([sents[i] for i in summary])
    return summary
    
df['Summary'] = df.apply(get_summary,axis=1)  

In [104]:
df.head()

Unnamed: 0,Cleaned Emails,lang,Embeddings,Cluster Centers,Summary
0,"Merci pour votre message! Malheureusement, mon...",en,"[[-0.17367607, 0.06847995, 0.1780457, -0.28497...","[[-0.42146394, 0.30830446, 0.1823968, -0.21394...","J'espère que ça ne vous dérange pas, mais je v..."
3,Grazie mille per aver trovato il tempo per met...,en,"[[-0.7721293, -0.2312476, 1.1882508, -0.083789...","[[0.12635724, -0.29532984, 0.37011546, 0.21694...",Grazie mille per aver trovato il tempo per met...
4,Thank you so much for reaching out and taking ...,en,"[[0.114546, -0.41043538, 0.4535624, 0.09567887...","[[0.16755143, -0.06252824, 0.43795574, 0.03444...",Thank you so much for reaching out and taking ...
5,Thank you so much for reaching out and taking ...,en,"[[0.114546, -0.41043538, 0.4535624, 0.09567887...","[[-0.007928453, -0.075849526, 0.17494668, -0.0...",If you can access via Touch ID you can simpl...
8,Thank you so much for reaching out and taking ...,en,"[[0.33351135, -0.31490558, 0.3397996, 0.299524...","[[-0.08749674, -0.29593503, 0.05410941, 0.1850...",We really appreciate it and are happy to hear ...


In [105]:
df.to_pickle('./finaldf')

In [115]:
print(df['Cleaned Emails'][4])
print('-'*120)
print(df['Summary'][4])

Thank you so much for reaching out and taking the time to contact us about this issue! Please excuse the delayed response. I'm happy to inform you that you can already enlarge the front and back pictures of your cards simply by tapping on it once. Your card pictures will then get enlarged as well as rotated. However, I will also suggest to our developers to make zooming already in the "Notes" tab possible for future versions of  . I hope I was able to help you. If you have any further questions, suggestions for improvements or general feedback, please don't hesitate to contact me again.
------------------------------------------------------------------------------------------------------------------------
Thank you so much for reaching out and taking the time to contact us about this issue!However, I will also suggest to our developers to make zooming already in the "Notes" tab possible for future versions of  .If you have any further questions, suggestions for improvements or general 

In [118]:
print(df['Cleaned Emails'][8])
print('-'*120)
print(df['Summary'][8])

Thank you so much for reaching out and taking the time to send us feedback! We really appreciate it and are happy to hear that you like our app!We are aware that the current loading times of our app have increased over the course of the latest updates and I can assure you that our developers are already working on improving the speed and overall performance again for future releases. Until then, you could try keeping   opened in the background while shopping as a temporary workaround. This way, the app doesn't have to reload all your information (e.g. card pictures, points balances, etc.) completely each time it is opened and the loading times will be decreased significantly. In the meantime, I sincerely apologize for the inconvenience this causes and hope that you can use   in its full capacity again soon.If you have any further questions, suggestions for improvements or general feedback, please don't hesitate to contact me again.
------------------------------------------------------