# Summarizer
Objective here is to produce a summary given a list of documents (texts)
- We are dealing with short documents (headlines, tweets) so we could either use single document summary by treating each headline as a sentence in a large document to summarize or we can use multidocument summary where each document is just the headline. Note that single document summarizers will likely use the position of the sentence in the document as a summary

Parts of the study
- We can use some evaluation too when applying across documents

# Quick start
Easiest approach is to use sentence transformers to embed each headline from a list and find the most similar within the cluster as representative

In [1]:
from sentence_transformers import SentenceTransformer
model1 = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

In [2]:
import pandas as pd
df = pd.read_csv('../et/data/abcnews-date-text_sample.csv',index_col=0)

In [9]:
clust_txt=list(df.headline_text.values)

In [10]:
%%time
clust_emb=model1.encode(clust_txt, batch_size=16, 
                                       show_progress_bar=False, convert_to_numpy=True)

CPU times: user 24.9 s, sys: 241 ms, total: 25.2 s
Wall time: 4.22 s


In [49]:
from sklearn.metrics.pairwise import paired_cosine_distances, cosine_similarity, \
paired_euclidean_distances, paired_manhattan_distances, cosine_distances


In [36]:
clust_avg=np.mean(clust_emb, axis=0, keepdims=True)

In [39]:
np.shape(cosine_distances(clust_emb,clust_avg))

(1000, 1)

In [50]:
import numpy as np
#df['inner_similarity']=np.sum(1 - (cosine_distances(clust_emb)),axis=1)
df['cluster_rank']=0.5*(1 + cosine_similarity(clust_emb,clust_avg))

In [24]:
df.sort_values('inner_similarity',ascending=False)

Unnamed: 0,publish_date,headline_text,inner_similarity
819,20030222,wa opp says police will be taken off the beat,144.007599
660,20030221,stolen wage report looks to improve indigenous,143.629669
133,20030219,police defend aboriginal tent embassy raid,143.467728
395,20030220,saff to gauge feelings on planned crown lands ...,142.455933
222,20030220,call for ambos help in wake of funding changes,142.176910
...,...,...,...
98,20030219,more than 40 pc of young men drink alcohol at,18.962337
300,20030220,iran military plane crash kills 302,11.452745
35,20030219,death toll continues to climb in south korean ...,8.362158
301,20030220,iran plane crashes with at least 250 aboard tv,2.499015


In [47]:
df.sort_values('cluster_rank',ascending=False)

Unnamed: 0,publish_date,headline_text,inner_similarity,cluster_rank
819,20030222,wa opp says police will be taken off the beat,144.007599,0.496646
660,20030221,stolen wage report looks to improve indigenous,143.629669,0.496389
133,20030219,police defend aboriginal tent embassy raid,143.467728,0.494510
222,20030220,call for ambos help in wake of funding changes,142.176910,0.492250
395,20030220,saff to gauge feelings on planned crown lands ...,142.455933,0.488922
...,...,...,...,...
260,20030220,families confront korean president elect over,19.677736,0.067093
300,20030220,iran military plane crash kills 302,11.452745,0.040750
35,20030219,death toll continues to climb in south korean ...,8.362158,0.031516
301,20030220,iran plane crashes with at least 250 aboard tv,2.499015,0.007392


In [51]:
df.sort_values('cluster_rank',ascending=False)

Unnamed: 0,publish_date,headline_text,inner_similarity,cluster_rank
819,20030222,wa opp says police will be taken off the beat,144.007599,0.748323
660,20030221,stolen wage report looks to improve indigenous,143.629669,0.748195
133,20030219,police defend aboriginal tent embassy raid,143.467728,0.747255
222,20030220,call for ambos help in wake of funding changes,142.176910,0.746125
395,20030220,saff to gauge feelings on planned crown lands ...,142.455933,0.744461
...,...,...,...,...
260,20030220,families confront korean president elect over,19.677736,0.533547
300,20030220,iran military plane crash kills 302,11.452745,0.520375
35,20030219,death toll continues to climb in south korean ...,8.362158,0.515758
301,20030220,iran plane crashes with at least 250 aboard tv,2.499015,0.503696


In [52]:
def cluster_summary_simple(clust_txt, model, top=1):
    df=pd.DataFrame()
    df['titles']=clust_txt
    clust_emb=model1.encode(clust_txt, batch_size=16, 
                                       show_progress_bar=False, convert_to_numpy=True)
    clust_avg=np.mean(clust_emb, axis=0, keepdims=True)
    df['cluster_rank']=0.5*(1 + cosine_similarity(clust_emb,clust_avg))
    df1=df.sort_values('cluster_rank',ascending=False).copy()
    if top_n==1:
        return df.titles.iloc[0]
    else:
        return list(df.head(top_n).titles.values)