# Summarizer
Objective here is to produce a summary given a list of documents (texts)
- We are dealing with short documents (headlines, tweets) so we could either use single document summary by treating each headline as a sentence in a large document to summarize or we can use multidocument summary where each document is just the headline. Note that single document summarizers will likely use the position of the sentence in the document as a summary

Parts of the study
- We can use some evaluation too when applying across documents

# Quick start
Easiest approach is to use sentence transformers to embed each headline from a list and find the most similar within the cluster as representative

In [2]:
from sentence_transformers import SentenceTransformer
model1 = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

In [3]:
import pandas as pd
df = pd.read_csv('../et/data/abcnews-date-text_sample.csv',index_col=0)

In [4]:
clust_txt=list(df.headline_text.values)

In [5]:
%%time
clust_emb=model1.encode(clust_txt, batch_size=16, 
                                       show_progress_bar=False, convert_to_numpy=True)

CPU times: user 25.4 s, sys: 258 ms, total: 25.6 s
Wall time: 4.3 s


In [6]:
%load_ext autoreload

In [7]:
%autoreload 2
from picture_text_summary import unroll_tree_map, cluster_summary_simple
from hac_tools import HAC
from treemap import build_tree_map

In [8]:
df_res = unroll_tree_map(clust_emb, max_extension=3)

In [44]:
import numpy as np
df_res['summary']=df_res.apply(lambda x: \
    cluster_summary_simple([np.array(clust_txt[m]) for m in x['cluster_members']], \
                           model1, [np.array(clust_emb[m]) for m in x['cluster_members']]), axis=1)


In [45]:
df_res['summary_parent']=df_res.parent.apply(lambda x: '' if x == 'all' else df_res.summary[x])



In [46]:
df_res1=df_res.drop(['id','parent'], axis=1).rename(columns={'summary':'id','summary_parent':'parent'})
df_res1['id']=df_res1['id'].apply(lambda x: x[:3])
df_res1['parent']=df_res1['parent'].apply(lambda x: x[:3])


In [47]:
df_res1


Unnamed: 0,cluster_members,value,cluster_table,color,id,parent
-1,[],1000,{},1.0,all,
260,[260],1,"{260: [260, '', '', 0, 1]}",0.001,fam,all
1990,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",992,"{0: [0, '', '', 0, 1], 1: [1, '', '', 0, 1], 2...",0.992,sto,all
505,[505],1,"{505: [505, '', '', 0, 1]}",0.001,cou,all
959,[959],1,"{959: [959, '', '', 0, 1]}",0.001,yem,all
254,[254],1,"{254: [254, '', '', 0, 1]}",0.001,eng,all
433,[433],1,"{433: [433, '', '', 0, 1]}",0.001,uzb,all
719,[719],1,"{719: [719, '', '', 0, 1]}",0.001,cro,all
986,[986],1,"{986: [986, '', '', 0, 1]}",0.001,cla,all
973,[973],1,"{973: [973, '', '', 0, 1]}",0.001,blo,all


In [50]:
#build_tree_map(df_res1.iloc[::-1])

build_tree_map(df_res1.iloc[::-1])

In [49]:
from sklearn.metrics.pairwise import paired_cosine_distances, cosine_similarity, \
paired_euclidean_distances, paired_manhattan_distances, cosine_distances


In [36]:
clust_avg=np.mean(clust_emb, axis=0, keepdims=True)

In [39]:
np.shape(cosine_distances(clust_emb,clust_avg))

(1000, 1)

In [50]:
import numpy as np
#df['inner_similarity']=np.sum(1 - (cosine_distances(clust_emb)),axis=1)
df['cluster_rank']=0.5*(1 + cosine_similarity(clust_emb,clust_avg))

In [24]:
df.sort_values('inner_similarity',ascending=False)

Unnamed: 0,publish_date,headline_text,inner_similarity
819,20030222,wa opp says police will be taken off the beat,144.007599
660,20030221,stolen wage report looks to improve indigenous,143.629669
133,20030219,police defend aboriginal tent embassy raid,143.467728
395,20030220,saff to gauge feelings on planned crown lands ...,142.455933
222,20030220,call for ambos help in wake of funding changes,142.176910
...,...,...,...
98,20030219,more than 40 pc of young men drink alcohol at,18.962337
300,20030220,iran military plane crash kills 302,11.452745
35,20030219,death toll continues to climb in south korean ...,8.362158
301,20030220,iran plane crashes with at least 250 aboard tv,2.499015


In [47]:
df.sort_values('cluster_rank',ascending=False)

Unnamed: 0,publish_date,headline_text,inner_similarity,cluster_rank
819,20030222,wa opp says police will be taken off the beat,144.007599,0.496646
660,20030221,stolen wage report looks to improve indigenous,143.629669,0.496389
133,20030219,police defend aboriginal tent embassy raid,143.467728,0.494510
222,20030220,call for ambos help in wake of funding changes,142.176910,0.492250
395,20030220,saff to gauge feelings on planned crown lands ...,142.455933,0.488922
...,...,...,...,...
260,20030220,families confront korean president elect over,19.677736,0.067093
300,20030220,iran military plane crash kills 302,11.452745,0.040750
35,20030219,death toll continues to climb in south korean ...,8.362158,0.031516
301,20030220,iran plane crashes with at least 250 aboard tv,2.499015,0.007392


In [62]:
df.sort_values('cluster_rank',ascending=False).head(2).headline_text.values

array(['wa opp says police will be taken off the beat',
       'stolen wage report looks to improve indigenous'], dtype=object)

In [74]:
cluster_summary_simple(clust_txt[:10], model1, top_n=1)

'air nz staff in aust strike for pay rise'