# **Automatic Summarization**

Algortimo semplice:
1. individua argomento del testo come una lista di vettori di Nasari (termine1, score1, termine2, score2, ...)
2. crea contesto raccogliendo vettori dei termini trovati al passo prima. Nel caso in cui il titolo sia troppo corto (e quindi poco informativo) posso cercare vettori di elementi che fanno parte delle definizioni dei termini trovati al passo 1
3. peso i paragrafi in base alla somma dei pesi dei termini che fanno parte del contesto con la Weighted Overlap. Mantengo solo i paragrafi che hanno un peso maggiore di una soglia

Valutazione:
- BLEU
- ROUGE

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import random
import math

In [2]:
doc_paths = ['/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Andy-Warhol.txt',
'/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt',
'/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Life-indoors.txt',
'/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Napoleon-wiki.txt', 
'/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Trump-wall.txt']

### Funzioni di utility per il preprocessing

In [3]:
# Useful to remove punctuation from first or last char of a token 
# - Esempio: senza questa funzione "It's" diventa "It" e "'s"
# - It viene eliminato perchè è una stopword, mentre 's non viene eliminato perchè non rientra nè tra le stopwords nè tra la punteggiatura
# - Con questa funzione rimuovo ' da 's e poi rimuovo nuovamente eventuali stopwords.
def remove_first_last(tokens, punct, stop):
    for i in range(len(tokens)):
        for p in punct:
            if tokens[i].startswith(p):
                tokens[i] = tokens[i][1:]
            if tokens[i].endswith(p):
                tokens[i] = tokens[i][:-1]
    tokens = [t for t in tokens if t not in stop]
    return tokens

# Remove stopwords and punctuation from the text, tokenize it and lemmatize it
def preprocess(text):
    text = text.lower()
    stop = []
    with open('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/stop_words_FULL.txt', 'r') as f:
        stop = f.read().splitlines()
    stop = set(stop)
    punct = ['.', ',', '!', '?', ':', ';', '(', ')', '[', ']', '{', '}', '"', "'", '``', "''", '...', '’', '“', '”', '‘']
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stop and t not in punct]
    lemmatizer = WordNetLemmatizer()
    tokens = list(set([lemmatizer.lemmatize(t) for t in tokens]))
    tokens = remove_first_last(tokens, punct, stop)
    return tokens

# try preprocessing
preprocess("This is a test. It's a test of the pre-processing system. Millions of people are using it.")

['pre-processing', 'people', 'system', 'test']

### Parsing del file di input Nasari e creazione dizionario di vettori Nasari

In [4]:
# Parsing the Nasari file and creating a dictionary with:
# - key: word
# - value: dictionary with:
#          - key: lemma
#          - value: score
nasari = {}
with open('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/dd-small-nasari-15.txt', 'r') as f:
    lines = [line.rstrip('\n') for line in f]
    for line in lines:
        line = line.split(';')
        tmp = {}
        for lemma in line[2:]:
            lemma = lemma.split('_')
            if len(lemma) > 1:
                tmp[lemma[0]] = lemma[1]
        nasari[line[1].lower()] = tmp

# nasari

### Salvo i documenti di input

Rendo ogni documento una lista di paragrafi

In [5]:
# Save document
def save_doc(filename):
    doc = []
    with open(filename, 'r') as f:
        lines = [line.rstrip('\n') for line in f]
        for line in lines:
            if '#' not in line and line != '': # remove empty lines and the first line with the link
                doc.append(line)
    return doc

# try save_doc
# save_doc('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')

## 1. Individuate the topic

In [6]:
# Get title from document, considering the first line
def get_title(filename):
    doc = save_doc(filename)
    return doc[0]
    # return preprocess(doc[0])

# Get topic words from the text checking if they are in the Nasari dictionary
def get_topic_words(text):
    tokens = preprocess(text)
    topic_words = [t for t in tokens if t in nasari.keys()]
    return topic_words

# so far used only for testing
# Get random paragraph topic words from the document (not the title)
def get_random_paragraph(filename):
    doc = save_doc(filename)
    paragraph = random.choice(doc[1:])
    return paragraph

# print(get_title('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Life-indoors.txt'))
# print(get_topic_words(get_title('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Life-indoors.txt')))
# get_random_paragraph('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')

## 2. Create the context

In [7]:
# Create the context for a document title 
# - It return a list of dictionaries associated to the topic words of the title if they are in the Nasari dictionary
def create_context(list):
    topic_words = get_topic_words(list)
    context_vector = [nasari[word] for word in topic_words]
    return context_vector

# Create the context for a paragraph
# - It returns a list of dictionaries associated to the topic words of the paragraph if they are in the Nasari dictionary
def create_paragraph_context(paragraph):
    topic = [w for w in paragraph if w in nasari.keys()]
    context_vector = [nasari[word] for word in topic]
    return context_vector

# try create_paragraph_context
# for i in range(10):
#     topic_words, paragraph = get_random_paragraph('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Life-indoors.txt')
#     print(topic_words)
#     # print(paragraph)
#     print(create_paragraph_context(paragraph))
# create_paragraph_context(get_random_paragraph('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')[1])
# create_context(get_title('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt'))


## 3. Retain paragraphs whose sentences contain the most salient terms, based on the Weighted Overlap

### Implementazione della Weighted Overlap

In [8]:
# Get overlap between a text topic words vectors and a paragraph topic words 
def get_overlap(context, paragraph):
    overlap = set()
    for w in paragraph:
        for dict in context:
            if w in dict.keys():
                overlap.add(w)
    return overlap

# Get rank as the position of a lemma in the vector
def get_rank(lemma, vector):
    min = math.inf
    for dict in vector:
        i = 1
        for key in dict.keys():
            if key == lemma:
                if i < min:
                    min = i
            i += 1
    return min

# Compute weighted overlap between two vectors
def weighted_overlap(context, paragraph, par_context):
    overlap = get_overlap(context, paragraph)
    # print(f'overlap: {overlap}')
    # print(f'lenght: {len(overlap)}')
    if overlap:
        i = 1
        num = 0
        den = 0
        for lemma in overlap:
            den += get_rank(lemma, context) + get_rank(lemma, par_context) # This should be the num but since it is to the power of -1 I can put it in the den
            num += 2 * i # This should be the den but since it is to the power of -1 I can put it in the num
            i += 1
            # print(f'lemma: {lemma}, title rank: {get_rank(lemma, context)}, paragraph rank: {get_rank(lemma, par_context)}')
            # print(f'total rank: {get_rank(lemma, context) + get_rank(lemma, par_context)}')
            # print(f'num: {num}, den: {den}')
        return num / den
    return 0

# for i in range(4):
#     par_processed = get_random_paragraph('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Life-indoors.txt')
#     title = get_title('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Life-indoors.txt')
#     context = create_context(title)
#     par_processed = preprocess(par_processed)
#     par_context = create_paragraph_context(par_processed)
#     print(weighted_overlap(context, par_processed, par_context))
#     print('-' * 70)


## Calcolo della Weighted Overlap per ogni paragrafo di un documento

In [13]:
# Compute the weighted overlap between a document title and all the paragraphs
def weight_doc(filename):
    title = get_title(filename)
    context = create_context(title)
    doc = save_doc(filename)
    paragraphs = [preprocess(par) for par in doc[1:]]
    par_context = [create_paragraph_context(par) for par in paragraphs]
    weighted_overlap_list = [weighted_overlap(context, paragraphs[i], par_context[i]) for i in range(len(paragraphs))]
    return weighted_overlap_list

for path in doc_paths:
    print(get_title(path))
    print(weight_doc(path))
    print('-' * 70)

Andy Warhol: Why the great Pop artist thought ‘Trump is sort of cheap’
[1.0, 0, 1.0, 1.0, 0, 0.5, 1.0, 0.0, 0.3333333333333333, 0, 1.0, 0.35294117647058826, 0, 0.0, 0, 0, 0, 0, 1.0]
----------------------------------------------------------------------
Ebola virus disease
[1.5384615384615385, 0.75, 0.75, 1.2, 0.4, 0, 0.8571428571428571, 0.4, 0, 0.4, 0, 0.4, 0, 0.5, 0, 1.5, 0, 0.48, 1.0891089108910892, 0, 1.5384615384615385, 1.0, 1.0909090909090908, 0.5, 1.3333333333333333, 0.0, 1.5]
----------------------------------------------------------------------
How people around the world are coping with life indoors
[0.6, 0.6, 0.6, 0, 0, 0.18181818181818182, 0.4, 0.3333333333333333, 0, 0, 0.3333333333333333, 0.3333333333333333]
----------------------------------------------------------------------
Napoleone Bonaparte.
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
----------------------------------------------------------------------
The Trump wall, commonly referred to as "The Wall", was

## Selezione dei paragrafi migliori

Verrà selezionato il 70 - 80 - 90% dei paragrafi con peso maggiore, a seconda della percentuale richiesta

In [22]:
# Calculate how many paragraphs to keep given a percentage of the document to be maintained
def get_threshold(doc, percentage):
    total = len(doc[1:])
    threshold = math.ceil(total * percentage)
    return threshold

# Select the best paragraphs given a threshold
def select_best_paragraphs(filename, percentage):
    weighted_overlap_list = weight_doc(filename)
    doc = save_doc(filename)
    paragraphs = doc[1:]
    best_paragraphs = [doc[0]]
    threshold = get_threshold(doc, percentage)
    print(f'threshold with percentage {percentage*100}%: {threshold} out of {len(doc[1:])} paragraphs')
    for i in range(int(threshold)):
        best_paragraphs.append(paragraphs[weighted_overlap_list.index(max(weighted_overlap_list))])
        weighted_overlap_list[weighted_overlap_list.index(max(weighted_overlap_list))] = -1
    return best_paragraphs

# threshold = get_threshold('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt', 0.9)
# select_best_paragraphs('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt', 0.7)

### Calcolo della BLEU e ROUGE

In [29]:
# Non ho idea di come funziona
# Compute BLEU score for a document
def bleu(filename, percentage):
    best_paragraphs = select_best_paragraphs(filename, percentage)
    doc = save_doc(filename)
    bleu = nltk.translate.bleu_score.sentence_bleu([doc], best_paragraphs, weights=(0.5, 0.5))
    return bleu

# Compute ROUGE score for a document
# def rouge(filename, percentage):
#     best_paragraphs = select_best_paragraphs(filename, percentage)
#     doc = save_doc(filename)
#     rouge = rouge_score.rouge_n(best_paragraphs, doc, n=2)
#     return rouge

for path in doc_paths:
    print(get_title(path))
    percents = [0.7, 0.8, 0.9, 1]
    for p in percents:
        print(f'BLEU score with {p*100}%: {bleu(path, p)}')
        print('_' * 40)
    print('-' * 100)

Andy Warhol: Why the great Pop artist thought ‘Trump is sort of cheap’
threshold with percentage 70.0%: 14 out of 19 paragraphs
BLEU score with 70.0%: 0.27082337919563315
________________________________________
threshold with percentage 80.0%: 16 out of 19 paragraphs
BLEU score with 80.0%: 0.41911171621149995
________________________________________
threshold with percentage 90.0%: 18 out of 19 paragraphs
BLEU score with 90.0%: 0.5477492206756237
________________________________________
threshold with percentage 100%: 19 out of 19 paragraphs
BLEU score with 100%: 0.6069769786668839
________________________________________
----------------------------------------------------------------------------------------------------
Ebola virus disease
threshold with percentage 70.0%: 19 out of 27 paragraphs
BLEU score with 70.0%: 0.21748054096067912
________________________________________
threshold with percentage 80.0%: 22 out of 27 paragraphs
BLEU score with 80.0%: 0.24260056810319283
_______