# **Automatic Summarization**

Algortimo semplice:
1. individua argomento del testo come una lista di vettori di Nasari (termine1, score1, termine2, score2, ...)
2. crea contesto raccogliendo vettori dei termini trovati al passo prima. Nel caso in cui il titolo sia troppo corto (e quindi poco informativo) posso cercare vettori di elementi che fanno parte delle definizioni dei termini trovati al passo 1
3. peso i paragrafi in base alla somma dei pesi dei termini che fanno parte del contesto con la Weighted Overlap. Mantengo solo i paragrafi che hanno un peso maggiore di una soglia

Valutazione:
- BLEU
- ROUGE

In [104]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import random

### Funzioni di utility per il preprocessing

In [61]:
# Useful to remove punctuation from first or last char of a token 
# - Esempio: senza questa funzione "It's" diventa "It" e "'s"
# - It viene eliminato perchè è una stopword, mentre 's non viene eliminato perchè non rientra nè tra le stopwords nè tra la punteggiatura
# - Con questa funzione rimuovo ' da 's e poi rimuovo nuovamente eventuali stopwords.
def remove_first_last(tokens, punct, stop):
    for i in range(len(tokens)):
        for p in punct:
            if tokens[i].startswith(p):
                tokens[i] = tokens[i][1:]
            if tokens[i].endswith(p):
                tokens[i] = tokens[i][:-1]
    tokens = [t for t in tokens if t not in stop]
    return tokens

# Remove stopwords and punctuation from the text, tokenize it and lemmatize it
def preprocess(text):
    text = text.lower()
    stop = []
    with open('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/stop_words_FULL.txt', 'r') as f:
        stop = f.read().splitlines()
    stop = set(stop)
    punct = ['.', ',', '!', '?', ':', ';', '(', ')', '[', ']', '{', '}', '"', "'", '``', "''", '...', '’', '“', '”', '‘']
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stop and t not in punct]
    lemmatizer = WordNetLemmatizer()
    tokens = list(set([lemmatizer.lemmatize(t) for t in tokens]))
    tokens = remove_first_last(tokens, punct, stop)
    return tokens

# try preprocessing
preprocess("This is a test. It's a test of the pre-processing system.")

['system', 'test', 'pre-processing']

### Parsing del file di input Nasari e creazione dizionario di vettori Nasari

In [98]:
# Parsing the Nasari file and creating a dictionary with:
# - key: word
# - value: dictionary with:
#          - key: lemma
#          - value: score
nasari = {}
with open('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/dd-small-nasari-15.txt', 'r') as f:
    lines = [line.rstrip('\n') for line in f]
    for line in lines:
        line = line.split(';')
        tmp = {}
        for lemma in line[2:]:
            lemma = lemma.split('_')
            if len(lemma) > 1:
                tmp[lemma[0]] = lemma[1]
        nasari[line[1].lower()] = tmp

# nasari

### Salvo i documenti di input

Rendo ogni documento una lista di paragrafi

In [121]:
# Save document
def save_doc(filename):
    doc = []
    with open(filename, 'r') as f:
        lines = [line.rstrip('\n') for line in f]
        for line in lines:
            if '#' not in line and line != '': # remove empty lines and the first line with the link
                doc.append(line)
    return doc

# try save_doc
# save_doc('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')

## 1. Individuate the topic

In [122]:
# Get title from document, considering the first line
def get_title(filename):
    doc = save_doc(filename)
    return doc[0]

# Get topic words from the text checking if they are in the Nasari dictionary
def get_topic_words(text):
    tokens = preprocess(text)
    topic_words = [t for t in tokens if t in nasari.keys()]
    return topic_words

# Get random paragraph topic words from the document (not the title)
def get_random_paragraph(filename):
    doc = save_doc(filename)
    paragraph = random.choice(doc[1:])
    topic_words = get_topic_words(paragraph)
    return topic_words, paragraph

# get_topic_words(get_title('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt'))
# get_random_paragraph('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')

## 2. Create the context

In [1]:
# Create the context for a document title
def create_title_context(filename):
    title = get_title(filename)
    topic_words = get_topic_words(title)
    context_vector = []
    for word in topic_words:
        context_vector.append(nasari[word])
    return context_vector

# try create_context
# create_title_context('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')

# Create the context for a paragraph
def create_paragraph_context(paragraph):
    topic_words = get_topic_words(paragraph)
    context_vector = []
    for word in topic_words:
        context_vector.append(nasari[word])
    return context_vector

# try create_paragraph_context
for i in range(1):
    topic_words, paragraph = get_random_paragraph('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')
    # print(topic_words)
    # print(paragraph)
    print(create_paragraph_context(paragraph))
# create_paragraph_context(get_random_paragraph('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')[1])

NameError: name 'get_random_paragraph' is not defined

## 3. Retain paragraphs whose sentences contain the most salient terms, based on the Weighted Overlap

In [106]:
# Get overlap between a text and a paragraph
def get_overlap(text, paragraph):
    return text and paragraph

# try get_overlap
# print(get_overlap('test', 'test'))

# Get rank as the position of a lemma in the vector
def get_rank(lemma, vector):
    for i in range(len(vector)):
        if lemma == vector[i]:
            return i + 1
    print('ever here?')
    return None
# def get_rank(term, vector):
#     for i in range(len(vector)):
#         print(term)
#         print(vector[i])
#         if term in vector[i].keys():
#             return i + 1
#     return None # If the term is not in the vector

# try get_rank
# print(get_rank('virus', create_context('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt')))

# Compute weighted overlap between two vectors
def weighted_overlap(v1, v2):
    overlap = get_overlap(v1, v2)
    print(f'overlap: {overlap}')
    if overlap:
        i = 0
        num = 0
        den = 0
        for term in overlap:
            den += get_rank(term, v1) + get_rank(term, v2) # This should be the num but since it is to the power of -1 I can put it in the den
            num += 2 * i # This should be the den but since it is to the power of -1 I can put it in the num
            i += 1
        return num / den
    return 0

# try weighted_overlap
weighted_overlap(get_title('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt'), get_random_paragraph('/Users/jak/Documents/Uni/TLN/TLN/Radicioni/data/docs/Ebola-virus-disease.txt'))

overlap: Those who survive often have ongoing muscular and joint pain, liver inflammation, and decreased hearing, and may have continued tiredness, continued weakness, decreased appetite, and difficulty returning to pre-illness weight. Problems with vision may develop.
ever here?


TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'