# Esercizio 4 - Segmentation

- Implementare un semplice algoritmo di text segmentation
- Usare come test un input di k paragrafi presi da differenti temi (ad es. pagine Wikipedia)  
- Il vostro sistema è in grado di trovare i giusti “tagli”?

## Idea:

- Ripulisco il file di testo da stopwords e punteggiatura e lo tokenizzo e lemmatizzo, ottenendo una lista di relevant words
- Suddivido il file di input in una lista di liste: ogni riga del file di input diventa una lista di relevant words
- Calcolo la cosine similarity tra ogni riga del file di input e la riga successiva
- Posiziono i tagli nei punti di minimo della cosine similarity


## File di input

Come file di input uso un file di testo contenente un po' di paragrafi presi da Wikipedia riguardo a 4 argomenti diversi:
- Lebanon
- Racing bike
- Labrador retriever
- Indie rock  

I tagli che dovrebbe trovare il mio algoritmo sono alla linea 27-28, alla linea 58-59 e alla linea 97-98.


In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

### Preprocessing del file di input

In [5]:
data = []
with open('../data/wiki.txt', 'r') as f:
    for line in f:
        data.append(line.strip())
    
lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer(r'\w+')
stop_words = stopwords.words('english')
paragraphs = []
for line in data:
    words = [lemmatizer.lemmatize(token.lower()) for token in tokenizer.tokenize(line) if token.lower() not in stop_words] 
    paragraphs.append(words)

# paragraphs[0]

### Creo una lista di tutte le relevant words del file di input, rimuovendo i duplicati ma mantenendo l'ordine 

In [6]:
# Create a list of all words with duplicates
all_words = []
for paragraph in paragraphs:
    for word in paragraph:
        all_words.append(word)

# Remove duplicates but keep the order
all_words = list(dict.fromkeys(all_words))
# all_words

# all_words = list(all_words)
# all_words_dict = {word: i for i, word in enumerate(all_words)}
# all_words_dict

### Creo un dizionario per ogni paragrafo del file di input, dove la chiave è la parola e il valore la sua frequenza

In [10]:
# Create a dictionary of all words in one paragraph
def paragraph_dict(p1):
    paragraph_dict = {}
    for word in all_words:
        paragraph_dict[word] = 0
    for word in p1:
        paragraph_dict[word] += 1
    return paragraph_dict

# # Create a dictionary of all words in two paragraphs
# def paragraph_dict(p1, p2):
#     paragraph_dict = {}
#     for word in all_words:
#         paragraph_dict[word] = 0
#     for word in p1:
#         paragraph_dict[word] += 1
#     for word in p2:
#         paragraph_dict[word] += 1
#     return paragraph_dict

# From the dictionary create a list of word counts
def paragraph_list(paragraph_dict):
    paragraph_list = []
    for word in all_words:
        paragraph_list.append(paragraph_dict[word])
    return paragraph_list

# par1 = paragraphs[0]
# par2 = paragraphs[1]
# # par_dict = paragraph_dict(par1, par2)
# par_dict = paragraph_dict(par1)
# par_list = paragraph_list(par_dict)
# print(par_dict)

### Calcolo della cosine similarity tra ogni riga del file di input e la riga successiva

In [14]:
# Compute cosine similarity between two paragraphs
# def cosine_similarity(p1, p2):
#     return np.dot(p1, p2) / (np.linalg.norm(p1) * np.linalg.norm(p2))


# Compute cosine similarity between all paragraphs two by two
# for i in range(0, len(paragraphs) - 1):
#     for j in range(i + 2, len(paragraphs) - 1):
#         par1 = paragraphs[i]
#         par2 = paragraphs[i + 1]
#         par3 = paragraphs[j]
#         par4 = paragraphs[j + 1]
#         dict1 = paragraph_dict(par1, par2)
#         dict2 = paragraph_dict(par3, par4)
#         list1 = paragraph_list(dict1)
#         list2 = paragraph_list(dict2)
#         cos_sim = cosine_similarity([list1], [list2])
#         print(f'Paragraphs {i} and {i + 1} are similar to paragraphs {j} and {j + 1} with a cosine similarity of {cos_sim}')

# Compute cosine similarity between all paragraphs one by one
for i in range(0, len(paragraphs) - 1):
    par1 = paragraphs[i]
    par2 = paragraphs[i + 1]
    dict1 = paragraph_dict(par1)
    dict2 = paragraph_dict(par2)
    list1 = paragraph_list(dict1)
    list2 = paragraph_list(dict2)
    cos_sim = cosine_similarity([list1], [list2])
    print(f'Paragraphs {i + 1} and {i + 2} are similar with a cosine similarity of {cos_sim}')
    

Paragraphs 1 and 2 are similar with a cosine similarity of [[0.18761766]]
Paragraphs 2 and 3 are similar with a cosine similarity of [[0.31999903]]
Paragraphs 3 and 4 are similar with a cosine similarity of [[0.24628353]]
Paragraphs 4 and 5 are similar with a cosine similarity of [[0.06278421]]
Paragraphs 5 and 6 are similar with a cosine similarity of [[0.20894948]]
Paragraphs 6 and 7 are similar with a cosine similarity of [[0.315353]]
Paragraphs 7 and 8 are similar with a cosine similarity of [[0.63245553]]
Paragraphs 8 and 9 are similar with a cosine similarity of [[0.]]
Paragraphs 9 and 10 are similar with a cosine similarity of [[0.]]
Paragraphs 10 and 11 are similar with a cosine similarity of [[0.11899932]]
Paragraphs 11 and 12 are similar with a cosine similarity of [[0.17242311]]
Paragraphs 12 and 13 are similar with a cosine similarity of [[0.11624764]]
Paragraphs 13 and 14 are similar with a cosine similarity of [[0.21213203]]
Paragraphs 14 and 15 are similar with a cosine 