# HO05: Text Similarity
A partir do conjunto de documentos `headlines.txt`, contendo um documento por linha, computar mapas de calor (create_heatmap) fornecendo a similaridade de cada documento com os outros, usando as 5 métricas de similaridade (Jaccard, Manhattan, Euclidean, Minkowski com p=3 e Cosine Similarity), e representando os documentos usando as 6 formas de vetorização:

1. One-Hot Encoding
2. Count Vectors
3. TF-IDF
4. n-grams (2-grams)
5. Co-occurrence Vectors (Context Window = 1)
6. Word2Vec

Disponibilizar o código-fonte, bem como os 30 mapas de calor em sua branch pessoal no repositório git dentro da pasta HO05.

## Defining heatmaps for similarity visualization

In [12]:
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import pairwise_distances

def make_heatmap(name, data):
    # cria uma pasta com o nome da variável name
    os.makedirs(name, exist_ok=True)

    jaccard = pairwise_distances(data, metric='jaccard')
    sns.heatmap(jaccard, cmap="Purples").set(title=(name,"- Jaccard Similarity"))
    plt.savefig(os.path.join(name, "jaccard.png"))
    plt.clf()

    manhattan = pairwise_distances(data, metric='manhattan')
    sns.heatmap(manhattan, cmap="Purples").set(title=(name,"- Manhattan Similarity"))
    plt.savefig(os.path.join(name, "manhattan.png"))
    plt.clf()

    euclidean = pairwise_distances(data, metric='euclidean')
    sns.heatmap(euclidean, cmap="Purples").set(title=(name,"- Euclidean Similarity"))
    plt.savefig(os.path.join(name, "euclidean.png"))
    plt.clf()

    minkowski = pairwise_distances(data, metric='minkowski', p=3)
    sns.heatmap(minkowski, cmap="Purples").set(title=(name,"- Minkowski Similarity"))
    plt.savefig(os.path.join(name, "minkowski.png"))
    plt.clf()

    cosine = pairwise_distances(data, metric='cosine')
    sns.heatmap(cosine, cmap="Purples").set(title=(name,"- Cosine Similarity"))
    plt.savefig(os.path.join(name, "cosine.png"))
    plt.clf()


## Reading data

In [13]:
with open("../datasets/headlines.txt", "r") as file:
    headlines = file.readlines()

#### One hot Encoding

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True)
data = vectorizer.fit_transform(headlines).toarray()

make_heatmap("One-Hot Encoding", data)



<Figure size 640x480 with 0 Axes>

#### Count Vectors

In [15]:
vectorizer = CountVectorizer()
data = vectorizer.fit_transform(headlines).toarray()

make_heatmap("Count Vectors", data)



<Figure size 640x480 with 0 Axes>

#### TF-IDF

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
data = tfidf.fit_transform(headlines).toarray()

make_heatmap("TF-IDF", data)



<Figure size 640x480 with 0 Axes>

#### n-grams (2-grams)

In [17]:
two_grams = CountVectorizer(ngram_range=(2, 2))
data = two_grams.fit_transform(headlines).toarray()

make_heatmap("Two-Grams", data)



<Figure size 640x480 with 0 Axes>

Tokenizing

In [18]:
import nltk
from nltk.tokenize import word_tokenize

def tokenize():
    data = [headlines.strip() for headlines in headlines]
    data_tokenized = [word_tokenize(data.lower()) for data in data]
    return data_tokenized, data

#### Co-occurrence

In [19]:
data_tokenized, data = tokenize()
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1))
data = vectorizer.fit_transform([' '.join(data) for data in data_tokenized])

make_heatmap("Co_occurrence", data.toarray())



<Figure size 640x480 with 0 Axes>

#### Word2Vec

In [20]:
import gensim
import numpy as np

data_tokenized, data = tokenize()

model = gensim.models.Word2Vec(sentences=data_tokenized, min_count=1, vector_size=100, window=5)

data = np.zeros((len(data), 100))

for i in range(len(data)):
    for word in data_tokenized[i]:
        data[i, :] += model.wv[word]
data /= np.linalg.norm(data, axis=1).reshape(-1, 1)

make_heatmap("Word2Vec",data)




<Figure size 640x480 with 0 Axes>