# HO06: Text Clustering
Clusterizar o conjunto de dados [20 News Group Dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) (ver HO04), vetorizando o dataset utilizando TF-IDF e Word2Vec, utilizando cada uma das abordagens abaixo:

1. K-Means (K=4)
2. Spectral Clustering (K=6)
3. Gaussian Mixture
4. Agglomerative Clustering
5. DBSCAN
6. HDBSCAN

Disponibilizar o código-fonte (Notebook Python) em sua branch pessoal no repositório git dentro da pasta HO06.


https://www.kaggle.com/code/mrbisht/extracting-features-from-text-variables
https://www.kaggle.com/code/aybukehamideak/clustering-text-documents-using-k-means

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data[:100]

In [2]:
import re

for i in range(len(data)):
    data[i] = re.sub(r"[^\w\s\d]", '', data[i])
    data[i] = data[i].lower()

print(data)



# TF-DF

## Vetorização

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
data_tfidf = tfidf.fit_transform(data).toarray()

print(data_tfidf)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [4]:
from gensim.models import Word2Vec

words = [line.split() for line in data]
data2vec = Word2Vec(sentences=words, vector_size=100, window=5, min_count=1, workers=4)


data_vec = data2vec.wv.vectors
print(data_vec)


[[-3.4449759e-01  6.7863464e-01  1.8567075e-01 ... -3.7320265e-01
  -1.6875891e-02 -3.8968217e-02]
 [-3.0744782e-01  5.9187001e-01  1.6169448e-01 ... -3.2087091e-01
  -2.4638435e-02 -3.6495134e-02]
 [-2.7735564e-01  5.5124658e-01  1.4033841e-01 ... -2.9550612e-01
  -6.4888173e-03 -4.5272239e-02]
 ...
 [-1.4987370e-02  1.1939871e-02 -5.2437442e-03 ... -1.2388209e-02
  -4.4267015e-03  4.6246871e-03]
 [-5.6348671e-03  1.1424099e-02 -2.4249959e-03 ... -8.2758674e-03
  -9.2538558e-03  1.0099219e-03]
 [-3.2971325e-04 -3.7197077e-03  1.9587753e-03 ... -4.3706028e-03
  -8.4824888e-03  3.0847746e-03]]


## K-means

In [5]:
from sklearn.cluster import KMeans

kmeans_tf_idf = KMeans(n_clusters=4, random_state=0, n_init="auto").fit(data_tfidf)
kmeans_word2vec = KMeans(n_clusters=4, random_state=0, n_init="auto").fit(data_vec)

## Spectral Clustering

In [6]:
from sklearn.cluster import SpectralClustering

spectral_tf_idf = SpectralClustering(n_clusters=6, random_state=0).fit(data_tfidf)
spectral_word2vec = SpectralClustering(n_clusters=6, random_state=0).fit(data_vec)

## Gaussian Mixture

In [7]:
from sklearn.mixture import GaussianMixture

gaussian_tf_idf = GaussianMixture(n_components=2, random_state=0).fit(data_tfidf)
gaussian_word2vec = GaussianMixture(n_components=2, random_state=0).fit(data_vec)
     

## Agglomerative Clustering

In [8]:
from sklearn.cluster import AgglomerativeClustering

agglomerative_tf_idf = AgglomerativeClustering().fit(data_tfidf)
agglomerative_word2vec = AgglomerativeClustering().fit(data_vec)

## DBSCAN

In [9]:
from sklearn.cluster import DBSCAN

dbscan_tf_idf = DBSCAN(eps=4, min_samples=2).fit(data_tfidf)
dbscan_word2vec = DBSCAN(eps=4, min_samples=2).fit(data_vec)

## HDBSCAN

In [10]:
import hdbscan

hdbscan_tf_idf = hdbscan.HDBSCAN().fit(data_tfidf)
hbscan_word2vec = hdbscan.HDBSCAN().fit(data_vec)