# Document Clustering

The document clustering is explained in the diagram below. The process is as follows:  
1. Normalization of documents has to be made if required or desired  
2. Vectorization of the documents using representations such as:  
    * Bag of Words (BoW)  
    * Term frequency - Inverse Document Frequency (TF-IDF)  
    * Word Embeddings (GloVe, Word2Vec, Transformers, BERT, etc.) with AverageEmbeddings.  
3. If required or desired, dimensionality reduction using:  
    * Latent Semantic Analysis (LSA).  
    * Word Collapsing (WordNet).  
4. Clustering documents using K-means or Gaussian Mixture Models.

![](..//diagrams/document_clustering_diagram.JPG)

## Document normalization

In [1]:
import sys
import os

sys.path.append(os.path.abspath('..//../techminer/'))
from docs_normalizer import DocNormalizer
from document_clustering import DocumentClustering
import pandas as pd
from techminer import RecordsDataFrame

In [2]:
rdf = RecordsDataFrame(
    pd.read_json(
        '..\\..\\data\\cleaned.json', 
        orient='records', 
        lines=True))

In [3]:
doc_normalizer = DocNormalizer()

In [None]:
doc_normalizer.fit(rdf['Abstract'])
rdf['Abstract_cleaned'] = doc_normalizer.transform(rdf['Abstract'])

Loaded 326 stopwords
Normalizing documents


In [None]:
print(f"Vocabulary size: {len(set([word for doc in rdf.loc[:,'Abstract_cleaned'] for word in doc.split()]))}")

In [None]:
document_clustering = DocumentClustering(vectorize=True, 
                                         min_count=5,
                                         max_count=1.0, 
                                         use_tfidf=True, 
                                         reduce_dimensions=True, 
                                         n_components=100,
                                         n_clusters=6, 
                                         random_state=42)

In [None]:
document_clustering.fit(rdf.loc[:,'Abstract_cleaned'])

#### Vectorized text

In [None]:
document_clustering.dfm

In [None]:
print(f'Dimensions of the documents vectorized: {document_clustering.dfm.shape[0]} rows (documents) x {document_clustering.dfm.shape[1]} columns (features) ')

In [None]:
print(f'Information keeped from the dimension reduction: {round(document_clustering.explained_var_*100,2)}%')

In [None]:
print(f'Information from the clusters performance: {round(document_clustering.silhouette_score_,3)}')

In [None]:
rdf['Cluster_labels'] = document_clustering.cluster_labels_

In [None]:
rdf['Cluster_labels'].value_counts()