# Document Clustering

The document clustering is explained in the diagram below. The process is as follows:  
1. Normalization of documents has to be made if required or desired  
2. Vectorization of the documents using representations such as:  
    * Bag of Words (BoW)  
    * Term frequency - Inverse Document Frequency (TF-IDF)  
    * Word Embeddings (GloVe, Word2Vec, Transformers, BERT, etc.) with AverageEmbeddings.  
3. If required or desired, dimensionality reduction using:  
    * Latent Semantic Analysis (LSA).  
    * Word Collapsing (WordNet).  
4. Clustering documents using K-means or Gaussian Mixture Models.

![](..//diagrams/document_clustering_diagram.JPG)

## Document normalization

In [1]:
import sys
import os

sys.path.append(os.path.abspath('..//../techminer/'))
from docs_normalizer import DocNormalizer
from document_clustering import DocumentClustering
import pandas as pd
from techminer import RecordsDataFrame

In [2]:
rdf = RecordsDataFrame(
    pd.read_json(
        'step-07.json', 
        orient='records', 
        lines=True))

In [3]:
doc_normalizer = DocNormalizer()

In [4]:
doc_normalizer.fit(rdf['Abstract'])
rdf['Abstract_cleaned'] = doc_normalizer.transform(rdf['Abstract'])

Loaded 326 stopwords
Normalizing documents


In [5]:
print(f"Vocabulary size: {len(set([word for doc in rdf.loc[:,'Abstract_cleaned'] for word in doc.split()]))}")

Vocabulary size: 2746


In [6]:
document_clustering = DocumentClustering(vectorize=True, 
                                         min_count=5,
                                         max_count=1.0, 
                                         use_tfidf=True, 
                                         reduce_dimensions=True, 
                                         n_components=100,
                                         n_clusters=6, 
                                         random_state=42)

In [7]:
document_clustering.fit(rdf.loc[:,'Abstract_cleaned'])

DocumentClustering(max_count=1.0, min_count=5, n_clusters=6, n_components=100,
                   random_state=42, reduce_dimensions=True, use_tfidf=True,
                   vectorize=True)

#### Vectorized text

In [8]:
document_clustering.dfm

array([[ 7.13428736e-01, -4.76060165e-02,  5.29940552e-02, ...,
        -4.01021078e-02,  5.87939975e-02,  3.22097256e-03],
       [ 4.71860375e-01,  1.45710210e-01, -2.84670255e-01, ...,
         2.28289060e-02, -9.56853825e-02,  5.13728047e-02],
       [ 6.03413082e-01, -2.90265662e-01,  4.36284296e-01, ...,
         1.40226363e-02,  1.05095781e-02,  1.55980547e-02],
       ...,
       [ 4.46258507e-01,  7.62377289e-01,  2.30886605e-01, ...,
         2.96174282e-04,  6.35138520e-03, -2.24497242e-02],
       [ 4.32616139e-01,  1.74846432e-01,  3.16626155e-01, ...,
         2.82303028e-02, -9.16002390e-02, -3.84914296e-02],
       [ 5.23728023e-01,  4.36412038e-01,  1.75262218e-01, ...,
         4.89726965e-02,  1.56163155e-01,  9.69715171e-02]])

In [9]:
print(f'Dimensions of the documents vectorized: {document_clustering.dfm.shape[0]} rows (documents) x {document_clustering.dfm.shape[1]} columns (features) ')

Dimensions of the documents vectorized: 144 rows (documents) x 100 columns (features) 


In [10]:
print(f'Information keeped from the dimension reduction: {round(document_clustering.explained_var_*100,2)}%')

Information keeped from the dimension reduction: 90.21%


In [11]:
print(f'Information from the clusters performance: {round(document_clustering.silhouette_score_,3)}')

Information from the clusters performance: 0.055


In [12]:
rdf['Cluster_labels'] = document_clustering.cluster_labels_

In [13]:
rdf['Cluster_labels'].value_counts()

1    97
0    25
5    10
3     6
4     3
2     3
Name: Cluster_labels, dtype: int64