<h1 style="text-align:center">Clustering Text Document using KMeans Scikit Learn</h1>

This assigment follows the tutorial of [Scikit Learn](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#)

## 1. LoadData

We load data from The 20 newsgroups text dataset, which comprises around 18,000 newsgroups posts on 20 topics. We select a subset of 4 topics only accounting for around 3,400 documents. See the example [Classification of text documents using sparse features](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py) to gain intuition on the overlap of such topics.

In [2]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

labels = dataset.target
unique_labels, category_sizes = np.unique(labels, return_counts=True)
true_k = unique_labels.shape[0]

print(f"{len(dataset.data)} documents - {true_k} categories")

3387 documents - 4 categories


## 2. Quantifying the quantity of clustering

In this section, we define a function to score different clustering evaluation metrics.

Evaluation metrics are fundamentally clustering methods. 
If we have the class labels of a specific dataset, we can use this "supervise" ground truth method to quantify the quantity of results clustering.
* homogeneity, which quantifies how much clusters contain only members of a single class;
* completeness, which quantifies how much members of a given class are assigned to the same clusters;
* V-measure, the harmonic mean of completeness and homogeneity;
* Rand-Index, which measures how frequently pairs of data points are grouped consistently according to the result of the clustering algorithm and the ground truth class assignment;
* Adjusted Rand-Index, a chance-adjusted Rand-Index such that random cluster assignment have an ARI of 0.0 in expectation.

In [7]:
from collections import defaultdict
from time import time

from sklearn import metrics

evaluations = []
evaluations_std = []


def fit_and_evaluate(km, X, name=None, n_runs=5):
    """
    km: KMeans instance
    X: Dataset
    name: not been known yet
    n_runs: Lưu trữ số lần train và mỗi lần train là một seed iteration chạy từ 0 -> n_runs
    """
    name = km.__class__.__name__ if name is None else name
    
    train_times = []               # Time train of each seed(n_runs)
    scores = defaultdict(list)     # Return a dict-like object. --> [("Bananas": [1,2,3]) , ("Apples" : [2])]
                                   # Dict store lists as items
    for seed in range(n_runs):
        # Set random_state parameter
        km.set_params(random_state=seed) 
        
        # Fit and store training time
        t0 = time()
        km.fit(X)
        train_times.append(time() - t0)
        
        # Evaluation metrics
        scores["Homogeneity"].append(metrics.homogeneity_score(labels, km.labels_))
        scores["Completeness"].append(metrics.completeness_score(labels, km.labels_))
        scores["V-measure"].append(metrics.v_measure_score(labels, km.labels_))
        scores["Adjusted Rand-Index"].append(
            metrics.adjusted_rand_score(labels, km.labels_)
        )
        scores["Silhouette Coefficient"].append(
            metrics.silhouette_score(X, km.labels_, sample_size=2000)
        )
    train_times = np.asarray(train_times)

    print(f"clustering done in {train_times.mean():.2f} ± {train_times.std():.2f} s ")
    evaluation = {
        "estimator": name,
        "train_time": train_times.mean(),
    }
    evaluation_std = {
        "estimator": name,
        "train_time": train_times.std(),
    }
    for score_name, score_values in scores.items():
        mean_score, std_score = np.mean(score_values), np.std(score_values)
        print(f"{score_name}: {mean_score:.3f} ± {std_score:.3f}")
        evaluation[score_name] = mean_score
        evaluation_std[score_name] = std_score
    evaluations.append(evaluation)
    evaluations_std.append(evaluation_std)