# Text Mining of BBC News Data

## Part 3: Document Clustering in Reduced TF-IDF space


## Document Clustering

In [None]:
from pathlib import Path

text_filepaths = sorted(Path("bbc").glob("*/*.txt"))
categories = [p.parent.name for p in text_filepaths]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(
    input="filename", encoding="utf-8", decode_error="ignore",
    min_df=5, max_df=0.7)

tfidf_docs = tfidf_vectorizer.fit_transform(text_filepaths)

In [None]:
%%time
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, n_init=1)
kmeans_predictions = kmeans.fit_predict(tfidf_docs)

In [None]:
kmeans.cluster_centers_.shape

**Questions**:

- Run the previous clustering a second time, what do you observe?
- Could you suggest why this is the case?

In [None]:
kmeans_predictions[:10]

In [None]:
kmeans_predictions[-10:]

In [None]:
categories[:10]

In [None]:
categories[-10:]

In [None]:
from sklearn.metrics import adjusted_rand_score

adjusted_rand_score(kmeans_predictions, categories)

In [None]:
adjusted_rand_score([0, 1, 1, 0], ["a", "b", "b", "a"])

In [None]:
adjusted_rand_score([2, 0, 0, 2], ["a", "b", "b", "a"])

In [None]:
adjusted_rand_score([1, 1, 0, 0], ["a", "b", "b", "a"])

In [None]:
adjusted_rand_score([1, 0, 0, 2], ["a", "b", "b", "a"])

Some (supervised) clustering metrics:
    
- Adjusted Rand Index
- Adjusted / Normalized Mutual Information
- V-measure (homegeneity and completeness)

When we don't have ground truth labels (which is most often the case, otherwise why not use a supervised classifier?), there is no single unique best way to quantify cluster quality/ One could use the following metrics but each of them makes different assumption on the question of what is a "good" clustering result:

- Measure inter or intra cluster average / min / max distances.
- Measure clustering stability when across resampling dataset and when adding small perturbations to the data.

**Exercises**

- Find the documentation of clustering metrics on the scikit-learn.org documentation;
- What is the meaning of homogeneity and completeness;
- On a toy dataset with only 4, and 2 "true" clustering classes, find a clustering that is homogeneous but not complete and the converse;
- Compute the homogneity, completness and V-measure score for the results of the KMeans algorithm above.

In [None]:
# %load notebook_solutions/homogeneity_vs_completeness.py

## Faster Clustering with Dimensionality Reduction

In [None]:
%%time
from sklearn.pipeline import make_pipeline
from sklearn.random_projection import GaussianRandomProjection

rp_kmeans = make_pipeline(GaussianRandomProjection(n_components=500),
                          KMeans(n_clusters=5, n_init=10))
rp_kmeans_predictions = rp_kmeans.fit_predict(tfidf_docs)

In [None]:
adjusted_rand_score(rp_kmeans_predictions, categories)

**Questions**:
    
- Try to reduce the dimension further, what do you observe?
- What is the number of tunable parameters of the KMeans model in this case?

In [None]:
%%time
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD

svd_kmeans = make_pipeline(TruncatedSVD(n_components=50),
                           KMeans(n_clusters=5, n_init=10))
svd_kmeans_predictions = svd_kmeans.fit_predict(tfidf_docs)

In [None]:
adjusted_rand_score(svd_kmeans_predictions, categories)

**Exercise**:

- Compute the homogeneity and completeness scores for this pipeline;
- Change the parameter `n_clusters`, how can you explain the results?

**Question**:

- How text clustering can help datascientists?
- What are some "real word" applications of unsupervised text clustering?
- What is the main limitation of the use of clustering when trying to organize documents by topics?