# simple adjacent text clustering

The purpose of this notebook is to a demonstrate text clustering approach that is simple and fast rather than accurate.

In [1]:
from unstructured.chunking.title import chunk_by_title
from unstructured.partition.auto import partition_html
from unstructured.documents.elements import Element
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from itertools import groupby
from typing import Iterable

We begin by partitioning, and chunking a document in the usual way.

In [2]:
elements = partition_html(url="https://en.wikipedia.org/wiki/Cabinet_Office")
chunks = chunk_by_title(elements=elements)

We then use TFiDF to build a sparse vector representation of each chunk

In [3]:
vectorizer = TfidfVectorizer()
sparse_vectors = vectorizer.fit_transform([chunk.text for chunk in chunks])

And cluster and label each chunk.

In [4]:
clusterer = AgglomerativeClustering(n_clusters=3)
clustering = clusterer.fit(sparse_vectors.toarray())

Note that this clustering does not respect the 1d top-to-bottom nature of our text, i.e. the set of chunks with label `1` is punctured by clusters with other labels

In [5]:
clustering.labels_

array([0, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 2,
       0, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0, 2,
       0, 0, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 0])

We need clusters of chunks without holes in them so we define the following helper function, `get_slices` that iterates over the cluster labels and returns references to how adjacent chunks should be compacted

In [6]:
def get_slices(labels: list) -> Iterable[slice]:
    """yields the start and end of each cluster as a slice
    Example:
        >>> clusters = ["a", "a", "b", "a", "c", "c"]
        >>> list(get_slices(clusters))
        ... [slice(0, 2), slice(2, 3), slice(3, 4), slice(4, 6)]
    """
    start = 0
    for _, group in groupby(labels):
        stop = start + len(list(group))
        yield slice(start, stop)
        start = stop

And we apply this to our clustering

In [7]:
slices = get_slices(clustering.labels_)
merged_chunks = [chunks[s] for s in slices]

In [8]:
print(f"originally {len(chunks)} chunks, now {len(merged_chunks)}")

originally 69 chunks, now 22


In [9]:
for merged_chunk in merged_chunks:
    print(" ".join(chunk.text for chunk in merged_chunk))
    print("\n------------------------------------\n")

Toggle the table of contents

Cabinet Office

16 languages

Dansk

Deutsch

Eesti

Español

فارسی

Français

Bahasa Indonesia

Italiano

עברית

日本語

Norsk bokmål

Português

Simple English

Svenska

Türkçe

中文

Edit links

Article

Talk

English

Read

Edit

View history

Tools

Tools

Actions

Read

Edit

View history

General

What links here

Related changes

Upload file

Special pages

Permanent link

Page information

Cite this page

Get shortened URL

Download QR code

Wikidata item

------------------------------------

Print/export

Download as PDF

Printable version

In other projects

Wikimedia Commons

Coordinates: 51°30′13″N 0°7′36″W﻿ / ﻿51.50361°N 0.12667°W﻿ / 51.50361; -0.12667

From Wikipedia, the free encyclopedia

Ministerial department of the UK Government

This article is about the Cabinet Office in the United Kingdom. For other Cabinet Offices, see 

Cabinet Office (disambiguation). 70 Whitehall , Westminster Department overview Formed December 1916 Preceding Depart