text-clustering

Text clustering algorithms implemented using Huggingface models and frameworks.

Apply k-means classification to the representations from the top layer of a pre-trained transformer model after forward feeding texts through the model.
Compute "broad support" metric for each sentence. For each sentence $s$ in a collection $T$ of texts, estimate the amount of support that it has among all of the texts as follows. For each text $t \in T$, let $u(s,t)$ be the entailment probability computed by an NLI model with $premise = s$ and $hypothesis = t$. Then set $S(s, T) = \sum_{u(s,t) > k}u(s,t)$ where $k$ is a threshold (say, $0.5$). So $S(s,T)$ is the sum of how much each text in $T$ implies $s$, restricting the sum to the texts that imply it at some minimal level.
Define a set of semantic dimensions, do nli-based zero-shot classification, then cluster softmax vectors

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
dimensions		dimensions
embeddings		embeddings
support		support
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback