Skip to content

This package defines a collection of functions for clustering phrases such as names and titles into equivalence/similarity classes.

License

Notifications You must be signed in to change notification settings

llee454/cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

The cluster module defines functions that allow you to cluster phrases, names, and titles by similarity or equivalence. It's functions can either accept a set of cluster categories and then group a set of names, phrases, or titles, into those categories, or it can accept a set of names, phrases, and titles, and partition them into a set of inferred clusters. This library is useful whenever you have data that you want to categorize.

Initializing the Build Environment

Note: the following are a hack solution to fix the fact that brew and opam are out of sync and opam's owl-plplot library requires library versions that can no longer be installed with brew.

opam switch create . ocaml-variants.4.10.0+flambda --no-install
opam update
opam install --deps-only .
dune build
dune exec analysis/main.exe

Usage in Other Projects

To use this PPX library in another project include the cluster library in your Dune file's libraries section. For example:

(executable
 (name main)
 (libraries
  core_kernel
  cluster
  )
 (preprocess
  (pps ppx_jane))
  (modes exe)
)

You must also update your OPAM package configuration file. Add the following line to your "pin-depends" section: ["cluster.1.0.0" "git+https://github.com/llee454/cluster.git#main"]. Add the following to your "depends" section: "cluster" { = "1.0.0"}.

Example

Imagine that you have a set of conceptual categories distinguished by the following keywords:

let clusters = [
  {
    label = "mental health issue";
    word_set = String.Set.of_list ["anxiety"; "depression"; "ptsd"; "adhd"; "attention"; "deficit"];
    entries = []
  };
  {
    label = "physical health issue";
    word_set = String.Set.of_list ["nutrition"; "dental"; "diabetes"];
    entries = []
  }
]

and you have a set of diagnoses such as:

let diagnoses = [
  "patient with anxiety disorder and panic attacks";
  "patient with dental carries and plaque";
  "attention deficit disorder with behavioral misconduct"
]

and you want to group the diagnoses into the two categories: "mental health issues," and "physical health issues." Then you can use the cluster function provided by this module to do so:

module Diagnoses_clusters = Cluster.Make ((
  struct
    type t = string [@@deriving sexp, compare]
    let get_name = Fn.id
  end : Cluster.Make_arg with type t = string
))

let categorize_diagnoses diagnoses =
  let open Diagnoses_clusters in
  cluster ~k_shared:0.5 ~k_not_shared:0.4 ~ignore:String.Set.empty
    ~clusters:(Diagnoses_clusters.Set.of_list clusters)
    diagnoses

Note: the cluster algorithm accepts two arguments: k_shared and k_not_shared. k_shared specifies how heavily the algorithm will weight words that appear in both the cluster's word set and the phrase being analyzed, while k_not_shared specifies how heavily the algorithm will weight words that are "missing." In general k_shared should be greater than k_not_shared.

About

This package defines a collection of functions for clustering phrases such as names and titles into equivalence/similarity classes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages