Skip to content
@goodcleanfun

goodcleanfun

Fun little NLP building blocks for the public good, free from capitalist interests, clean as in small/focused and low climate impact

Popular repositories

  1. tokenizer tokenizer Public

    Jinja

  2. vector vector Public

    Generic vector functions for any numeric type in C

    C

  3. tokens tokens Public

    Arrays of tokens as string offsets and lengths as well as a tokenized string which stores the tokens as a single contiguous array of NUL-terminated strings using cstring_array

    C

  4. token_types token_types Public

    Global enum of token types and associated grouping functions.

    C

  5. utf8 utf8 Public

    utf8 strings to unicode codepoints using utf8proc in C

    C

  6. khash khash Public

    Header-only clib package for khash.h

Repositories

Showing 10 of 47 repositories
  • sartorial Public

    Pydantic model base classes and custom type handling, JSON schema generation, etc. covering a variety of common scenarios without much config

    Python 0 MIT 0 0 1 Updated May 10, 2024
  • atypical Public

    Custom types for things like phone numbers, emails, etc. with normalization, Pydantic handling and JSON Schema serialization

    Python 0 MIT 0 0 2 Updated May 10, 2024
  • communal Public

    A library of common Python utilities and functions

    Python 0 MIT 0 0 2 Updated May 10, 2024
  • spelling Public

    A generative Bayesian noisy channel model of typos focusing on likelihood only i.e. agnostic of language model probabilities

    C 0 MIT 0 0 0 Updated Apr 4, 2024
  • pypi-template Public

    PyPI package template using copier

    Jinja 0 MIT 0 0 0 Updated Apr 4, 2024
  • emoji_sequences Public

    Emoji Sequence Data and regexes from Unicode

    C 0 MIT 0 0 0 Updated Mar 17, 2024
  • unicode_categories Public

    Unicode category regexes for tokenizers, built from the latest Unicode data

    C 0 MIT 0 0 0 Updated Mar 17, 2024
  • word_breaks Public

    Unicode word breaks for TR-29 segmentation

    C 0 MIT 0 0 0 Updated Mar 17, 2024
  • array Public

    Generic, dynamic arrays in C using simple includes and defines instead of macros

    C 0 MIT 0 0 0 Updated Feb 19, 2024
  • hirschberg Public

    Hirschberg's algorithm to recover full sequence alignments in linear space. Generic header-only implementation which takes a cost function which can be e.g. Longest Common Subsequence (LCS), Needleman-Wunsch, Damerau-Levenshtein, or any other similar dynamic programming algorithm for sequences.

    C 0 MIT 0 0 0 Updated Feb 18, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…