Skip to content

Commit

Permalink
Merge pull request #1 from intercontinental-dictionary-series/bibiko
Browse files Browse the repository at this point in the history
initial commit
  • Loading branch information
xrotwang committed Feb 8, 2021
2 parents dadd284 + bb925d0 commit d7e7789
Show file tree
Hide file tree
Showing 24 changed files with 4,666 additions and 1 deletion.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,7 @@ dmypy.json

# Pyre type checker
.pyre/

# macOS
.DS_Store
__MACOSX
30 changes: 30 additions & 0 deletions .zenodo.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"title": "Kalamang IDS wordlist by Eline Visser",
"access_right": "open",
"keywords": [
"cldf:Wordlist",
"linguistics"
],
"creators": [
{
"name": "Eline Visser"
},
{
"name": "Bernard Comrie"
},
{
"name": "Hans-J\u00f6rg Bibiko"
}
],
"contributors": [],
"communities": [
{
"identifier": "lexibank"
}
],
"upload_type": "dataset",
"description": "<p>Cite the source of the dataset as:</p>\n\n<blockquote>\n<p>Eline Visser. 2021. Kalamang dictionary. In: Key, Mary Ritchie &amp; Comrie, Bernard (eds.) The Intercontinental Dictionary Series. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://ids.clld.org/)</p>\n</blockquote>",
"license": {
"id": "CC-BY-4.0"
}
}
7 changes: 7 additions & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Contributors

Name | GitHub user | Description | Role
--- | --- | --- | ---
Eline Visser | | author, data entry | Author
Bernard Comrie | | consultant | Creator
Hans-Jörg Bibiko | @Bibiko | patron, code | Maintainer
27 changes: 27 additions & 0 deletions FORMS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## Specification of form manipulation


Specification of the value-to-form processing in Lexibank datasets:

The value-to-form processing is divided into two steps, implemented as methods:
- `FormSpec.split`: Splits a string into individual form chunks.
- `FormSpec.clean`: Normalizes a form chunk.

These methods use the attributes of a `FormSpec` instance to configure their behaviour.

- `brackets`: `{'(': ')'}`
Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string
- `separators`: `;,/~`
Iterable of single character tokens that should be recognized as word separator
- `missing_data`: `('?', '∅', '-', '--', '- -', '––', '???', '', '-666', '666', '—', 'ʼ')`
Iterable of strings that are used to mark missing data
- `strip_inside_brackets`: `True`
Flag signaling whether to strip content in brackets (**and** strip leading and trailing whitespace)
- `replacements`: `[('[', ''), (']', ''), ('<', ''), ('>', '')]`
List of pairs (`source`, `target`) used to replace occurrences of `source` in formswith `target` (before stripping content in brackets)
- `first_form_only`: `False`
Flag signaling whether at most one form should be returned from `split` - effectively ignoring any spelling variants, etc.
- `normalize_whitespace`: `True`
Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
- `normalize_unicode`: `NFD`
UNICODE normalization form to use for input of `split` (`None`, 'NFD' or 'NFC')
Loading

0 comments on commit d7e7789

Please sign in to comment.