Merge pull request #1 from intercontinental-dictionary-series/bibiko

initial commit
intercontinental-dictionary-series · Feb 8, 2021 · d7e7789 · d7e7789
2 parents dadd284 + bb925d0
commit d7e7789
Show file tree

Hide file tree

Showing 24 changed files with 4,666 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,7 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# macOS
+.DS_Store
+__MACOSX
diff --git a/.zenodo.json b/.zenodo.json
@@ -0,0 +1,30 @@
+{
+    "title": "Kalamang IDS wordlist by Eline Visser",
+    "access_right": "open",
+    "keywords": [
+        "cldf:Wordlist",
+        "linguistics"
+    ],
+    "creators": [
+        {
+            "name": "Eline Visser"
+        },
+        {
+            "name": "Bernard Comrie"
+        },
+        {
+            "name": "Hans-J\u00f6rg Bibiko"
+        }
+    ],
+    "contributors": [],
+    "communities": [
+        {
+            "identifier": "lexibank"
+        }
+    ],
+    "upload_type": "dataset",
+    "description": "<p>Cite the source of the dataset as:</p>\n\n<blockquote>\n<p>Eline Visser. 2021. Kalamang dictionary. In: Key, Mary Ritchie &amp; Comrie, Bernard (eds.) The Intercontinental Dictionary Series. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://ids.clld.org/)</p>\n</blockquote>",
+    "license": {
+        "id": "CC-BY-4.0"
+    }
+}
diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
@@ -0,0 +1,7 @@
+# Contributors
+
+Name               | GitHub user     | Description                          | Role
+---                | ---             | ---                                  | ---
+Eline Visser |  | author, data entry | Author
+Bernard Comrie |  | consultant | Creator
+Hans-Jörg Bibiko | @Bibiko | patron, code | Maintainer
diff --git a/FORMS.md b/FORMS.md
@@ -0,0 +1,27 @@
+## Specification of form manipulation
+
+
+Specification of the value-to-form processing in Lexibank datasets:
+
+The value-to-form processing is divided into two steps, implemented as methods:
+- `FormSpec.split`: Splits a string into individual form chunks.
+- `FormSpec.clean`: Normalizes a form chunk.
+
+These methods use the attributes of a `FormSpec` instance to configure their behaviour.
+
+- `brackets`: `{'(': ')'}`
+  Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string
+- `separators`: `;,/~`
+  Iterable of single character tokens that should be recognized as word separator
+- `missing_data`: `('?', '∅', '-', '--', '- -', '––', '???', '', '-666', '666', '—', 'ʼ')`
+  Iterable of strings that are used to mark missing data
+- `strip_inside_brackets`: `True`
+  Flag signaling whether to strip content in brackets (**and** strip leading and trailing whitespace)
+- `replacements`: `[('[', ''), (']', ''), ('<', ''), ('>', '')]`
+  List of pairs (`source`, `target`) used to replace occurrences of `source` in formswith `target` (before stripping content in brackets)
+- `first_form_only`: `False`
+  Flag signaling whether at most one form should be returned from `split` - effectively ignoring any spelling variants, etc.
+- `normalize_whitespace`: `True`
+  Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
+- `normalize_unicode`: `NFD`
+  UNICODE normalization form to use for input of `split` (`None`, 'NFD' or 'NFC')