Skip to content

Commit

Permalink
Merge pull request #383 from Anaphory/fix-sources-and-comments
Browse files Browse the repository at this point in the history
Fix sources and comments
  • Loading branch information
nataliacp committed Feb 4, 2023
2 parents 3cad3ae + dd70756 commit 76397f1
Show file tree
Hide file tree
Showing 16 changed files with 183 additions and 131 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/python-tox.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python: ["3.8", "3.9", "3.10"]
python: ["3.8", "3.9", "3.10", "3.11"]
pycldf: ["1.24.0", "1.21.0", "NEWEST"]
steps:
- uses: actions/checkout@v2
Expand All @@ -36,6 +36,7 @@ jobs:
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Install catalogs
run: |
git config --global init.defaultBranch main
mkdir -p ~/.config/cldf/
(cd ~/.config/cldf/ && [ -d glottolog ] || git clone --depth 1 https://github.com/glottolog/glottolog.git)
(cd ~/.config/cldf/ && [ -d concepticon-data ] || git clone --depth 1 https://github.com/concepticon/concepticon-data.git concepticon)
Expand Down
91 changes: 47 additions & 44 deletions docs/tour.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,11 @@ exporting the dataset as a phylogenetic alignment.

(To prevent this tutorial from becoming obsolete, our continuous integration
testing system ‘follows’ this tutorial when we update the software. So if it
appears overly verbose or rigid to you at times, it is because it has a secondary
function as test case. This is also the reason we use the command line where we
can, even in places where a GUI tool would be handy: Our continuous integration
tester cannot use the GUI.) ::
appears overly verbose or rigid to you at times, it is because it has a
secondary function as test case. This is also the reason we use the command line
where we can, even in places where a GUI tool would be handy: Our continuous
integration tester cannot use the GUI. We need to prepare this a bit to avoid
confusing output later.) ::

$ python -m lexedata.importer.excel_interleaved --help
[...]
Expand Down Expand Up @@ -46,7 +47,7 @@ For this tutorial, we will be using lexical data from the Bantu family,
collected by Hilde Gunnink. The dataset is a subset of an earlier version
(deliberately, so this tour can show some steps in the cleaning process) of her
lexical dataset. The data is stored in an Excel file which you can download from
https://github.com/Anaphory/lexedata/blob/master/src/lexeadata/data/example-bantu.xlsx
https://github.com/Anaphory/lexedata/blob/master/src/lexedata/data/example-bantu.xlsx
in the lexedata repository. (We will use the most recent version here, which
comes shipped with lexedata. Sorry this looks a bit cryptic, but as we said, this
way the testing system also knows where to find the file.) ::
Expand Down Expand Up @@ -78,7 +79,7 @@ contains cognacy judgements for those forms.
This is one of several formats supported by lexedata for import. The
corresponding importer is called ``excel_interleaved`` and it works like this::

$ python -m lexedata.importer.excel_interleaved --help
$ python -m lexedata.importer.excel_interleaved --help # doctest: +NORMALIZE_WHITESPACE
usage: python -m lexedata.importer.excel_interleaved [-h]
[--sheets SHEET [SHEET ...]]
[--directory DIRECTORY]
Expand Down Expand Up @@ -155,9 +156,8 @@ A well-structured ``forms.csv`` is a valid, `“metadata-free”
this case, the data contains a column that CLDF does not know out-of-the-box,
but otherwise the dataset is fine. ::

$ cldf validate forms.csv
[...] UserWarning: Unspecified column "Cognateset_ID" in table forms.csv
warnings.warn(
$ cldf validate forms.csv
[...]

Working with git
================
Expand Down Expand Up @@ -350,13 +350,17 @@ scaffold for metadata about languages etc. with another tool. ::

$ python -m lexedata.edit.add_table LanguageTable
INFO:lexedata:Found 14 different entries for your new LanguageTable.
$ python -m lexedata.edit.add_table ParameterTable
INFO:lexedata:Found 100 different entries for your new ParameterTable.
$ python -m lexedata.edit.add_table ParameterTable --but-not-column ColumnSpec
[...]
WARNING:lexedata:Some of your reference values are not valid as IDs: ['go to', 'rain (v)', 'sick, be', 'sleep (v)']. You can transform them into valid ids by running lexedata.edit.simplify_ids

“Parameter” is CLDF speak for the things sampled per-language. In a
StructureDataset this might be typological features, in a Wordlist the
ParameterTable contains the concepts. We will ignore the warning about IDs for now.
ParameterTable contains the concepts. We will ignore the warning about IDs for
now. Because ‘parameters’ can have very different data types, the ParameterTable
by default has a column that allows the specification of a distinct data type
for each parameter. We don't need this for a word list, so we told the table
adding tool to not add that column.

Every form belongs to one language, and every language has multiple forms. This
is a simple 1:n relationship. Every form has one or more concepts associated
Expand All @@ -368,7 +372,7 @@ good. ::
$ git add languages.csv parameters.csv
$ git commit -am "Add language and concept tables"
[main [...]] Add language and concept tables
3 files changed, 246 insertions(+), 4 deletions(-)
3 files changed, [...] insertions(+), 4 deletions(-)
create mode 100644 languages.csv
create mode 100644 parameters.csv

Expand Down Expand Up @@ -673,14 +677,14 @@ lexedata toolbox::
This was however not the only issue with the data. ::

$ python -m lexedata.report.extended_cldf_validate -q
WARNING:lexedata:In cognates.csv, row 110: Alignment has length 4, other alignments of cognateset big_1 have length(s) {6}
WARNING:lexedata:In cognates.csv, row 114: Alignment has length 6, other alignments of cognateset blood_11 have length(s) {4}
WARNING:lexedata:In cognates.csv, row 122: Alignment has length 1, other alignments of cognateset die_1 have length(s) {2}
WARNING:lexedata:In cognates.csv, row 127: Alignment has length 4, other alignments of cognateset eat_1 have length(s) {2}
WARNING:lexedata:In cognates.csv, row 133: Alignment has length 6, other alignments of cognateset feather_19 have length(s) {4}
WARNING:lexedata:In cognates.csv, row 138: Alignment has length 4, other alignments of cognateset full_8 have length(s) {6}
WARNING:lexedata:In cognates.csv, row 151: Alignment has length 6, other alignments of cognateset knee_13 have length(s) {7}
WARNING:lexedata:In cognates.csv, row 166: Alignment has length 3, other alignments of cognateset name_1 have length(s) {4}
WARNING:lexedata:In cognates.csv, row 110: Alignment has length 4, other alignments of cognateset big-1 have length(s) {6}
WARNING:lexedata:In cognates.csv, row 114: Alignment has length 6, other alignments of cognateset blood-11 have length(s) {4}
WARNING:lexedata:In cognates.csv, row 122: Alignment has length 1, other alignments of cognateset die-1 have length(s) {2}
WARNING:lexedata:In cognates.csv, row 127: Alignment has length 4, other alignments of cognateset eat-1 have length(s) {2}
WARNING:lexedata:In cognates.csv, row 133: Alignment has length 6, other alignments of cognateset feather-19 have length(s) {4}
WARNING:lexedata:In cognates.csv, row 138: Alignment has length 4, other alignments of cognateset full-8 have length(s) {6}
WARNING:lexedata:In cognates.csv, row 151: Alignment has length 6, other alignments of cognateset knee-13 have length(s) {7}
WARNING:lexedata:In cognates.csv, row 166: Alignment has length 3, other alignments of cognateset name-1 have length(s) {4}
[...]

The alignment column of the cognate table is empty, so there is no form for which
Expand Down Expand Up @@ -739,8 +743,8 @@ script has already done that for us::

$ head -n3 cognates.csv
ID,Form_ID,Cognateset_ID,Segment_Slice,Alignment,Source,Status_Column
duala_all-all_1,duala_all,all_1,1:4,ɓ ɛ́ s ɛ̃ - -,,automatically aligned
duala_arm-arm_7,duala_arm,arm_7,1:3,d i a,,automatically aligned
duala_all-all-1,duala_all,all-1,1:4,ɓ ɛ́ s ɛ̃ - -,,automatically aligned
duala_arm-arm-7,duala_arm,arm-7,1:3,d i a,,automatically aligned

Most scripts do not add a status column if there is none. To make use of this
functionality, we therefore add status columns to all tables. ::
Expand Down Expand Up @@ -871,16 +875,16 @@ polysemous forms connected to multiple concepts. ::
$ grep 'kikuyu_\(white\|new\)' forms.csv cognates.csv
forms.csv:kikuyu_new,Kikuyu,new,erũ,,e r ũ,,
forms.csv:kikuyu_white,Kikuyu,white,erũ,,e r ũ,,
cognates.csv:kikuyu_new-new_3,kikuyu_new,new_3,1:3,e r ũ,,automatically aligned
cognates.csv:kikuyu_white-white_2,kikuyu_white,white_2,1:3,e r ũ,,automatically aligned
cognates.csv:kikuyu_new-new-3,kikuyu_new,new-3,1:3,e r ũ,,automatically aligned
cognates.csv:kikuyu_white-white-2,kikuyu_white,white-2,1:3,e r ũ,,automatically aligned
$ python -m lexedata.edit.merge_homophones polysemies.txt
WARNING:lexedata:I had to set a separator for your forms' concepts. I set it to ';'.
INFO:lexedata:Going through forms and merging
100%|██████████| 1592/1592 [...]
$ grep 'kikuyu_\(white\|new\)' forms.csv cognates.csv
forms.csv:kikuyu_new,Kikuyu,new;white,erũ,,e r ũ,,
cognates.csv:kikuyu_new-new_3,kikuyu_new,new_3,1:3,e r ũ,,automatically aligned
cognates.csv:kikuyu_white-white_2,kikuyu_new,white_2,1:3,e r ũ,,automatically aligned
cognates.csv:kikuyu_new-new-3,kikuyu_new,new-3,1:3,e r ũ,,automatically aligned
cognates.csv:kikuyu_white-white-2,kikuyu_new,white-2,1:3,e r ũ,,automatically aligned
$ git commit -am "Annotate polysemies"
[main [...]] Annotate polysemies
4 files changed, 3302 insertions(+), 3288 deletions(-)
Expand All @@ -900,33 +904,32 @@ not represent disjoint, consecutive groups of segments also occur when morpheme
boundaries have been eroded or when a language has non-concatenative morphemes.
There is a script that reports such cases. ::

$ python -m lexedata.report.nonconcatenative_morphemes > overlapping_cogsets
$ python -m lexedata.report.nonconcatenative_morphemes > overlapping_cogsets # doctest: +NORMALIZE_WHITESPACE
[...]
WARNING:lexedata:In form ntomba_skin, segments are associated with multiple cognate sets.
INFO:lexedata:In form ntomba_skin, segments 1:6 (l o p o h o) are in both cognate sets bark_22 and skin_27.
INFO:lexedata:In form ntomba_skin, segments 1:6 (l o p o h o) are in both cognate sets bark-22 and skin-27.
WARNING:lexedata:In form ngombe_big, segments are associated with multiple cognate sets.
INFO:lexedata:In form ngombe_big, segments 1:4 (n ɛ́ n ɛ) are in both cognate sets big_1 and many_12.
INFO:lexedata:In form ngombe_big, segments 1:4 (n ɛ́ n ɛ) are in both cognate sets big-1 and many-12.
WARNING:lexedata:In form bushoong_go_to, segments are associated with multiple cognate sets.
INFO:lexedata:In form bushoong_go_to, segments 1:4 (y ɛ ɛ n) are in both cognate sets go_to_1 and walk_1.
INFO:lexedata:In form bushoong_go_to, segments 1:4 (y ɛ ɛ n) are in both cognate sets go_to-1 and walk-1.
WARNING:lexedata:In form lega_go_to, segments are associated with multiple cognate sets.
INFO:lexedata:In form lega_go_to, segments 1:4 (ɛ n d a) are in both cognate sets go_to_2 and walk_1.
INFO:lexedata:In form lega_go_to, segments 1:4 (ɛ n d a) are in both cognate sets go_to-2 and walk-1.
WARNING:lexedata:In form kikuyu_new, segments are associated with multiple cognate sets.
INFO:lexedata:In form kikuyu_new, segments 1:3 (e r ũ) are in both cognate sets new_3 and white_2.
$ cat overlapping_cogsets # doctest: +NORMALIZE_WHITESPACE
INFO:lexedata:In form kikuyu_new, segments 1:3 (e r ũ) are in both cognate sets new-3 and white-2.
$ cat overlapping_cogsets # doctest: +NORMALIZE_WHITESPACE
Cluster of overlapping cognate sets:
bark_22
skin_27
bark-22
skin-27
Cluster of overlapping cognate sets:
big_1
many_12
big-1
many-12
Cluster of overlapping cognate sets:
go_to_1
go_to_2
walk_1
go_to-1
go_to-2
walk-1
Cluster of overlapping cognate sets:
new_3
white_2

new-3
white-2

There are other ways to merge cognate sets, which we will see in a moment, but
this kind of structured report is suitable for automatic merging, in the same
Expand Down
4 changes: 2 additions & 2 deletions src/lexedata/edit/add_singleton_cognatesets.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,10 +125,10 @@ def create_singletons(
)
for form, slice in forms_and_segments:
i = 1
singleton_id = f"X_{form}_{i:d}"
singleton_id = f"x_{form}_{i:d}"
while singleton_id in all_cognatesets:
i += 1
singleton_id = f"X_{form}_{i:d}"
singleton_id = f"x_{form}_{i:d}"
all_cognatesets[singleton_id] = types.CogSet({})
properties = {
c_s_name: util.ensure_list(forms[form]["parameterReference"])[0],
Expand Down
18 changes: 18 additions & 0 deletions src/lexedata/edit/add_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@
column.""",
metavar="TABLE",
)
parser.add_argument(
"--but-not-column",
type=str,
action="append",
default=[],
help="""Add the table, but without this column. (Can be specified multiple times.)""",
)
args = parser.parse_args()
logger = cli.setup_logging(args)

Expand All @@ -50,6 +57,17 @@
f"I don't know how to add a {args.table:s}. Is it a well-defined CLDF component, according to https://cldf.clld.org/v1.0/terms.rdf#components ?"
)

for skip_column in args.but_not_column:
try:
column_name = ds[args.table, skip_column].name
ds.remove_columns(args.table, skip_column)
except KeyError:
logger.warning(
"Table %s has no column %s to be not added, so it didn't get added anyway.",
args.table,
skip_column,
)

invalid_ids = []

def new_row(item: str):
Expand Down
44 changes: 41 additions & 3 deletions src/lexedata/edit/merge_cognate_sets.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
*Optionally*, merge cognate sets that get merged by this procedure.
"""

import re
import argparse
import typing as t
from collections import defaultdict
Expand All @@ -24,11 +25,46 @@
first,
format_mergers,
must_be_equal,
parse_homophones_report,
parse_merge_override,
)
from lexedata.util.simplify_ids import update_ids


def parse_cognatesets_report(
report: t.TextIO,
logger: cli.logging.Logger = cli.logger,
) -> t.List[t.List[types.Cognateset_ID]]:
r"""Parse cognateset merge instructions
The format of the input file is the same as the output of the homophones report
>>> from io import StringIO
>>> file = StringIO("Cluster of overlapping cognate sets:\n"
... " bark-22\n"
... " skin-27")
>>> parse_cognatesets_report(file)
[['bark-22', 'skin-27']]
"""
cognateset_groups: t.List[t.List] = []
next_group = []
for line in report:
line = line.rstrip()
match = re.match(r"\s+?([\w-]+?)( \(.*\))?$", line)
if match:
next_group.append(match.group(1))
else:
if "luster" not in line:
logger.warning(
"I assume '%s' is the header of a cluster of cognatesets to be merged, but it does not say ‘cluster’ so I am not sure.",
line,
)
if next_group:
cognateset_groups.append(next_group)
next_group = []
if next_group:
cognateset_groups.append(next_group)
return cognateset_groups


# TODO: Options given on the command line should have preference over defaults,
# no matter whether they are given in terms of names ("Parameter_ID") or
# property URLs ("parameterReference")
Expand Down Expand Up @@ -221,10 +257,12 @@ def merge_cogsets(
"The cognatet set merger was initialized as follows\n Column : merger function\n"
+ "\n".join("{}: {}".format(k, m.__name__) for k, m in mergers.items())
)
# Parse the homophones instructions!
cogset_groups = parse_homophones_report(
# Parse the cognatesets instructions!
report: t.List[t.List[str]] = parse_cognatesets_report(
args.merge_file.open("r", encoding="utf8"),
)
print(report)
cogset_groups = {variants[0]: variants for variants in report}
if cogset_groups == defaultdict(list):
cli.Exit.INVALID_INPUT(
f"The provided report {args.merge_file} is empty or does not have the correct format."
Expand Down
2 changes: 1 addition & 1 deletion src/lexedata/edit/merge_homophones.py
Original file line number Diff line number Diff line change
Expand Up @@ -554,7 +554,7 @@ def parse_homophones_report(
)
target_id: t.Optional[types.Form_ID] = None
for line in report:
match = re.match(r"\s+?(\w+?)( \(.*\))?$", line)
match = re.match(r"\s+?([\w-]+?)( \(.*\))?$", line)
if match:
id = match.group(1)
if target_id is None:
Expand Down
1 change: 0 additions & 1 deletion src/lexedata/exporter/edictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,6 @@ def write_edictor_file(
delimiter="\t",
)
out.writerow({column: rename(column) for column in tsv_header})
out_cognatesets: t.List[t.Optional[str]]
for f, (id, form) in enumerate(forms.items(), 1):
# store original form id in other field and get cogset integer id
this_form = dict(form)
Expand Down
4 changes: 2 additions & 2 deletions src/lexedata/exporter/phylogenetics.py
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ def apply_heuristics(
... name="Central_Concept",
... propertyUrl="http://cldf.clld.org/v1.0/terms.rdf#parameterReference"))
>>> ds.auto_constraints(cst)
>>> ds.write(CognatesetTable=[
>>> _= ds.write(CognatesetTable=[
... {"ID": "cognateset1", "Central_Concept": "concept1"}
... ])
>>> apply_heuristics(ds, heuristic=AbsenceHeuristic.CENTRALCONCEPT) == {'cognateset1': {'concept1'}}
Expand All @@ -384,7 +384,7 @@ def apply_heuristics(
... propertyUrl="http://cldf.clld.org/v1.0/terms.rdf#parameterReference",
... separator=","))
>>> ds.auto_constraints(cst)
>>> ds.write(CognatesetTable=[
>>> _ = ds.write(CognatesetTable=[
... {"ID": "cognateset1", "Central_Concepts": ["concept1", "concept2"]}
... ])
>>> apply_heuristics(ds, heuristic=AbsenceHeuristic.CENTRALCONCEPT) == {
Expand Down
12 changes: 0 additions & 12 deletions src/lexedata/importer/excel_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,11 @@ class DB:
"""

cache: t.Dict[str, t.Dict[t.Hashable, t.Dict[str, t.Any]]]
source_ids: t.Set[str]

def __init__(self, output_dataset: pycldf.Wordlist):
"""Create a new *empty* cache associated with a dataset."""
self.dataset = output_dataset
self.cache = {}
self.source_ids = set()

@classmethod
def from_dataset(k, dataset, logger: cli.logging.Logger = cli.logger):
Expand Down Expand Up @@ -84,18 +82,12 @@ def cache_dataset(self, logger: cli.logging.Logger = cli.logger):
except FileNotFoundError:
self.cache[table_type] = {}

for source in self.dataset.sources:
self.source_ids.add(source.id)

def drop_from_cache(self, table: str):
self.cache[table] = {}

def retrieve(self, table_type: str):
return self.cache[table_type].values()

def add_source(self, source_id):
self.source_ids.add(source_id)

def empty_cache(self):
self.cache = {
# TODO: Is there a simpler way to get the list of all tables?
Expand All @@ -112,10 +104,6 @@ def write_dataset_from_cache(self, tables: t.Optional[t.Iterable[str]] = None):
table_type
].write(self.retrieve(table_type))
self.dataset.write_metadata()
# TODO: Write BIB file, without pycldf
with self.dataset.bibpath.open("w", encoding="utf-8") as bibfile:
for source in self.source_ids:
print("@misc{" + source + ", title={" + source + "} }", file=bibfile)

def associate(
self, form_id: str, row: RowObject, comment: t.Optional[str] = None
Expand Down
Loading

0 comments on commit 76397f1

Please sign in to comment.