mismatch between length of input and output #6

martino-vic · 2022-01-13T03:20:35Z

While working on streitberggothic I encountered following issue:

# mismatch in col len
import pandas as pd
from pysem.glosses import to_concepticon

PATH = "Streitberg-1910-3659.tsv"

def main():
    dfgot = pd.read_csv(PATH, sep="\t").fillna("")
    glosses = [{"gloss": str(g), "pos": str(p)}
                for g, p in zip(dfgot.sense, dfgot.pos)]

    print(len(glosses),
          len(to_concepticon(glosses, language="de", pos_ref="pos",
                             max_matches=1)))

if __name__ == "__main__":
    main()

prints: 3645 3274

So there are somehow less output matches than input provided.

I thought this might be valuable information for the developers, even though eventually I found a workaround, like so:

def main():
    dfgot = pd.read_csv(PATH, sep="\t")

    conid, conglo = [], []
    for g, p in zip(dfgot.sense, dfgot.pos):
        gloss = [{"gloss": g, "pos": p}]
        out = list(to_concepticon(gloss, language="de",
                                  pos_ref="pos", max_matches=1).values())[0]
        if out:
            conid.append(out[0][0])
            conglo.append(out[0][1])
        else:
            conid.append(None)
            conglo.append(None)

    dfgot["CONCEPTICON_ID"], dfgot["CONCEPTICON_GLOSS"] = conid, conglo
    del dfgot["form"]
    dfgot.to_csv("concepts.tsv", index=False, encoding="utf-8", sep="\t")

if __name__ == "__main__":
    main()

The text was updated successfully, but these errors were encountered:

LinguList · 2022-01-13T07:06:46Z

The problem is that the function expects a list of dicts, but returns a dict. We may discuss if this is the desired behavior. I recommend, however, to make sure to do a disambiguation or a collection of identical elicitation glosses before, as the algorithm will map them in the same way anyway.

LinguList · 2022-01-13T07:10:51Z

So the recommended solution would be:

from collections import defaultdict
G = defaultdict(list)
for i, g, p in enumerate(zip(dfgot.sense, dfgot.pos)):
    G[g, p] += [i]
for g, p in G.items():
    to_concepticon ...

martino-vic · 2022-01-13T19:22:27Z

Ah okay, thanks, now it makes sense, so next time I should just disambiguate my word list as part of the preprocessing, or somehow make sure that there are no duplicate senses

martino-vic closed this as completed Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mismatch between length of input and output #6

mismatch between length of input and output #6

martino-vic commented Jan 13, 2022 •

edited

Loading

LinguList commented Jan 13, 2022

LinguList commented Jan 13, 2022

martino-vic commented Jan 13, 2022

mismatch between length of input and output #6

mismatch between length of input and output #6

Comments

martino-vic commented Jan 13, 2022 • edited Loading

LinguList commented Jan 13, 2022

LinguList commented Jan 13, 2022

martino-vic commented Jan 13, 2022

martino-vic commented Jan 13, 2022 •

edited

Loading