Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mismatch between length of input and output #6

Closed
martino-vic opened this issue Jan 13, 2022 · 3 comments
Closed

mismatch between length of input and output #6

martino-vic opened this issue Jan 13, 2022 · 3 comments

Comments

@martino-vic
Copy link

martino-vic commented Jan 13, 2022

While working on streitberggothic I encountered following issue:

# mismatch in col len
import pandas as pd
from pysem.glosses import to_concepticon

PATH = "Streitberg-1910-3659.tsv"

def main():
    dfgot = pd.read_csv(PATH, sep="\t").fillna("")
    glosses = [{"gloss": str(g), "pos": str(p)}
                for g, p in zip(dfgot.sense, dfgot.pos)]

    print(len(glosses),
          len(to_concepticon(glosses, language="de", pos_ref="pos",
                             max_matches=1)))

if __name__ == "__main__":
    main()

prints: 3645 3274

So there are somehow less output matches than input provided.

I thought this might be valuable information for the developers, even though eventually I found a workaround, like so:

def main():
    dfgot = pd.read_csv(PATH, sep="\t")

    conid, conglo = [], []
    for g, p in zip(dfgot.sense, dfgot.pos):
        gloss = [{"gloss": g, "pos": p}]
        out = list(to_concepticon(gloss, language="de",
                                  pos_ref="pos", max_matches=1).values())[0]
        if out:
            conid.append(out[0][0])
            conglo.append(out[0][1])
        else:
            conid.append(None)
            conglo.append(None)

    dfgot["CONCEPTICON_ID"], dfgot["CONCEPTICON_GLOSS"] = conid, conglo
    del dfgot["form"]
    dfgot.to_csv("concepts.tsv", index=False, encoding="utf-8", sep="\t")

if __name__ == "__main__":
    main()
@LinguList
Copy link
Contributor

The problem is that the function expects a list of dicts, but returns a dict. We may discuss if this is the desired behavior. I recommend, however, to make sure to do a disambiguation or a collection of identical elicitation glosses before, as the algorithm will map them in the same way anyway.

@LinguList
Copy link
Contributor

So the recommended solution would be:

from collections import defaultdict
G = defaultdict(list)
for i, g, p in enumerate(zip(dfgot.sense, dfgot.pos)):
    G[g, p] += [i]
for g, p in G.items():
    to_concepticon ...

@martino-vic
Copy link
Author

Ah okay, thanks, now it makes sense, so next time I should just disambiguate my word list as part of the preprocessing, or somehow make sure that there are no duplicate senses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants