Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting analysis widget #14

Open
kodymoodley opened this issue Jan 4, 2024 · 4 comments · May be fixed by #33
Open

Setting analysis widget #14

kodymoodley opened this issue Jan 4, 2024 · 4 comments · May be fixed by #33
Assignees
Labels
enhancement New feature or request

Comments

@kodymoodley
Copy link
Contributor

Implement one feature for analysing the setting of a story:

  • One approach could be to obtain a list of keywords / uniquely identifying words from the story, say 'kw'.
  • Thereafter, we could find the 'closest' N words to each word in 'kw' within a pretrained embedding space for Dutch
  • The cluster(s) of these words (t-sne) could be rendered to the screen to inform the setting
@kodymoodley kodymoodley added the enhancement New feature or request label Jan 4, 2024
@kodymoodley kodymoodley self-assigned this Jan 4, 2024
@f-hafner
Copy link
Collaborator

f-hafner commented Jan 29, 2024

We defined the following subtasks:

  • start from corpus of stories
  • remove stopwords
  • lemmatize
  • put into a dataframe together with storyid and segment id
  • prepare embeddings: @kodymoodley finds out which model to use
  • extract similar words in embedding space

@f-hafner f-hafner mentioned this issue Jan 29, 2024
3 tasks
@f-hafner
Copy link
Collaborator

f-hafner commented Jan 29, 2024

Questions to discuss

  • I am reusing the spacy model loaded for other tasks. is this ok here?
    • for instance, the "merge_noun_chunks" is added to the nlp model. Then, "Mijn eerste vriendje" becomes ["mijn een vriendje"]; if this is not added, we have ["mijn", "een", "vriendje"]
  • refactoring
    • structure between tagger and setting analyzer are now quite similar, maybe we can think of combining them?
    • test the function util.is_valid_token(); reuse in tagging.py

@f-hafner f-hafner linked a pull request Jan 29, 2024 that will close this issue
3 tasks
@kodymoodley
Copy link
Contributor Author

We defined the following subtasks:

  • start from corpus of stories
  • remove stopwords
  • lemmatize
  • put into a dataframe together with storyid and segment id
  • prepare embeddings: @kodymoodley finds out which model to use
  • extract similar words in embedding space

Thanks very much @f-hafner ! This is already super helpful to have completed the preprocessing. The lead applicants have recently informed me that they would like to pause on the Setting widget until after the workshop. So this feature is no longer required for the workshop in April. But I / we could resume where you left off after the workshop.

@kodymoodley
Copy link
Contributor Author

Questions to discuss

  • I am reusing the spacy model loaded for other tasks. is this ok here?

    • for instance, the "merge_noun_chunks" is added to the nlp model. Then, "Mijn eerste vriendje" becomes ["mijn een vriendje"]; if this is not added, we have ["mijn", "een", "vriendje"]
  • refactoring

    • structure between tagger and setting analyzer are now quite similar, maybe we can think of combining them?
    • test the function util.is_valid_token(); reuse in tagging.py

@f-hafner, will revisit this comment in April / May. Right now, I suspect that merging the noun chunkswould not be necessary for what we want to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants