Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidy.STM with content variable in stm #209

Merged
merged 4 commits into from
Apr 27, 2022

Conversation

jonathanvoelkle
Copy link
Contributor

I realized that when then document-level content covariate is set in stm, the beta-list contains also the interaction coefficients besides the topic-term probabilities and provides a beta-matrix for each covariate separately, in which case tidy.STM returns a tibble where some topics/terms are NAs, and multiple values for topic-term combinations (exactly the number of levels in the covariate) - reprex below.

So I restricted it to only return the beta-values and added a column to the beta tibble which indicates the covariate.

library(stm)
#> stm v1.3.6 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com
library(tidytext)
topic_model_content <- stm(poliblog5k.docs, poliblog5k.voc, K = 20,
            prevalence = ~ rating + s(day), content = ~ rating,
            max.em.its = 3, data = poliblog5k.meta, init.type = "Spectral",
            verbose = F)
td_beta_content <- tidy(topic_model_content)
td_beta_content
#> # A tibble: 271,097 × 3
#>    topic term      beta
#>    <int> <chr>    <dbl>
#>  1    NA <NA>  0.000133
#>  2    NA <NA>  0.000292
#>  3    NA <NA>  0.000269
#>  4    NA <NA>  0.000536
#>  5    NA <NA>  0.000376
#>  6    NA <NA>  0.000109
#>  7    NA <NA>  0.000294
#>  8    NA <NA>  0.000101
#>  9    NA <NA>  0.000130
#> 10    NA <NA>  0.000246
#> # … with 271,087 more rows

Created on 2022-04-20 by the reprex package (v2.0.1)

@juliasilge
Copy link
Owner

Nice! I have not used that content argument before. We should try to use a column name in the output that aligns with some existing tidy() output. Maybe y.level?

Would you be up for adding a test in this file, probably using gadarian since we already use that in tests?

Copy link
Owner

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @jonathanvoelkle! 🙌

@juliasilge juliasilge merged commit 5af1c66 into juliasilge:main Apr 27, 2022
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants