Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA and tf-idf document term matrix #1

Closed
TheOne000 opened this issue Aug 8, 2017 · 3 comments
Closed

LDA and tf-idf document term matrix #1

TheOne000 opened this issue Aug 8, 2017 · 3 comments

Comments

@TheOne000
Copy link

TheOne000 commented Aug 8, 2017

Dear Ted

Question: Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how?

it does not work in my case and the LDA function requires the 'term-frequency' document term matrix.

Thank you
(I make a question as concise as possible. So, if you need more details, I can add

##########################################################################
                           TF-IDF Document matrix construction
##########################################################################    

> DTM_tfidf <-DocumentTermMatrix(corpora,control = list(weighting = 
function(x)+   weightTfIdf(x, normalize = FALSE)))
> str(DTM_tfidf)
List of 6
$ i       : int [1:4466] 1 1 1 1 1 1 1 1 1 1 ...
$ j       : int [1:4466] 6 10 22 26 28 36 39 41 47 48 ...
$ v       : num [1:4466] 6 2.09 1.05 3.19 2.19 ...
$ nrow    : int 64
$ ncol    : int 297
$ dimnames:List of 2
  ..$ Docs : chr [1:64] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:297] "accommod" "account" "achiev" "act" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency - inverse document 
frequency" "tf-idf"

##########################################################################
                           LDA section
##########################################################################

> LDA_results <-LDA(DTM_tfidf,k, method="Gibbs", control=list(nstart=nstart,
  +                                seed = seed, best=best, 
  +                                burnin = burnin, iter = iter, thin=thin))

##########################################################################
                           Error messages
##########################################################################
  Error in LDA(DTM_tfidf, k, method = "Gibbs", control = list(nstart = 
  nstart,  : 
  The DocumentTermMatrix needs to have a term frequency weighting
@kwartler
Copy link
Owner

kwartler commented Aug 9, 2017

Try this stackoverflow explanation with a workaround. I have never done it myself.
Apparently, LDA requires TF not TfIdf because its measuring distributions.
I wouldn't recommend using LDA this way. I suppose you could do some data wrangling to get it into a useable format for LDA but the authors of LDA clearly wants Tf.
What exactly are you trying to accomplish?

@TheOne000
Copy link
Author

TheOne000 commented Aug 9, 2017 via email

@kwartler
Copy link
Owner

kwartler commented Nov 9, 2017

Was giving this some thought and I think you could perform some sort of tf-idf TDM, then apply a heuristic to identify the low quality terms.

library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
                          control = list(weighting =
                                           function(x)
                                             weightTfIdf(x, normalize =
                                                           FALSE),
                                         stopwords = TRUE))

dtmM<-as.matrix(dtm)
tfScores<-colSums(dtmM)
tfScores<-data.frame(term=names(tfScores),tfScoring=tfScores)
tfScores<-tfScores[order(tfScores$tfScoring),]

# Then perform a subset based on deciling, or other heuristic for example
drops<- subset(tfScores$term,tfScores$tfScoring<=5) #or change to 0 etc.
drops<-as.character(drops)

drops is a vector of terms that can be concatenated to the stop words list. The example above has no Corpus Cleaning functions applied so you would have to do that before. Then you would have a tfIDF version for LDA.

@kwartler kwartler closed this as completed Nov 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants