LDA and tf-idf document term matrix #1

TheOne000 · 2017-08-08T08:43:11Z

Dear Ted

Question: Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how?

it does not work in my case and the LDA function requires the 'term-frequency' document term matrix.

Thank you
(I make a question as concise as possible. So, if you need more details, I can add

##########################################################################
                           TF-IDF Document matrix construction
##########################################################################    

> DTM_tfidf <-DocumentTermMatrix(corpora,control = list(weighting = 
function(x)+   weightTfIdf(x, normalize = FALSE)))
> str(DTM_tfidf)
List of 6
$ i       : int [1:4466] 1 1 1 1 1 1 1 1 1 1 ...
$ j       : int [1:4466] 6 10 22 26 28 36 39 41 47 48 ...
$ v       : num [1:4466] 6 2.09 1.05 3.19 2.19 ...
$ nrow    : int 64
$ ncol    : int 297
$ dimnames:List of 2
  ..$ Docs : chr [1:64] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:297] "accommod" "account" "achiev" "act" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency - inverse document 
frequency" "tf-idf"

##########################################################################
                           LDA section
##########################################################################

> LDA_results <-LDA(DTM_tfidf,k, method="Gibbs", control=list(nstart=nstart,
  +                                seed = seed, best=best, 
  +                                burnin = burnin, iter = iter, thin=thin))

##########################################################################
                           Error messages
##########################################################################
  Error in LDA(DTM_tfidf, k, method = "Gibbs", control = list(nstart = 
  nstart,  : 
  The DocumentTermMatrix needs to have a term frequency weighting

The text was updated successfully, but these errors were encountered:

kwartler · 2017-08-09T15:19:37Z

Try this stackoverflow explanation with a workaround. I have never done it myself.
Apparently, LDA requires TF not TfIdf because its measuring distributions.
I wouldn't recommend using LDA this way. I suppose you could do some data wrangling to get it into a useable format for LDA but the authors of LDA clearly wants Tf.
What exactly are you trying to accomplish?

TheOne000 · 2017-08-09T19:32:38Z

Hi Ted My plan is to use TF-IDF as a tool to take some terms out of the corpus after the analytical pre-processing. As you know, words (that are not in a list of stop words) with high frequency do not always contribute meaningful information to the document. 'Term frequency' shows only how frequent the terms appear in the document, but TF-IDF weights these term with 'rarity'. I would like to clean the corpus in this fashion before applying LDA with the corpus. Thank you for your elaboration Sapphasak 2017-08-09 16:19 GMT+01:00 kwartler <notifications@github.com>:

…

Try this stackoverflow <https://stackoverflow.com/questions/33770287/documenttermmatrix-needs-to-have-a-term-frequency-weighting-error> explanation with a workaround. I have never done it myself. Apparently, LDA requires TF not TfIdf because its measuring distributions. I wouldn't recommend using LDA this way. I suppose you could do some data wrangling to get it into a useable format for LDA but the authors of LDA clearly wants Tf. What exactly are you trying to accomplish? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AY61H0qwL9EtrwOAkvBS6jCa17IGXFwkks5sWc4JgaJpZM4OwbCn> .

kwartler · 2017-11-09T04:41:05Z

Was giving this some thought and I think you could perform some sort of tf-idf TDM, then apply a heuristic to identify the low quality terms.

library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
                          control = list(weighting =
                                           function(x)
                                             weightTfIdf(x, normalize =
                                                           FALSE),
                                         stopwords = TRUE))

dtmM<-as.matrix(dtm)
tfScores<-colSums(dtmM)
tfScores<-data.frame(term=names(tfScores),tfScoring=tfScores)
tfScores<-tfScores[order(tfScores$tfScoring),]

# Then perform a subset based on deciling, or other heuristic for example
drops<- subset(tfScores$term,tfScores$tfScoring<=5) #or change to 0 etc.
drops<-as.character(drops)

drops is a vector of terms that can be concatenated to the stop words list. The example above has no Corpus Cleaning functions applied so you would have to do that before. Then you would have a tfIDF version for LDA.

kwartler closed this as completed Nov 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDA and tf-idf document term matrix #1

LDA and tf-idf document term matrix #1

TheOne000 commented Aug 8, 2017 •

edited

Loading

kwartler commented Aug 9, 2017

TheOne000 commented Aug 9, 2017 via email

kwartler commented Nov 9, 2017

LDA and tf-idf document term matrix #1

LDA and tf-idf document term matrix #1

Comments

TheOne000 commented Aug 8, 2017 • edited Loading

kwartler commented Aug 9, 2017

TheOne000 commented Aug 9, 2017 via email

kwartler commented Nov 9, 2017

TheOne000 commented Aug 8, 2017 •

edited

Loading