-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411 #168
Comments
I am getting a similar error message. I am using quanteda v0.9.6-1. I am doing the capstone project for the Data Science certificate offered by Coursera. I have created to training datasets based on text files. Three weeks ago, everything was working fine. I started working on the final version of the project and I'm getting that error for two of three datasets. I did upgrade to R 3.3.0 and had to reinstall several packages, including quanteda, as well as dependencies. Is it possible that something "broke" with the latest R update? Example follows: |
The email dataset works fine for me, I just tested it. Could be that your version of the Matrix package needs updating.
Rerun the code after:
My session info:
|
This still doesn't work for me, even after updating, as long as I have this code before creating the dfm:
So it looks like the sub() created the bug (maybe by creating some empty documents, as you mentioned above). Here is my sessionInfo():
|
Should I wait for the next stable release or download the developer version perhaps? |
Please reinstall Quanteda from GitHub, you need 0.9.6-5. |
Updated quanteda from GutHub using sessionInfo(): R version 3.3.0 (2016-05-03) locale: attached base packages: other attached packages: loaded via a namespace (and not attached): |
Great. |
This is still an issue for me...
with 500.000 documents it fails |
Yes this is a bug in v0.9.9-3. I just today submitted a fixed version to CRAN, which can be installed from the GitHub repository now if you don't wish to wait. |
yep done and solved thx |
Hi, I am having the same issue as the title on this post. When I try to create a sparse matrix it gives me the following error: Error in validObject(r) : I tried to install quanteda from devtools but it's not helping. Any suggestions? |
@cesarmolea Without any information on your versions or what you have tried that produces this error, it's impossible to help. Please output I would note that this error was resolved three years ago so I suspect you are using a very old version. Try updating everything first. |
@kbenoit Thank you. I updated and still doesn't work. Here is the command review_matrix <- new_data %>% Basically, I am filtering by the type of document. If I eliminate the name of document ("name") it works, but I need that. And this is the sessionInfo: R version 4.0.2 (2020-06-22) Matrix products: default locale: attached base packages: other attached packages: loaded via a namespace (and not attached): Thanks in advance for your help. |
Looks like a tidytext problem to me, not a quanteda issue. |
Hi again,
When I create a dfm from my emails corpus, I get the error message:
Below is my code. The data comes from https://www.kaggle.com/kaggle/hillary-clinton-emails/ again.
Also, interestingly, when I write LIMIT 100 at the end of the SQL query, with exactly the same code for everything else, I don't have any problem. So it might be linked with the size of the corpus.
Connect to db
db <- dbConnect(dbDriver("SQLite"), "output/database.sqlite")
get all emails
emails <- dbGetQuery(db, "
SELECT ExtractedBodyText body,MetadataSubject subject, MetadataDateSent date
FROM Emails e
INNER JOIN Persons p
ON e.SenderPersonId=P.Id
WHERE p.Name='Hillary Clinton'
AND e.ExtractedBodyText != ''
ORDER BY RANDOM()")
Create new column with weekdays, and column weekend
emails = emails %>% separate(date,"date", sep = "T") %>% mutate(
weekday = weekdays(as.Date(emails$date)),
weekend = ifelse(weekday %in% c('Saturday','Sunday'),1,0)
)
Clean some of the email bodies that still contain part of the header (manual inspection of emails)
emails = emails %>% mutate(body = sub("H <.Re:.\n", "", body)) %>%
mutate(body = sub("H <._Re:", "", body)) %>%
mutate(body = sub("H <._RELEASE IN.B6", "", body)) %>%
mutate(body = sub("RELEASE\nIN PART B6\n", "", body)) %>%
mutate(body = sub("RELEASE\nIN PART B6", "", body)) %>%
mutate(body = sub("RELEASE IN PART\nB6", "", body)) %>%
mutate(body = sub("RELEASE IN PART.\B1", "", body)) %>%
mutate(body = sub("H <._Fw:", "", body)) %>%
mutate(body = sub('Declassify on: 04/23/2035', "", body)) %>%
mutate(body = sub("H <._B6\nB6\n", "", body)) %>%
mutate(body = sub("H <._PM\n", "", body)) %>%
mutate(body = sub("H <._AM\n", "", body))
Create corpus
email_corpus = corpus(emails$body)
Create dfm, after stemming and removing stopwords
email_dfm <- dfm(email_corpus,ignoredFeatures = stopwords("english"), stem = TRUE)
The text was updated successfully, but these errors were encountered: