Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411 #168

Closed
7804j opened this issue May 17, 2016 · 14 comments

Comments

@7804j
Copy link

7804j commented May 17, 2016

Hi again,

When I create a dfm from my emails corpus, I get the error message:

Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 1,882 documents
... indexing features: 4,596 feature types
Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411

Below is my code. The data comes from https://www.kaggle.com/kaggle/hillary-clinton-emails/ again.

Also, interestingly, when I write LIMIT 100 at the end of the SQL query, with exactly the same code for everything else, I don't have any problem. So it might be linked with the size of the corpus.


Connect to db

db <- dbConnect(dbDriver("SQLite"), "output/database.sqlite")

get all emails

emails <- dbGetQuery(db, "
SELECT ExtractedBodyText body,MetadataSubject subject, MetadataDateSent date
FROM Emails e
INNER JOIN Persons p
ON e.SenderPersonId=P.Id
WHERE p.Name='Hillary Clinton'
AND e.ExtractedBodyText != ''
ORDER BY RANDOM()")

Create new column with weekdays, and column weekend

emails = emails %>% separate(date,"date", sep = "T") %>% mutate(
weekday = weekdays(as.Date(emails$date)),
weekend = ifelse(weekday %in% c('Saturday','Sunday'),1,0)
)

Clean some of the email bodies that still contain part of the header (manual inspection of emails)

emails = emails %>% mutate(body = sub("H <.Re:.\n", "", body)) %>%
mutate(body = sub("H <._Re:", "", body)) %>%
mutate(body = sub("H <._RELEASE IN.B6", "", body)) %>%
mutate(body = sub("RELEASE\nIN PART B6\n", "", body)) %>%
mutate(body = sub("RELEASE\nIN PART B6", "", body)) %>%
mutate(body = sub("RELEASE IN PART\nB6", "", body)) %>%
mutate(body = sub("RELEASE IN PART.
\B1", "", body)) %>%
mutate(body = sub("H <._Fw:", "", body)) %>%
mutate(body = sub('Declassify on: 04/23/2035', "", body)) %>%
mutate(body = sub("H <._B6\nB6\n", "", body)) %>%
mutate(body = sub("H <._PM\n", "", body)) %>%
mutate(body = sub("H <._AM\n", "", body))

Create corpus

email_corpus = corpus(emails$body)

Create dfm, after stemming and removing stopwords

email_dfm <- dfm(email_corpus,ignoredFeatures = stopwords("english"), stem = TRUE)

@richard-ian-carpenter
Copy link

richard-ian-carpenter commented May 18, 2016

I am getting a similar error message. I am using quanteda v0.9.6-1.

I am doing the capstone project for the Data Science certificate offered by Coursera. I have created to training datasets based on text files. Three weeks ago, everything was working fine. I started working on the final version of the project and I'm getting that error for two of three datasets.

I did upgrade to R 3.3.0 and had to reinstall several packages, including quanteda, as well as dependencies.

Is it possible that something "broke" with the latest R update?

Example follows:
blogDfm <- dfm(blogCorpus) #, ignoredFeatures = stopwords("english"), stem = FALSE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 134,818 documents
... indexing features: 130,551 feature types
Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 5526105

@kbenoit
Copy link
Collaborator

kbenoit commented May 18, 2016

The email dataset works fine for me, I just tested it. Could be that your version of the Matrix package needs updating.

> email_dfm <- dfm(email_corpus, ignoredFeatures = stopwords("english"), stem = TRUE)
Creating a dfm from a corpus ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 1,882 documents
   ... indexing features: 4,765 feature types
   ... removed 155 features, from 174 supplied (glob) feature types
   ... stemming features (English), trimmed 872 feature variants
   ... created a 1882 x 3739 sparse dfm
   ... complete. 
Elapsed time: 0.133 seconds.

Rerun the code after:

update.packages(ask = FALSE)

My session info:

> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_0.9.6-5 tidyr_0.4.1      magrittr_1.5     dplyr_0.4.3      RSQLite_1.0.0    DBI_0.3.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4      lattice_0.20-33  XML_3.98-1.4     assertthat_0.1   SnowballC_0.5.1  chron_2.3-47     grid_3.2.4       R6_2.1.2         stringi_1.0-1    lazyeval_0.1.10 
[11] data.table_1.9.6 ca_0.64          Matrix_1.2-4     tools_3.2.4      parallel_3.2.

@7804j
Copy link
Author

7804j commented May 18, 2016

This still doesn't work for me, even after updating, as long as I have this code before creating the dfm:

emails = emails %>% mutate(body = sub("H <.Re:.\n", "", body)) %>%
mutate(body = sub("H <.Re:", "", body))

So it looks like the sub() created the bug (maybe by creating some empty documents, as you mentioned above).

Here is my sessionInfo():

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gmodels_2.16.2   quanteda_0.9.6-1 tidyr_0.4.1      ggplot2_2.1.0    dplyr_0.4.3      RSQLite_1.0.0    DBI_0.4-1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5      lattice_0.20-33  gtools_3.5.0     assertthat_0.1   MASS_7.3-45      chron_2.3-47     grid_3.3.0       R6_2.1.2        
 [9] plyr_1.8.3       gtable_0.2.0     magrittr_1.5     scales_0.4.0     stringi_1.0-1    lazyeval_0.1.10  gdata_2.17.0     data.table_1.9.6
[17] ca_0.64          Matrix_1.2-6     tools_3.3.0      munsell_0.4.3    parallel_3.3.0   colorspace_1.2-6

@7804j
Copy link
Author

7804j commented May 18, 2016

Should I wait for the next stable release or download the developer version perhaps?

@kbenoit
Copy link
Collaborator

kbenoit commented May 18, 2016

Please reinstall Quanteda from GitHub, you need 0.9.6-5.

@richard-ian-carpenter
Copy link

richard-ian-carpenter commented May 19, 2016

Updated quanteda from GutHub using devtools. Everything is working now. Thank you!

sessionInfo():

R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] quanteda_0.9.6-7 Matrix_1.2-6 data.table_1.9.6 stringi_1.0-1 lattice_0.20-33 igraph_1.0.1 ggplot2_2.1.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 digest_0.6.9 withr_1.0.1 R6_2.1.2 chron_2.3-47 grid_3.3.0 plyr_1.8.3 gtable_0.2.0 git2r_0.15.0
[10] magrittr_1.5 scales_0.4.0 httr_1.1.0 curl_0.9.7 ca_0.64 devtools_1.11.1 tools_3.3.0 munsell_0.4.3 parallel_3.3.0
[19] colorspace_1.2-6 memoise_1.0.0 knitr_1.13

@kbenoit
Copy link
Collaborator

kbenoit commented May 19, 2016

Great.

@hugokoopmans
Copy link

This is still an issue for me...
R 3.3.2
quanteda 0.9.9-3

> myDfm <- dfm(myCorpus, removeNumbers = TRUE) Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[1]) differs from Dim[1] which is 2027

with 500.000 documents it fails
with 100.000 same
with 10k all works fine

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 27, 2017

Yes this is a bug in v0.9.9-3. I just today submitted a fixed version to CRAN, which can be installed from the GitHub repository now if you don't wish to wait.

@hugokoopmans
Copy link

yep done and solved thx

@cesarmolea
Copy link

Hi,

I am having the same issue as the title on this post. When I try to create a sparse matrix it gives me the following error:

Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[1]) differs from Dim[1] which is 66

I tried to install quanteda from devtools but it's not helping. Any suggestions?

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 6, 2020

@cesarmolea Without any information on your versions or what you have tried that produces this error, it's impossible to help. Please output sessionInfo() as well as the command(s) that we can follow to try to reproduce the error.

I would note that this error was resolved three years ago so I suspect you are using a very old version. Try updating everything first.

@cesarmolea
Copy link

@kbenoit Thank you. I updated and still doesn't work. Here is the command

review_matrix <- new_data %>%
filter(!word %in% c("sistemas", "dimensión", "recomendación",
"recomienda", "ejecución", "gestión", "resultados",
"indicador", "indicadores", "objetivo", "componente",
"país", "línea", "impacto", "sector", "nivel", "operación",
"objetivo", "objetivos", "diseño", "evaluación",
"apoyo", "productos", "calidad", "medida", "estrategia",
"nacional", "institucional", "matriz", "aprovación",
"número", "sistema", "desarrollo", "valor", "final",
"plan", "riesgo", "áreas", "informe", "aprobación",
"financiera", "crédito", "mejora", "mayor", "millones",
"aprobación", "sostenibilidad", "pública", "marco",
"acceso", "servicios", "costos", "préstamo", "actividades",
"medio", "saneamiento", "financiamiento", "general",
"respecto", "ministerio", "servicio", "agua", "ley",
"eficiencia", "plazo", "ex", "elegibilidad", "gobierno",
"vial", "tasa", "política", "energía", "salud", "público",
"relación", "incremento", "mercado", "aumento", "social",
"aumento", "inversión", "promedio", "reducción", "inversiones",
"potable", "infraestructura", "días", "además", "cambio",
"menores", "específico", "embargo", "seguridad", "condiciones",
"tiempo", "acuerdo", "lógica", "momento", "muestra", "control",
"mismo", "impactos", "técnica", "empresas", "población",
"efectividad", "alcantarillado", "financieros", "cuenta",
"alcance", "cumplimiento", "fortalecer", "cambios",
"mejoras", "desastres", "intervención", "cartera",
"redes", "índice", "modelo", "proceso", "políticas",
"eop", "vertical", "eléctrico", "rurales", "desempeño",
"fecha", "preinversión", "monitoreo", "post", "económica",
"públicas", "decreto", "capacidad", "beneficios",
"cuales", "económico", "fuente", "grado", "accidentes",
"estimación", "pdfor", "zona", "estimación", "gasto", "ep",
"tramo", "consecuencia", "díasaño", "tramo", "etc",
"d", "correctivas", "pesos", "ambiental", "vías", "pod",
"mismos", "planes", "coparticipación", "ro", "'ultimos",
"similares", "presupuesto", "anterior", "bañado",
"anteriormente", "argentino", "realizadas", "n", "implicó",
"ensanche", "rutas", "concluir", "calzada", "caf", "camión",
"alcantarillas", "anual", "ahorros")) %>%
filter(dummy_exitoso_OVE == "Not successful") %>%
group_by(word) %>%
filter(n >= 2) %>%
cast_sparse(name, word, n)

Basically, I am filtering by the type of document. If I eliminate the name of document ("name") it works, but I need that.

And this is the sessionInfo:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_2.3.2 usethis_1.6.3 stm_1.3.6 drlib_0.1.1
[5] ggraph_2.0.3 tidylo_0.1.0 tokenizers_0.2.1 igraph_1.2.5
[9] janeaustenr_0.1.5 wordcloud_2.6 RColorBrewer_1.1-2 quanteda_2.1.2
[13] stringi_1.5.3 tabulizer_0.2.2 tm_0.7-7 NLP_0.2-0
[17] doParallel_1.0.15 iterators_1.0.12 foreach_1.5.0 readxl_1.3.1
[21] cleanNLP_3.0.2 openNLP_0.2-7 pdftools_2.3.1 lubridate_1.7.9
[25] tidygraph_1.2.0 widyr_0.1.3 stopwords_2.0 magrittr_1.5
[29] SnowballC_0.7.0 topicmodels_0.2-11 ggrepel_0.8.2 reshape2_1.4.4
[33] forcats_0.5.0 stringr_1.4.0 readr_1.3.1 tidyverse_1.3.0
[37] yardstick_0.0.7 workflows_0.2.0 tune_0.1.1 tidyr_1.1.2
[41] tibble_3.0.3 rsample_0.0.8 recipes_0.1.13 purrr_0.3.4
[45] parsnip_0.1.3 modeldata_0.0.2 infer_0.5.3 ggplot2_3.3.2
[49] dplyr_1.0.2 dials_0.0.9 scales_1.1.1 broom_0.7.1
[53] tidymodels_0.1.1 tidytext_0.2.6

loaded via a namespace (and not attached):
[1] backports_1.1.10 fastmatch_1.1-0 plyr_1.8.6 splines_4.0.2
[5] listenv_0.8.0 digest_0.6.25 viridis_0.5.1 fansi_0.4.1
[9] memoise_1.1.0 lda_1.4.2 remotes_2.2.0 graphlayouts_0.7.0
[13] globals_0.13.0 modelr_0.1.8 gower_0.2.2 matrixStats_0.57.0
[17] RcppParallel_5.0.2 askpass_1.1 prettyunits_1.1.1 colorspace_1.4-1
[21] blob_1.2.1 rvest_0.3.6 haven_2.3.1 xfun_0.17
[25] callr_3.4.4 crayon_1.3.4 jsonlite_1.7.1 survival_3.1-12
[29] glue_1.4.2 polyclip_1.10-0 gtable_0.3.0 ipred_0.9-9
[33] pkgbuild_1.1.0 qpdf_1.1 DBI_1.1.0 Rcpp_1.0.5
[37] viridisLite_0.3.0 GPfit_1.0-8 stats4_4.0.2 lava_1.6.8
[41] prodlim_2019.11.13 httr_1.4.2 modeltools_0.2-23 ellipsis_0.3.1
[45] farver_2.0.3 pkgconfig_2.0.3 rJava_0.9-13 openNLPdata_1.5.3-4
[49] nnet_7.3-14 dbplyr_1.4.4 labeling_0.3 tidyselect_1.1.0
[53] rlang_0.4.7 DiceDesign_1.8-1 munsell_0.5.0 cellranger_1.1.0
[57] tools_4.0.2 cli_2.0.2 generics_0.0.2 yaml_2.2.1
[61] processx_3.4.4 knitr_1.30 fs_1.5.0 future_1.19.1
[65] slam_0.1-47 xml2_1.3.2 compiler_4.0.2 rstudioapi_0.11
[69] curl_4.3 png_0.1-7 testthat_2.3.2 reprex_0.3.0
[73] tweenr_1.0.1 lhs_1.0.2 ps_1.3.4 desc_1.2.0
[77] lattice_0.20-41 Matrix_1.2-18 tabulizerjars_1.0.1 vctrs_0.3.4
[81] pillar_1.4.6 lifecycle_0.2.0 furrr_0.1.0 data.table_1.13.0
[85] R6_2.4.1 gridExtra_2.3 sessioninfo_1.1.1 codetools_0.2-16
[89] pkgload_1.1.0 MASS_7.3-51.6 assertthat_0.2.1 rprojroot_1.3-2
[93] withr_2.3.0 hms_0.5.3 ISOcodes_2020.03.16 grid_4.0.2
[97] rpart_4.1-15 timeDate_3043.102 class_7.3-17 ggforce_0.3.2
[101] pROC_1.16.2

Thanks in advance for your help.

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 6, 2020

Looks like a tidytext problem to me, not a quanteda issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants