Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411 #168

7804j · 2016-05-17T21:04:49Z

Hi again,

When I create a dfm from my emails corpus, I get the error message:

Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 1,882 documents
... indexing features: 4,596 feature types
Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411

Below is my code. The data comes from https://www.kaggle.com/kaggle/hillary-clinton-emails/ again.

Also, interestingly, when I write LIMIT 100 at the end of the SQL query, with exactly the same code for everything else, I don't have any problem. So it might be linked with the size of the corpus.

Connect to db

db <- dbConnect(dbDriver("SQLite"), "output/database.sqlite")

get all emails

emails <- dbGetQuery(db, "
SELECT ExtractedBodyText body,MetadataSubject subject, MetadataDateSent date
FROM Emails e
INNER JOIN Persons p
ON e.SenderPersonId=P.Id
WHERE p.Name='Hillary Clinton'
AND e.ExtractedBodyText != ''
ORDER BY RANDOM()")

Create new column with weekdays, and column weekend

emails = emails %>% separate(date,"date", sep = "T") %>% mutate(
weekday = weekdays(as.Date(emails$date)),
weekend = ifelse(weekday %in% c('Saturday','Sunday'),1,0)
)

Clean some of the email bodies that still contain part of the header (manual inspection of emails)

emails = emails %>% mutate(body = sub("H <.Re:.\n", "", body)) %>%
mutate(body = sub("H <._Re:", "", body)) %>%
mutate(body = sub("H <._RELEASE IN.B6", "", body)) %>%
mutate(body = sub("RELEASE\nIN PART B6\n", "", body)) %>%
mutate(body = sub("RELEASE\nIN PART B6", "", body)) %>%
mutate(body = sub("RELEASE IN PART\nB6", "", body)) %>%
mutate(body = sub("RELEASE IN PART.\B1", "", body)) %>%
mutate(body = sub("H <._Fw:", "", body)) %>%
mutate(body = sub('Declassify on: 04/23/2035', "", body)) %>%
mutate(body = sub("H <._B6\nB6\n", "", body)) %>%
mutate(body = sub("H <._PM\n", "", body)) %>%
mutate(body = sub("H <._AM\n", "", body))

Create corpus

email_corpus = corpus(emails$body)

Create dfm, after stemming and removing stopwords

email_dfm <- dfm(email_corpus,ignoredFeatures = stopwords("english"), stem = TRUE)

richard-ian-carpenter · 2016-05-18T01:50:34Z

I am getting a similar error message. I am using quanteda v0.9.6-1.

I am doing the capstone project for the Data Science certificate offered by Coursera. I have created to training datasets based on text files. Three weeks ago, everything was working fine. I started working on the final version of the project and I'm getting that error for two of three datasets.

I did upgrade to R 3.3.0 and had to reinstall several packages, including quanteda, as well as dependencies.

Is it possible that something "broke" with the latest R update?

Example follows:
blogDfm <- dfm(blogCorpus) #, ignoredFeatures = stopwords("english"), stem = FALSE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 134,818 documents
... indexing features: 130,551 feature types
Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 5526105

kbenoit · 2016-05-18T06:40:46Z

The email dataset works fine for me, I just tested it. Could be that your version of the Matrix package needs updating.

> email_dfm <- dfm(email_corpus, ignoredFeatures = stopwords("english"), stem = TRUE)
Creating a dfm from a corpus ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 1,882 documents
   ... indexing features: 4,765 feature types
   ... removed 155 features, from 174 supplied (glob) feature types
   ... stemming features (English), trimmed 872 feature variants
   ... created a 1882 x 3739 sparse dfm
   ... complete. 
Elapsed time: 0.133 seconds.

Rerun the code after:

update.packages(ask = FALSE)

My session info:

> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_0.9.6-5 tidyr_0.4.1      magrittr_1.5     dplyr_0.4.3      RSQLite_1.0.0    DBI_0.3.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4      lattice_0.20-33  XML_3.98-1.4     assertthat_0.1   SnowballC_0.5.1  chron_2.3-47     grid_3.2.4       R6_2.1.2         stringi_1.0-1    lazyeval_0.1.10 
[11] data.table_1.9.6 ca_0.64          Matrix_1.2-4     tools_3.2.4      parallel_3.2.

7804j · 2016-05-18T09:53:30Z

This still doesn't work for me, even after updating, as long as I have this code before creating the dfm:

emails = emails %>% mutate(body = sub("H <.Re:.\n", "", body)) %>%
mutate(body = sub("H <.Re:", "", body))

So it looks like the sub() created the bug (maybe by creating some empty documents, as you mentioned above).

Here is my sessionInfo():

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gmodels_2.16.2   quanteda_0.9.6-1 tidyr_0.4.1      ggplot2_2.1.0    dplyr_0.4.3      RSQLite_1.0.0    DBI_0.4-1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5      lattice_0.20-33  gtools_3.5.0     assertthat_0.1   MASS_7.3-45      chron_2.3-47     grid_3.3.0       R6_2.1.2        
 [9] plyr_1.8.3       gtable_0.2.0     magrittr_1.5     scales_0.4.0     stringi_1.0-1    lazyeval_0.1.10  gdata_2.17.0     data.table_1.9.6
[17] ca_0.64          Matrix_1.2-6     tools_3.3.0      munsell_0.4.3    parallel_3.3.0   colorspace_1.2-6

7804j · 2016-05-18T09:58:21Z

Should I wait for the next stable release or download the developer version perhaps?

kbenoit · 2016-05-18T10:15:52Z

Please reinstall Quanteda from GitHub, you need 0.9.6-5.

richard-ian-carpenter · 2016-05-19T14:59:13Z

Updated quanteda from GutHub using devtools. Everything is working now. Thank you!

sessionInfo():

R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] quanteda_0.9.6-7 Matrix_1.2-6 data.table_1.9.6 stringi_1.0-1 lattice_0.20-33 igraph_1.0.1 ggplot2_2.1.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 digest_0.6.9 withr_1.0.1 R6_2.1.2 chron_2.3-47 grid_3.3.0 plyr_1.8.3 gtable_0.2.0 git2r_0.15.0
[10] magrittr_1.5 scales_0.4.0 httr_1.1.0 curl_0.9.7 ca_0.64 devtools_1.11.1 tools_3.3.0 munsell_0.4.3 parallel_3.3.0
[19] colorspace_1.2-6 memoise_1.0.0 knitr_1.13

kbenoit · 2016-05-19T16:12:32Z

Great.

hugokoopmans · 2017-01-27T17:27:50Z

This is still an issue for me...
R 3.3.2
quanteda 0.9.9-3

> myDfm <- dfm(myCorpus, removeNumbers = TRUE) Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[1]) differs from Dim[1] which is 2027

with 500.000 documents it fails
with 100.000 same
with 10k all works fine

kbenoit · 2017-01-27T17:29:23Z

Yes this is a bug in v0.9.9-3. I just today submitted a fixed version to CRAN, which can be installed from the GitHub repository now if you don't wish to wait.

hugokoopmans · 2017-01-27T17:56:10Z

yep done and solved thx

cesarmolea · 2020-10-05T20:05:48Z

Hi,

I am having the same issue as the title on this post. When I try to create a sparse matrix it gives me the following error:

Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[1]) differs from Dim[1] which is 66

I tried to install quanteda from devtools but it's not helping. Any suggestions?

kbenoit · 2020-10-06T07:53:27Z

@cesarmolea Without any information on your versions or what you have tried that produces this error, it's impossible to help. Please output sessionInfo() as well as the command(s) that we can follow to try to reproduce the error.

I would note that this error was resolved three years ago so I suspect you are using a very old version. Try updating everything first.

cesarmolea · 2020-10-06T14:18:30Z

@kbenoit Thank you. I updated and still doesn't work. Here is the command

review_matrix <- new_data %>%
filter(!word %in% c("sistemas", "dimensión", "recomendación",
"recomienda", "ejecución", "gestión", "resultados",
"indicador", "indicadores", "objetivo", "componente",
"país", "línea", "impacto", "sector", "nivel", "operación",
"objetivo", "objetivos", "diseño", "evaluación",
"apoyo", "productos", "calidad", "medida", "estrategia",
"nacional", "institucional", "matriz", "aprovación",
"número", "sistema", "desarrollo", "valor", "final",
"plan", "riesgo", "áreas", "informe", "aprobación",
"financiera", "crédito", "mejora", "mayor", "millones",
"aprobación", "sostenibilidad", "pública", "marco",
"acceso", "servicios", "costos", "préstamo", "actividades",
"medio", "saneamiento", "financiamiento", "general",
"respecto", "ministerio", "servicio", "agua", "ley",
"eficiencia", "plazo", "ex", "elegibilidad", "gobierno",
"vial", "tasa", "política", "energía", "salud", "público",
"relación", "incremento", "mercado", "aumento", "social",
"aumento", "inversión", "promedio", "reducción", "inversiones",
"potable", "infraestructura", "días", "además", "cambio",
"menores", "específico", "embargo", "seguridad", "condiciones",
"tiempo", "acuerdo", "lógica", "momento", "muestra", "control",
"mismo", "impactos", "técnica", "empresas", "población",
"efectividad", "alcantarillado", "financieros", "cuenta",
"alcance", "cumplimiento", "fortalecer", "cambios",
"mejoras", "desastres", "intervención", "cartera",
"redes", "índice", "modelo", "proceso", "políticas",
"eop", "vertical", "eléctrico", "rurales", "desempeño",
"fecha", "preinversión", "monitoreo", "post", "económica",
"públicas", "decreto", "capacidad", "beneficios",
"cuales", "económico", "fuente", "grado", "accidentes",
"estimación", "pdfor", "zona", "estimación", "gasto", "ep",
"tramo", "consecuencia", "díasaño", "tramo", "etc",
"d", "correctivas", "pesos", "ambiental", "vías", "pod",
"mismos", "planes", "coparticipación", "ro", "'ultimos",
"similares", "presupuesto", "anterior", "bañado",
"anteriormente", "argentino", "realizadas", "n", "implicó",
"ensanche", "rutas", "concluir", "calzada", "caf", "camión",
"alcantarillas", "anual", "ahorros")) %>%
filter(dummy_exitoso_OVE == "Not successful") %>%
group_by(word) %>%
filter(n >= 2) %>%
cast_sparse(name, word, n)

Basically, I am filtering by the type of document. If I eliminate the name of document ("name") it works, but I need that.

And this is the sessionInfo:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_2.3.2 usethis_1.6.3 stm_1.3.6 drlib_0.1.1
[5] ggraph_2.0.3 tidylo_0.1.0 tokenizers_0.2.1 igraph_1.2.5
[9] janeaustenr_0.1.5 wordcloud_2.6 RColorBrewer_1.1-2 quanteda_2.1.2
[13] stringi_1.5.3 tabulizer_0.2.2 tm_0.7-7 NLP_0.2-0
[17] doParallel_1.0.15 iterators_1.0.12 foreach_1.5.0 readxl_1.3.1
[21] cleanNLP_3.0.2 openNLP_0.2-7 pdftools_2.3.1 lubridate_1.7.9
[25] tidygraph_1.2.0 widyr_0.1.3 stopwords_2.0 magrittr_1.5
[29] SnowballC_0.7.0 topicmodels_0.2-11 ggrepel_0.8.2 reshape2_1.4.4
[33] forcats_0.5.0 stringr_1.4.0 readr_1.3.1 tidyverse_1.3.0
[37] yardstick_0.0.7 workflows_0.2.0 tune_0.1.1 tidyr_1.1.2
[41] tibble_3.0.3 rsample_0.0.8 recipes_0.1.13 purrr_0.3.4
[45] parsnip_0.1.3 modeldata_0.0.2 infer_0.5.3 ggplot2_3.3.2
[49] dplyr_1.0.2 dials_0.0.9 scales_1.1.1 broom_0.7.1
[53] tidymodels_0.1.1 tidytext_0.2.6

loaded via a namespace (and not attached):
[1] backports_1.1.10 fastmatch_1.1-0 plyr_1.8.6 splines_4.0.2
[5] listenv_0.8.0 digest_0.6.25 viridis_0.5.1 fansi_0.4.1
[9] memoise_1.1.0 lda_1.4.2 remotes_2.2.0 graphlayouts_0.7.0
[13] globals_0.13.0 modelr_0.1.8 gower_0.2.2 matrixStats_0.57.0
[17] RcppParallel_5.0.2 askpass_1.1 prettyunits_1.1.1 colorspace_1.4-1
[21] blob_1.2.1 rvest_0.3.6 haven_2.3.1 xfun_0.17
[25] callr_3.4.4 crayon_1.3.4 jsonlite_1.7.1 survival_3.1-12
[29] glue_1.4.2 polyclip_1.10-0 gtable_0.3.0 ipred_0.9-9
[33] pkgbuild_1.1.0 qpdf_1.1 DBI_1.1.0 Rcpp_1.0.5
[37] viridisLite_0.3.0 GPfit_1.0-8 stats4_4.0.2 lava_1.6.8
[41] prodlim_2019.11.13 httr_1.4.2 modeltools_0.2-23 ellipsis_0.3.1
[45] farver_2.0.3 pkgconfig_2.0.3 rJava_0.9-13 openNLPdata_1.5.3-4
[49] nnet_7.3-14 dbplyr_1.4.4 labeling_0.3 tidyselect_1.1.0
[53] rlang_0.4.7 DiceDesign_1.8-1 munsell_0.5.0 cellranger_1.1.0
[57] tools_4.0.2 cli_2.0.2 generics_0.0.2 yaml_2.2.1
[61] processx_3.4.4 knitr_1.30 fs_1.5.0 future_1.19.1
[65] slam_0.1-47 xml2_1.3.2 compiler_4.0.2 rstudioapi_0.11
[69] curl_4.3 png_0.1-7 testthat_2.3.2 reprex_0.3.0
[73] tweenr_1.0.1 lhs_1.0.2 ps_1.3.4 desc_1.2.0
[77] lattice_0.20-41 Matrix_1.2-18 tabulizerjars_1.0.1 vctrs_0.3.4
[81] pillar_1.4.6 lifecycle_0.2.0 furrr_0.1.0 data.table_1.13.0
[85] R6_2.4.1 gridExtra_2.3 sessioninfo_1.1.1 codetools_0.2-16
[89] pkgload_1.1.0 MASS_7.3-51.6 assertthat_0.2.1 rprojroot_1.3-2
[93] withr_2.3.0 hms_0.5.3 ISOcodes_2020.03.16 grid_4.0.2
[97] rpart_4.1-15 timeDate_3043.102 class_7.3-17 ggforce_0.3.2
[101] pROC_1.16.2

Thanks in advance for your help.

kbenoit · 2020-10-06T15:04:06Z

Looks like a tidytext problem to me, not a quanteda issue.

kbenoit closed this as completed in 2add29e May 18, 2016

kbenoit reopened this May 19, 2016

kbenoit closed this as completed May 19, 2016

jofrerocabert mentioned this issue Jan 26, 2017

invalid class “dgTMatrix” object: length(Dimnames[1]) differs from Dim[1] #520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411 #168

Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411 #168

7804j commented May 17, 2016

richard-ian-carpenter commented May 18, 2016 •

edited

Loading

kbenoit commented May 18, 2016 •

edited

Loading

7804j commented May 18, 2016

7804j commented May 18, 2016

kbenoit commented May 18, 2016

richard-ian-carpenter commented May 19, 2016 •

edited

Loading

kbenoit commented May 19, 2016

hugokoopmans commented Jan 27, 2017

kbenoit commented Jan 27, 2017

hugokoopmans commented Jan 27, 2017

cesarmolea commented Oct 5, 2020

kbenoit commented Oct 6, 2020

cesarmolea commented Oct 6, 2020

kbenoit commented Oct 6, 2020

Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411 #168

Error in validObject(r) : invalid class “dgTMatrix” object: length(Dimnames[2]) differs from Dim[2] which is 33411 #168

Comments

7804j commented May 17, 2016

Connect to db

get all emails

Create new column with weekdays, and column weekend

Clean some of the email bodies that still contain part of the header (manual inspection of emails)

Create corpus

Create dfm, after stemming and removing stopwords

richard-ian-carpenter commented May 18, 2016 • edited Loading

kbenoit commented May 18, 2016 • edited Loading

7804j commented May 18, 2016

7804j commented May 18, 2016

kbenoit commented May 18, 2016

richard-ian-carpenter commented May 19, 2016 • edited Loading

kbenoit commented May 19, 2016

hugokoopmans commented Jan 27, 2017

kbenoit commented Jan 27, 2017

hugokoopmans commented Jan 27, 2017

cesarmolea commented Oct 5, 2020

kbenoit commented Oct 6, 2020

cesarmolea commented Oct 6, 2020

kbenoit commented Oct 6, 2020

richard-ian-carpenter commented May 18, 2016 •

edited

Loading

kbenoit commented May 18, 2016 •

edited

Loading

richard-ian-carpenter commented May 19, 2016 •

edited

Loading