parse_input does not detect invalid metadata value types #24

rrydbirk · 2020-05-28T08:37:40Z

I'm trying to run CELLEX on ~120k cells from a single-nucleus experiment. In the preprocessing step, the ANOVA gene filtering removes all my genes. Of course, as you describe in the workflow, I could just omit this step (ANOVA=False), however, it seems relevant to include.

What would you recommend I do? Would some initial gene filtering for low-expressed genes or similar help?

Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 0 min 56 sec
Preprocessing - normalizing data ... data normalized in 2 min 18 sec
Preprocessing - running ANOVA ... excluded 28621 / 28621 genes in 2 min 14 sec

tstannius · 2020-05-28T10:02:11Z

Hi Rasmus!

I will need a little more to go on. Would you kindly share the code (a minimal example) needed to reproduce this output?
Please also include the output of calling

data.head()
metadata.head()

And the versions of the modules you are using.

rrydbirk · 2020-05-28T10:26:07Z

Hi Tobias

Thanks for getting back to me.
This is my code. Please note, since I'm coming from R and my count matrix is too big to write to .csv, I used HDF5 instead. It's a bit dirty since I couldn't get it to transfer row and column names directly (and I'm not that familiar with Python in general):

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']))
data.columns = pd.read_csv("./cells.csv").values
data.index = pd.read_csv("./genes.csv").values
data = data.astype(dtype="int64")

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

The requested output. I omitted sensitive data:

data.head()
               (Cell1_AAACCCAAGATCCGAG-1,)  ...  (Celln_TTTGTTGTCTAACGGT-1,)
(AL627309.1,)                                  0  ...                                  0
(AC114498.1,)                                  0  ...                                  0
(AL669831.2,)                                  0  ...                                  0
(AL669831.5,)                                  1  ...                                  0
(FAM87B,)                                      0  ...                                  0

[5 rows x 118345 columns]

>>> metadata.head()
                               cell_type
cell_id
Cell1_ACTACGAGTATGCGGA-1    Type1
Cell2_AAACCCAAGTGCGCTC-1    Type1
Cell3_ACTGATGGTAACACGG-1    Type1
Cell4_CACTGAATCACAATGC-1    Type1
Cell5_CAGGGCTCACCAACAT-1    Type1

And the versions:
➜ Cellex python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux

➜ Cellex pip list

Click to show

Package Version ---------------------------------- --------- adjustText 0.7.3 alabaster 0.7.12 anaconda-client 1.7.2 anaconda-navigator 1.9.7 anaconda-project 0.8.3 aniso8601 8.0.0 annoy 1.16.3 asn1crypto 1.0.1 astroid 2.3.1 astropy 3.2.2 atomicwrites 1.3.0 attrs 19.2.0 Babel 2.7.0 backcall 0.1.0 backports.functools-lru-cache 1.5 backports.os 0.1.1 backports.shutil-get-terminal-size 1.0.0 backports.tempfile 1.0 backports.weakref 1.0.post1 beautifulsoup4 4.8.0 bitarray 1.0.1 bkcharts 0.2 bleach 3.1.0 bokeh 1.3.4 boto 2.49.0 boto3 1.7.84 botocore 1.10.84 Bottleneck 1.2.1 cellex 1.1.1 CellPhoneDB 2.1.2 certifi 2019.9.11 cffi 1.12.3 chardet 3.0.4 click 6.7 cloudpickle 1.2.2 clyent 1.2.2 colorama 0.4.1 conda 4.8.2 conda-build 3.18.9 conda-package-handling 1.6.0 conda-verify 3.4.2 contextlib2 0.6.0 cryptography 2.7 cycler 0.10.0 Cython 0.29.13 cytoolz 0.10.0 dask 2.5.2 decorator 4.4.0 defusedxml 0.6.0 descartes 1.1.0 distributed 2.5.2 docutils 0.15.2 entrypoints 0.3 et-xmlfile 1.0.1 fastcache 1.1.0 fbpca 1.0 filelock 3.0.12 Flask 1.0.4 Flask-RESTful 0.3.8 Flask-Testing 0.7.1 fsspec 0.5.2 future 0.17.1 geosketch 0.3 gevent 1.4.0 glob2 0.7 gmpy2 2.0.8 greenlet 0.4.15 h5py 2.9.0 HeapDict 1.0.1 html5lib 1.0.1 idna 2.7 imageio 2.6.0 imagesize 1.1.0 importlib-metadata 0.23 ipykernel 5.1.2 ipython 7.8.0 ipython-genutils 0.2.0 ipywidgets 7.5.1 isort 4.3.21 itsdangerous 1.1.0 jdcal 1.4.1 jedi 0.15.1 jeepney 0.4.1 Jinja2 2.10.3 jmespath 0.9.5 joblib 0.13.2 json5 0.8.5 jsonschema 3.0.2 jupyter 1.0.0 jupyter-client 5.3.3 jupyter-console 6.0.0 jupyter-core 4.5.0 jupyterlab 1.1.4 jupyterlab-server 1.0.6 keyring 18.0.0 kiwisolver 1.1.0 lazy-object-proxy 1.4.2 libarchive-c 2.8 lief 0.9.0 llvmlite 0.29.0 locket 0.2.0 loompy 3.0.6 lxml 4.4.1 MarkupSafe 1.1.1 matplotlib 3.1.1 mccabe 0.6.1 mistune 0.8.4 mizani 0.6.0 mkl-fft 1.0.14 mkl-random 1.1.0 mkl-service 2.3.0 mock 3.0.5 more-itertools 7.2.0 mpmath 1.1.0 msgpack 0.6.1 multipledispatch 0.6.0 navigator-updater 0.2.1 nbconvert 5.6.0 nbformat 4.4.0 networkx 2.3 nltk 3.4.5 nose 1.3.7 notebook 6.0.1 numba 0.45.1 numexpr 2.7.0 numpy 1.17.2 numpy-groupies 0+unknown numpydoc 0.9.1 olefile 0.46 openpyxl 3.0.0 packaging 19.2 palettable 3.3.0 pandas 1.0.3 pandocfilters 1.4.2 parso 0.5.1 partd 1.0.0 path.py 12.0.1 pathlib2 2.3.5 patsy 0.5.1 pep8 1.7.1 pexpect 4.7.0 pickleshare 0.7.5 pika 0.12.0 Pillow 6.2.0 pip 19.2.3 pkginfo 1.5.0.1 plotnine 0.6.0 pluggy 0.13.0 ply 3.11 prometheus-client 0.7.1 prompt-toolkit 2.0.10 psutil 5.6.3 ptyprocess 0.6.0 py 1.8.0 pyarrow 0.17.1 pycodestyle 2.5.0 pycosat 0.6.3 pycparser 2.19 pycrypto 2.6.1 pycurl 7.43.0.3 pyflakes 2.1.1 Pygments 2.4.2 pylint 2.4.2 pyodbc 4.0.27 pyOpenSSL 19.0.0 pyparsing 2.4.2 pyreadr 0.2.9 pyrsistent 0.15.4 pysam 0.15.4 PySocks 1.7.1 pytest 5.2.1 pytest-arraydiff 0.3 pytest-astropy 0.5.0 pytest-doctestplus 0.4.0 pytest-openfiles 0.4.0 pytest-remotedata 0.3.2 python-dateutil 2.8.0 pytz 2019.3 PyWavelets 1.0.3 PyYAML 5.1.2 pyzmq 18.1.0 QtAwesome 0.6.0 qtconsole 4.5.5 QtPy 1.9.0 requests 2.19.1 rope 0.14.0 rpy2 3.0.5 ruamel-yaml 0.15.46 s3transfer 0.1.13 scikit-image 0.15.0 scikit-learn 0.21.3 scipy 1.3.1 scrublet 0.2.1 seaborn 0.9.0 SecretStorage 3.1.1 Send2Trash 1.5.0 setuptools 41.4.0 setuptools-scm 4.1.1 simplegeneric 0.8.1 singledispatch 3.4.0.3 six 1.12.0 snowballstemmer 2.0.0 sortedcollections 1.1.2 sortedcontainers 2.1.0 soupsieve 1.9.3 Sphinx 2.2.0 sphinxcontrib-applehelp 1.0.1 sphinxcontrib-devhelp 1.0.1 sphinxcontrib-htmlhelp 1.0.2 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.2 sphinxcontrib-serializinghtml 1.1.3 sphinxcontrib-websupport 1.1.2 spyder 3.3.6 spyder-kernels 0.5.2 SQLAlchemy 1.3.9 statsmodels 0.10.1 sympy 1.4 tables 3.5.2 tblib 1.4.0 terminado 0.8.2 testpath 0.4.2 toolz 0.10.0 tornado 6.0.3 tqdm 4.32.2 traitlets 4.3.3 tzlocal 2.1 umap-learn 0.3.10 unicodecsv 0.14.1 urllib3 1.23 velocyto 0.17.17 wcwidth 0.1.7 webencodings 0.5.1 Werkzeug 0.16.0 wheel 0.33.6 widgetsnbextension 3.5.1 wrapt 1.11.2 wurlitzer 1.0.3 xlrd 1.2.0 XlsxWriter 1.2.1 xlwt 1.3.0 zict 1.0.0 zipp 0.6.0

If you need the actual data, I can anonymize it and share it with you. Let me know if you need anything else.

tstannius · 2020-05-28T11:16:54Z

Thanks!

It makes good sense to store your data as hdf5 and we will also transition to this format in the next release :-)

It seems that the type (tuple) of the row and column names are causing the error. They should be strings instead.

Do try something along the lines of the following and let me know how it goes.

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
# I think float32 should be sufficient
cols = pd.read_csv("./cells.csv").values
cols = [c[0] for c in cols if type(c) is tuple]
# may give an error, if not all colnames were tuples and you will have to do some more work,
# but rather that than a silent error!
data.columns = cols


# same approach for idx
idx = pd.read_csv("./genes.csv").values
# do stuff ...
data.index = idx

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

rrydbirk · 2020-05-28T13:13:03Z

Unfortunately, your code erased everything in cols:

>>> cols = pd.read_csv("./cells.csv").values
>>> cols.size
118345
>>> cols = [c[0] for c in cols if type(c) is tuple]
>>> cols.size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'size'
>>> cols
[]

Instead, this did the trick:

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
data.columns = pd.read_csv("./cells.csv").values
data.columns = [v[0] for v in data.columns]
data.index = pd.read_csv("./genes.csv").values
data.index = [v[0] for v in data.index]

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 1 min 8 sec
Preprocessing - normalizing data ... data normalized in 2 min 24 sec
Preprocessing - running ANOVA ... excluded 4172 / 28621 genes in 2 min 44 sec

Also, I now see that I'm running on an old version of HDF5Array (1.10.1) since I'm still on R 3.5.1. In newer versions (at least 1.16.0), writeHDF5Array includes the option "with.dimnames" which should circumvent my problems.
So, for future reference, going from R to Python with large sparsematrices, this should suffice:
writeHDF5Array(matrix, "./data.h5", "data", verbose = T, with.dimnames=T)

tstannius · 2020-05-28T14:32:49Z

Glad to hear that!

I will keep the issue open until the input check function has been updated to handle this kind of issue.

rrydbirk mentioned this issue May 28, 2020

IndexError: tuple index out of range #25

Closed

tstannius self-assigned this May 28, 2020

tstannius changed the title ~~Gene filtering removes all genes~~ parse_input does not detect invalid metadata value types May 28, 2020

tstannius added the check&catch error checks and catching of errors. If very serious issue, report as bug label May 28, 2020

rrydbirk closed this as completed May 28, 2020

tstannius reopened this May 28, 2020

tstannius mentioned this issue Jun 3, 2020

ValueError: Index gene invalid perslab/CELLECT#54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_input does not detect invalid metadata value types #24

parse_input does not detect invalid metadata value types #24

rrydbirk commented May 28, 2020

tstannius commented May 28, 2020 •

edited

Loading

rrydbirk commented May 28, 2020 •

edited by tstannius

Loading

tstannius commented May 28, 2020 •

edited

Loading

rrydbirk commented May 28, 2020 •

edited by tstannius

Loading

tstannius commented May 28, 2020

parse_input does not detect invalid metadata value types #24

parse_input does not detect invalid metadata value types #24

Comments

rrydbirk commented May 28, 2020

tstannius commented May 28, 2020 • edited Loading

rrydbirk commented May 28, 2020 • edited by tstannius Loading

tstannius commented May 28, 2020 • edited Loading

rrydbirk commented May 28, 2020 • edited by tstannius Loading

tstannius commented May 28, 2020

tstannius commented May 28, 2020 •

edited

Loading

rrydbirk commented May 28, 2020 •

edited by tstannius

Loading

tstannius commented May 28, 2020 •

edited

Loading

rrydbirk commented May 28, 2020 •

edited by tstannius

Loading