Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_input does not detect invalid metadata value types #24

Open
rrydbirk opened this issue May 28, 2020 · 5 comments
Open

parse_input does not detect invalid metadata value types #24

rrydbirk opened this issue May 28, 2020 · 5 comments
Assignees
Labels
check&catch error checks and catching of errors. If very serious issue, report as bug

Comments

@rrydbirk
Copy link

I'm trying to run CELLEX on ~120k cells from a single-nucleus experiment. In the preprocessing step, the ANOVA gene filtering removes all my genes. Of course, as you describe in the workflow, I could just omit this step (ANOVA=False), however, it seems relevant to include.

What would you recommend I do? Would some initial gene filtering for low-expressed genes or similar help?

Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 0 min 56 sec
Preprocessing - normalizing data ... data normalized in 2 min 18 sec
Preprocessing - running ANOVA ... excluded 28621 / 28621 genes in 2 min 14 sec

@tstannius
Copy link
Member

tstannius commented May 28, 2020

Hi Rasmus!

I will need a little more to go on. Would you kindly share the code (a minimal example) needed to reproduce this output?
Please also include the output of calling

data.head()
metadata.head()

And the versions of the modules you are using.

@rrydbirk
Copy link
Author

rrydbirk commented May 28, 2020

Hi Tobias

Thanks for getting back to me.
This is my code. Please note, since I'm coming from R and my count matrix is too big to write to .csv, I used HDF5 instead. It's a bit dirty since I couldn't get it to transfer row and column names directly (and I'm not that familiar with Python in general):

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']))
data.columns = pd.read_csv("./cells.csv").values
data.index = pd.read_csv("./genes.csv").values
data = data.astype(dtype="int64")

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

The requested output. I omitted sensitive data:

data.head()
               (Cell1_AAACCCAAGATCCGAG-1,)  ...  (Celln_TTTGTTGTCTAACGGT-1,)
(AL627309.1,)                                  0  ...                                  0
(AC114498.1,)                                  0  ...                                  0
(AL669831.2,)                                  0  ...                                  0
(AL669831.5,)                                  1  ...                                  0
(FAM87B,)                                      0  ...                                  0

[5 rows x 118345 columns]
>>> metadata.head()
                               cell_type
cell_id
Cell1_ACTACGAGTATGCGGA-1    Type1
Cell2_AAACCCAAGTGCGCTC-1    Type1
Cell3_ACTGATGGTAACACGG-1    Type1
Cell4_CACTGAATCACAATGC-1    Type1
Cell5_CAGGGCTCACCAACAT-1    Type1

And the versions:
➜ Cellex python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux

➜ Cellex pip list

Click to show

Package Version ---------------------------------- --------- adjustText 0.7.3 alabaster 0.7.12 anaconda-client 1.7.2 anaconda-navigator 1.9.7 anaconda-project 0.8.3 aniso8601 8.0.0 annoy 1.16.3 asn1crypto 1.0.1 astroid 2.3.1 astropy 3.2.2 atomicwrites 1.3.0 attrs 19.2.0 Babel 2.7.0 backcall 0.1.0 backports.functools-lru-cache 1.5 backports.os 0.1.1 backports.shutil-get-terminal-size 1.0.0 backports.tempfile 1.0 backports.weakref 1.0.post1 beautifulsoup4 4.8.0 bitarray 1.0.1 bkcharts 0.2 bleach 3.1.0 bokeh 1.3.4 boto 2.49.0 boto3 1.7.84 botocore 1.10.84 Bottleneck 1.2.1 cellex 1.1.1 CellPhoneDB 2.1.2 certifi 2019.9.11 cffi 1.12.3 chardet 3.0.4 click 6.7 cloudpickle 1.2.2 clyent 1.2.2 colorama 0.4.1 conda 4.8.2 conda-build 3.18.9 conda-package-handling 1.6.0 conda-verify 3.4.2 contextlib2 0.6.0 cryptography 2.7 cycler 0.10.0 Cython 0.29.13 cytoolz 0.10.0 dask 2.5.2 decorator 4.4.0 defusedxml 0.6.0 descartes 1.1.0 distributed 2.5.2 docutils 0.15.2 entrypoints 0.3 et-xmlfile 1.0.1 fastcache 1.1.0 fbpca 1.0 filelock 3.0.12 Flask 1.0.4 Flask-RESTful 0.3.8 Flask-Testing 0.7.1 fsspec 0.5.2 future 0.17.1 geosketch 0.3 gevent 1.4.0 glob2 0.7 gmpy2 2.0.8 greenlet 0.4.15 h5py 2.9.0 HeapDict 1.0.1 html5lib 1.0.1 idna 2.7 imageio 2.6.0 imagesize 1.1.0 importlib-metadata 0.23 ipykernel 5.1.2 ipython 7.8.0 ipython-genutils 0.2.0 ipywidgets 7.5.1 isort 4.3.21 itsdangerous 1.1.0 jdcal 1.4.1 jedi 0.15.1 jeepney 0.4.1 Jinja2 2.10.3 jmespath 0.9.5 joblib 0.13.2 json5 0.8.5 jsonschema 3.0.2 jupyter 1.0.0 jupyter-client 5.3.3 jupyter-console 6.0.0 jupyter-core 4.5.0 jupyterlab 1.1.4 jupyterlab-server 1.0.6 keyring 18.0.0 kiwisolver 1.1.0 lazy-object-proxy 1.4.2 libarchive-c 2.8 lief 0.9.0 llvmlite 0.29.0 locket 0.2.0 loompy 3.0.6 lxml 4.4.1 MarkupSafe 1.1.1 matplotlib 3.1.1 mccabe 0.6.1 mistune 0.8.4 mizani 0.6.0 mkl-fft 1.0.14 mkl-random 1.1.0 mkl-service 2.3.0 mock 3.0.5 more-itertools 7.2.0 mpmath 1.1.0 msgpack 0.6.1 multipledispatch 0.6.0 navigator-updater 0.2.1 nbconvert 5.6.0 nbformat 4.4.0 networkx 2.3 nltk 3.4.5 nose 1.3.7 notebook 6.0.1 numba 0.45.1 numexpr 2.7.0 numpy 1.17.2 numpy-groupies 0+unknown numpydoc 0.9.1 olefile 0.46 openpyxl 3.0.0 packaging 19.2 palettable 3.3.0 pandas 1.0.3 pandocfilters 1.4.2 parso 0.5.1 partd 1.0.0 path.py 12.0.1 pathlib2 2.3.5 patsy 0.5.1 pep8 1.7.1 pexpect 4.7.0 pickleshare 0.7.5 pika 0.12.0 Pillow 6.2.0 pip 19.2.3 pkginfo 1.5.0.1 plotnine 0.6.0 pluggy 0.13.0 ply 3.11 prometheus-client 0.7.1 prompt-toolkit 2.0.10 psutil 5.6.3 ptyprocess 0.6.0 py 1.8.0 pyarrow 0.17.1 pycodestyle 2.5.0 pycosat 0.6.3 pycparser 2.19 pycrypto 2.6.1 pycurl 7.43.0.3 pyflakes 2.1.1 Pygments 2.4.2 pylint 2.4.2 pyodbc 4.0.27 pyOpenSSL 19.0.0 pyparsing 2.4.2 pyreadr 0.2.9 pyrsistent 0.15.4 pysam 0.15.4 PySocks 1.7.1 pytest 5.2.1 pytest-arraydiff 0.3 pytest-astropy 0.5.0 pytest-doctestplus 0.4.0 pytest-openfiles 0.4.0 pytest-remotedata 0.3.2 python-dateutil 2.8.0 pytz 2019.3 PyWavelets 1.0.3 PyYAML 5.1.2 pyzmq 18.1.0 QtAwesome 0.6.0 qtconsole 4.5.5 QtPy 1.9.0 requests 2.19.1 rope 0.14.0 rpy2 3.0.5 ruamel-yaml 0.15.46 s3transfer 0.1.13 scikit-image 0.15.0 scikit-learn 0.21.3 scipy 1.3.1 scrublet 0.2.1 seaborn 0.9.0 SecretStorage 3.1.1 Send2Trash 1.5.0 setuptools 41.4.0 setuptools-scm 4.1.1 simplegeneric 0.8.1 singledispatch 3.4.0.3 six 1.12.0 snowballstemmer 2.0.0 sortedcollections 1.1.2 sortedcontainers 2.1.0 soupsieve 1.9.3 Sphinx 2.2.0 sphinxcontrib-applehelp 1.0.1 sphinxcontrib-devhelp 1.0.1 sphinxcontrib-htmlhelp 1.0.2 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.2 sphinxcontrib-serializinghtml 1.1.3 sphinxcontrib-websupport 1.1.2 spyder 3.3.6 spyder-kernels 0.5.2 SQLAlchemy 1.3.9 statsmodels 0.10.1 sympy 1.4 tables 3.5.2 tblib 1.4.0 terminado 0.8.2 testpath 0.4.2 toolz 0.10.0 tornado 6.0.3 tqdm 4.32.2 traitlets 4.3.3 tzlocal 2.1 umap-learn 0.3.10 unicodecsv 0.14.1 urllib3 1.23 velocyto 0.17.17 wcwidth 0.1.7 webencodings 0.5.1 Werkzeug 0.16.0 wheel 0.33.6 widgetsnbextension 3.5.1 wrapt 1.11.2 wurlitzer 1.0.3 xlrd 1.2.0 XlsxWriter 1.2.1 xlwt 1.3.0 zict 1.0.0 zipp 0.6.0

If you need the actual data, I can anonymize it and share it with you. Let me know if you need anything else.

@tstannius tstannius changed the title Gene filtering removes all genes parse_input does not detect invalid metadata value types May 28, 2020
@tstannius tstannius added the check&catch error checks and catching of errors. If very serious issue, report as bug label May 28, 2020
@tstannius
Copy link
Member

tstannius commented May 28, 2020

Thanks!

It makes good sense to store your data as hdf5 and we will also transition to this format in the next release :-)

It seems that the type (tuple) of the row and column names are causing the error. They should be strings instead.

Do try something along the lines of the following and let me know how it goes.

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
# I think float32 should be sufficient
cols = pd.read_csv("./cells.csv").values
cols = [c[0] for c in cols if type(c) is tuple]
# may give an error, if not all colnames were tuples and you will have to do some more work,
# but rather that than a silent error!
data.columns = cols


# same approach for idx
idx = pd.read_csv("./genes.csv").values
# do stuff ...
data.index = idx

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

@rrydbirk
Copy link
Author

rrydbirk commented May 28, 2020

Unfortunately, your code erased everything in cols:

>>> cols = pd.read_csv("./cells.csv").values
>>> cols.size
118345
>>> cols = [c[0] for c in cols if type(c) is tuple]
>>> cols.size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'size'
>>> cols
[]

Instead, this did the trick:

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
data.columns = pd.read_csv("./cells.csv").values
data.columns = [v[0] for v in data.columns]
data.index = pd.read_csv("./genes.csv").values
data.index = [v[0] for v in data.index]

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 1 min 8 sec
Preprocessing - normalizing data ... data normalized in 2 min 24 sec
Preprocessing - running ANOVA ... excluded 4172 / 28621 genes in 2 min 44 sec

Also, I now see that I'm running on an old version of HDF5Array (1.10.1) since I'm still on R 3.5.1. In newer versions (at least 1.16.0), writeHDF5Array includes the option "with.dimnames" which should circumvent my problems.
So, for future reference, going from R to Python with large sparsematrices, this should suffice:
writeHDF5Array(matrix, "./data.h5", "data", verbose = T, with.dimnames=T)

@tstannius
Copy link
Member

Glad to hear that!

I will keep the issue open until the input check function has been updated to handle this kind of issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check&catch error checks and catching of errors. If very serious issue, report as bug
Projects
None yet
Development

No branches or pull requests

2 participants