Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different shape/ length error #3

Closed
lianyunhuang opened this issue Feb 17, 2021 · 14 comments
Closed

Different shape/ length error #3

lianyunhuang opened this issue Feb 17, 2021 · 14 comments

Comments

@lianyunhuang
Copy link

Hey There,

Thanks for your reading. I'm using this package and get an error:

[WARNING] 8634629 SNPs are found in the annotation files and in all the sumstats files
[INFO] reading M files...
100%|???????????????????????????????????????????????????????????????????????| 22/22 [16:14<00:00, 44.29s/it]
/S-PCGC/pcgc_main.py:397: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself.
Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
gencov_arr = np.empty((len(pcgc_data_list), len(pcgc_data_list)), dtype=np.object)
Traceback (most recent call last):
File "/S-PCGC/pcgc_main.py", line 857, in
pcgc_obj = SPCGC(args)
File "/S-PCGC/pcgc_main.py", line 402, in init
cov_ij = self.create_cov_obj(args, oi, oj,
File "/pcgc_main.py", line 628, in create_cov_obj
self.compute_taus(args, oi, oj,
File "/S-PCGC/pcgc_main.py", line 753, in compute_taus
z1_anno = df_annotations_sumstats_noneg.values * sumstats1[:, np.newaxis] * np.sqrt(trace_ratios1)
ValueError: operands could not be broadcast together with shapes (8634629,97) (8636723,1)

And with the same data, same codes we get a different error message when performed by another person:

[WARNING] 8636723 SNPs are found in the annotation files and in all the sumstats files
[INFO] reading M files...
[INFO] reading annot files...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [21:54<00:00, 59.77s/it]
Traceback (most recent call last):
File "pcgc_main.py", line 857, in
pcgc_obj = SPCGC(args)
File "pcgc_main.py", line 394, in init
self.load_annotations_data(args, df_prodr2, index_intersect)
File "pcgc_main.py", line 488, in load_annotations_data
is_same = (df.index == index_intersect).all()
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 123, in cmp_method
raise ValueError("Lengths must match to compare")
ValueError: Lengths must match to compare

Do you have any idea how this error comes and how to solve it? Thanks a lot and looking forward to your reply :))

@omerwe
Copy link
Owner

omerwe commented Feb 17, 2021

Hi,

Can you please let me know if you manage to run the tl;dr example from the main GitHub page? If you can, we need to figure out what's the difference between my example data and the data that you're using. Can you please send me the exact command that you used to generate this output?

Thanks,

Omer

@lianyunhuang
Copy link
Author

Hey Omer,

Thanks a lot for your reply. The toy example from the main page runs nicely. How can I show you my data structure? Like .. the head of each file? Then what files do you think are necessary to be listed?

As for the command, the error comes in the step of calculating h2 and genetic correlation (between Case/Control of phenotype A2 in my case). The main codes are like this:

python $dir_softpcgc/pcgc_main.py
--annot-chr $dir_data/baselineLD.
--sync $dir_data/baselineLD.
--sumstats-chr $dir_data/Case_A2.chr,$dir_data/Control_A2.chr
--prodr2-chr $dir_data/baselineLD.goodSNPs.
--out $wdir/pcgc

Thanks!
Lianyun

@omerwe
Copy link
Owner

omerwe commented Feb 18, 2021

Hi Lianyun,

Since the example data works well, there must be something off in your input files. Do you think you could send me a small sample (just the first few lines) of each of these files, so that I'll try to figure out what's wrong? I'll also update the code to give a more meaningful error message if this happens in the future. If it's ok, please send these to oweissbrod@hsph.harvard.edu

Thanks,

Omer

@lianyunhuang
Copy link
Author

Hey Omer,

I've sent you the email. Thanks! :))

Lianyun

@omerwe
Copy link
Owner

omerwe commented Feb 19, 2021

Hi Lianyun,

Thanks for sending me the files. It looks like there's a problem in the .prodr2 files --- some of the annotations are missing from the header line (e.g. FetalDHS_Trynka). I also see some annotations that are only in the .prodr2 files (e.g. FetalDHS_TrynkaFetalDHS).

Do you have any idea how this happened? Maybe you used slightly different annotation files in different parts of the pipeline? If you're sure you haven't, can you please send me a small reproducible example that I can run from scratch (using e.g. small/fake files)?

Thanks,

Omer

@lianyunhuang
Copy link
Author

Hey Omer,

I see, that is quite interesting. I will check the whole procedure and maybe re-run it before sending you an example, which might take a while. I will let you know how it goes.

Thanks!

Best,
Lianyun

@lianyunhuang
Copy link
Author

Hey Omer,

  1. I checked the annotations, they are fine. The dimension of prodr2 file is 97*97. Most of the annotations are the same in annot file and prodr2 file except for 4 more columns in the annotation file which are CHR, BP, SNP and CM. Maybe you get the difference due to an imperfect file format that I sent.
  2. I tried to re-run step2 to generate prodr2 file on another cluster. I get a same prodr2 file as the previous one.
  3. Now i'm re-running step3 to generate sumstats files, which might take a long time.
  4. Then if I send you a small example to run, how should I subset the data to make sure it includes all necessary info?

Thanks!

Best,
Lianyun

@omerwe
Copy link
Owner

omerwe commented Feb 21, 2021

Hi Lianyun,

Thanks for the update. For my understanding, can you please say which of these annotations appeared in the original annotation files: (1) FetalDHS_Trynka; (2) FetalDHS_TrynkaFetalDHS; or (3) both?

I think the simplest possibility for you is to subset a small number of SNPs (e.g. 5000) and run the pipeline on only these SNPs? If you can reproduce the problem, I can work on files derived from these small files.

Thanks,

Omer

@lianyunhuang
Copy link
Author

Hey Omer,

I've sent you the detail of annotation as well as the data link per email, please check. Thanks!

Best,
Lianyun

@lianyunhuang
Copy link
Author

lianyunhuang commented Feb 22, 2021

Hey Omer,

Plus, I get a fresh error in step3 just now (creating sumstats files), which is:

Traceback (most recent call last):
File "/softwares/spcgc/pcgc_sumstats_creator.py", line 590, in
sumstats_creator.compute_all_sumstats(args.chunk_size)
File "/softwares/spcgc/pcgc_sumstats_creator.py", line 271, in compute_all_sumstats
self.set_locus(snp1, snp2)
File "/softwares/spcgc/pcgc_sumstats_creator.py", line 318, in set_locus
snp_maf = self.mafs[snp1+j]
File "/anaconda3/envs/xyb/lib/python3.8/site-packages/pandas/core/series.py", line 821, in getitem
return self._values[key]
IndexError: index 116914 is out of bounds for axis 0 with size 116914

Best,
Lianyun

@omerwe
Copy link
Owner

omerwe commented Feb 28, 2021

Hi,

Apparently the problem was due to duplicate rsids in the input files. I modified the code to allow better handling of this situation. Can you please git pull the latest code and try again?

@lianyunhuang
Copy link
Author

Hi Omer,

Thanks a lot! I will try and let you know. :))

Lianyun

@lianyunhuang
Copy link
Author

Hi Omer,

A quick update. Seems the new codes are working well. I get the final result files regardless of a lot of Warning messages. I'm runing everything all over again on the data exluding duplicated rsids. Will let you know if there are any news.

Thanks a lot!

Lianyun

@lianyunhuang
Copy link
Author

Hi Omer,

I've finished a new run of the same data. Still get some weird results. I've email you the details. Please check.
Thanks a lot for your help!

Lianyun

@omerwe omerwe closed this as completed Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants