Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with index.genotype command. Error in file.exists(file) : invalid 'file' argument #1

Closed
TobyGurran opened this issue Jun 5, 2016 · 1 comment

Comments

@TobyGurran
Copy link

Dear Dr Monlong,

Many thanks for publishing and making available such an interesting and useful package! I find splice QTLs very interesting and would very much like to identify and study some from my cancer dataset.

I have encountered an error with the index.genotype command which I hope you will be able to help me with.

As per the instructions on your sQTLseekeR Github page, https://github.com/jmonlong/sQTLseekeR, I have prepared my genotype information as described, with chromosome, snp start, snp end, snpID, then my samples with genotypes coded 0,1,2,(-1 for missing):

output_Reference_file_Transpose_a2Version_CHR_22.traw[1:3,1:7]
chr start end snpId sample1 sample2 sample3
1 22 16054311 16054311 rs102459 2 2 2
2 22 16054713 16054713 rs230493 2 2 2
3 22 16066757 16066757 rs356385 2 2 2

However when I try and run the index.genotype command as per the "run-example" page, https://github.com/jmonlong/sQTLseekeR/blob/master/scripts/run-example.R, I get the following error:

CHR_22.indexed <- index.genotype(output_Reference_file_Transpose_a2Version_CHR_22.traw)
Error in file.exists(file) : invalid 'file' argument

Do have any suggestions as to why this could be?

Admittedly the file I am running the command on contains information from chromosome 22 (I decided to run on a small subset first). Could this be confusing the programme by it not containing every chromosome?

I am sure that I have created the strcture of the file correclty, because if I use a file which does not have the correct number of input columns as is stipulated in the instructions, I get a different error telling me that those columns are missing.

genotype.indexed.f <- index.genotype(incorrect.table)
Error in index.genotype(genotype.f) :
Missing column or in incorrect order. The first 4 columns must be 'chr', 'start', 'end' and 'snpId'.

Could it also be that my data is not in the correct format? Is the data required to be in .tsv format? Because the run-example page reads:

genotype.f="snps-012coded.tsv"
#1) Index the genotype file (if not done externally before)

genotype.indexed.f = index.genotype(genotype.f)

My data is not in .tsv, but it is already read into R. I would guess that this is unlikely the issue, because no matter what format the data is in prior to being read into R, it will become a dataframe once it is read in.
However, I cannot actually see a line in the run-example where the .tsv is actually read in. read.table is used to read in transcript expression in Step 2, and to read in the bed file in step 3. However I cannot actually see a line to specifically read in the .tsv, which is what makes me wonder if it is required to be specifically in that format.
#2) Prepare transcript expression

te.df= read.table(trans.exp.f,as.is=TRUE,header=TRUE,sep="\t")
#3) Test gene/SNP associations

gene.bed= read.table(gene.bed.f,as.is=TRUE,sep="\t")

As a potential solution to this problem, your example says "Index the genotype file (if not done externally before)", which implies that this step can be achieved another way. If I am unable to get this command to work, is there an alternative method I can use to compress and index the genotypes, as the index.genotype command is supposed to do? Would you be able to point me in the direction of a suitable package with which to do that?

I sincerely appreciate your time and I would be extremely grateful of assistance you are able to give!! I look forward to referencing your package when I have found some novel splice QTLs.

And I a using R version 3.2.4 on a linux server if is important.

Many thanks!

@jmonlong
Copy link
Owner

jmonlong commented Jun 6, 2016

Dear Toby,

Thanks for your enthusiasm and detailed message !

As you said, the data seems to be formatted correctly. The problem is actually what you mentioned: that index.genotype is supposed to get as input the name of a file, not a R object. The reason for this is to avoid loading the entire file in R (as these genotypes can be quite large). Under the hood, the file won't actually be loaded in a data.frame but will be directly compressed and indexed using Rsamtools functions.

I'll try to clear the documentation and error messages, thanks for the feedback.

(As you mentioned the other solution would be to compress/index the file outside of R, using the tabix program. But anyway, now it should work within R when you use the file name instead of the R object.)

Don't hesitate if you have any other problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants