-
Notifications
You must be signed in to change notification settings - Fork 5
/
README.TXT
275 lines (211 loc) · 11.8 KB
/
README.TXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
isoem2/isoDE2 README
1. Installation:
------------
1. Create a isoem2 directory and download the git repository using
git clone https://github.com/mandricigor/isoem2.git
3. run the linux shell script 'install' provided in the git repository
4. [Optional] On windows you might want to add the isoem2/isoDE2
installation directory to the path, such that you can invoke
isoem2 from any location. On linux, you can obtain a similar
effect by creating a symbolic link to the isoem2 and isoDE2 executables
in /usr/local/bin.
2. Testing your installation:
--------------------------
To test the installation of isoem2 and isoDE2, download and unzip the following
compressed archive and follow the instructions in the README file included in the archive.
http://dna.engr.uconn.edu/~software/IsoEM/testdata/IsoEM2IsoDE2-MAQC-Sample.zip
3. Running isoem2:
-----------------
isoem2 takes as input a set of known isoforms in GTF format, and a
file with aligned reads in SAM format. The aligned reads MUST be
sorted by read name. If not sure, run this command to sort the
file:
sort -k 1,1 aligned_reads.sam > aligned_reads_sorted.sam
You can run isoem2 from the command line as follows:
isoem2 [global options]* [library options]* <aligned_reads.sam>
Or, if you run provide read alignments from the standard input:
cat <aligned_reads.sam> | isoem2 [global options]* [library options]*
Mandatory global options:
------------------------
-G, --GTF <GTF file> Known genes and isoforms in GTF format
Mandatory library options: either -a or both -m and -d must be present:
-------------------------
-m, --fragment-mean <Double> Fragment length mean
-d, --fragment-std-dev <Double> Fragment length standard deviation
-a, --auto-fragment-distrib Automatically detect fragment length
distribution from uniquely mapping
paired reads (DOES NOT WORK FOR
SINGLE READS)
Optional global options:
-----------------------
-c, --gene-clusters <Cluster file> Override isoform to gene mapping
defined in the GTF file with a
mapping taken from the given file.
The format of each line in the file
is "isoform gene"
-g <genome fasta file> Genome reference sequence (needed by
some library options)
-b Perform hexamer bias correction
-h, --help Show help
-r <Repeats GTF> Drop alignments falling within
annotated repeats
Optional library options:
------------------------
-s, --directional Dataset obtained by directed RNA-Seq
(the strand of each read is
deterministically chosen: for single
reads, the read always comes from
the coding strand; for paired reads,
the first read always comes from the
coding strand, the second from the
opposite strand)
--antisense Directional sequencing but the reads
come from the antisense
--mate-pairs Paired reads come from the same strand
(as opposed to the default behavior
where the two reads in a pair are
assumed to come from opposite
strands)
--max-mismatches <Integer> Maximum number of mismatched allowed
for a read. This requires the genome
sequence to be specified (see -g).
-q, --quality-scores Weigh the reads based on their quality
scores. This requires the genome
sequence to be specified (see -g).
--repeat-threshold <nbases> Drop all reads that have more than
this many bases inside annotated
repeats. Default: 20.
--polyA <nbases> Reads have been generated from mRNAs
with polyA tails of approximately
the given number of bases
-o <file prefix> Output files prefix. It can include
path. Default: same as sam file name
-O <directory prefix> Output directory prefix. If read
alignments are read from stdin,
the default value is stdinSample
-C <confidence interval (%)> Compute expression of genes/isoforms
with specified confidence intervals.
Provide an integer (default: 95,
bootstraps: 200)
--endseq Disable length normalization for data
generated using 5' or 3' end-sequen-
cing protocols, which generate a
single fragment per cDNA molecule
Output
------
isoem2 generates the following output files structure under a directory
with the same name as the sam file, unless the -o is used
<output_directory>
|
- output
| |
| - Isoforms
| | |
| | - iso_fpkm_estimates
| | - iso_tpm_estimates
| - Genes
| |
| - iso_fpkm_estimates
| - iso_tpm_estimates
- ConfidenceIntervals (Only if -C option is used)
| |
| - iso_fpkm_ci
| - iso_tpm_ci
| - gene_fpkm_ci
| - gene_tpm_ci
- boostrap.tar.gz
Files under output/Isoforms and output/Genes are tab delimited files with the following two fields
1- Isoform/Gene ID
2- Isoform/Gene FPKM (Fragments Per Kilobase per Million reads) or TPM (Transcripts per Million reads)
Files under output/ConfidenceIntervals are tab delimited files with the following three fields
1- Isoform/Gene ID
2- Lower-bound for the 95% confidence interval of the Isoform/Gene FPKM/TPM estimate determined by bootstrapping
3- Upper-bound for the 95% confidence interval of the Isoform/Gene FPKM/TPM estimate determined by bootstrapping
boostrap.tar.gz is a compressed tar archive containing bootstrap samples used to determine confidence intervals.
This archive can be used as input to the isoDE2 tool for computing differentially expressed isoforms/genes.
Note: Read Alignment:
---------------------
To align the reads you have one of two options:
1) Use spliced alignment directly on the genome
2) Use unspliced alignment to the transcriptome.
If you have a transcriptome reference and no GTF (needed to run isoem2),
you can use the fastaToGTF tool, included with the isoem2 suite, to generate a GTF.
If you want to generate a transcriptome reference using a GTF, you can use the
extract-isoform-sequences-from-genome tool, included with the isoem2 suite
4. Running isoDE2
----------------
isoDE2 makes DE calls for gene/isoform FPKM and TPM estimated using the boostrapping output generated by isoem2
isoDE2 -c1 <List of boostraping path for condition 1> -c2 <List of boostraping path for condition 2> -pval <desired p value> -out <output-files-prefix>
Mandatory parameters
--------------------
-c1 List of bootstrapping compressed archives for condition 1
-c2 List of bootstrapping compressed archives for condition 2
-pval pval
-out prefix for generated output files
Output
------
4 files with the prefix specifies as input and the following suffixes
geneFPKM
geneTPM
isoFPKM
isoTPM
All four output files have the same structure, described below
Description of isoDE2 Output file:
---------------------------------
1- Gene/isoform ID
2- Confident log2(FC): the base 2 logarithm of the largest condition 2 vs condition 1
fold change of gene/isoform FPKM/TPM estimates supported by the
bootstrap samples at a significance level of 'pval'. Positive values
represent over-expression in condition 2, negative values representing
over-expression in condition 1, and zero values indicate that no
significant change was detected.
3- Single run log2(FC): the base 2 logarithm of the ratio between expression levels estimated
by isoem2 for condition 2 and condition 1 (or the mean estimates in case
replicates are provided for the two conditions).
4- condition 1 FPKM (or TPM) based on isoem2 run without bootstrapping (mean value in case of replicates)
5- condition 2 FPKM (or TPM) based on isoem2 run without bootstrapping (mean value in case of replicates)
Example
-------
isoDE2 -c1 /data1/BRAIN_UHR_Test/BRAIN_Genome_DIR/ /DataSet1/Test1_DIR/ -c2 /data1/BRAIN_UHR_Test/UHR_Genome_DIR/ /DataSet1/Test2_DIR/ -pval 0.05 -out "output1.txt"
isoDE2 -c1 ./BRAIN_Genome_DIR/ ./Test1_DIR/ -c2 ./UHR_Genome_DIR/ ./DataSet1/Test2_DIR/ -pval 0.05 -out "output2.txt"
Source Code:
------------
The source code can be found in the src directory under the
installation path.
Revision history
----------------
Version 2.0.0 (1/20/16) - added TPM estimates for genes and isoforms
- added option to compute confidence intervals (bootstrapping)
- added option for reading alignments from standard input
- integrated IsoDE with IsoEM
- Added DE for isoform FPKMs and genes and isoforms TPMs
- Removed the isoviz visualization tool. To be added to the isoem2 suite in the future
Version 1.1.4 (12/18/15) - added --counts option to generate expected read counts and --endseq to
handle data from end-sequencing protocols
Version 1.1.3 (10/11/15) - bug fix in handling CIGAR with indels in convert-iso-to-genome-coords
- bug fix related to hisat/hisat2 alignments
Version 1.1.1 (11/5/12) - bug fix related to clipped read alignments (CIGAR with S field)
Version 1.1.0 (4/24/12) - added support for alignments with insertions and deletions
Version 1.0.6 (8/12/11) - extract-isoform-sequences-from-genome (see
http://dna.engr.uconn.edu/software/IsoEM/README-SAMPLE.TXT)
generates transcripts in a randomized order
- isoviz generates a gtf with fpkm values
- added output file name option
Version 1.0.5 (5/08/11) - bugfix related to paired read data
Version 1.0.4 (2/22/11) - added polyATail option
- further memory and speed improvements
Version 1.0.3 (8/30/10) - correct for annotated repeats
Version 1.0.2 (8/05/10) - improved memory requirements for storing genome sequence
- added hexamer bias correction option
- added isoviz visualization tool
Version 1.0.1 (6/25/10) - added support for mate pairs
- added support for max number of mismatches
- performance improvements
Version 1.0.0 (6/16/10) - first public release
Contact
-------
For questions or suggestions regarding IsoEM2/IsoDE2 you can contact:
Igor Mandric (imandric1@student.gsu.edu)
Sahar Al Seesi (sahar@engr.uconn.edu)
Ion Mandoiu (ion@engr.uconn.edu)
Alex Zelikovsky (alexz@cs.gsu.edu)