# Step 1: Alignment with HISAT2

- **-p 4** tells HISAT2 to use 4 CPUs for bowtie alignments.
- **–rna-strandness RF** specifies strandness of RNAseq library. We will specify RF since the TruSeq strand-specific library was used to make these libraries. See here for options.
- **–rg-id ID** specifies a read group ID that is a unique identifier.
- **–rg SM:SAMPLE_NAME** specifies a read group sample name. This together with rg-id will allow you to determine which reads came from which sample in the merged bam later on.
- **–rg LB:LIBRARY_NAME** specifies a read group library name. This together with rg-id will allow you to determine which reads came from which library in the merged bam later on.
- **–rg PL:ILLUMINA** specifies a read group sequencing platform.
- **–rg PU:PLATFORM_UNIT** specifies a read group sequencing platform unit. Typically this consists of FLOWCELL-BARCODE.LANE
- **–dta** Reports alignments tailored for transcript assemblers.
- **-x /path/to/hisat2/index** The HISAT2 index filename prefix (minus the trailing .X.ht2) built earlier including splice sites and exons.
- **-1 /path/to/read1.fastq.gz** The read 1 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
- **-2 /path/to/read2.fastq.gz** The read 2 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
- **-S /path/to/output.sam** The output SAM format text file of alignments.

In [2]:
echo $RNA_ALIGN_DIR #Note: RNA_ALIGN_DIR=/home/ubuntu/workspace/rnaseq/alignments/hisat2
mkdir -p $RNA_ALIGN_DIR
cd $RNA_ALIGN_DIR
pwd

/home/ubuntu/workspace/rnaseq/alignments/hisat2
/home/ubuntu/workspace/rnaseq/alignments/hisat2


## This is the course's code to align reads with HISAT2:

In [None]:
hisat2 -p 4 --rg-id=UHR_Rep1 --rg SM:UHR --rg LB:UHR_Rep1_ERCC-Mix1 --rg PL:ILLUMINA --rg PU:CXX1234-ACTGAC.1 -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_DATA_DIR/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz -2 $RNA_DATA_DIR/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz -S ./UHR_Rep1.sam
hisat2 -p 4 --rg-id=UHR_Rep2 --rg SM:UHR --rg LB:UHR_Rep2_ERCC-Mix1 --rg PL:ILLUMINA --rg PU:CXX1234-TGACAC.1 -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_DATA_DIR/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz -2 $RNA_DATA_DIR/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz -S ./UHR_Rep2.sam
hisat2 -p 4 --rg-id=UHR_Rep3 --rg SM:UHR --rg LB:UHR_Rep3_ERCC-Mix1 --rg PL:ILLUMINA --rg PU:CXX1234-CTGACA.1 -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_DATA_DIR/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz -2 $RNA_DATA_DIR/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz -S ./UHR_Rep3.sam

hisat2 -p 4 --rg-id=HBR_Rep1 --rg SM:HBR --rg LB:HBR_Rep1_ERCC-Mix2 --rg PL:ILLUMINA --rg PU:CXX1234-TGACAC.1 -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_DATA_DIR/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz -2 $RNA_DATA_DIR/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz -S ./HBR_Rep1.sam
hisat2 -p 4 --rg-id=HBR_Rep2 --rg SM:HBR --rg LB:HBR_Rep2_ERCC-Mix2 --rg PL:ILLUMINA --rg PU:CXX1234-GACACT.1 -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_DATA_DIR/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz -2 $RNA_DATA_DIR/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz -S ./HBR_Rep2.sam
hisat2 -p 4 --rg-id=HBR_Rep3 --rg SM:HBR --rg LB:HBR_Rep3_ERCC-Mix2 --rg PL:ILLUMINA --rg PU:CXX1234-ACACTG.1 -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_DATA_DIR/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz -2 $RNA_DATA_DIR/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz -S ./HBR_Rep3.sam

## Alternatively: A cleaner and scalable method from ChatGPT to align reads with HISAT2:
### 1. First, create a samples.tsv
### 2. Then use a Bash loop

### To create samples.tsv on WSL terminal

In [None]:
nano samples.tsv # This will create and open an empty file
# Paste the following to that empty file
SampleID	SM	Mix	PU
UHR_Rep1	UHR	1	CXX1234-ACTGAC.1
UHR_Rep2	UHR	1	CXX1234-TGACAC.1
UHR_Rep3	UHR	1	CXX1234-CTGACA.1
HBR_Rep1	HBR	2	CXX1234-TGACAC.1
HBR_Rep2	HBR	2	CXX1234-GACACT.1
HBR_Rep3	HBR	2	CXX1234-ACACTG.1

# Then:
# 1. Press Ctrl+O to write (save) the file.
# 2. Press Enter to confirm the filename (samples.tsv).
# 3. Press Ctrl+X to exit Nano.

### To create samples.tsv on Jupyter Notebook bash kernel

In [6]:
printf "SampleID\tSM\tMix\tPU\n\
UHR_Rep1\tUHR\t1\tCXX1234-ACTGAC.1\n\
UHR_Rep2\tUHR\t1\tCXX1234-TGACAC.1\n\
UHR_Rep3\tUHR\t1\tCXX1234-CTGACA.1\n\
HBR_Rep1\tHBR\t2\tCXX1234-TGACAC.1\n\
HBR_Rep2\tHBR\t2\tCXX1234-GACACT.1\n\
HBR_Rep3\tHBR\t2\tCXX1234-ACACTG.1\n" > samples.tsv

In [7]:
# To confirm the file was created
cat samples.tsv

SampleID	SM	Mix	PU
UHR_Rep1	UHR	1	CXX1234-ACTGAC.1
UHR_Rep2	UHR	1	CXX1234-TGACAC.1
UHR_Rep3	UHR	1	CXX1234-CTGACA.1
HBR_Rep1	HBR	2	CXX1234-TGACAC.1
HBR_Rep2	HBR	2	CXX1234-GACACT.1
HBR_Rep3	HBR	2	CXX1234-ACACTG.1


In [8]:
# to double-check it's tab-separated:
column -t -s $'\t' samples.tsv # will display samples.tsv nicely as a table if the tabs are correct

SampleID  SM   Mix  PU
UHR_Rep1  UHR  1    CXX1234-ACTGAC.1
UHR_Rep2  UHR  1    CXX1234-TGACAC.1
UHR_Rep3  UHR  1    CXX1234-CTGACA.1
HBR_Rep1  HBR  2    CXX1234-TGACAC.1
HBR_Rep2  HBR  2    CXX1234-GACACT.1
HBR_Rep3  HBR  2    CXX1234-ACACTG.1


### Bash loop for HISAT2 alignment
- **while ...; do ... done** is a standard Bash while loop that reads the file line-by-line.
- **IFS=...** sets the Internal Field Separator to a tab character (\t).
- **read** reads one line from the file and assigns the tab-separated fields into the variables:
    - SAMPLE → sample ID
    - SM → HBR or UHR)
    - MIX → 1 or 2
    - PU → platform unit (e.g. flowcell barcode)
- **The -r flag** tells read not to interpret backslashes as escape characters.
- **[[...]] && continue** checks if the value in the first column ($SAMPLE) is the string "SampleID", if so **continue** skips that line and goes to the next iteration of the loop
- **< samples.tsv** redirects the contents of samples.tsv into the loop so each line is processed

In [9]:
while IFS=$'\t' read -r SAMPLE SM MIX PU; do
  [[ $SAMPLE == "SampleID" ]] && continue  # skip header

  READ1="$RNA_DATA_DIR/${SAMPLE}_ERCC-Mix${MIX}_Build37-ErccTranscripts-chr22.read1.fastq.gz"
  READ2="$RNA_DATA_DIR/${SAMPLE}_ERCC-Mix${MIX}_Build37-ErccTranscripts-chr22.read2.fastq.gz"

  hisat2 -p 4 \
    --rg-id=$SAMPLE \
    --rg SM:$SM \
    --rg LB:${SAMPLE}_ERCC-Mix${MIX} \
    --rg PL:ILLUMINA \
    --rg PU:$PU \
    -x $RNA_REF_INDEX \
    --dta \
    --rna-strandness RF \
    -1 $READ1 \
    -2 $READ2 \
    -S ./${SAMPLE}.sam
done < samples.tsv

227392 reads; of these:
  227392 (100.00%) were paired; of these:
    1155 (0.51%) aligned concordantly 0 times
    222491 (97.84%) aligned concordantly exactly 1 time
    3746 (1.65%) aligned concordantly >1 times
    ----
    1155 pairs aligned concordantly 0 times; of these:
      526 (45.54%) aligned discordantly 1 time
    ----
    629 pairs aligned 0 times concordantly or discordantly; of these:
      1258 mates make up the pairs; of these:
        510 (40.54%) aligned 0 times
        646 (51.35%) aligned exactly 1 time
        102 (8.11%) aligned >1 times
99.89% overall alignment rate
162373 reads; of these:
  162373 (100.00%) were paired; of these:
    965 (0.59%) aligned concordantly 0 times
    158728 (97.76%) aligned concordantly exactly 1 time
    2680 (1.65%) aligned concordantly >1 times
    ----
    965 pairs aligned concordantly 0 times; of these:
      580 (60.10%) aligned discordantly 1 time
    ----
    385 pairs aligned 0 times concordantly or discordantly; of these

In [11]:
# Now we can see the output SAM files 
ls

HBR_Rep1.sam  HBR_Rep3.sam  UHR_Rep2.sam  samples.tsv
HBR_Rep2.sam  UHR_Rep1.sam  UHR_Rep3.sam


# Step 2 (SAM to BAM Conversion): Convert HISAT2 sam files to bam files and sort by aligned position

In [12]:
samtools sort -@ 4 -o UHR_Rep1.bam UHR_Rep1.sam # -@ 4 means use 4 threads
samtools sort -@ 4 -o UHR_Rep2.bam UHR_Rep2.sam
samtools sort -@ 4 -o UHR_Rep3.bam UHR_Rep3.sam
samtools sort -@ 4 -o HBR_Rep1.bam HBR_Rep1.sam
samtools sort -@ 4 -o HBR_Rep2.bam HBR_Rep2.sam
samtools sort -@ 4 -o HBR_Rep3.bam HBR_Rep3.sam

[bam_sort_core] merging from 0 files and 4 in-memory blocks...
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
[bam_sort_core] merging from 0 files and 4 in-memory blocks...


In [13]:
# Now we should also see BAM files in the current directory
ls

HBR_Rep1.bam  HBR_Rep2.sam  UHR_Rep1.bam  UHR_Rep2.sam  samples.tsv
HBR_Rep1.sam  HBR_Rep3.bam  UHR_Rep1.sam  UHR_Rep3.bam
HBR_Rep2.bam  HBR_Rep3.sam  UHR_Rep2.bam  UHR_Rep3.sam


# Step 3: Merge HISAT2 BAM files
Make a single BAM file combining all UHR data and another for all HBR data. Note: This could be done in several ways such as ‘samtools merge’, ‘bamtools merge’, or using **picard-tools** (see below). We chose the third method because it **did the best job at merging the bam header information**.

In [14]:
java -Xmx2g -jar $PICARD MergeSamFiles -OUTPUT UHR.bam -INPUT UHR_Rep1.bam -INPUT UHR_Rep2.bam -INPUT UHR_Rep3.bam
java -Xmx2g -jar $PICARD MergeSamFiles -OUTPUT HBR.bam -INPUT HBR_Rep1.bam -INPUT HBR_Rep2.bam -INPUT HBR_Rep3.bam

23:40:48.037 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ubuntu/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri May 02 23:40:48 EDT 2025] MergeSamFiles --INPUT UHR_Rep1.bam --INPUT UHR_Rep2.bam --INPUT UHR_Rep3.bam --OUTPUT UHR.bam --SORT_ORDER coordinate --ASSUME_SORTED false --MERGE_SEQUENCE_DICTIONARIES false --USE_THREADING false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Fri May 02 23:40:48 EDT 2025] Executing as ubuntu@ip-172-31-4-43 on Linux 6.8.0-1017-aws amd64; OpenJDK 64-Bit Server VM 11.0.26+4-post-Ubuntu-1ubuntu122.04; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.26.4
INFO	2025-05-02 23:40:48	MergeSamFiles	Input files are in sam

In [15]:
# Now we have 2 additional BAM files, so 8 BAM files in total:
ls -l *.bam | wc -l
ls -l *.bam

8
-rw-rw-r-- 1 ubuntu ubuntu 55533414 May  2 23:41 HBR.bam
-rw-rw-r-- 1 ubuntu ubuntu 17080246 May  2 23:37 HBR_Rep1.bam
-rw-rw-r-- 1 ubuntu ubuntu 20690572 May  2 23:37 HBR_Rep2.bam
-rw-rw-r-- 1 ubuntu ubuntu 18624961 May  2 23:38 HBR_Rep3.bam
-rw-rw-r-- 1 ubuntu ubuntu 88676399 May  2 23:40 UHR.bam
-rw-rw-r-- 1 ubuntu ubuntu 34947098 May  2 23:37 UHR_Rep1.bam
-rw-rw-r-- 1 ubuntu ubuntu 26010264 May  2 23:37 UHR_Rep2.bam
-rw-rw-r-- 1 ubuntu ubuntu 28637844 May  2 23:37 UHR_Rep3.bam


# PRACTICAL EXERCISE 6
- Align the reads with HISAT2 aligner. Also practice converting SAM to BAM files, and merging BAM files.
- Do this analysis on the ‘practice’ data

1. If you sorted the resulting BAM file as we did above, is the result sorted by read name? Or position? ->Ans: by position
2. Which columns of the BAM file can be viewed to determine the style of sorting? -> The first, third and fourth columns contain the read name, chromosome, and position. Try samtools view HCC1395_normal.bam | head | cut -f 1,3,4 to confirm the sorting style.
3. What command can you use to view only the BAM header? -> Ans: samtools view -H HCC1395_normal.bam

In [17]:
export RNA_PRACTICE_DATA_DIR=$RNA_HOME/practice/data
# RNA_HOME=/home/ubuntu/workspace/rnaseq
cd $RNA_HOME/practice/

mkdir -p alignments/hisat2
cd alignments/hisat2

In [18]:
hisat2 -p 8 --rg-id=HCC1395_normal_rep1 --rg SM:HCC1395_normal_rep1 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep1_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep1_r2.fastq.gz -S ./HCC1395_normal_rep1.sam
hisat2 -p 8 --rg-id=HCC1395_normal_rep2 --rg SM:HCC1395_normal_rep2 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep2_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep2_r2.fastq.gz -S ./HCC1395_normal_rep2.sam
hisat2 -p 8 --rg-id=HCC1395_normal_rep3 --rg SM:HCC1395_normal_rep3 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep3_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_normal_rep3_r2.fastq.gz -S ./HCC1395_normal_rep3.sam

hisat2 -p 8 --rg-id=HCC1395_tumor_rep1 --rg SM:HCC1395_tumor_rep1 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep1_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep1_r2.fastq.gz -S ./HCC1395_tumor_rep1.sam
hisat2 -p 8 --rg-id=HCC1395_tumor_rep2 --rg SM:HCC1395_tumor_rep2 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep2_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep2_r2.fastq.gz -S ./HCC1395_tumor_rep2.sam
hisat2 -p 8 --rg-id=HCC1395_tumor_rep3 --rg SM:HCC1395_tumor_rep3 --rg PL:ILLUMINA -x $RNA_REF_INDEX --dta --rna-strandness RF -1 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep3_r1.fastq.gz -2 $RNA_PRACTICE_DATA_DIR/hcc1395_tumor_rep3_r2.fastq.gz -S ./HCC1395_tumor_rep3.sam

331958 reads; of these:
  331958 (100.00%) were paired; of these:
    81374 (24.51%) aligned concordantly 0 times
    245861 (74.06%) aligned concordantly exactly 1 time
    4723 (1.42%) aligned concordantly >1 times
    ----
    81374 pairs aligned concordantly 0 times; of these:
      20118 (24.72%) aligned discordantly 1 time
    ----
    61256 pairs aligned 0 times concordantly or discordantly; of these:
      122512 mates make up the pairs; of these:
        79756 (65.10%) aligned 0 times
        42033 (34.31%) aligned exactly 1 time
        723 (0.59%) aligned >1 times
87.99% overall alignment rate
331958 reads; of these:
  331958 (100.00%) were paired; of these:
    79808 (24.04%) aligned concordantly 0 times
    247478 (74.55%) aligned concordantly exactly 1 time
    4672 (1.41%) aligned concordantly >1 times
    ----
    79808 pairs aligned concordantly 0 times; of these:
      20751 (26.00%) aligned discordantly 1 time
    ----
    59057 pairs aligned 0 times concordantly or 

In [19]:
samtools sort -@ 8 -o HCC1395_normal_rep1.bam HCC1395_normal_rep1.sam
samtools sort -@ 8 -o HCC1395_normal_rep2.bam HCC1395_normal_rep2.sam
samtools sort -@ 8 -o HCC1395_normal_rep3.bam HCC1395_normal_rep3.sam
samtools sort -@ 8 -o HCC1395_tumor_rep1.bam HCC1395_tumor_rep1.sam
samtools sort -@ 8 -o HCC1395_tumor_rep2.bam HCC1395_tumor_rep2.sam
samtools sort -@ 8 -o HCC1395_tumor_rep3.bam HCC1395_tumor_rep3.sam

[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...


In [20]:
java -Xmx2g -jar $PICARD MergeSamFiles -OUTPUT HCC1395_normal.bam -INPUT HCC1395_normal_rep1.bam -INPUT HCC1395_normal_rep2.bam -INPUT HCC1395_normal_rep3.bam
java -Xmx2g -jar $PICARD MergeSamFiles -OUTPUT HCC1395_tumor.bam -INPUT HCC1395_tumor_rep1.bam -INPUT HCC1395_tumor_rep2.bam -INPUT HCC1395_tumor_rep3.bam

00:13:28.663 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ubuntu/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat May 03 00:13:28 EDT 2025] MergeSamFiles --INPUT HCC1395_normal_rep1.bam --INPUT HCC1395_normal_rep2.bam --INPUT HCC1395_normal_rep3.bam --OUTPUT HCC1395_normal.bam --SORT_ORDER coordinate --ASSUME_SORTED false --MERGE_SEQUENCE_DICTIONARIES false --USE_THREADING false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sat May 03 00:13:28 EDT 2025] Executing as ubuntu@ip-172-31-4-43 on Linux 6.8.0-1017-aws amd64; OpenJDK 64-Bit Server VM 11.0.26+4-post-Ubuntu-1ubuntu122.04; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.26.4
INFO	2025-05-03 0

In [21]:
ls

HCC1395_normal.bam       HCC1395_normal_rep3.bam  HCC1395_tumor_rep2.bam
HCC1395_normal_rep1.bam  HCC1395_normal_rep3.sam  HCC1395_tumor_rep2.sam
HCC1395_normal_rep1.sam  HCC1395_tumor.bam        HCC1395_tumor_rep3.bam
HCC1395_normal_rep2.bam  HCC1395_tumor_rep1.bam   HCC1395_tumor_rep3.sam
HCC1395_normal_rep2.sam  HCC1395_tumor_rep1.sam


In [22]:
samtools view -H HCC1395_normal.bam

@HD	VN:1.6	GO:none	SO:coordinate
@SQ	SN:22	LN:50818468
@SQ	SN:ERCC-00002	LN:1061
@SQ	SN:ERCC-00003	LN:1023
@SQ	SN:ERCC-00004	LN:523
@SQ	SN:ERCC-00009	LN:984
@SQ	SN:ERCC-00012	LN:994
@SQ	SN:ERCC-00013	LN:808
@SQ	SN:ERCC-00014	LN:1957
@SQ	SN:ERCC-00016	LN:844
@SQ	SN:ERCC-00017	LN:1136
@SQ	SN:ERCC-00019	LN:644
@SQ	SN:ERCC-00022	LN:751
@SQ	SN:ERCC-00024	LN:536
@SQ	SN:ERCC-00025	LN:1994
@SQ	SN:ERCC-00028	LN:1130
@SQ	SN:ERCC-00031	LN:1138
@SQ	SN:ERCC-00033	LN:2022
@SQ	SN:ERCC-00034	LN:1019
@SQ	SN:ERCC-00035	LN:1130
@SQ	SN:ERCC-00039	LN:740
@SQ	SN:ERCC-00040	LN:744
@SQ	SN:ERCC-00041	LN:1122
@SQ	SN:ERCC-00042	LN:1023
@SQ	SN:ERCC-00043	LN:1023
@SQ	SN:ERCC-00044	LN:1156
@SQ	SN:ERCC-00046	LN:522
@SQ	SN:ERCC-00048	LN:992
@SQ	SN:ERCC-00051	LN:274
@SQ	SN:ERCC-00053	LN:1023
@SQ	SN:ERCC-00054	LN:274
@SQ	SN:ERCC-00057	LN:1021
@SQ	SN:ERCC-00058	LN:1136
@SQ	SN:ERCC-00059	LN:525
@SQ	SN:ERCC-00060	LN:523
@SQ	SN:ERCC-00061	LN:1136
@SQ	SN:ERCC-00062	LN:1023
@SQ	SN:ERCC-00067	LN:644
@SQ	SN:ERCC-00069	LN:1137


In [23]:
samtools view HCC1395_normal.bam | head | cut -f 1,3,4

K00193:38:H3MYFBBXX:4:1218:30962:22325	22	10564457
K00193:38:H3MYFBBXX:4:1218:30962:22325	22	10564457
K00193:38:H3MYFBBXX:4:1203:22579:10757	22	10617977
K00193:38:H3MYFBBXX:5:1214:3568:13939	22	10617977
K00193:38:H3MYFBBXX:4:2205:11383:15873	22	10643913
K00193:38:H3MYFBBXX:4:2215:4400:39009	22	10643915
K00193:38:H3MYFBBXX:4:2215:4400:39009	22	10643927
K00193:38:H3MYFBBXX:4:2211:19067:14027	22	10673954
K00193:38:H3MYFBBXX:4:2211:19067:14027	22	10673954
K00193:38:H3MYFBBXX:4:1227:20061:25384	22	10684438


In [26]:
# To print Field Descriptions and view the first alignment in BAM file
echo -e "QuerryNAME\tFLAG\tRefNAME\tPOS\tMAPQ\tCIGAR\tRNEXT\tPNEXT\tTLEN\tSEQ\tQUAL"
samtools view HCC1395_normal.bam | head -n 1

QuerryNAME	FLAG	RefNAME	POS	MAPQ	CIGAR	RNEXT	PNEXT	TLEN	SEQ	QUAL
K00193:38:H3MYFBBXX:4:1218:30962:22325	133	22	10564457	0	*	=	10564457	0	CATGCACATGTTTATTTTTTGAGCACCTATGTTTTGTAAGATTAACAGCTGACTTAAGAGAAAAAATGGAAGGAAAGAGGCAGTAGAATAATATATTCAAAAGATGCAAAGGAAAAAAAAACTTTAGGCCACAAATTACTTATTCAGTAAT	<AFFFKKKKKKKKKKKKKKKAKAFKKKKKK,FKKKAAFK,FKKKKFKAFKFF,FFKKAKFFKK,FFA7FFKA,,7F,,<,,A,<FFFFKKKKFFFKKAFAAAA7A<,<,A,,<FKF7AFAF,<,,,,,,,7,,<AFF,,,7AA,,7,,<A,	YT:Z:UP	RG:Z:HCC1395_normal_rep1


| Field Name   | Value from your line                     | Description                                                          |
| ------------ | ---------------------------------------- | -------------------------------------------------------------------- |
| **QNAME**    | `K00193:38:H3MYFBBXX:4:1218:30962:22325` | Query template/pair name (read ID)                                   |
| **FLAG**     | `133`                                    | Bitwise flag describing the read (e.g. paired, reverse strand, etc.) |
| **RNAME**    | `22`                                     | Reference sequence name (chromosome 22)                              |
| **POS**      | `10564457`                               | 1-based leftmost mapping position                                    |
| **MAPQ**     | `0`                                      | Mapping quality (0 means unmapped or low confidence)                 |
| **CIGAR**    | `*`                                      | No alignment (i.e. unmapped or skipped)                              |
| **RNEXT**    | `=`                                      | Mate reference name ("=" means same as RNAME)                        |
| **PNEXT**    | `10564457`                               | Position of the mate read                                            |
| **TLEN**     | `0`                                      | Template length                                                      |
| **SEQ**      | `CATGCACATGTT...`                        | The read sequence                                                    |
| **QUAL**     | `<AFFFKKKK...`                           | ASCII-encoded base quality values                                    |
| **Optional** | `YT:Z:UP` and `RG:Z:HCC1395_normal_rep1` | Optional fields (e.g. `Read Group`)                                  |