# Transect data processing

```
# file path to the template script
source_path: "/proj/b2010008/nobackup/projects/crispr/sergiu/src/"
sample_names: ['P1994_116', 'P1994_111', 'P1994_117', 'P1994_128', 'P1994_120', 'P1994_113', 'P1994_108', 'P1994_130', 'P1994_101', 'P1994_102', 'P1994_106', 'P1994_115', 'P1994_121', 'P1994_125', 'P1994_105', 'P1994_110', 'P1994_118', 'P1994_122', 'P1994_112', 'P1994_104', 'P1994_103', 'P1994_126', 'P1994_123', 'P1994_129', 'P1994_107', 'P1994_119', 'P1994_109', 'P1994_124', 'P1994_127', 'P1994_114']
run_dir: "/proj/b2010008/nobackup/projects/crispr/data/transect/"
crass_location: "/proj/b2010008/nobackup/projects/crispr/sergiu/bin/bin/"
# location of the raw files
raw_files_dir: "/pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/"
```


I will start by obtaining a file list

```
$ ls /pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_*.gz > files.txt
```

These reads have the spike-ins removed. The reads containing reference sequences are at:

/pica/v9/b2014214_nobackup/BARM/only_for_mapping/with_internal_standards/dna/P1994_1*fq.gz

P1994_123 seems to have a lot of reference sequences. We decided to run Crass on that sample alone to see if we retrieve crispr clusters, but maybe we should run on everything if the program doesn't produce something conclusive.



The reads without reference are here:
/pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_*.gz

The reads with reference sequences included are here:
/pica/v9/b2014214_nobackup/BARM/only_for_mapping/with_internal_standards/dna/P1994_1*fq.gz


> P1994_123 seems to have a lot of reference sequences if that's what you'd like to have.


#### Metadata:

At: data/transect/transect_meta.txt

The samples with a 2 in the "Sampling_depth" column are the surface samples.

Here they are:
[P1994_101, P1994_104, P1994_107, P1994_110, P1994_113, P1994_116, P1994_119, P1994_122, P1994_125, P1994_128]


Just to double check on the transect data as well: I interpreted your lines below sent in late August to mean that reads at:

/pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_*.gz

do NOT contain internal standards (are trimmed for adapters) and at teh time did not bother to check. However, looking now at the filenames it is suggesting spike ins are still present. So do they need trimming or not, and if they need trimming then what is the reference sequence? More generally, are you guys using "adapter sequence", "internal standards" and "reference sequences" as to mean the same thing?

$ ls /pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_*.gz
/pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_101_with_i_s_R1.fq.gz
/pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_101_with_i_s_R2.fq.gz
/pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_102_with_i_s_R1.fq.gz
/pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_102_with_i_s_R2.fq.gz
...

Grepping the main Illumina spikein on one sample produces about the same number of lines, so either there was no trimming or it trimmed for another adapter.

```
$ gunzip -c /pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_101_with_i_s_R1.fq.gz | grep AGATCGGAAGA | wc -l
7066
$ gunzip -c /pica/v9/b2014214_nobackup/BARM/only_for_mapping/with_internal_standards/dna/P1994_101_with_i_s_R1.fq.gz | grep AGATCGGAAGA | wc -l
7083
```

In [9]:
samples = set()
with open("/home/sergiu/data/data/work/andersson/src/andersson/src/tmp/files.txt") as f:
    for l in f:
        samples.add('_'.join(l.strip('\n').split('/')[-1].split('_')[:2]))
print(samples)

{'P1994_116', 'P1994_111', 'P1994_117', 'P1994_128', 'P1994_120', 'P1994_113', 'P1994_108', 'P1994_130', 'P1994_101', 'P1994_102', 'P1994_106', 'P1994_115', 'P1994_121', 'P1994_125', 'P1994_105', 'P1994_110', 'P1994_118', 'P1994_122', 'P1994_112', 'P1994_104', 'P1994_103', 'P1994_126', 'P1994_123', 'P1994_129', 'P1994_107', 'P1994_119', 'P1994_109', 'P1994_124', 'P1994_127', 'P1994_114'}


I run Snakemake using split screen, with:

```
# create the screen
screen -S andersson-tsect
# then ctrl a, ctrl d
# bring the screen forward
screen -rd andersson-tsect
# see named screens
screen -ls

cd /proj/b2010008/nobackup/projects/crispr/data/transect/ && \
module use /proj/b2013006/sw/modules && \
module load miniconda3 && \
source activate andersson

# run the pipeline on the login node in detached mode
snakemake -s /proj/b2010008/nobackup/projects/crispr/sergiu/src/snakemake/Snakefile -j 99 --cluster-config /proj/b2010008/nobackup/projects/crispr/sergiu/src/snakemake/cluster.yaml --cluster "sbatch -A {cluster.account} -t {cluster.time} -p {cluster.partition} -n {cluster.n}"

source deactivate andersson
```

```
$ ls -lah
53G 25 aug 12.55 transect_merged.fastq.gz
```


In [4]:
import gzip

#test_file = "/home/sergiu/data/local/andersson/asko/merged.fastq.gz"
test_file = "/proj/b2010008/nobackup/projects/crispr/data/transect/transect_merged.fastq.gz"

samples = ['P1994_116', 'P1994_111', 'P1994_117', 'P1994_128', 'P1994_120', 'P1994_113', 
           'P1994_108', 'P1994_130', 'P1994_101', 'P1994_102', 'P1994_106', 'P1994_115', 
           'P1994_121', 'P1994_125', 'P1994_105', 'P1994_110', 'P1994_118', 'P1994_122', 
           'P1994_112', 'P1994_104', 'P1994_103', 'P1994_126', 'P1994_123', 'P1994_129', 
           'P1994_107', 'P1994_119', 'P1994_109', 'P1994_124', 'P1994_127', 'P1994_114']

## this failed for some reason
#from Bio import SeqIO
#with open(test_file, "rt") as handle:
#    for record in SeqIO.parse(handle, "fastq"):
#        print(record.id)


with gzip.open(test_file, 'rt') as fin:
    for line in fin:
        if line[0] == '@':
            print(line)
            break


@SRR3745603.1 1 length=150



In [None]:
^@(\w*).*
>>> import re
>>> s = """
... 
... /dev/sda:
... 
... ATA device, with non-removable media
...     Model Number:       ST500DM002-1BD142                       
...     Serial Number:      W2AQHKME
...     Firmware Revision:  KC45    
...     Transport:          Serial, SATA Rev 3.0"""
>>> m = re.search(r'Model Number:\s*([^\n]+)', s)
>>> m.group(1)
'ST500DM002-1BD142'

In [None]:
interactive -A b2010008 -c 1 --qos=short -t 15
gunzip -c /pica/v9/b2014214_nobackup/BARM/internal_standards/unmapped_reads/P1994_101_with_i_s_R1.fq.gz | grep AGATCGGAAGA | wc -l >> ./amount.txt
gunzip -c /pica/v9/b2014214_nobackup/BARM/only_for_mapping/with_internal_standards/dna/P1994_101_with_i_s_R1.fq.gz | grep AGATCGGAAGA | wc -l >> ./amount.txt

I moved the main chunk into "temp_transect_merged.fastq.gz" and I am continuing from sample 105_R2 onwards, but I will have to remove the 105_R2 reads from this file before merging.

```
Waiting bonus jobs:
11626949    305      core snakejob.merge_samples.2.  sergiun       b2010008 PD                 N/A  9-00:00:00    50000    2       (Priority)       (null)
```