### NOTE FOR LUCA

**Remember to set/remove metadata as:**
{
  "nbsphinx": "hidden"
}

to enable/disable solutions view

# Midterm 2018

## Before you start
**Please write one single python script.**

**IMPORTANT: Add your name and ID (matricola) on top of the .py file!**

## The Problem

The dataset we are going to use is a tab separated version of the Sequence Alignment Format (SAM), which contains, among the other things (that have been removed), the following fields:

```
readID	Chr	Pos	Sequence	MismatchingPositions
DJB77P1:536:H880AADXX:1:1116:16537:22664	Chr02	41	GGAACTCGAACA...TAATCAGGAACT	26A124
DJB77P1:536:H880AADXX:1:1111:5187:26828	Chr02	61	TCACCCAAACTCAA...ACATGATGC	24G3T40G81
DJB77P1:536:H880AADXX:1:1111:20900:20916	Chr02	9	TGAACTCGAG...AACTTCAA	151
```

as the header (first line) describes, the first column is the identifier of the read, the second and third are the chromosome and position (starting position) to which the read is aligned, the fourth is the sequence and the fifth is a string that reports any mismatches (it is normally more complex than this but we can skip insertions/deletions). 

For example, the mismatchingPositions string:
```26A124``` means that the read aligned with 26 matches, one mismatch (A in the reference) and 124 more matches (for a total of 151 bases that is the length of the read);
```24G3T40G81``` means 24 matches, one G (in the reference), 3 matches, T, 40 matches, G, 81 matches (total 151 bases)
```151``` means 151 matches and no mismatch.

The alignment:
```
reference: ATACACG
read:      ATCCACG
```
would be represented as ```2A4```




Implement the following python functions:

1. ```countMismatches(mis_str)```: gets a mismatch string and returns the number of mismatches. Please note that reads might contain "N" as bases. Hint: count characters that represent nucleotides.

Calling:
```
mismatches = ["9C15T125", 
              "9G4C136", 
              "151", 
              "9G7T2C24A24C11A0G5G6G1A0G51", 
              "9T13A16T1G5A6A6A6A81"]

for m in mismatches:
    print("{} has {} mismatches".format(m, countMismatches(m)))
```
Should give:
```
9C15T125 has 2 mismatches
9G4C136 has 2 mismatches
151 has 0 mismatches
9G7T2C24A24C11A0G5G6G1A0G51 has 11 mismatches
9T13A16T1G5A6A6A6A81 has 8 mismatches
```


2. ```getReadsInInterval(filename, chr, startPos, endPos, max_diff)```: reads all entries in the file ```filename``` and returns all the reads that aligned to the chromosome ```chr``` and have at least one base in the closed interval ```[startPos,endPos]``` with no more than ```max_diff``` mismatches. The function should return a dictionary with readIDs as keys and all the other elements (in a list) as values. The function should also print some information on the reads selected, like their number, readID and starting position (see below).

given the entry:
```
DJB77P1:536:H880AADXX:1:1116:16537:22664	Chr02	41	GGAACTCGAACA...TAATCAGGAACT	26A124
```
the returned dictionary should be:
```
{"DJB77P1:536:H880AADXX:1:1116:16537:22664" : ['Chr02',41,'GGAACTCGAACA...TAATCAGGAACT', '26A124']}
```

Calling:
```
fn = "./test_data.tsv"
chromosome = "Chr01"
reads = getReadsInInterval(fn, chromosome, 100,600, 2)
```
should give:
```
4 reads cover the interval Chr01: 100-600
They are:
	DJB77P1:536:H880AADXX:1:1204:4499:50848 in Chr01 starts at 174
	DJB77P1:536:H880AADXX:1:1111:13063:11094 in Chr01 starts at 270
	DJB77P1:536:H880AADXX:1:1101:3693:67653 in Chr01 starts at 183
	DJB77P1:536:H880AADXX:1:1106:18413:36334 in Chr01 starts at 250
```

3.  ```getCoverage(reads, chr,startP, endP)```: based on the ```reads``` selected with ```getReadsInInterval``` this function computes the coverage for each position of chromosome ```chr``` starting from ```startP``` to ```endP``` both included (i.e. number of reads covering each position of the interval). For each position, the 

Calling:
```
getCoverage(reads, chromosome, 172,400)
```
should print:
```
Chr01 172 has coverage 0
Chr01 173 has coverage 0
Chr01 174 has coverage 1
Chr01 175 has coverage 1
Chr01 176 has coverage 1
Chr01 177 has coverage 1
...
Chr01 382 has coverage 2
Chr01 383 has coverage 2
Chr01 384 has coverage 2
...
Chr01 396 has coverage 2
Chr01 397 has coverage 2
Chr01 398 has coverage 2
Chr01 399 has coverage 2
Chr01 400 has coverage 2
```

**Note: "..." are just missing lines, but the output should go from startP (172) to endP (400) included.**

### Download the data

Create a "qcbsciprolab-2018-11-13-NAME-SURNAME-ID" folder on the desktop. Download the following data to the folder that you just created.

The .tsv alignment file is: [test_data_unique10k.tsv](2018mt/test_data_unique10k.tsv).

### A possible solution: 

In [1]:
%reset -f


#gets the MismatchingPositions string and counts how many mismatches are there.
def countMismatches(mis_str):
    t_str = mis_str.upper()
    res = [t_str.count(x) for x in "ATCGN"]
    return sum(res)

def getReadsInInterval(filename, chr, startPos, endPos, max_diff = -1):
    reads = {}
    inF = open(filename, "r")

    for line in inF:
        line = line.strip()
        
        if(not line.startswith("readID")):
            els = line.split('\t')
   
            L = len(els[3])
            readStart = int(els[2])
            if(els[1] == chr):
                for pos in range(startPos, endPos+1):
                    if(els[1] == chr and pos >= readStart and pos < readStart + L): #note: not = !!! 
                        if(max_diff == -1 or max_diff >= countMismatches(els[4])):
                            if(els[0] not in reads):
                                reads[els[0]] = els[1:]
    print("{} reads cover the interval {}: {}-{}\nThey are:".format(len(reads), chromosome, 100,600 ))
    for r in reads:
        print("\t{} in {} starts at {}".format(r,reads[r][0],reads[r][1]))
    

    return reads



            
def getCoverage(reads, chr,startP, endP):

    coverage = [0 for x in range(startP, endP+1)]
    intervals = []
    for r in reads:
        read = reads[r]
        if(read[0] == chr):
            pos = int(read[1])
            L = len(read[2])
            intervals.append([pos, pos+L])
    for i in range(startP, endP + 1):
        for s,e in intervals:
            if(i >= s and i < e):
                coverage[i-startP] += 1
    for ind in range(len(coverage)):
        print("{} {} has coverage {}".format(chr, ind + startP, coverage[ind]))
        
#    return [ (i + startP, coverage[i]) for i in range(len(coverage))]        
    
#2x average coverage
def getRepeats(cov):
    values = [x[1] for x in cov]
    A = np.mean(values)
    repeats = [x for x in cov if x[1] >= 2*A]
    return [A, repeats]
    

mismatches = ["9C15T125", 
              "9G4C136", 
              "151", 
              "9G7T2C24A24C11A0G5G6G1A0G51", 
              "9T13A16T1G5A6A6A6A81"]

for m in mismatches:
    print("{} has {} mismatches".format(m, countMismatches(m)))
print("\n")
fn = "./2018mt/test_data_unique10k.tsv"
chromosome = "Chr01"
reads = getReadsInInterval(fn, chromosome, 100,600, 2)
#print(reads)
print("\n")
getCoverage(reads, chromosome, 172,400)




9C15T125 has 2 mismatches
9G4C136 has 2 mismatches
151 has 0 mismatches
9G7T2C24A24C11A0G5G6G1A0G51 has 11 mismatches
9T13A16T1G5A6A6A6A81 has 8 mismatches


4 reads cover the interval Chr01: 100-600
They are:
	DJB77P1:536:H880AADXX:1:1111:13063:11094 in Chr01 starts at 270
	DJB77P1:536:H880AADXX:1:1204:4499:50848 in Chr01 starts at 174
	DJB77P1:536:H880AADXX:1:1101:3693:67653 in Chr01 starts at 183
	DJB77P1:536:H880AADXX:1:1106:18413:36334 in Chr01 starts at 250


Chr01 172 has coverage 0
Chr01 173 has coverage 0
Chr01 174 has coverage 1
Chr01 175 has coverage 1
Chr01 176 has coverage 1
Chr01 177 has coverage 1
Chr01 178 has coverage 1
Chr01 179 has coverage 1
Chr01 180 has coverage 1
Chr01 181 has coverage 1
Chr01 182 has coverage 1
Chr01 183 has coverage 2
Chr01 184 has coverage 2
Chr01 185 has coverage 2
Chr01 186 has coverage 2
Chr01 187 has coverage 2
Chr01 188 has coverage 2
Chr01 189 has coverage 2
Chr01 190 has coverage 2
Chr01 191 has coverage 2
Chr01 192 has coverage 2
Chr01