# Drilling into alternative splicing in MD

I decided to follow the following plan in order to see if our data supports previous evidence of alternative splicing in MD:

1. Check if my CEL file parser gives comparable probe-level results with oligo.
2. Take the list of genes from Nakamori13 ("Splicing biomarkers of disease severity in myotonic dystrophy."), the paper Darren passed on to me.
3. Check if the genes identified by Nakamori13 come up as significant in our analysis as well.
4. Build a hierchical model to check how significant they are.

Unfortunately, I got pretty bogged down on the first step. After a few days of debugging it turned out that there was a bug in the way I was translating coordinates from the annotation file to the coordinates on the chip.

Luckily this doesn't affect the previous work on ISEs (we didn't use annotations there). It does invalidate the preliminary result I was showing you about alternative splicing of the fifth exon in TNNT2 (I would have beeen reading wrong coordinates).

This is now fixed, and I will carry on with the rest of the plan this week.


# New datasets for working with ISEs

I and John had a look at the doping dataset. While interesting, we came to the conclusion this may not be the best dataset to be looking at: the microarrays used for that research were of completely different type, and might require quite a few changes to the analysis pipeline.

However, it appears that there is a great number of microarray data available online at GEO and ArrayExpress, and we expect some of this to both:

- have many changing genes
- use HuEx ST1, which we can support out of the box.

The plan is to take as many of these datasets as feasible and run the pipeline on them.

In [26]:
import Bio.Affy.CelFile as CELFile
import os.path
import sys, re

In [2]:
with open(os.path.join("CEL_files", "111747589_B.CEL"), "rb") as f:
    array = CELFile.read_v4(f)

In [3]:
contextDict = {}
CEL_PATH = "CEL_files"
with open(os.path.join(CEL_PATH, "HuEx-1_0-st-v2.text.cdf")) as f:
    lines = f.readlines()
    context = None
    for line in lines:
        line = line.rstrip()
        if line:
            c1 = line[0] == "["
            c2 = "Unit" in line
            c3 = "_Block1]" in line
            checkConds = [c1, c2, c3]
            if all(checkConds):
                context = line[len("Unit") + 1: -1 * len("_Block1") - 1]
                if context in contextDict:
                    print(line)
                    raise ValueError
                else:
                    contextDict[context] = {}
            elif context is not None:
                eqPos = line.find("=")
                key = line[:eqPos]
                matchedList = re.findall("Cell[1-9]+", key)
                if matchedList:
                    value = line[eqPos + 1:].split("\t")
                    contextDict[context][key] = value[:3]
        else:
            context = None

In [None]:
intensityBySeqDict = {}

In [20]:
def intensityBySeq(record, contextDict, seq):
    if not intensityBySeqDict:
        for _, innerDict in contextDict.items():
            for _, triple in innerDict.items():
                try:
                    intensityBySeqDict[triple[2]] = (int(triple[0]), int(triple[1]))
                except TypeError:
                    pass
    return extractIntensity(record, intensityBySeqDict[seq])

In [21]:
def extractIntensity(record, coord):
    coord = coord[::-1]
    return record.intensities[coord]

## This should be the same as in the "rawProbesR"

In [27]:
import os
for seq in ["ACCTTATACCAGTAGCAGTCGTACC", "CGGCCTACGACATAGGTCCGAGACA", "AGAAACTAATAATACACCTGGTGTT", "CAGTACGGGCAGCTACAAAACCCAT", "AGTCCCACGTGTCGGCGTTGCCGTT", "TGCACAGCCTACTGCCACTCGAGTT"]:
    print(intensityBySeq(array, contextDict, seq))

103.0
188.0
42.0
189.0
1072.0
886.0
