## Regression Analysis of Sabin data 

This analsys has been done in order to try and determine the context effects
of different context on the substitution probability
The input data is a csv files of the different K-mers and the $log(Pr(A\to B|S)$ of substitution rates of each of the passeges.

It includes 4 stages:

1. Feature extraction- creating a dummy variable for the categorical feature of context in a certain position
2. Feature selection- selecting the most important features using the randomized lasso regression method
3. Signifcance and effect test- after constructing the regression formula we are applying it in the regression to see if the effect of the motiff is significant also find the $\beta$ coeffs 

4.summarize the results across the different mutation types and passeges


In [113]:
import pandas as pd
import itertools
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from treeinterpreter import treeinterpreter as ti
from scipy.stats import kendalltau,spearmanr
import statsmodels.formula.api as sm
from sklearn.linear_model import RandomizedLasso
import os
import glob
import warnings
warnings.filterwarnings("ignore")
global MUTATIONS
MUTATIONS=['AC', 'GT', 'AG', 'CA', 'CG', 'GC', 'AT', 'GA', 'TG', 'CT', 'TC', 'TA']


Sample of a data file in Passege 7

In [114]:
data=pd.read_csv(r'C:\Users\Guyling1\ContextProject\regressionCSVOneFileSABINK5P7.csv')
data.head(3)

Unnamed: 0,kmer,mutationType,P1,P2,P3,P4,P5,prob,createCpG
0,ACTAA,CT,A,C,C,A,A,-7.819754,False
1,TTATC,TA,T,T,T,T,C,-20.723266,False
2,ATAGG,GA,A,T,G,G,G,-20.723266,False


## 1.Feature Extraction

Defined a function that adds dummy variables of context dependence up to 3rd degree 


In [115]:
def generateFeature(k):
    for j in range(1,k+1):
        for feature in itertools.imap(''.join,itertools.combinations('1245',j)):
            yield feature

def functionGenerator(positions,x):
    res=''
    for position in positions:
        res+=x['P{}'.format(position)]
    return res

def createFeaturePrefix(featureList):
    res=''
    for feature in featureList:
        res+="P{}_".format(feature)
    return res

def getDummyVarPositionForK(data,k):
    for feature in generateFeature(k):
        featureList=[int(x) for x in list(feature)]
        concatFeature=data.apply(lambda x: functionGenerator(featureList,x),axis=1)
        dummies=pd.get_dummies(concatFeature,prefix=createFeaturePrefix(featureList))
        #print createFeaturePrefix(featureList) 
        data=pd.concat([data,dummies],axis=1)
    return data



Sample of the data after adding the dummy var features

In [116]:
data=getDummyVarPositionForK(data,3)
data.head(3)


Unnamed: 0,kmer,mutationType,P1,P2,P3,P4,P5,prob,createCpG,P1__A,...,P2_P4_P5__TCG,P2_P4_P5__TCT,P2_P4_P5__TGA,P2_P4_P5__TGC,P2_P4_P5__TGG,P2_P4_P5__TGT,P2_P4_P5__TTA,P2_P4_P5__TTC,P2_P4_P5__TTG,P2_P4_P5__TTT
0,ACTAA,CT,A,C,C,A,A,-7.819754,False,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,TTATC,TA,T,T,T,T,C,-20.723266,False,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,ATAGG,GA,A,T,G,G,G,-20.723266,False,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


After adding the dummy variables we want to create a feature matrix X and predicted value vector y  so that we can easily use it in the regression .

WithNull parameter chooses if we want to include unobserved mutations with their defult value of 10^-9


In [117]:
def cleanDataMatrix(data):
    for i in range(1,6):
        data.drop('P{}'.format(i),axis=1,inplace=True)
    data.drop('createCpG',axis=1,inplace=True)

def createCovarsAndProbVectorForMutationType(data,mutType,withNull=False):
    data=data[data['mutationType']==mutType]
    if withNull:
        probVector=data.prob
        covars=data.drop('prob',axis=1).drop('kmer',axis=1).drop('mutationType',axis=1)
    else:
        data=data[data['prob']>-20]
        probVector=data.prob
        covars=data.drop('prob',axis=1).drop('kmer',axis=1).drop('mutationType',axis=1)
    return covars,probVector,data

## 2.Feature Selection

Because we have a very large number of features and these features are also dependent (for example there is an obvious dependecy between P1=A and P1=A,P2=C) we use <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RandomizedLasso.html">Randomized lasso regression</a> which a randomized  L-1 norm based method of regression. We will only take in to account features that were used in over ratio=20% of the cases. This was done in order to solve problems with the forward feature selection methods that did not preform well

In [118]:
def randomLassoFeatureExtraction(covars,probVector,ratio=0.2):
    featureList=[]
    names=covars.columns
    rlasso = RandomizedLasso()
    rlasso.fit(covars,probVector)
    sortedFeatures=sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), 
                 names), reverse=True)
    for i  in range(20):
        f=sortedFeatures[i]
        if f[0]>ratio:
            featureList.append(f[1])
    return featureList


## 3.Significance and effect test
Because every mutation context can only be compared to mutations of the same class (we might want to change this to transitions/transversions to get more power) the data is devided to different mutation Types and then we preform a regression when the covars are only features we obtained in the randomized lasso test.

In order to further decrease the effect of dependencies between the features I've used <a  href="http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.WLS.html">WLS method</a>  instead of OLS which puts weights on the different expainatory variables based on the covariance matrix of the variables

In [144]:
def featureToMotiff(feature,mt,k=5):
    kmer=['X' for i in range(k)]
    oldnuc=mt[0]
    kmer[k//2]=oldnuc
    positions=feature.split('__')[0]
    nucs=feature.split('__')[1]
    positions=[int(x[1]) for x in positions.split("_")]
    #print positions
    #print nucs
    for i in range(len(positions)):
        kmer[positions[i]-1]=nucs[i]
    return "".join(kmer)

def arrowBySign(float):
    if float>0:
        return "\u2191"
    else:
        return "\u2193"

def regressOverDataFile(data,pvalCutOff=10**-4,justName=False):
    mutationTypes=set(data['mutationType'].values)
    #print mutationTypes
    motifMap={}
    for mt in mutationTypes:
        motifMap[mt]=[]
        #print "-"*20
        #print mt
        covars,probVector,mutData=createCovarsAndProbVectorForMutationType(data,mt)
        features=randomLassoFeatureExtraction(covars,probVector)
        if len(features)==0:
            #print "No segnificant features"
            continue
        predictors=" + ".join(features)
        #print predictors
        result = sm.wls(formula="prob~ {} +1".format(predictors), data=mutData).fit()
        sigFeatures=result.pvalues.sort_values()[1:5].keys()
        #print("model r^2 {}".format(result.rsquared_adj))
        for f in sigFeatures:
            if result.pvalues[f]<pvalCutOff:
                if justName:
                    motifMap[mt].append((featureToMotiff(f,mt)+arrowBySign(result.params[f]).decode('unicode-escape')))#,result.pvalues[f],result.params[f],result.conf_int().loc[f][0],result.conf_int().loc[f][1])
                else:
                    motifMap[mt].append((featureToMotiff(f,mt),result.pvalues[f],result.params[f],result.conf_int().loc[f][0]\
                                         ,result.conf_int().loc[f][1]))
    return motifMap

In [120]:
cleanDataMatrix(data)
motiffMap=regressOverDataFile(data)

## 4.Summarize different passeges and mutation types 
I created another function to give us the effect across diffetent passeges

In [137]:
def fileToPassege(fileName):
    return fileName.split(".")[0][-2:]

def regressOverFolder(folder,justNameVar=True):
    os.chdir(folder)
    files=glob.glob("*.csv")
    passeges=[fileToPassege(fileName) for fileName in files]
    motifTable=pd.DataFrame('',index=passeges,columns=MUTATIONS)
    #print motifTable
    for filee in files:
        data=pd.read_csv(filee)
        #print filee
        data=getDummyVarPositionForK(data,3)
        cleanDataMatrix(data)
        fileMotifMap=regressOverDataFile(data,justName=justNameVar)
        for mut in fileMotifMap.keys():
            if len(fileMotifMap[mut])>0:
                motifTable.loc[fileToPassege(filee)][mut]=fileMotifMap[mut]
    return motifTable





This is a summary of all significat motifs across generations, arrows imply if the motif lowers or highers the substitution rates

In [145]:
regressOverFolder(r'C:\Users\Guyling1\ContextProject')


Unnamed: 0,AC,GT,AG,CA,CG,GC,AT,GA,TG,CT,TC,TA
P1,,,[XTAGX↑],,,,,[CTGXT↓],,[CXCCA↓],[XXTAA↑],
P2,[XCAXX↑],,,,,[XGGTG↓],,,,[XXCCA↓],[XXTAX↑],
P3,,,[XTAGX↑],,,,"[XXATG↓, XTACT↓]",,,[XXCCX↓],[XXTAA↑],[CXTAG↓]
P4,[XCAXX↑],,,,,,,,,[XXCCA↓],[XXTAX↑],
P5,[XCAXX↑],,"[CAATX↓, XTAGX↑]",,,,,,,[CXCCA↓],[XXTAA↑],[CXTAC↓]
P6,,,,,,[CTGXT↓],,,,[XXCCX↓],[XXTAA↑],
P7,[AXATA↑],,,,[GCCXX↑],,,,,"[XXCCX↓, CXCCA↓]",,


In [138]:
summary=regressOverFolder(r'C:\Users\Guyling1\ContextProject',justNameVar=False)

You can see more details about the different motifs (P-value,$\beta$ coeff,$CI_\alpha[low]$,$CI_\alpha[high]$)) for $\alpha=0.05$

Just change the Passege and mutation Type to see the motiff you want

In [141]:
summary.loc['P2']['CA']

[('CXCCA',
  9.1982797801072077e-06,
  -2.4924195347435991,
  -3.5521976503178889,
  -1.4326414191693093)]