# Overview of `kvector` features

In [1]:
import kvector



## Read HOMER Motifs

Read HOMER motif file and create a pandas dataframe for each position weight matrix (PWM), with all motifs saved as a series with the motif name as the key.

In [2]:
motifs = kvector.read_motifs('kvector/tests/data/example_rbps.motif', residues='ACGT')
motifs.head()

M001_0.6_A1CF_ENSG00000148584_Homo_sapiens\tM001_0.6_A1CF_ENSG00000148584_Homo_sapiens\t5.0                                          A         C         G         T
0  0...
M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens\tM002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens\t5.0                                    A         C         G         T
0  0...
M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster\tM003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster\t5.0              A         C         G         T
0  0...
M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\tM004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\t5.0                                    A         C         G         T
0  0...
dtype: object

You can can access individual motifs with the usual pandas indexing:

In [3]:
# the 4th (counting from 0) motif
motifs[3]

Unnamed: 0,A,C,G,T
0,0.085063,0.085063,0.175952,0.653921
1,0.013046,0.013046,0.776577,0.19733
2,0.013046,0.013046,0.013046,0.960861
3,0.013046,0.013046,0.764576,0.209331
4,0.013046,0.013046,0.104634,0.869273
5,0.013046,0.013046,0.666799,0.307108
6,0.083101,0.083101,0.264548,0.56925


In [4]:
# Specific motif name
motifs['M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\tM004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\t5.0']

Unnamed: 0,A,C,G,T
0,0.085063,0.085063,0.175952,0.653921
1,0.013046,0.013046,0.776577,0.19733
2,0.013046,0.013046,0.013046,0.960861
3,0.013046,0.013046,0.764576,0.209331
4,0.013046,0.013046,0.104634,0.869273
5,0.013046,0.013046,0.666799,0.307108
6,0.083101,0.083101,0.264548,0.56925


## Convert motifs to kmer vectors

Instead of representing a motif as a position-specific weight matrix which would require aligning motifs to compare them, you can convert them to a vector of kmers, where the value for each kmer is the score of the kmer in that motif.

Citation: [Xu and Su, *PLoS Computational Biology* (2010)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0008797)

In [5]:
%pdb

Automatic pdb calling has been turned ON


In [6]:
motif_kmer_vectors = kvector.motifs_to_kmer_vectors(motifs, residues='ACGT', 
    kmer_lengths=(3, 4))
motif_kmer_vectors

Unnamed: 0,M001_0.6_A1CF_ENSG00000148584_Homo_sapiens	M001_0.6_A1CF_ENSG00000148584_Homo_sapiens	5.0,M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens	M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens	5.0,M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster	M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster	5.0,M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens	M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens	5.0
AAA,0.442114,0.310285,0.012428,0.022518
AAC,0.301068,0.278607,0.012428,0.022518
AAG,0.323372,0.273450,0.133575,0.134406
AAT,0.424485,0.271529,0.212117,0.207887
ACA,0.312890,0.301837,0.012428,0.022518
ACC,0.171844,0.270159,0.012428,0.022518
ACG,0.194148,0.265001,0.133575,0.134406
ACT,0.295261,0.263081,0.212117,0.207887
AGA,0.312890,0.360837,0.174895,0.173211
AGC,0.171844,0.329159,0.174895,0.173211


## Count kmers in fasta files


You may also want to just count the integer number of occurences of a DNA word (kmer) in a file. `count_kmers` does just that, returning a pandas dataframe.

In [7]:
asdf = 'akjsdhfkjahsf klasjdfk     asdfasdf'


In [8]:
asdf.replace('\t', ' ')

'akjsdhfkjahsf klasjdfk     asdfasdf'

In [9]:
kmer_vector = kvector.count_kmers('kvector/tests/data/example.fasta', kmer_lengths=(3, 4))
kmer_vector.head()

Unnamed: 0,AAA,AAC,AAG,AAT,ACA,ACC,ACG,ACT,AGA,AGC,...,TTCG,TTCT,TTGA,TTGC,TTGG,TTGT,TTTA,TTTC,TTTG,TTTT
0,2,3,1,0,0,2,1,2,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,3,1,0,0,2,1,2,1,0,...,0,0,0,0,0,0,0,0,0,0
2,6,5,6,0,0,3,2,4,3,5,...,2,0,0,0,0,0,0,0,0,0
3,6,6,4,1,12,2,0,5,3,9,...,0,0,0,1,2,0,0,1,0,1
4,19,9,1,7,7,8,0,8,4,1,...,0,2,1,1,1,3,1,4,2,11


Since this is a pandas dataframe, you can do convenient things like get the mean and standard deviation.

In [10]:
kmer_vector.mean()

AAA      7.0
AAC      5.2
AAG      2.6
AAT      1.6
ACA      3.8
ACC      3.4
ACG      0.8
ACT      4.2
AGA      2.4
AGC      3.0
AGG      2.4
AGT      1.4
ATA      0.6
ATC      1.6
ATG      2.0
ATT      1.2
CAA      3.6
CAC      4.8
CAG      3.2
CAT      2.0
CCA      2.6
CCC     10.8
CCG      2.8
CCT      6.2
CGA      0.6
CGC      1.4
CGG      3.2
CGT      1.2
CTA      2.2
CTC      6.8
        ... 
TGAG     0.6
TGAT     0.2
TGCA     2.2
TGCC     0.4
TGCG     0.8
TGCT     0.6
TGGA     1.2
TGGC     0.4
TGGG     0.4
TGGT     1.0
TGTA     0.6
TGTC     0.8
TGTG     0.4
TGTT     0.4
TTAA     0.4
TTAC     0.4
TTAG     0.2
TTAT     0.2
TTCA     0.6
TTCC     0.6
TTCG     0.4
TTCT     0.4
TTGA     0.2
TTGC     0.4
TTGG     0.6
TTGT     0.6
TTTA     0.2
TTTC     1.0
TTTG     0.4
TTTT     2.4
dtype: float64

In [11]:
kmer_vector.std()

AAA     7.000000
AAC     2.489980
AAG     2.302173
AAT     3.049590
ACA     5.495453
ACC     2.607681
ACG     0.836660
ACT     2.489980
AGA     1.341641
AGC     3.937004
AGG     0.894427
AGT     1.516575
ATA     0.894427
ATC     2.607681
ATG     2.738613
ATT     2.683282
CAA     3.435113
CAC     4.969909
CAG     2.949576
CAT     2.828427
CCA     1.516575
CCC     4.381780
CCG     2.588436
CCT     1.303840
CGA     1.341641
CGC     1.140175
CGG     1.643168
CGT     1.643168
CTA     0.447214
CTC     1.643168
          ...   
TGAG    0.894427
TGAT    0.447214
TGCA    3.033150
TGCC    0.894427
TGCG    0.836660
TGCT    0.894427
TGGA    1.095445
TGGC    0.547723
TGGG    0.547723
TGGT    1.414214
TGTA    0.894427
TGTC    0.447214
TGTG    0.894427
TGTT    0.894427
TTAA    0.894427
TTAC    0.894427
TTAG    0.447214
TTAT    0.447214
TTCA    1.341641
TTCC    0.894427
TTCG    0.894427
TTCT    0.894427
TTGA    0.447214
TTGC    0.547723
TTGG    0.894427
TTGT    1.341641
TTTA    0.447214
TTTC    1.7320