## Setting The Stage With A Pretend Scenario

For demoing I wanted to find an example of something non-trivial to simulate an actual coding situation where writing straight CPython is insufficient performance wise and you have to start looking for solutions with Python alternatives.
 
Scenario: Your favorite band, 2 Pegasus Tasks, just released a new single. However, in their usual style, they did it with a twist.

Instead of distributing the song on Spotify or some other distribution platform, for its initial release they segmented  the audio file binary into tiny substrings and then inserted them according to a     secret pattern into a bunch of longer strings to create an internet scavenger hunt. 

Any group that is able to piece the song together gets a cash prize on the condition that they don't share it or how t    hey got it with others. They have provided a preliminary list of 300 strings, 100 of which the've confirmed do contain a chunk of the song, and 200 of which don't.

You and a group of friends have managed to scrape down all known instances of these string sequences and tried some linear but you can't make heads or tails of what they mean. When you analyze the seque    nces by hand, sometimes you think you've found a pattern but then another example breaks it. It definitely looks like the function for the pattern will be non-linear in nature.

So you have the idea to use a Support Vector Machine based on a string kernel to learn a boundary between what is part of the pattern and what is not in higher dimensional space. You find a whitepaper about string subsequence kernels that looks like just what you need. You search through libraries online like Scikit Learn and NLTK but in this imaginary scenario, you can't find a decent implemntation of this kernel function anywhere, so you'll have to write your own.


### Example 1: Just CPython

After pouring over the whitepaper and mayber some supplemental material to make sure you understand the concepts, you are able to create an implementation of the string-subsequence-kernel. You are ready to run it, let's simulate that by going to an older revision of our example code which is just written in CPython with some profiling lines to show us how fast it is.

In [4]:
import numpy as np
import timeit

"""
this code is derivative of https://github.com/helq/python-ssk/commit/6acee597ff37f7e7e12dd8651421a4d34c5dad70
by https://github.com/helq,  which is licensed under Creattive Commons Zero v1.0 Universal / 
changes: removing lodhi assertions because we know it works, changing script handling into function, 
adding profiling code, changing inputs to exagerrate performance problem. 
Do check that repo out if you actually want to learn about ssk/get a fast implementation for practical purposes.)
"""
# Kernel defined by Lodhi et al. (2002)
def ssk(s, t, n, lbda, accum=False):
    dynamic = {}

    def k_prim(s, t, i):
        # print( "k_prim({},{},{})".format(s, t, i) )
        if i == 0:
            # print( "k_prim({},{},{}) => 1".format(s, t, i)  )
            return 1.
        if min(len(s), len(t)) < i:
            # print( "k_prim({},{},{}) => 0".format(s, t, i)  )
            return 0.
        if (s,t,i) in dynamic:
            return dynamic[(s,t,i)]

        x = s[-1]
        s_ = s[:-1]
        indices = [i for i, e in enumerate(t) if e == x]
        toret = lbda * k_prim(s_, t, i) \
              + sum( k_prim(s_, t[:j], i-1) * (lbda**(len(t)-j+1)) for j in indices )
        # print( "k_prim({},{},{}) => {}".format(s, t, i, toret) )
        dynamic[(s,t,i)] = toret
        return toret

    def k(s, t, n):
        # print( "k({},{},{})".format(s, t, n) )
        if n <= 0:
            raise "Error, n must be bigger than zero"
        if min(len(s), len(t)) < n:
            # print( "k({},{},{}) => 0".format(s, t, n) )
            return 0.
        x = s[-1]
        s_ = s[:-1]
        indices = [i for i, e in enumerate(t) if e == x]
        toret = k(s_, t, n) \
              + lbda**2 * sum( k_prim(s_, t[:j], n-1) for j in indices )
        # print( "k({},{},{}) => {}".format(s, t, n, toret) )
        return toret

    if accum:
        toret = sum( k(s, t, i) for i in range(1, min(n,len(s),len(t))+1) )
    else:
        toret = k(s, t, n)

    # print( len(dynamic) )
    return toret

def string_kernel(xs, ys, n, lbda):
    if len(xs.shape) != 2 or len(ys.shape) != 2 or xs.shape[1] != 1 or ys.shape[1] != 1:
        raise "The shape of the features is wrong, it must be (n,1)"

    lenxs, lenys = xs.shape[0], ys.shape[0]

    mat = np.zeros( (lenxs, lenys) )
    for i in range(lenxs):
        for j in range(lenys):
            mat[i,j] = ssk(xs[i,0], ys[j,0], n, lbda, accum=True)

    mat_xs = np.zeros( (lenxs, 1) )
    mat_ys = np.zeros( (lenys, 1) )

    for i in range(lenxs):
        mat_xs[i] = ssk(xs[i,0], xs[i,0], n, lbda, accum=True)
    for j in range(lenys):
        mat_ys[j] = ssk(ys[j,0], ys[j,0], n, lbda, accum=True)

    return np.divide(mat, np.sqrt(mat_ys.T * mat_xs))

def evaluate_ssk():
    print("Testing...")
    
    ## code for the pretend scenario, long binary sequences
    s1 = np.array(["101110010010111110101011100000101000010010111100101011010011011010110111", \
                   "101000010010111111111111110000010100001001011110010101101001101101011011", \
                   "101000010010111110101011100011111111101001011110010101101001101101011011", \
                   "10111111001011111010101110000010100001001011110010101101001101101011011"]).reshape((4,1))
    s2 = np.array(["10100001001011111111111110000010100001001011110010101101001101101011011", \
                   "10100001001011111010101110000010100001001011110010111111111101101011011", \
                   "10100001001011111010101110000010100011101011110010101101001101101011011", \
                   "10100001001011111010101110110010100001001011110010101101001111111011011"]).reshape((4, 1))

    # code for pretend scenario, we are looking for common substrings up to 8 chars in length, because that could be a byte
    print( string_kernel(s1, s2, 11, 1.) )
    
print(f"Running String Kernel on the strings took: \
{timeit.timeit('evaluate_ssk()', setup='from __main__ import evaluate_ssk', number=1)} seconds", )

Testing...
[[0.8451728  0.78146246 0.88319993 0.86031583]
 [0.99737074 0.72270543 0.86173297 0.86350414]
 [0.78265891 0.66158364 0.83420016 0.80619043]
 [0.74976104 0.61763655 0.71480048 0.69580217]]
Running String Kernel on the strings took: 17.162577100039925 seconds


### CPython Is Too Slow. . . 

Oh, no! Don't know what it shows on your machine but mine had ~16 seconds and that is no bueno. The real strings were much longer than that, and the SVM is goingo to need to pairwise compare thousands of them at maybe even longer subsequence values. If you want to be able to iterate and tune your SVM model quickly, we need to build a better mousetrap.

What have we got? Well, Pypy is supposed to be a faster drop in replacement for CPython right? And even though we know it won't help us with the already optimized numpy operations, looks like a lot of the bottleneck is good old fashioned python for loops. Pypy should be able to figure out how to compile those down to machine code for us.

## Example 2 Pypy To The Rescue?

So For This Example, we are going to run the same code, but swap out our Python 3 kernel in jupyter for a Pypy one