Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster ftrl fit and predict based on many c-style optimization #22

Merged

Conversation

Projects
None yet
2 participants
@stegben
Copy link
Contributor

commented Dec 26, 2016

I worked on some cython optimization on fit and predict function. The speed is 2X~5X when interaction=False:

before

       3 function calls in 4.597 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    4.597    4.597    4.597    4.597 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}

after

         3 function calls in 0.981 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.981    0.981    0.981    0.981 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}

Note that when interaction=True, the main overhead will be:

abs(hash('{}_{}'.format(x[i], x[j])))

I've provide the solution for it and commented it out in the _indices function. I've try that out and find it really solve the overhead. Plz take a look and see if it works general enough.

@stegben

This comment has been minimized.

Copy link
Contributor Author

commented Dec 26, 2016

by the way, here's my profiling code:

import pickle as pkl
import cProfile

import numpy as np
np.random.seed(1234)
import scipy.sparse as sps

from kaggler.online_model import FTRL


DATA_NUM = 4e6


def main():
    print('create y...')
    y = np.random.randint(0, 1, DATA_NUM)
    print('create x...')
    row = np.random.randint(0, 1000000, DATA_NUM)
    col = np.random.randint(0, 100, DATA_NUM)
    data = np.ones(DATA_NUM)
    x = sps.csr_matrix((data, (row, col)), dtype=np.int8)

    print('train...')
    profiler = cProfile.Profile(subcalls=True, builtins=True, timeunit=0.001,)
    clf = FTRL(interaction=False)
    profiler.enable()
    clf.fit(x, y)
    profiler.disable()
    profiler.print_stats()
    clf.predict(x)


if __name__ == '__main__':
    main()
for j in range(i + 1, x_len):
indices.append(abs(hash('{}_{}'.format(x[i], x[j]))) % self.n)
# a much faster and also reasonable way:
# indices.append((x[i] * x[j]) % self.n)

This comment has been minimized.

Copy link
@jeongyoonlee

jeongyoonlee Dec 27, 2016

Owner

With this, there are collisions for low indices. e.g. when x[i] == 1 and x[j] == 2, (x[i] * x[j]) % self.n == 2. Alternatively, we may look for a faster hash function.

This comment has been minimized.

Copy link
@stegben

stegben Dec 27, 2016

Author Contributor

Well, you're right. Randomness for interaction features is needed. Let's deal with this issue in other PR.

This comment has been minimized.

Copy link
@stegben

stegben Dec 27, 2016

Author Contributor

Maybe try some simple hash function like Knuth's multiplicative method? There're some interesting discussion here.

This comment has been minimized.

Copy link
@jeongyoonlee

jeongyoonlee Dec 27, 2016

Owner

Looks promising. As you said, let's deal with it in a separate PR. Meanwhile, could you update the comments in line #96, 97, 103, 104 to reflect our discussion? Thanks!

This comment has been minimized.

Copy link
@stegben

stegben Dec 28, 2016

Author Contributor

Sure, I'll update them after work!

This comment has been minimized.

Copy link
@stegben

stegben Dec 28, 2016

Author Contributor

Done. #23

@jeongyoonlee jeongyoonlee merged commit 0f9b6d7 into jeongyoonlee:master Dec 28, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.