Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximate license detection #86

Closed
pombredanne opened this issue Oct 4, 2015 · 3 comments
Closed

Approximate license detection #86

pombredanne opened this issue Oct 4, 2015 · 3 comments
Assignees
Milestone

Comments

@pombredanne
Copy link
Member

In addition to exact detection we should add support for approximate detection, as planned in the roadmap.

@pombredanne pombredanne self-assigned this Nov 19, 2015
@pombredanne pombredanne added this to the v2.0 milestone Nov 19, 2015
pombredanne added a commit that referenced this issue Mar 11, 2016
 * any rule can now be a template
pombredanne added a commit that referenced this issue Mar 11, 2016
pombredanne added a commit that referenced this issue Mar 11, 2016
 * simpler merging
 * support for creating spans from integers lists
 * several other refinements
pombredanne added a commit that referenced this issue Mar 11, 2016
 * use multiple strategies for matching (bitvectors, frequency counters,
   ngram index and sequence alignments and inverted positional index
 * tokens are converted to numerical ids instead of keeping strings
   around reducing the memory footprint significantly
 * index is cached on disk after creation, making for a shorter
   startup time after initial indexing
 * new scancode option --license-score <int> to set the lowest score
   to keep an approximate license match
pombredanne added a commit that referenced this issue Mar 11, 2016
@pombredanne
Copy link
Member Author

There is a weird heisenbug on Windows tests from Appveyor... cannot reproduce it anywhere else:

rules_tokens_ids = [[32, 8, 110, 108, 104, 10, ...]]
dictionary = {'1': 0, '2': 1, '3': 2, '4': 3, ...}
tokens_by_tid = ('1', '2', '3', '4', '5', 'a', ...)
frequencies_by_tid = (1, 1, 1, 1, 1, 3, ...), length = 9, with_checks = True

    def renumber_token_ids(rules_tokens_ids, dictionary, tokens_by_tid, frequencies_by_tid, length=9, with_checks=True):
        """
        Return updated index structures with new token ids such that the most common
        aka. 'junk' tokens have the lowest ids.

        `rules_tokens_ids` is a mapping of rule_id->sequence of token ids

        These common tokens are based on a curated list of frequent words and
        further refined such that:
         - no rule text sequence is composed entirely of these common tokens.
         - no or only a few rule text sub-sequence of `length` tokens (aka.
         ngrams) is not composed entirely of these common tokens.

        The returned structures are:
        - old_to_new: mapping of (old token id->new token id)
        - len_junk: the highest id of a junk token
        - dictionary (token string->token id)
        - tokens_by_tid (token id->token string)
        - frequencies_by_tid (token id->frequency)
        """
        # keep track of very common junk tokens: digits and single letters
        very_common = set()
        for tid, token in enumerate(tokens_by_tid):
            # DIGIT TOKENS: Treat tokens composed only of digits as common junk
            # SINGLE ASCII LETTER TOKENS: Treat single ASCII letter tokens as common junk

            # TODO: ensure common numbers as strings are always there (one, two, and first, second, etc.)
>           if token.isdigit() or (len(token) == 1 and token in string.lowercase):
E           UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 26: ordinal not in range(12

@pombredanne
Copy link
Member Author

The issue for Windows @ appveyor is likely due to default encoding: https://www.google.com/search?q=sys.getdefaultencoding%20problem%20on%20windows

pombredanne added a commit that referenced this issue Jun 21, 2016
 * also use matcher instead of _type
 * refine filtering and merging
pombredanne added a commit that referenced this issue Jun 21, 2016
pombredanne added a commit that referenced this issue Jun 21, 2016
pombredanne added a commit that referenced this issue Jun 21, 2016
 * also refine the way rule identifiers are assigned
   when testing rules
pombredanne added a commit that referenced this issue Jun 21, 2016
 * also improve duplicate rules detection
 * use cPickle for dumps. Nor more recursion shenannigans
 * compute canidates over all rule: regular_rids and small_rids
pombredanne added a commit that referenced this issue Jun 21, 2016
 * we are still missing pre-built wheels for Win & Mac
pombredanne added a commit that referenced this issue Jun 22, 2016
pombredanne added a commit that referenced this issue Jun 22, 2016
 * until we have stable working cross-os ones
   this commit remobes the intbitset and pyahocrasick wheels
pombredanne added a commit that referenced this issue Jun 22, 2016
 * listing actual rules files that seem to be mising
pombredanne added a commit that referenced this issue Jun 22, 2016
 * restoring travis config to normal
pombredanne added a commit that referenced this issue Aug 5, 2016
 * upstream maintainers have fixed all the pendings we submitted
@pombredanne
Copy link
Member Author

This is all merged in develop now. And I am closing

pombredanne added a commit that referenced this issue Oct 6, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Oct 6, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant