-
Notifications
You must be signed in to change notification settings - Fork 2
plusbzz/cpe-vendor-match
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Code for an interesting NLP problem. sample_input.csv contains 3 columns: Vendor, Product and Version. These entries have been created by humans, so are not very consistent. dict.csv contains a standard dictionary. This has been created by parsing the XML standard provided by NIST at http://static.nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.2.xml We would like to match each entry in input.csv to an accurate entry in dict.csv. If there is no matching entry in the dictionary, we would like to output NA. match_cpe.py is my attempt at solving this problem using NLP techniques. Usage: ./match_cpe.py <input csv file> <dictionary csv file> Method: I use the product of Levenshtein distance and Jaccard distance as a distance metric. I match each entry in the input to all entries in the dictionary that start with the same letter (as an optimization) and then return the dictionary entry with the closest (smallest) value of the distance metric. Why the product of the two distances? My hunch was that a good match would have both distances very small. Since the Jaccard distance is always between 0 and 1 and the Levenshtein distance is an arbitrary integer, their product would amplify the separation of scores between good and bad matches. It seems to work when you look at the results.
About
Python code for an interesting NLP problem
Resources
Stars
Watchers
Forks
Releases
No releases published