GitHub - plusbzz/cpe-vendor-match: Python code for an interesting NLP problem

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README		README
dict.csv		dict.csv
match_cpe.py		match_cpe.py
sample_input.csv		sample_input.csv
sample_output.csv		sample_output.csv

Repository files navigation

Code for an interesting NLP problem. 

sample_input.csv contains 3 columns: Vendor, Product and Version. 
These entries have been created by humans, so are not very consistent.

dict.csv contains a standard dictionary. This has been created by parsing 
the XML standard provided by NIST at 
http://static.nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.2.xml

We would like to match each entry in input.csv to an accurate entry in 
dict.csv. If there is no matching entry in the dictionary, we would like 
to output NA.

match_cpe.py is my attempt at solving this problem using NLP techniques.

Usage:

./match_cpe.py <input csv file> <dictionary csv file>

Method:

I use the product of Levenshtein distance and Jaccard distance as a distance metric. 
I match each entry in the input to all entries in the dictionary that start 
with the same letter (as an optimization) and then return the dictionary entry with the 
closest (smallest) value of the distance metric.

Why the product of the two distances? My hunch was that a good match would have both
distances very small. Since the Jaccard distance is always between 0 and 1 and the
Levenshtein distance is an arbitrary integer, their product would amplify the separation
of scores between good and bad matches. It seems to work when you look at the results.