-
Notifications
You must be signed in to change notification settings - Fork 2
/
README
30 lines (21 loc) · 1.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Code for an interesting NLP problem.
sample_input.csv contains 3 columns: Vendor, Product and Version.
These entries have been created by humans, so are not very consistent.
dict.csv contains a standard dictionary. This has been created by parsing
the XML standard provided by NIST at
http://static.nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.2.xml
We would like to match each entry in input.csv to an accurate entry in
dict.csv. If there is no matching entry in the dictionary, we would like
to output NA.
match_cpe.py is my attempt at solving this problem using NLP techniques.
Usage:
./match_cpe.py <input csv file> <dictionary csv file>
Method:
I use the product of Levenshtein distance and Jaccard distance as a distance metric.
I match each entry in the input to all entries in the dictionary that start
with the same letter (as an optimization) and then return the dictionary entry with the
closest (smallest) value of the distance metric.
Why the product of the two distances? My hunch was that a good match would have both
distances very small. Since the Jaccard distance is always between 0 and 1 and the
Levenshtein distance is an arbitrary integer, their product would amplify the separation
of scores between good and bad matches. It seems to work when you look at the results.