Javier Garcia-Bernardo, 2017
CODE AND FIGURES HERE: match_strings.ipynb
TODO:
- Make it more elegant/flexible, this is recycled code from many years ago.
- Big data to avoid comparing all names in database 1 to all names in database 2. This can be achieved neatly with LSH forests (see see here for the current implementation
- Libraries
pip install distance numpy pandas matplotlib sklearn seaborn python-Levenshtein
- Train and test set: Two files with three columns (string1, string2, 0/1 for match)
- Check the notebook
database1 = "./D/database_1.csv"
database2 = "./D/database_2.csv"
train_data_file = "./D/train.csv"
test_data_file = "./D/test.csv"
tfidf_matrix_train,dictTrain,tfidf_matrix_trainBigrams,dictTrainBigrams,lenGram = createTFIDF(database1,database2)
clf,clf2 = train(train_data_file,tfidf_matrix_train,dictTrain,tfidf_matrix_trainBigrams,dictTrainBigrams,lenGram,sep="\t")
predict = test(test_data_file,tfidf_matrix_train,dictTrain,tfidf_matrix_trainBigrams,dictTrainBigrams,lenGram,clf,clf2,sep="\t")
plot(predict)
- You can then use clf (the SVM) to predict matches between any two strings, you can use the plot with ROC curve to set up your threshold (or let the algorithm find it, but that will depend on your training set).
distances = find_distances(st1,st2)
clf.decision_function(np.array(temp,dtype=float))