de-dupe images from multiple runs of a classifier on same searchtag. #4

mariochampion · 2017-12-21T17:39:05Z

because you can run different classifiers on the same unsorted_{searchtag} images to see how they perform, you can get duplicate images in your sorted_{timestamp} folders, when you run the a classifier multiple times on the SAME searchtag images at different times.

for example: download 100 images tagged robotart and classify them. then download 100 more and run classify them again. it will look to the same root unsorted_robotart dir, and classify all 200 images (into a different sorted_{timestamp} dir as the time will have changed) BUT if you then just run a retrain with 'harvest' enabled, it ll take ALL the high-confidence images from the as yet un-harvested sorted_* dirs, and you get dupes in your training_photos dir.

potential solution: when running a classifier, look through any unharvested basetag/sorted_{timestamp} for the same image BEGINNING, since image names get a score appended, the exact image name wont likely exist. (ie, robotart_2_1234.jpg becomes, under one classifier, robotart_2_1234_875.jpg for a 87.5% score from a classifier and robotart_2_1234_825.jpg from another. so there is certainly a way to check the filename start for dupes, just havent done it.

current (temp) workaround: manually delete dupes by looking at filenames for a MAX value of a previous classifier run, and delete the overlap BEFORE a harvest run.

mariochampion added enhancement help wanted labels Dec 21, 2017

mariochampion self-assigned this Dec 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

de-dupe images from multiple runs of a classifier on same searchtag. #4

de-dupe images from multiple runs of a classifier on same searchtag. #4

mariochampion commented Dec 21, 2017

de-dupe images from multiple runs of a classifier on same searchtag. #4

de-dupe images from multiple runs of a classifier on same searchtag. #4

Comments

mariochampion commented Dec 21, 2017