You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
because you can run different classifiers on the same unsorted_{searchtag} images to see how they perform, you can get duplicate images in your sorted_{timestamp} folders, when you run the a classifier multiple times on the SAME searchtag images at different times.
for example: download 100 images tagged robotart and classify them. then download 100 more and run classify them again. it will look to the same root unsorted_robotart dir, and classify all 200 images (into a different sorted_{timestamp} dir as the time will have changed) BUT if you then just run a retrain with 'harvest' enabled, it ll take ALL the high-confidence images from the as yet un-harvested sorted_* dirs, and you get dupes in your training_photos dir.
potential solution: when running a classifier, look through any unharvested basetag/sorted_{timestamp} for the same image BEGINNING, since image names get a score appended, the exact image name wont likely exist. (ie, robotart_2_1234.jpg becomes, under one classifier, robotart_2_1234_875.jpg for a 87.5% score from a classifier and robotart_2_1234_825.jpg from another. so there is certainly a way to check the filename start for dupes, just havent done it.
current (temp) workaround: manually delete dupes by looking at filenames for a MAX value of a previous classifier run, and delete the overlap BEFORE a harvest run.
The text was updated successfully, but these errors were encountered:
because you can run different classifiers on the same unsorted_{searchtag} images to see how they perform, you can get duplicate images in your sorted_{timestamp} folders, when you run the a classifier multiple times on the SAME searchtag images at different times.
for example: download 100 images tagged robotart and classify them. then download 100 more and run classify them again. it will look to the same root unsorted_robotart dir, and classify all 200 images (into a different sorted_{timestamp} dir as the time will have changed) BUT if you then just run a retrain with 'harvest' enabled, it ll take ALL the high-confidence images from the as yet un-harvested sorted_* dirs, and you get dupes in your training_photos dir.
potential solution: when running a classifier, look through any unharvested basetag/sorted_{timestamp} for the same image BEGINNING, since image names get a score appended, the exact image name wont likely exist. (ie, robotart_2_1234.jpg becomes, under one classifier, robotart_2_1234_875.jpg for a 87.5% score from a classifier and robotart_2_1234_825.jpg from another. so there is certainly a way to check the filename start for dupes, just havent done it.
current (temp) workaround: manually delete dupes by looking at filenames for a MAX value of a previous classifier run, and delete the overlap BEFORE a harvest run.
The text was updated successfully, but these errors were encountered: