New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing duplicates using dedupe #463
Conversation
Could you add |
data = {} | ||
for work in Work.objects.filter(category__slug='manga'): | ||
data[work.id] = {'title': work.title, 'vo_title': work.vo_title} | ||
for field in ['title', 'vo_title']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to use tuple here: ('title', 'vo_title')
those are immutable and consume less memory than a proper list.
help = 'Make training of dedupe with mangas or animes' | ||
|
||
def add_arguments(self, parser): | ||
parser.add_argument('myargs', nargs='+', type=str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it a category
single argument rather than an array arguments, otherwise, the automatically generated help won't make a lot of sense.
@Stolinger So here what I have found out:
Last, if you could remove the GBR modifications from the PR, it'd be great also! |
Finally, could we refactor the duplication of code between anime and manga in the code? |
requirements/dev.txt
Outdated
@@ -6,6 +6,7 @@ django-nose | |||
tensorflow | |||
matplotlib | |||
flake8 | |||
dedupe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you enforce a version also?
|
||
print('# duplicate sets', len(clustered_dupes)) | ||
|
||
input_file = 'dedupe/'+category+'.csv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When building path, it's preferable to use os.path.join
.
def dedupe_training(category): | ||
assert (category in ['mangas', 'animes']),"Only mangas or animes needs training" | ||
data = {} | ||
for work in Work.objects.filter(category__slug=category[:len(category)-1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just write category[:-1]
it'll do the same
(negative indices are rewritten as len(container) + index
, e.g. len(container) - 1
).
with open(settings_file, 'wb') as sf: | ||
deduper.writeSettings(sf) | ||
|
||
threshold = deduper.threshold(data, recall_weight=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is recall_weight
? What is its purpose ? What it is fixed to 2
?
(add a comment to explain, maybe a FIXME
to make it a configurable constant.)
def dedupe_training(category): | ||
assert (category in ['mangas', 'animes']),"Only mangas or animes needs training" | ||
data = {} | ||
for work in Work.objects.filter(category__slug=category[:len(category)-1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you're only using title
and vo_title
, so let's add a values_list
after the filter
to make the query faster.
I'd also recommend to query the category first, then use the category ID rather than the slug.
Ce serait bien de créer les objets Suggestion + WorkCluster à partir des clusters à la fin : |
Codecov Report
@@ Coverage Diff @@
## master #463 +/- ##
==========================================
- Coverage 71.42% 70.47% -0.96%
==========================================
Files 81 82 +1
Lines 5040 5127 +87
==========================================
+ Hits 3600 3613 +13
- Misses 1440 1514 +74
Continue to review full report at Codecov.
|
Onegai 🙏 |
* Search for alternative titles and title when searching a work (#530) * Add CFM: convex factorization machine (#512) * Add factorization machine * Add fastFM to requirements * Add Cython dependency * Add comment, remove some logs and upgrade pip * Add python-dev and libopenblas-dev to requirements (yes yes) * I should sleep * Fix minor changes * Add auto-fix, re-open, close buttons to Mangaki Fix (#527) * Add auto-fix, re-open, close buttons to Mangaki Fix * Fallback gracefully in case of external modifications of the state * Mark MAL as source of truth for further MAL imports (#532) * Add forgotten dep on fastFM in setup.py (#537) * Removing duplicates using dedupe (#463) * Adding management command to train dedupe * Removing debugging and changing saving directory * Output file now supported to retrieve clusters and scores * adding dedupe to dev.txt, changing handler and argument sparser * adding dedupe requirement in the right file, making some changes to dedupe.py * Add DATA_DIR * Up numpy * Fix Ansible config (LOL) (#541) * Fix stupid syntax error * Update README * Rewrite user profile settings (legacy routes killed, replaced by Vue.js + API) (#531) Introduce Vue.js on settings. Kill some legacy routes. Have fun with ugly views. * Test recommendation endpoint and saving of snapshots (#539) * Test recommendation endpoint and saving of snapshots * Fix annoying module problem * Fix test * Decode UTF-8 HTTP response * Test DPP (#538) * Test DPP * Actually test DPP * Upgrade to AniList's API v2 (#535) * Handle AniList's API v2 * Suppress the now useless AniListRichEntry * Get a work by ID or title in the same method * Properly handle seasonal animes from AniList * Update tests for the new version of the API * Shortcuts on Mangaki (#389) * Initial shortcuts ; proof of concept * Plug into onchange event rather than rolling out our own custom event * Add keyboard shortcut settings and cheatsheet * Add migration for new profile setting * Add new recommendation algorithm FMA (#549) * Add new recommendation algorithm FMA * Nit * Add comments * Fix hypothesis version and fix error due to upgrade (#558) * Delete SearchIssue model * Delete Neighbour model * Delete Event/Attendee/Location models * Actually delete events * 🔫 Google Analytics * Add new ALS (#544) * Add new ALS * Change nature of data * Speed up ALS3 * Add references to algorithms * GDPR compliancy — Step 1 (Export features) (#553) Export features for users. * Unreviewed: Fix setupfile → sendfile typo in setup.py * Add X-SendFile for NGINX on Ansible playbooks (#563) * GDPR compliancy — Step 3 (frontend and account deletion) (#564) * MAL duplication problem test (#565) * Improve MAL accuracy (#533) * WIP: GDPR compliancy — Step 4 (frontend polish & copy changes) (#566) * Frontend factor: split profile into profile-works and profile-preferences * Fix anonymous profile view * Prevent Flash of Uncompiled Content with the modal * Opt-in for newsletter and research by design * Explain use of personal data on signup page (#560) * Add explanation on signup page * Remove toggle from FAQ * Fix migrations
Adding a management command to train dedupeio.
Command should be ./manage.py dedupe {mangas, animes}
You might have to create a 'dedupe' folder in ./mangaki (where training and outputs will be saved)