Removing duplicates using dedupe #463

Stolinger · 2017-08-14T15:00:03Z

Adding a management command to train dedupeio.
Command should be ./manage.py dedupe {mangas, animes}

You might have to create a 'dedupe' folder in ./mangaki (where training and outputs will be saved)

RaitoBezarius · 2017-08-16T16:32:53Z

Could you add dedupe to the requirements/production.txt ?

RaitoBezarius · 2017-08-16T15:06:53Z

mangaki/mangaki/management/commands/dedupe.py

+            data = {}
+            for work in Work.objects.filter(category__slug='manga'):
+                data[work.id] = {'title': work.title, 'vo_title': work.vo_title}
+                for field in ['title', 'vo_title']:


Prefer to use tuple here: ('title', 'vo_title') those are immutable and consume less memory than a proper list.

RaitoBezarius · 2017-08-16T16:34:53Z

mangaki/mangaki/management/commands/dedupe.py

+    help = 'Make training of dedupe with mangas or animes'
+
+    def add_arguments(self, parser):
+        parser.add_argument('myargs', nargs='+', type=str)


Let's make it a category single argument rather than an array arguments, otherwise, the automatically generated help won't make a lot of sense.

RaitoBezarius · 2017-08-16T16:41:18Z

@Stolinger So here what I have found out:

It seems like animes_output.csv stores a string Python bytes representation of the titles, I think we should store only the bytes representation. If you take a look, you will see b'something', we should have the file in UTF-8 and store the something directly or som\U00A0thing and so on, I think.
There is no easy way to review the created clusters, on my side, I have a lot of single item clusters (is it normal?) such as 1318,,1,b'Death Note',,, for instance.
Is there a plan to feed these clusters into WorkCluster ?

Last, if you could remove the GBR modifications from the PR, it'd be great also!
(to do so: you would have to use git rebase --interactive and drop all non related commits then push force onto this PR).

RaitoBezarius · 2017-08-16T18:07:58Z

Finally, could we refactor the duplication of code between anime and manga in the code?
I feel like a lot of the code is the same, could you abstract it into a function and parametrize it, if necessary?

RaitoBezarius · 2017-08-20T15:44:50Z

requirements/dev.txt

@@ -6,6 +6,7 @@ django-nose
 tensorflow
 matplotlib
 flake8
+dedupe


Could you enforce a version also?

RaitoBezarius · 2017-08-20T15:46:29Z

mangaki/mangaki/management/commands/dedupe.py

+
+    print('# duplicate sets', len(clustered_dupes))
+
+    input_file = 'dedupe/'+category+'.csv'


When building path, it's preferable to use os.path.join.

RaitoBezarius · 2017-08-20T15:48:13Z

mangaki/mangaki/management/commands/dedupe.py

+def dedupe_training(category):
+    assert (category in ['mangas', 'animes']),"Only mangas or animes needs training"
+    data = {}
+    for work in Work.objects.filter(category__slug=category[:len(category)-1]):


You can just write category[:-1] it'll do the same
(negative indices are rewritten as len(container) + index, e.g. len(container) - 1).

RaitoBezarius · 2017-08-20T15:49:01Z

mangaki/mangaki/management/commands/dedupe.py

+    with open(settings_file, 'wb') as sf:
+        deduper.writeSettings(sf)
+
+    threshold = deduper.threshold(data, recall_weight=2)


What is recall_weight ? What is its purpose ? What it is fixed to 2 ?
(add a comment to explain, maybe a FIXME to make it a configurable constant.)

RaitoBezarius · 2017-08-20T15:50:24Z

mangaki/mangaki/management/commands/dedupe.py

+def dedupe_training(category):
+    assert (category in ['mangas', 'animes']),"Only mangas or animes needs training"
+    data = {}
+    for work in Work.objects.filter(category__slug=category[:len(category)-1]):


It seems you're only using title and vo_title, so let's add a values_list after the filter to make the query faster.
I'd also recommend to query the category first, then use the category ID rather than the slug.

…edupe.py

jilljenn · 2017-08-25T09:55:02Z

Ce serait bien de créer les objets Suggestion + WorkCluster à partir des clusters à la fin :
https://docs.google.com/document/d/1Irzsu7VSNhUSFCyHyoUTyS2QtjodrwuaOa3aW01Y2_g/edit#

codecov · 2017-08-25T09:59:46Z

Codecov Report

Merging #463 into master will decrease coverage by 0.95%.
The diff coverage is 14.94%.

@@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   71.42%   70.47%   -0.96%     
==========================================
  Files          81       82       +1     
  Lines        5040     5127      +87     
==========================================
+ Hits         3600     3613      +13     
- Misses       1440     1514      +74

Impacted Files	Coverage Δ
mangaki/mangaki/management/commands/dedupe.py	`14.94% <14.94%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0eaceb...efa401a. Read the comment docs.

jilljenn · 2017-12-29T13:58:45Z

Onegai 🙏

* Search for alternative titles and title when searching a work (#530) * Add CFM: convex factorization machine (#512) * Add factorization machine * Add fastFM to requirements * Add Cython dependency * Add comment, remove some logs and upgrade pip * Add python-dev and libopenblas-dev to requirements (yes yes) * I should sleep * Fix minor changes * Add auto-fix, re-open, close buttons to Mangaki Fix (#527) * Add auto-fix, re-open, close buttons to Mangaki Fix * Fallback gracefully in case of external modifications of the state * Mark MAL as source of truth for further MAL imports (#532) * Add forgotten dep on fastFM in setup.py (#537) * Removing duplicates using dedupe (#463) * Adding management command to train dedupe * Removing debugging and changing saving directory * Output file now supported to retrieve clusters and scores * adding dedupe to dev.txt, changing handler and argument sparser * adding dedupe requirement in the right file, making some changes to dedupe.py * Add DATA_DIR * Up numpy * Fix Ansible config (LOL) (#541) * Fix stupid syntax error * Update README * Rewrite user profile settings (legacy routes killed, replaced by Vue.js + API) (#531) Introduce Vue.js on settings. Kill some legacy routes. Have fun with ugly views. * Test recommendation endpoint and saving of snapshots (#539) * Test recommendation endpoint and saving of snapshots * Fix annoying module problem * Fix test * Decode UTF-8 HTTP response * Test DPP (#538) * Test DPP * Actually test DPP * Upgrade to AniList's API v2 (#535) * Handle AniList's API v2 * Suppress the now useless AniListRichEntry * Get a work by ID or title in the same method * Properly handle seasonal animes from AniList * Update tests for the new version of the API * Shortcuts on Mangaki (#389) * Initial shortcuts ; proof of concept * Plug into onchange event rather than rolling out our own custom event * Add keyboard shortcut settings and cheatsheet * Add migration for new profile setting * Add new recommendation algorithm FMA (#549) * Add new recommendation algorithm FMA * Nit * Add comments * Fix hypothesis version and fix error due to upgrade (#558) * Delete SearchIssue model * Delete Neighbour model * Delete Event/Attendee/Location models * Actually delete events * 🔫 Google Analytics * Add new ALS (#544) * Add new ALS * Change nature of data * Speed up ALS3 * Add references to algorithms * GDPR compliancy — Step 1 (Export features) (#553) Export features for users. * Unreviewed: Fix setupfile → sendfile typo in setup.py * Add X-SendFile for NGINX on Ansible playbooks (#563) * GDPR compliancy — Step 3 (frontend and account deletion) (#564) * MAL duplication problem test (#565) * Improve MAL accuracy (#533) * WIP: GDPR compliancy — Step 4 (frontend polish & copy changes) (#566) * Frontend factor: split profile into profile-works and profile-preferences * Fix anonymous profile view * Prevent Flash of Uncompiled Content with the modal * Opt-in for newsletter and research by design * Explain use of personal data on signup page (#560) * Add explanation on signup page * Remove toggle from FAQ * Fix migrations

Stolinger changed the title ~~Cham/dedupe~~ Removing duplicates using dedupe Aug 14, 2017

RaitoBezarius reviewed Aug 16, 2017

View reviewed changes

RaitoBezarius reviewed Aug 20, 2017

View reviewed changes

Stolinger added 5 commits August 24, 2017 11:24

Adding management command to train dedupe

e02aebb

Removing debugging and changing saving directory

edf46d0

Output file now supported to retrieve clusters and scores

60568f6

adding dedupe to dev.txt, changing handler and argument sparser

f73b10b

adding dedupe requirement in the right file, making some changes to d…

92af494

…edupe.py

Stolinger force-pushed the cham/dedupe branch from ad38723 to 92af494 Compare August 24, 2017 09:26

Merge branch 'master' into cham/dedupe

ef285ef

jilljenn added the ongoing label Nov 14, 2017

Add DATA_DIR

32dd0ac

jilljenn added PR: To Review and removed ongoing labels Nov 20, 2017

Up numpy

040137b

RaitoBezarius approved these changes Dec 30, 2017

View reviewed changes

Merge branch 'master' into cham/dedupe

efa401a

jilljenn merged commit 8823ec5 into master Jan 6, 2018

jilljenn deleted the cham/dedupe branch January 6, 2018 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing duplicates using dedupe #463

Removing duplicates using dedupe #463

Stolinger commented Aug 14, 2017

RaitoBezarius commented Aug 16, 2017

RaitoBezarius Aug 16, 2017

RaitoBezarius Aug 16, 2017

RaitoBezarius commented Aug 16, 2017 •

edited

RaitoBezarius commented Aug 16, 2017

RaitoBezarius Aug 20, 2017

RaitoBezarius Aug 20, 2017

RaitoBezarius Aug 20, 2017

RaitoBezarius Aug 20, 2017

RaitoBezarius Aug 20, 2017

jilljenn commented Aug 25, 2017

codecov bot commented Aug 25, 2017 •

edited

jilljenn commented Dec 29, 2017


		print('# duplicate sets', len(clustered_dupes))

		input_file = 'dedupe/'+category+'.csv'

Navigation Menu

Removing duplicates using dedupe #463

Removing duplicates using dedupe #463

Conversation

Stolinger commented Aug 14, 2017

RaitoBezarius commented Aug 16, 2017

RaitoBezarius Aug 16, 2017

Choose a reason for hiding this comment

RaitoBezarius Aug 16, 2017

Choose a reason for hiding this comment

RaitoBezarius commented Aug 16, 2017 • edited

RaitoBezarius commented Aug 16, 2017

RaitoBezarius Aug 20, 2017

Choose a reason for hiding this comment

RaitoBezarius Aug 20, 2017

Choose a reason for hiding this comment

RaitoBezarius Aug 20, 2017

Choose a reason for hiding this comment

RaitoBezarius Aug 20, 2017

Choose a reason for hiding this comment

RaitoBezarius Aug 20, 2017

Choose a reason for hiding this comment

jilljenn commented Aug 25, 2017

codecov bot commented Aug 25, 2017 • edited

Codecov Report

jilljenn commented Dec 29, 2017

RaitoBezarius commented Aug 16, 2017 •

edited

codecov bot commented Aug 25, 2017 •

edited