Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing duplicates using dedupe #463

Merged
merged 9 commits into from Jan 6, 2018
Merged

Removing duplicates using dedupe #463

merged 9 commits into from Jan 6, 2018

Conversation

Stolinger
Copy link
Contributor

Adding a management command to train dedupeio.
Command should be ./manage.py dedupe {mangas, animes}

You might have to create a 'dedupe' folder in ./mangaki (where training and outputs will be saved)

@Stolinger Stolinger changed the title Cham/dedupe Removing duplicates using dedupe Aug 14, 2017
@RaitoBezarius
Copy link
Member

Could you add dedupe to the requirements/production.txt ?

data = {}
for work in Work.objects.filter(category__slug='manga'):
data[work.id] = {'title': work.title, 'vo_title': work.vo_title}
for field in ['title', 'vo_title']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to use tuple here: ('title', 'vo_title') those are immutable and consume less memory than a proper list.

help = 'Make training of dedupe with mangas or animes'

def add_arguments(self, parser):
parser.add_argument('myargs', nargs='+', type=str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it a category single argument rather than an array arguments, otherwise, the automatically generated help won't make a lot of sense.

@RaitoBezarius
Copy link
Member

RaitoBezarius commented Aug 16, 2017

@Stolinger So here what I have found out:

  • It seems like animes_output.csv stores a string Python bytes representation of the titles, I think we should store only the bytes representation. If you take a look, you will see b'something', we should have the file in UTF-8 and store the something directly or som\U00A0thing and so on, I think.
  • There is no easy way to review the created clusters, on my side, I have a lot of single item clusters (is it normal?) such as 1318,,1,b'Death Note',,, for instance.
  • Is there a plan to feed these clusters into WorkCluster ?

Last, if you could remove the GBR modifications from the PR, it'd be great also!
(to do so: you would have to use git rebase --interactive and drop all non related commits then push force onto this PR).

@RaitoBezarius
Copy link
Member

Finally, could we refactor the duplication of code between anime and manga in the code?
I feel like a lot of the code is the same, could you abstract it into a function and parametrize it, if necessary?

@@ -6,6 +6,7 @@ django-nose
tensorflow
matplotlib
flake8
dedupe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you enforce a version also?


print('# duplicate sets', len(clustered_dupes))

input_file = 'dedupe/'+category+'.csv'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building path, it's preferable to use os.path.join.

def dedupe_training(category):
assert (category in ['mangas', 'animes']),"Only mangas or animes needs training"
data = {}
for work in Work.objects.filter(category__slug=category[:len(category)-1]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just write category[:-1] it'll do the same
(negative indices are rewritten as len(container) + index, e.g. len(container) - 1).

with open(settings_file, 'wb') as sf:
deduper.writeSettings(sf)

threshold = deduper.threshold(data, recall_weight=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is recall_weight ? What is its purpose ? What it is fixed to 2 ?
(add a comment to explain, maybe a FIXME to make it a configurable constant.)

def dedupe_training(category):
assert (category in ['mangas', 'animes']),"Only mangas or animes needs training"
data = {}
for work in Work.objects.filter(category__slug=category[:len(category)-1]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you're only using title and vo_title, so let's add a values_list after the filter to make the query faster.
I'd also recommend to query the category first, then use the category ID rather than the slug.

@jilljenn
Copy link
Member

Ce serait bien de créer les objets Suggestion + WorkCluster à partir des clusters à la fin :
https://docs.google.com/document/d/1Irzsu7VSNhUSFCyHyoUTyS2QtjodrwuaOa3aW01Y2_g/edit#

@codecov
Copy link

codecov bot commented Aug 25, 2017

Codecov Report

Merging #463 into master will decrease coverage by 0.95%.
The diff coverage is 14.94%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   71.42%   70.47%   -0.96%     
==========================================
  Files          81       82       +1     
  Lines        5040     5127      +87     
==========================================
+ Hits         3600     3613      +13     
- Misses       1440     1514      +74
Impacted Files Coverage Δ
mangaki/mangaki/management/commands/dedupe.py 14.94% <14.94%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0eaceb...efa401a. Read the comment docs.

@jilljenn
Copy link
Member

Onegai 🙏

@jilljenn jilljenn merged commit 8823ec5 into master Jan 6, 2018
@jilljenn jilljenn deleted the cham/dedupe branch January 6, 2018 21:38
jilljenn added a commit that referenced this pull request May 24, 2018
* Search for alternative titles and title when searching a work (#530)

* Add CFM: convex factorization machine (#512)

* Add factorization machine

* Add fastFM to requirements

* Add Cython dependency

* Add comment, remove some logs and upgrade pip

* Add python-dev and libopenblas-dev to requirements (yes yes)

* I should sleep

* Fix minor changes

* Add auto-fix, re-open, close buttons to Mangaki Fix (#527)

* Add auto-fix, re-open, close buttons to Mangaki Fix

* Fallback gracefully in case of external modifications of the state

* Mark MAL as source of truth for further MAL imports (#532)

* Add forgotten dep on fastFM in setup.py (#537)

* Removing duplicates using dedupe (#463)

* Adding management command to train dedupe

* Removing debugging and changing saving directory

* Output file now supported to retrieve clusters and scores

* adding dedupe to dev.txt, changing handler and argument sparser

* adding dedupe requirement in the right file, making some changes to dedupe.py

* Add DATA_DIR

* Up numpy

* Fix Ansible config (LOL) (#541)

* Fix stupid syntax error

* Update README

* Rewrite user profile settings (legacy routes killed, replaced by Vue.js + API) (#531)

Introduce Vue.js on settings. Kill some legacy routes. Have fun with ugly views.

* Test recommendation endpoint and saving of snapshots (#539)

* Test recommendation endpoint and saving of snapshots

* Fix annoying module problem

* Fix test

* Decode UTF-8 HTTP response

* Test DPP (#538)

* Test DPP

* Actually test DPP

* Upgrade to AniList's API v2 (#535)

* Handle AniList's API v2
* Suppress the now useless AniListRichEntry
* Get a work by ID or title in the same method
* Properly handle seasonal animes from AniList
* Update tests for the new version of the API

* Shortcuts on Mangaki (#389)

* Initial shortcuts ; proof of concept

* Plug into onchange event rather than rolling out our own custom event

* Add keyboard shortcut settings and cheatsheet

* Add migration for new profile setting

* Add new recommendation algorithm FMA (#549)

* Add new recommendation algorithm FMA

* Nit

* Add comments

* Fix hypothesis version and fix error due to upgrade (#558)

* Delete SearchIssue model

* Delete Neighbour model

* Delete Event/Attendee/Location models

* Actually delete events

* 🔫 Google Analytics

* Add new ALS (#544)

* Add new ALS

* Change nature of data

* Speed up ALS3

* Add references to algorithms

* GDPR compliancy — Step 1 (Export features) (#553)

Export features for users.

* Unreviewed: Fix setupfile → sendfile typo in setup.py

* Add X-SendFile for NGINX on Ansible playbooks (#563)

* GDPR compliancy — Step 3 (frontend and account deletion) (#564)

* MAL duplication problem test (#565)

* Improve MAL accuracy (#533)

* WIP: GDPR compliancy — Step 4 (frontend polish & copy changes) (#566)

* Frontend factor: split profile into profile-works and profile-preferences

* Fix anonymous profile view

* Prevent Flash of Uncompiled Content with the modal

* Opt-in for newsletter and research by design

* Explain use of personal data on signup page (#560)

* Add explanation on signup page

* Remove toggle from FAQ

* Fix migrations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants