Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicates #4

Closed
GoogleCodeExporter opened this issue Jan 13, 2016 · 4 comments
Closed

duplicates #4

GoogleCodeExporter opened this issue Jan 13, 2016 · 4 comments

Comments

@GoogleCodeExporter
Copy link

duplumszűrés:
a központozás nélküli mondatokból csinálunk egy hash-t és azt lerakjuk 
egy 
plusz mezőben.
a lucene az indexeléskor figyelje, hogy ez szerepelt-e már a mondat (a hash-
re keresünk rá) 
és ha igen, akkor egy plusz fieldben megjegyezzük, hogy duplikátum

Original issue reported on code.google.com by bpgergo on 2 Oct 2009 at 5:46

@GoogleCodeExporter
Copy link
Author

Valami minimális heurisztika arra, hogy több közül melyiket mutatja meg.

Original comment by bpgergo on 2 Oct 2009 at 5:46

@GoogleCodeExporter
Copy link
Author

ez készen van,
két részből áll, egyrészt Bisen.java updateHashCode
másrészt controll_harness.py flagDuplicates

még egy bug van benne, hogy a "-" karaktert nem szűri ki
pl: 
http://kozel.mokk.bme.hu:8080/hunglish/search?huSentence=csod%C3%A1latos&enSente
nce=beautiful&doc.genre=-10



mysql> select * from bisen where id in (1291330, 1332913);
+---------+---------+-----------+--------------+---------------+------------+---
----------+---------+------+------------------+------------------+--------------
+---------------------+-----------+----------+
| id      | version | downvotes | en_sentence  | hu_sentence   | is_indexed | 
line_number | upvotes | doc  | en_sentence_hash | hu_sentence_hash | 
is_duplicate | indexed_timestamp   | copyright | approved |
+---------+---------+-----------+--------------+---------------+------------+---
----------+---------+------+------------------+------------------+--------------
+---------------------+-----------+----------+
| 1291330 |       1 |      NULL | Beautiful.   | - Csod�latos. | �          | 
        977 |    NULL |  317 |       -625700480 |      -1975168042 |            
  | 2011-01-20 15:12:52 | C         | N        | 
| 1332913 |       1 |      NULL | - Beautiful. | - Csod�latos. | �          | 
         73 |    NULL |  373 |       1272378659 |      -1975168042 |            
  | 2011-01-20 15:27:32 | C         | N        | 
+---------+---------+-----------+--------------+---------------+------------+---
----------+---------+------+------------------+------------------+--------------
+---------------------+-----------+----------+


megoldási javaslat:
Bisen.java stripPunctuation method-ot kell javítani
most ezt használja: 
http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isLett
erOrDigit(char)


Original comment by bpgergo on 23 Jan 2011 at 6:50

@GoogleCodeExporter
Copy link
Author

most már teljes mértékben a harness csinálja a duplum-szűrést

Original comment by bpgergo on 1 Mar 2011 at 5:41

  • Changed state: New

@GoogleCodeExporter
Copy link
Author

Original comment by Varga.Da...@gmail.com on 1 Mar 2011 at 5:54

  • Changed state: Verified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant