New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Wildcards in Searches #441

Closed
GoogleCodeExporter opened this Issue Mar 14, 2015 · 14 comments

Comments

Projects
None yet
3 participants
@GoogleCodeExporter
Collaborator

GoogleCodeExporter commented Mar 14, 2015

I used to own a Dell Axim running Windows mobile that had an incredibly 
powerful J-E dictionary. What made it so powerful, aside from its massive size, 
was the ability to include wildcards in searches. '?' represented 1 character,  
while '*' represented 0 or more characters. 

A search for 'た?り' would pull up 'たより', 'たまり', 'たもり', 
etc...
Similarly, searching 'た*り' pulls up all of the above,  plus many others, 
including 'たっぷり', 'たり', 'たんまり', 'たちまわり', etc...

I know the "Jishou" app provides limited '*' wildcard searches,  and that the 
"Akebi" app provides unfortunately inconsistent wildcard searches using both  
'*' and  '?'.

In Aedict, this type of search would be most useful when conducting "exact" 
searches.  You could replicate your "begins with" and "ends with" search 
functionality by placing an asterisk at the end or beginning respectively, 
offering the potential to remove some clutter from the interface. 

I don't know if your databases could support such a search or how difficult it 
would be to code, but if you could pull it off,  Aedict would remain 
permanently unbeatable.

Original issue reported on code.google.com by jebj...@gmail.com on 8 Mar 2015 at 2:40

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Collaborator
Yes, Wildcard searches would be extremely helpful, if you are looking for 
words, where you are missing the middle Kanji.

Original comment by wonc...@googlemail.com on 10 Mar 2015 at 11:53

Collaborator

GoogleCodeExporter commented Mar 14, 2015

Yes, Wildcard searches would be extremely helpful, if you are looking for 
words, where you are missing the middle Kanji.

Original comment by wonc...@googlemail.com on 10 Mar 2015 at 11:53

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Collaborator
Thanks for the idea, I think that this actually can be implemented. Please let 
me try if this is possible.

Original comment by martin.v...@gmail.com on 13 Mar 2015 at 3:51

Collaborator

GoogleCodeExporter commented Mar 14, 2015

Thanks for the idea, I think that this actually can be implemented. Please let 
me try if this is possible.

Original comment by martin.v...@gmail.com on 13 Mar 2015 at 3:51

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny May 8, 2015

Owner

Sorry mate, I just tried such searches and Lucene finds nothing, for unknown reasons. E.g. I tried to search for 見? in hope to find miru but Lucene find nothing. So, I have to mark this as won't fix :(

Owner

mvysny commented May 8, 2015

Sorry mate, I just tried such searches and Lucene finds nothing, for unknown reasons. E.g. I tried to search for 見? in hope to find miru but Lucene find nothing. So, I have to mark this as won't fix :(

@mvysny mvysny closed this May 8, 2015

@mvysny mvysny added the wontfix label May 8, 2015

@mvysny mvysny reopened this May 8, 2015

@mvysny mvysny closed this May 8, 2015

@pouncingant

This comment has been minimized.

Show comment
Hide comment
@pouncingant

pouncingant Sep 15, 2015

I see the wontfix and the reasoning. Is it possible to, for the above "見?" example, to do a "..見.." search, then filter the results through some code before presenting only those that fit the "見?" pattern, rather than rely on the function being in-built to lucene?
I also see there are thousands of open issues, and guess you are unlikely to have time to work on this, but to be frank, I've still yet to find a dictionary that is as useful for technical translation on-the-go, and this change could be very handy for poorly photocopied text.

pouncingant commented Sep 15, 2015

I see the wontfix and the reasoning. Is it possible to, for the above "見?" example, to do a "..見.." search, then filter the results through some code before presenting only those that fit the "見?" pattern, rather than rely on the function being in-built to lucene?
I also see there are thousands of open issues, and guess you are unlikely to have time to work on this, but to be frank, I've still yet to find a dictionary that is as useful for technical translation on-the-go, and this change could be very handy for poorly photocopied text.

@mvysny mvysny removed the wontfix label Sep 16, 2015

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny Sep 16, 2015

Owner

Hmm, so basically if you put 見* or 見? into the search box, this will override the search setting and will activate the ..見.. search? This is a great idea! Even better:
Any text ending with '*' or '?' will activate the "見.." search; any text starting with '*' or '?' will activate the "..見" search; any text both starting and ending with '*' or '?' will activate the "..見..". What do you think?

Owner

mvysny commented Sep 16, 2015

Hmm, so basically if you put 見* or 見? into the search box, this will override the search setting and will activate the ..見.. search? This is a great idea! Even better:
Any text ending with '*' or '?' will activate the "見.." search; any text starting with '*' or '?' will activate the "..見" search; any text both starting and ending with '*' or '?' will activate the "..見..". What do you think?

@pouncingant

This comment has been minimized.

Show comment
Hide comment
@pouncingant

pouncingant Sep 16, 2015

Sounds good to me!

pouncingant commented Sep 16, 2015

Sounds good to me!

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny Sep 16, 2015

Owner

Great, thanks, I will implement this. Unfortunately, ? or * in the middle of the word will be ignored :(

Owner

mvysny commented Sep 16, 2015

Great, thanks, I will implement this. Unfortunately, ? or * in the middle of the word will be ignored :(

@pouncingant

This comment has been minimized.

Show comment
Hide comment
@pouncingant

pouncingant Sep 16, 2015

Hmm. Fair enough, though I can imagine such would be useful in the case of a poor photocopy or handwriting, those are fairly rare occurrences in my line of work.
Forgive me for backtracking, but I assume you're aware of regular expressions, hence the use of "*" and "?" in the first post?

certainly, you could make wildcards 'short-cuts' to the "ends-with" and "begins-with" functions, and I'm sure that would be much faster to implement; however, assuming you have any way of filling an array of strings with the search results of a "contains X" search, it should be fairly trivial to filter that array based on whether the strings match the pattern in the search box. Regular expression matches like this are easy to implement in java, though I'll admit I've never done so for Japanese text (maybe the reference http://stackoverflow.com/questions/13876955/regex-that-allows-chinese-characters is useful)..

To be totally clear, this is how I imagine the code (though, I've no idea how this might fit into the existing code; my apologies if this is entirely useless to you)

import java.util.regex;

//the following variable would be the search string at the moment the search was initiated
String example_search= "電??学"; // should match "電磁気学"
//String example_search = "電*学";  // should also match "電磁気学"

String temporary_search = "";

//check if there are wildcards
if(example_search.contains("?") || example_search.contains("*")){
  //there are wildcards. Get the longest contiguous string of searchable characters
  String[] s = example_search.split("\\*|\\?");//this should split any string using * and ? as delimiters
  for(int i = 0; i <s.len(); i++){
    if(s[i].len()>temporary_search.len()){
      temporary_search=s[i].len();
    }
  }
  aedictSearchFor(temporary_search); //do a normal search on the longest contiguous normal string 

  //get a list of the search results:
  String[] contains_X_search_results = getAllSearchResultsAsStringArray()

  //since the initial example_search string is already a regular expression, we can just do a regex test
  //against it.
  for(int i =0; i<contains_X_search_results.len();i++)
    results_to_display.add(contains_X_search_results.matches(example_text))
  displayResults();
}else carryOnAsUsual();

pouncingant commented Sep 16, 2015

Hmm. Fair enough, though I can imagine such would be useful in the case of a poor photocopy or handwriting, those are fairly rare occurrences in my line of work.
Forgive me for backtracking, but I assume you're aware of regular expressions, hence the use of "*" and "?" in the first post?

certainly, you could make wildcards 'short-cuts' to the "ends-with" and "begins-with" functions, and I'm sure that would be much faster to implement; however, assuming you have any way of filling an array of strings with the search results of a "contains X" search, it should be fairly trivial to filter that array based on whether the strings match the pattern in the search box. Regular expression matches like this are easy to implement in java, though I'll admit I've never done so for Japanese text (maybe the reference http://stackoverflow.com/questions/13876955/regex-that-allows-chinese-characters is useful)..

To be totally clear, this is how I imagine the code (though, I've no idea how this might fit into the existing code; my apologies if this is entirely useless to you)

import java.util.regex;

//the following variable would be the search string at the moment the search was initiated
String example_search= "電??学"; // should match "電磁気学"
//String example_search = "電*学";  // should also match "電磁気学"

String temporary_search = "";

//check if there are wildcards
if(example_search.contains("?") || example_search.contains("*")){
  //there are wildcards. Get the longest contiguous string of searchable characters
  String[] s = example_search.split("\\*|\\?");//this should split any string using * and ? as delimiters
  for(int i = 0; i <s.len(); i++){
    if(s[i].len()>temporary_search.len()){
      temporary_search=s[i].len();
    }
  }
  aedictSearchFor(temporary_search); //do a normal search on the longest contiguous normal string 

  //get a list of the search results:
  String[] contains_X_search_results = getAllSearchResultsAsStringArray()

  //since the initial example_search string is already a regular expression, we can just do a regex test
  //against it.
  for(int i =0; i<contains_X_search_results.len();i++)
    results_to_display.add(contains_X_search_results.matches(example_text))
  displayResults();
}else carryOnAsUsual();
@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny Sep 16, 2015

Owner

Hmm, this is good idea. Instead of searching for a longest string, I can perhaps search for all components in s[], say, 電 AND 学, and then filter out any entries not matching 電??学. Let me play with this a bit.

Owner

mvysny commented Sep 16, 2015

Hmm, this is good idea. Instead of searching for a longest string, I can perhaps search for all components in s[], say, 電 AND 学, and then filter out any entries not matching 電??学. Let me play with this a bit.

@pouncingant

This comment has been minimized.

Show comment
Hide comment
@pouncingant

pouncingant Sep 16, 2015

Well, the motivation for using the longest string was just to minimize the size of the last for-loop. You could use any non-wildcard component of the query to perform a search, and then only show those that pass the .matches() test; however, I imagine some characters will lead to a very long list of results to filter.

pouncingant commented Sep 16, 2015

Well, the motivation for using the longest string was just to minimize the size of the last for-loop. You could use any non-wildcard component of the query to perform a search, and then only show those that pass the .matches() test; however, I imagine some characters will lead to a very long list of results to filter.

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny Sep 16, 2015

Owner

The regexp matching is inevitable anyway; to trim down the search results, it is good to include all strings in the query I believe.

Owner

mvysny commented Sep 16, 2015

The regexp matching is inevitable anyway; to trim down the search results, it is good to include all strings in the query I believe.

@pouncingant

This comment has been minimized.

Show comment
Hide comment
@pouncingant

pouncingant Sep 16, 2015

If you're saying that the aedict query function can accept multiple arguments in the query to narrow down the results list, then that's great, and indeed removes the need for searching for the longest string. Otherwise, what I mean is that (for example) if the query was "あとう", then using the longest string "とう" is likely to have fewer results than "あ"; thus, when one does get as far as running .matches("あとう"), there will be fewer calls to .matches(). That said, I'm not sure whether the speedup would be noticeable or not.

pouncingant commented Sep 16, 2015

If you're saying that the aedict query function can accept multiple arguments in the query to narrow down the results list, then that's great, and indeed removes the need for searching for the longest string. Otherwise, what I mean is that (for example) if the query was "あとう", then using the longest string "とう" is likely to have fewer results than "あ"; thus, when one does get as far as running .matches("あとう"), there will be fewer calls to .matches(). That said, I'm not sure whether the speedup would be noticeable or not.

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny Sep 17, 2015

Owner

Implemented in Aedict 3.37

Owner

mvysny commented Sep 17, 2015

Implemented in Aedict 3.37

@mvysny mvysny closed this Sep 17, 2015

@mvysny

This comment has been minimized.

Show comment
Hide comment
@mvysny

mvysny Dec 19, 2015

Owner

Some tips to use the wildcards:

  1. to find words containing 出 (but not starting with 出), search for ?*出 - that is, a question mark, followed by an asterisk and the 出 kanji. This directs Aedict to only find words which start with one or more kanjis. This is because ? matches exactly one kanji, and * matches zero, one or more kanjis.
  2. Make sure that there is no space after the '?' character, otherwise Aedict will search for ? AND *出 which is not exactly what we want here :) Android loves adding space after you type the '?' character, so watch out for this.
Owner

mvysny commented Dec 19, 2015

Some tips to use the wildcards:

  1. to find words containing 出 (but not starting with 出), search for ?*出 - that is, a question mark, followed by an asterisk and the 出 kanji. This directs Aedict to only find words which start with one or more kanjis. This is because ? matches exactly one kanji, and * matches zero, one or more kanjis.
  2. Make sure that there is no space after the '?' character, otherwise Aedict will search for ? AND *出 which is not exactly what we want here :) Android loves adding space after you type the '?' character, so watch out for this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment