# Query translation

In [1]:
from googletrans import Translator
import deepl
from sudachipy import tokenizer, dictionary
import pandas as pd
import wikipedia
import stanza

In [2]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')

2024-05-13 21:51:29 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-13 21:51:30 INFO: Downloaded file to C:\Users\irela\stanza_resources\resources.json
2024-05-13 21:51:30 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2024-05-13 21:51:31 INFO: Using device: cuda
2024-05-13 21:51:31 INFO: Loading: tokenize
2024-05-13 21:51:32 INFO: Loading: mwt
2024-05-13 21:51:32 INFO: Loading: pos
2024-05-13 21:51:33 INFO: Loading: lemma
2024-05-13 21:51:33 INFO: Done loading processors!


### Translation

I chose to try out two different translation sites, in hopes of comparing the two. The first is Google Translate, which is probably the most widely-used translator, but it tends to yield questionable and/or unnatural results with Japanese. The second I chose was DeepL, which is less popular overall but was by far the most popular among my classmates in my Japanese class. DeepL tends to perform decently and it has some features that make it particularly good for Japanese, such as the option to adjust the formality of translations. 

I also considered Papago, which is the state of the art for Korean translation, but it doesn't have a free API, so I stuck to the above websites and ended up not incorporating Korean into this project. 

In [3]:
key = "" # API key for DeepL
src = 'ja'
dest = 'en'

I'm now going to test the two translators with sentences I've handwritten myself, with a mix of sentences with more complex grammar and levels of formality. 

In [4]:
translator = Translator() # Google Translate
dl_translator = deepl.Translator(key) # DeepL
translation = translator.translate('公園に行ったらいいのになあって思います', src=src, dest=dest)
dl_translation = dl_translator.translate_text('公園に行ったらいいのになあって思います', target_lang='EN-US')
print("Google: ", translation.text, ", DeepL: ", dl_translation.text)

sentences = ['あのクマにバナナを食べさせられた。', 'こちらをご覧ください。', '先生はもう昼ご飯を召し上がりましたか。', 'AKB48と言えば、推しメンは誰？', 'お金さえあれば嬉しくなると思いますか。', '数学は下手なんだけど、頭が良くないというわけではない']
for sentence in sentences:
    deepl_trans = dl_translator.translate_text(sentence, target_lang='EN-US')
    g_trans = translator.translate(sentence, src=src, dest=dest)
    print("Google: ", g_trans.text, ", DeepL: ", deepl_trans.text)

Google:  I wish I could go to the park , DeepL:  I wish I could go to the park.
Google:  I was forced to eat a banana by that bear. , DeepL:  That bear made me eat a banana.
Google:  Have a look at this. , DeepL:  Please see here.
Google:  Did the teacher already have lunch? , DeepL:  Has the doctor had lunch yet?
Google:  Speaking of AKB48, who is it? , DeepL:  Speaking of AKB48, who are your guesses?
Google:  Do you think you'll be happy if you have the money? , DeepL:  Do you think that as long as you have money you will be happy?
Google:  I'm not good at mathematics, but not smart , DeepL:  I'm terrible at math, but that doesn't mean I'm not smart.


Expected translations: "I wish I could go to the park", "That bear made me eat a banana", "Please look here", "Did the teacher already have lunch?", "Speaking of AKB48, who is your favorite member?", "Do you think that you'll be happy as long as you have money?", "I'm bad at math, but that doesn't mean that I'm not smart."

Both performed pretty well, but let's try in the other direction now.

In [5]:
eng_sentences = ["I wish I could go to the park", "That bear made me eat a banana", "Please look here", "Did the teacher already have lunch?", "Speaking of AKB48, who is your favorite member?", "Do you think that you'll be happy as long as you have money?", "I'm bad at math, but that doesn't mean that I'm not smart."]
for sentence in eng_sentences:
    deepl_trans = dl_translator.translate_text(sentence, target_lang='JA')
    g_trans = translator.translate(sentence, src=dest, dest=src)
    print("Google: ", g_trans.text, ", DeepL: ", deepl_trans.text)

Google:  公園に行けたらいいのに , DeepL:  公園に行きたいなあ
Google:  そのクマは私にバナナを食べさせました , DeepL:  クマにバナナを食べさせられた
Google:  こちらをご覧ください , DeepL:  こちらをご覧ください
Google:  先生はすでに昼食をとっていましたか？ , DeepL:  先生はもう昼食を食べたのですか？
Google:  AKB48と言えば、あなたの好きなメンバーは誰ですか？ , DeepL:  AKB48といえば、好きなメンバーは？
Google:  あなたがお金を持っている限り、あなたは幸せになると思いますか？ , DeepL:  お金さえあれば幸せだと思いますか？
Google:  私は数学が苦手ですが、それは私が賢くないという意味ではありません。 , DeepL:  数学は苦手だけど、頭が悪いわけじゃない。


DeepL performed far better than Google in this direction. For the reader who doesn't know Japanese, I want to explain the mistakes made by each translator in each sentence.

1. (Park sentence): DeepL missed the "wish" part, translating it into "I want to go to the park." Google retained the original meaning pretty well, despite the structure changing a bit.
2. (Banana): This sentence has more complicated grammar, and I was testing if either site would be able to catch it. The sentence is supposed to be causative-passive (ending in させられた), meaning it should convey a feeling of "I was made to do this, and I feel negatively about it." Google missed this and only translated it as causative, i.e. "That bear made me eat a banana". DeepL did catch the causative-passive part.
3. (Look - formal) Both sites translated the sentence correctly
4. (Teacher - formal) Google used a different verb instead of eat, more like "had", and also didn't make the sentence as formal as intended. DeepL translated mostly correctly, but didn't make the sentence as formal as I had intended.
5. (AKB48) - Google made the sentence semi-formal, which wasn't what I had intended. DeepL made the sentence casual as intended. Neither of them used the slang term for favorite member.
6. (Money) - Google stretched this sentence out a lot in such a way that it sounds a bit unnatural/robotic. DeepL was correct.
7. (Math) - Google made the sentence semi-formal and didn't use the grammar I was looking for. DeepL translated it correctly.

Since the focus of this project is Pokemon, let's try translating some sentences related to Pokemon.

In [6]:
pokemon_text = 'サトシはフシギダネを捕まえたことがあります'
google_tr = translator.translate(pokemon_text, src=src, dest=dest)
dl_tr = dl_translator.translate_text(pokemon_text, target_lang='EN-US')
print("Google: ", google_tr.text, " DeepL: ", dl_tr.text)

Google:  Satoshi has caught Fushigidane  DeepL:  Satoshi once caught a fushigidane


In [7]:
pokemon_text_en = 'Has Ash caught a Bulbasaur before?'
google_en_to_j = translator.translate(pokemon_text_en, src='en', dest='ja')
dl_en_to_j = dl_translator.translate_text(pokemon_text_en, target_lang='JA')
print("Google: ", google_en_to_j.text, " DeepL: ", dl_en_to_j.text)

Google:  アッシュは以前にブルバサウルスを捕まえたことがありますか？  DeepL:  アッシュはブルバサウルスを捕まえたことがある？


Google sounds a bit more unnatural than DeepL, but neither of them are properly translating the proper nouns into their Japanese equivalents. Pokemon is an interesting topic to choose because some Pokemon have the same names across the two languages (e.g. Pikachu), but most don't. Bulbasaur should be 'fushigidane' in Japanese, but both systems are just using transliterations. Additionally, Ash should be Satoshi in Japanese, but the systems aren't picking up on that. 

I'm going to see if I can find a sentence where at least one of the systems recognizes the topic (Pokemon) and properly translates the sentence. 

In [8]:
pokemon_text = 'ポケモンというアニメでは主人公のサトシはフシギダネというポケモンを捕まえたことがありますか。'
google_tr = translator.translate(pokemon_text, src=src, dest=dest)
dl_tr = dl_translator.translate_text(pokemon_text, target_lang='EN-US')
print("Google: ", google_tr.text, "\n DeepL: ", dl_tr.text)

pokemon_text_en = 'Has Ash, the protagonist of the Pokemon anime, ever caught the Pokemon called Bulbasaur?'
google_en_to_j = translator.translate(pokemon_text_en, src='en', dest='ja')
dl_en_to_j = dl_translator.translate_text(pokemon_text_en, target_lang='JA')
print("Google: ", google_en_to_j.text, "\n DeepL: ", dl_en_to_j.text)

Google:  In the anime called Pokemon, have you ever caught a Pokemon called Fushigidane? 
 DeepL:  In the anime Pokemon, has the main character Satoshi ever caught a Pokemon called Fushigidane?
Google:  ポケモンアニメの主人公であるアッシュは、ブルバサウルスと呼ばれるポケモンを捕まえたことがありますか？ 
 DeepL:  アニメ『ポケットモンスター』の主人公アッシュは、ブルバサウルというポケモンを捕まえたことがあるのだろうか？


Even after giving the systems quite a bit of context, still no luck

In [9]:
tokenizer_s = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
print([m.surface() for m in tokenizer_s.tokenize(pokemon_text, mode)])
m = tokenizer_s.tokenize("サトシ", mode)[0]
m.part_of_speech()

['ポケモン', 'と', 'いう', 'アニメ', 'で', 'は', '主人公', 'の', 'サトシ', 'は', 'フシギダネ', 'と', 'いう', 'ポケモン', 'を', '捕まえ', 'た', 'こと', 'が', 'あり', 'ます', 'か', '。']


('名詞', '固有名詞', '人名', '名', '*', '*')

Possible idea: if a word is a proper noun (固有名詞), check if it's in the mapping and use that translation instead

In [10]:
pokemon_text = 'ポケモンというアニメでは主人公のサトシはBulbasaurというポケモンを捕まえたことがありますか。'
google_tr = translator.translate(pokemon_text, src=src, dest=dest)
dl_tr = dl_translator.translate_text(pokemon_text, target_lang='EN-US')
print("Google: ", google_tr.text, "\n DeepL: ", dl_tr.text)

pokemon_text_en = 'Has Ash, the protagonist of the Pokemon anime, ever caught the Pokemon called フシギダネ?'
google_en_to_j = translator.translate(pokemon_text_en, src='en', dest='ja')
dl_en_to_j = dl_translator.translate_text(pokemon_text_en, target_lang='JA')
print("Google: ", google_en_to_j.text, "\n DeepL: ", dl_en_to_j.text)

Google:  In the anime called Pokemon, have you ever caught a Pokemon called Bulbasaur? 
 DeepL:  In the anime Pokemon, has the main character Satoshi ever caught a Pokemon called Bulbasaur?
Google:  ポケモンアニメの主人公であるアッシュは、フシギダネと呼ばれるポケモンを捕まえたことがありますか？ 
 DeepL:  アニメ『ポケットモンスター』の主人公アッシュは、フシギダネというポケモンを捕まえたことがあるのだろうか？


Translation will work with direct replacement

### Game plan for query translation:

1. Get query from user input
2. Tokenize query
3. Run part-of-speech analyzer(?), see if any proper nouns in query
4. If there are proper nouns, compare with list of names. If the term is in the list, replace with the version in the other language.
5. Translate query
6. Perform search on documents

In [11]:
mapping = pd.read_csv('pokemonmap.csv')

In [12]:
mapping

Unnamed: 0,English,Japanese
0,Bulbasaur,フシギダネ
1,Ivysaur,フシギソウ
2,Venusaur,フシギバナ
3,Charmander,ヒトカゲ
4,Charmeleon,リザード
...,...,...
1034,Gary,シゲル
1035,Kukui,ククイ
1036,Dawn,ヒカリ
1037,Jenny,ジュンサー


In [13]:
original_query = 'サトシはピカチュウを捕まえたことがあります'
tokenizer_s = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.C
tokens = [m.surface() for m in tokenizer_s.tokenize(original_query, mode)]
mapping_eng = list(mapping['English'])
mapping_jpn = list(mapping['Japanese'])
for m in tokenizer_s.tokenize(original_query, mode):
    if '固有名詞' in m.part_of_speech():
        print(m.surface())
        if m.surface() in mapping_jpn:
            name_idx = mapping_jpn.index(m.surface())
            translated_name = mapping_eng[name_idx]
            tokens = list(map(lambda x: x.replace(m.surface(), translated_name), tokens))
tokens

サトシ
ピカチュウ


['Ash', 'は', 'Pikachu', 'を', '捕まえ', 'た', 'こと', 'が', 'あり', 'ます']

In [14]:
query = ''.join(tokens)
google_tr = translator.translate(query, src=src, dest=dest)
dl_tr = dl_translator.translate_text(query, target_lang='EN-US')
print("Google: ", google_tr.text, "\n DeepL: ", dl_tr.text)

Google:  Ash has caught Pikachu 
 DeepL:  Ash has caught Pikachu once


So far, I've only tried out translating sentences in natural language. Now, let's try translating strings that are more query-like

In [15]:
query_name_j = 'ピカチュウ英語'
query_name_e = 'Pikachu in Japanese'
query_type_j = 'でんきタイプ弱点'
query_type_e = 'Electric type weakness'
google_tr = translator.translate(query_name_j, src=src, dest=dest)
dl_tr = dl_translator.translate_text(query_name_j, target_lang='EN-US', context='Pokemon')
print("Google: ", google_tr.text, "\n DeepL: ", dl_tr.text)
google_tr = translator.translate(query_type_j, src=src, dest=dest)
dl_tr = dl_translator.translate_text(query_type_j, target_lang='EN-US', context='Pokemon')
print("Google: ", google_tr.text, "\n DeepL: ", dl_tr.text)
google_tr = translator.translate(query_name_e, src='en', dest='ja')
dl_tr = dl_translator.translate_text(query_name_e, target_lang='JA', context='ポケモン')
print("Google: ", google_tr.text, "\n DeepL: ", dl_tr.text)
google_tr = translator.translate(query_type_e, src='en', dest='ja')
dl_tr = dl_translator.translate_text(query_type_e, target_lang='JA', context='ポケモン')
print("Google: ", google_tr.text, "\n DeepL: ", dl_tr.text)

Google:  Pikachu English 
 DeepL:  Pikachu English
Google:  Denki type weakness 
 DeepL:  Electric Type Weakness
Google:  日本語のピカチュ 
 DeepL:  日本語のピカチュウ
Google:  電気タイプの衰弱 
 DeepL:  電気タイプの弱点


DeepL seems to be performing noticeably better than Google Translate when translating queries. 

DeepL also has a 'context' parameter that allows you to provide more context that could impact the translation. Before I added the context, DeepL struggled with the 'electric type weakness' translation, but upon adding context, DeepL returned the same query as I had created on my own for the JP->EN translation testing (with one word converted to its Kanji form).

# Search on Wikipedia articles

In [16]:
print(wikipedia.search("Pokemon"))
wikipedia.set_lang('ja')
print(wikipedia.search("ポケモン"))

['Pokémon', 'Pokémon (TV series)', 'Pokémon (video game series)', 'List of generation III Pokémon', 'Gameplay of Pokémon', 'Pokémon Red, Blue, and Yellow', 'List of generation I Pokémon', 'List of Pokémon characters', 'Pokémon Black and White', 'List of generation IV Pokémon']
['ポケットモンスター', 'ポケモン (企業)', 'ポケモンの一覧 (102-151)', 'ポケモンの一覧 (1-51)', 'ポケモンの一覧 (152-201)', 'ポケモンの一覧 (52-101)', 'ポケモン一覧', 'ポケモン不思議のダンジョン 青の救助隊・赤の救助隊', 'ポケモンの一覧 (441-493)', 'ポケモン☆サンデー']


In [17]:
movie_query_e = "What was the first Pokemon movie?"
movie_tr = dl_translator.translate_text(movie_query_e, target_lang='JA', context='ポケモン')
print(movie_tr)
results = wikipedia.search(movie_tr)
print(results)

最初のポケモン映画は？
['ポケモン (企業)', 'ポケットモンスター', '劇場版ポケットモンスター 幻のポケモン ルギア爆誕', 'ポケモン☆サンデー', 'ポケモンの家あつまる?', 'ポケモンの一覧 (102-151)', 'ポケモンの一覧 (1-51)', 'ポケットモンスター (架空の生物)', 'みんなのポケモン牧場', 'ポケモンファン']


There is only one relevant result above, 劇場版ポケットモンスター 幻のポケモン ルギア爆誕. The page is not the page for the first movie, but it does mention that it is the second movie and the sidebar gives the name of the first movie.

Of 10 results, only the 3rd one was relevant. This yields a precision of 1/10.

However, this might be user error; have you ever typed a natural language question into Wikipedia before? Let's try again with something more realistic

In [18]:
movie_query_e_shortened = "First Pokemon movie"
movie_tr_new = dl_translator.translate_text(movie_query_e_shortened, target_lang='JA', context='ポケモン')
print(movie_tr_new)
results_new = wikipedia.search(movie_tr_new)
print(results_new)

ポケモン映画第1作
['ポケットモンスター (劇場版)', 'ポケモン (企業)', 'ポケットモンスター', 'ポケモンの一覧 (1-51)', 'ポケモンの一覧 (102-151)', 'ポケモン☆サンデー', 'ポケットモンスター (アニメ)', 'ポケモンの家あつまる?', 'ポケモン・ザ・ムービーXY 光輪の超魔神 フーパ', 'ポケモン・ザ・ムービーXY 破壊の繭とディアンシー']


The relevant results are ポケットモンスター (劇場版) (1), ポケットモンスター (アニメ) (7), ポケモン・ザ・ムービーXY 破壊の繭とディアンシー (10). These pages all mention the first movie by name, and not just in the external links section. 

Our precision is now 3/10, which is an improvement.

### Lemmatizing and Normalizing

Aside from tokenizing and analyzing the parts of speech of each token, Sudachi can also get the normalized and dictionary forms of tokens. 

In [19]:
def convert_query(query, tokenizer, mode, lemma=False, normalize=False):
    '''
    Preprocesses a Japanese query. Replaces Pokemon names with their English equivalents, and lemmatizes and normalizes the
    words.
    
        Parameters:
            query (string): the Japanese query to be preprocessed
            tokenizer: the sudachi tokenizer
            mode: the mode for the sudachi tokenizer
            lemma (bool): whether or not to lemmatize the words in the query
            normalize (bool): whether or not to normalize the words in the query
            
        Returns: 
            query (string): the preprocessed query
    '''
    tokens = [m.surface() for m in tokenizer_s.tokenize(query, mode)]
    for m in tokenizer.tokenize(query, mode):
            if '固有名詞' in m.part_of_speech():
                if m.surface() in mapping_jpn:
                    name_idx = mapping_jpn.index(m.surface())
                    translated_name = mapping_eng[name_idx]
                    tokens = list(map(lambda x: x.replace(m.surface(), translated_name), tokens))
            elif lemma == True:
                tokens = list(map(lambda x: x.replace(m.surface(), m.dictionary_form()), tokens))
            if normalize == True:
                tokens = list(map(lambda x: x.replace(m.surface(), m.normalized_form()), tokens))
    return ''.join(tokens)

In [20]:
wikipedia.set_lang('en')
jp_query_l = convert_query('サトシはピカチュウがあります', tokenizer_s, mode, lemma=True)
jp_query_n = convert_query('サトシはピカチュウがあります', tokenizer_s, mode, normalize=True)
jp_query = convert_query('サトシはピカチュウがあります', tokenizer_s, mode)
print("Preprocessed query: ", jp_query, ", Preprocessing + lemmatization: ", jp_query_l, ", Preprocessing + normalization: ", jp_query_n)
jp_tr = dl_translator.translate_text(jp_query, source_lang='JA', target_lang='EN-US', context='Pokemon')
jp_tr_l = dl_translator.translate_text(jp_query_l, source_lang='JA', target_lang='EN-US', context='Pokemon')
jp_tr_n = dl_translator.translate_text(jp_query_n, source_lang='JA', target_lang='EN-US', context='Pokemon')
print("Regular translation: ", jp_tr, ", Lemmatized: ", jp_tr_l, ", Normalized: ", jp_tr_n)
results_jp = wikipedia.search(jp_tr)
results_jp_l = wikipedia.search(jp_tr_l)
results_jp_n = wikipedia.search(jp_tr_n)
print("Regular results: ", results_jp, "\n Lemmatized results: ", results_jp_l, "\n Normalized results: ", results_jp_n)

Preprocessed query:  AshはPikachuがあります , Preprocessing + lemmatization:  AshはPikachuがあるます , Preprocessing + normalization:  AshはPikachuが有るます
Regular translation:  Ash has Pikachu , Lemmatized:  Ash has Pikachu , Normalized:  Ash has Pikachu
Regular results:  ['Pikachu', 'Ash Ketchum', 'Pokémon: The Electric Tale of Pikachu', 'Pokémon 3: The Movie', 'List of Pokémon films', 'Detective Pikachu (film)', 'Pokémon the Movie: I Choose You!', 'Pokémon Horizons: The Series', 'Raichu', 'Pokémon, I Choose You!'] 
 Lemmatized results:  ['Ash Ketchum', 'Pikachu', 'Pokémon: The Electric Tale of Pikachu', 'Pokémon 3: The Movie', 'List of Pokémon films', 'Detective Pikachu (film)', 'Pokémon the Movie: I Choose You!', 'Pokémon Horizons: The Series', 'Pokémon, I Choose You!', 'Pokémon the Movie 2000'] 
 Normalized results:  ['Pikachu', 'Ash Ketchum', 'Pokémon: The Electric Tale of Pikachu', 'Pokémon 3: The Movie', 'List of Pokémon films', 'Detective Pikachu (film)', 'Pokémon the Movie: I Choose You!', '

In [21]:
def convert_query_en(query, lemma=True):
    '''
    Preprocesses an English query. Replaces Pokemon names with their Japanese equivalents, and lemmatizes the words.
        Parameters:
            query (string): the English query to be preprocessed
            lemma (bool): whether or not to lemmatize the words in the query
            
        Returns: 
            query (string): the preprocessed query
    '''
    query_split = query.split()
    for word in query_split:
        if word in mapping_eng:
            name_idx = mapping_eng.index(word)
            translated_name = mapping_jpn[name_idx]
            query_split = list(map(lambda x: x.replace(word, translated_name), query_split))
            query = ' '.join(query_split)
    if lemma == True:
        doc = nlp(query)
        for sent in doc.sentences:
            for word in sent.words:
                query_split = list(map(lambda x: x.replace(word.text, word.lemma), query_split))
    return ' '.join(query_split)

In [22]:
wikipedia.set_lang('ja')

In [23]:
print(convert_query_en('Ash has a Pikachu', lemma=True))
print(convert_query_en('Ash has a Pikachu'))

サトシ have a ピカチュウ
サトシ have a ピカチュウ


In [24]:
movie_query_e = "What was the first Pokemon movie?"
movie_query_converted = convert_query_en(movie_query_e, lemma=True)
print(movie_query_converted)
movie_tr = dl_translator.translate_text(movie_query_converted, target_lang='JA', context='ポケモン')
print(movie_tr)
results = wikipedia.search(movie_tr)
print(results)

what be the first Pokemon movie?
ポケモン映画第1弾は？
['ポケットモンスター (劇場版)', 'ポケモンの一覧 (1-51)', 'ポケモンの一覧 (102-151)', 'ポケモン☆サンデー', 'ポケモンの家あつまる?', 'ポケモン・ザ・ムービーXY 光輪の超魔神 フーパ', 'ポケモンの一覧 (52-101)', 'ピカチュウとポケモンおんがくたい', 'ポケモンカードゲームの発売タイトル一覧', 'ポケットモンスター (アニメ)']


Recall that the first attempt at making this search, where we didn't lemmatize the query first, had only 1 relevant query returned (at rank 3). Now, the very first result is relevant, along with results 6 and 10. Lemmatizing the query before translating it greatly improved the precision.

### Comparing DeepL and Google Translate

#### English Queries

In [25]:
# DeepL
anime_query_e = "When did Ash set his Butterfree free?"
anime_query_converted_lemma = convert_query_en(anime_query_e, lemma=True)
anime_query_converted = convert_query_en(anime_query_e, lemma=False)
print("Preprocessed query: ", anime_query_converted, ", Preprocessing + lemmatization: ", anime_query_converted_lemma)
anime_tr = dl_translator.translate_text(anime_query_converted, source_lang='EN', target_lang='JA', context='ポケモン')
anime_tr_l = dl_translator.translate_text(anime_query_converted_lemma, source_lang='EN', target_lang='JA', context='ポケモン')
print("Regular translation: ", anime_tr, ", Lemmatized: ", anime_tr_l)
results_anime = wikipedia.search(anime_tr)
results_anime_l = wikipedia.search(anime_tr_l)
print("Regular results: ", results_anime, "\n Lemmatized results: ", results_anime_l)


Preprocessed query:  When did サトシ set his バタフリー free? , Preprocessing + lemmatization:  when do サトシ set his バタフリー free?
Regular translation:  サトシはいつバタフリーになったのか？ , Lemmatized:  サトシはいつ、彼のバタフライを自由にするのか？
Regular results:  ['電撃!ピカチュウ', 'カスミ (アニメポケットモンスター)', 'ポケットモンスター (アニメ)', 'エンタの神様', 'ポケットモンスター (ゲームシリーズ)', 'ポケットモンスターSPECIALの登場人物', 'ポケモン☆サンデー', 'M-1グランプリ', '伊集院光', 'イナズマイレブンGOの登場人物'] 
 Lemmatized results:  ['アニメ版ポケットモンスターの登場人物', 'ポケモンの一覧 (102-151)', '林原めぐみ', '山本太郎', 'ワンパンマン', '孤独のグルメ (テレビドラマ)', 'ウルトラ怪獣擬人化計画', "Pop'n musicの登場キャラクター", 'アイシールド21の登場人物', '龍が如くシリーズの登場人物']


Relevant results, no lemmatization: 0

Relevant results, with lemmatization: 1 (first result)

Let's try it again with Google Translate instead of DeepL

In [26]:
# Google
anime_query_e = "When did Ash set his Butterfree free?"
anime_query_converted_lemma = convert_query_en(anime_query_e, lemma=True)
anime_query_converted = convert_query_en(anime_query_e, lemma=False)
print("Preprocessed query: ", anime_query_converted, ", Preprocessing + lemmatization: ", anime_query_converted_lemma)
anime_tr = translator.translate(anime_query_converted, src='en', dest='ja').text
anime_tr_l = translator.translate(anime_query_converted_lemma, src='en', dest='ja').text
print("Regular translation: ", anime_tr, ", Lemmatized: ", anime_tr_l)
results_anime = wikipedia.search(anime_tr)
results_anime_l = wikipedia.search(anime_tr_l)
print("Regular results: ", results_anime, "\n Lemmatized results: ", results_anime_l)

Preprocessed query:  When did サトシ set his バタフリー free? , Preprocessing + lemmatization:  when do サトシ set his バタフリー free?
Regular translation:  サトシはいつ彼のバタフリーを自由に設定しましたか？ , Lemmatized:  サトシはいつ彼のバタフリーを自由に設定しますか？
Regular results:  ['ポケットモンスターSPECIALの登場人物', 'イナズマイレブンGOの登場人物'] 
 Lemmatized results:  ['ポケットモンスターSPECIALの登場人物', 'イナズマイレブンGOの登場人物']


The Google Translate translation yielded less results overall, and both queries had the same results. Neither of the results are relevant

#### Japanese Queries

For these, instead of trying to translate them myself, I'm going to reuse the translations from above. 

In [27]:
wikipedia.set_lang('en')

In [28]:
# DeepL 
anime_query_j = "サトシはいつバタフリーになったのか？"
anime_query_converted_lemma = convert_query(anime_query_j, tokenizer_s, mode, lemma=True)
anime_query_converted_n = convert_query(anime_query_j, tokenizer_s, mode, lemma=False, normalize=True)
anime_query_converted = convert_query(anime_query_j, tokenizer_s, mode)
print("Preprocessed query: ", anime_query_converted, ", Preprocessing + lemmatization: ", anime_query_converted_lemma, ", Preprocessing + normalization: ", anime_query_converted_n)
anime_tr = dl_translator.translate_text(anime_query_converted, source_lang='JA', target_lang='EN-US', context='Pokemon')
anime_tr_l = dl_translator.translate_text(anime_query_converted_lemma, source_lang='JA', target_lang='EN-US', context='Pokemon')
anime_tr_n = dl_translator.translate_text(anime_query_converted_n, source_lang='JA', target_lang='EN-US', context='Pokemon')
print("Regular translation: ", anime_tr, ", Lemmatized: ", anime_tr_l, ", Normalized: ", anime_tr_n)
results_anime = wikipedia.search(anime_tr)
results_anime_l = wikipedia.search(anime_tr_l)
results_anime_n = wikipedia.search(anime_tr_n)
print("Regular results: ", results_anime, "\n Lemmatized results: ", results_anime_l, "\n Normalized results: ", results_anime_n)


Preprocessed query:  Ashはいつバタフリーになったのか？ , Preprocessing + lemmatization:  Ashはいつバタフリーだなるたのか? , Preprocessing + normalization:  Ashはいつバタフリーだ成るたのか?
Regular translation:  When did Ash become butterfree? , Lemmatized:  When did Ash become butterfree? , Normalized:  When did Ash become a butterfly?
Regular results:  ['Pokémon: Indigo League', 'Pokémon the Movie: I Choose You!', 'Pokémon: Black & White: Adventures in Unova and Beyond', 'List of generation I Pokémon', 'List of Pokémon episodes (seasons 1–13)', 'List of Pokémon episodes (seasons 14–present)', 'Pokémon Adventures'] 
 Lemmatized results:  ['Pokémon: Indigo League', 'Pokémon the Movie: I Choose You!', 'Pokémon: Black & White: Adventures in Unova and Beyond', 'List of generation I Pokémon', 'List of Pokémon episodes (seasons 1–13)', 'List of Pokémon episodes (seasons 14–present)', 'Pokémon Adventures'] 
 Normalized results:  ['To Pimp a Butterfly', 'Papilio troilus', 'Seppuku', 'List of Code:Breaker characters', 'Possessor (film)'

In [29]:
# Google
anime_query_j = "サトシはいつバタフリーになったのか？"
anime_query_converted_lemma = convert_query(anime_query_j, tokenizer_s, mode, lemma=True)
anime_query_converted_n = convert_query(anime_query_j, tokenizer_s, mode, lemma=False, normalize=True)
anime_query_converted = convert_query(anime_query_j, tokenizer_s, mode)
print("Preprocessed query: ", anime_query_converted, ", Preprocessing + lemmatization: ", anime_query_converted_lemma, ", Preprocessing + normalization: ", anime_query_converted_n)
anime_tr = translator.translate(anime_query_converted, src ='ja', dest='en').text
anime_tr_l = translator.translate(anime_query_converted_lemma, src='ja', dest='en').text
anime_tr_n = translator.translate(anime_query_converted_n, src='ja', dest='en').text
print("Regular translation: ", anime_tr, ", Lemmatized: ", anime_tr_l, ", Normalized: ", anime_tr_n)
results_anime = wikipedia.search(anime_tr)
results_anime_l = wikipedia.search(anime_tr_l)
results_anime_n = wikipedia.search(anime_tr_n)
print("Regular results: ", results_anime, "\n Lemmatized results: ", results_anime_l, "\n Normalized results: ", results_anime_n)


Preprocessed query:  Ashはいつバタフリーになったのか？ , Preprocessing + lemmatization:  Ashはいつバタフリーだなるたのか? , Preprocessing + normalization:  Ashはいつバタフリーだ成るたのか?
Regular translation:  When did ASH become so -free? , Lemmatized:  When is ASH butterfly? , Normalized:  When did ASH make a butterfly?
Regular results:  ['Ash Ketchum', 'Ferreira family', 'Pokémon, I Choose You!', 'Hochstetter family', 'Callum Kane', 'Hannah Ashworth', 'Pokémon Puzzle League', 'Renewable Heat Incentive scandal', 'Types of volcanic eruptions', 'Ally Gorman'] 
 Lemmatized results:  ['Fraxinus', 'To Pimp a Butterfly', 'Iron Butterfly', 'Fraxinus excelsior', 'Ash Riser', 'Seppuku', 'Papilio glaucus', 'Butterfly (TV series)', 'Papilio polyxenes', 'Papilio troilus'] 
 Normalized results:  ['To Pimp a Butterfly', 'Butterfly (TV series)', 'Papilio troilus', 'Seppuku', 'List of Code:Breaker characters', 'Possessor (film)', 'Shadow Cabinet (comics)', 'List of film director–composer collaborations', 'List of Chopped episodes (season 41

Relevant: Pokemon: Indigo League, Pokemon the Movie: I Choose You!, Pokemon Black and White: Adventures in Unova and Beyond, List of Pokemon Episodes (seasons 1-13)

**DeepL Precisions**
* Regular: 4/10
* Lemmatized: 4/10
* Normalized: 0/10

**Google Precisions**
* Regular: 1/10
* Lemmatized: 0/10
* Normalized: 0/10

### Design decisions for final code

* I chose to set lemma=True as the default for English and lemma=False and normalize=False for Japanese. Based on the above experiments, the precision improved when English queries were lemmatized, while lemmatization didn't work very well for Japanese (arimasu becoming aru+masu -> arumasu) and normalization yielded strange results.

* I chose to only use the DeepL translator instead of allowing the user to choose which translator to use. When I experimented above, Google Translate's translated query yielded far fewer results for both queries.

In [30]:
def translate_query(query, source_language):
    '''
    Translates a query that has already been preprocessed. Uses the DeepL translator.
        Parameters:
            query (string): the query to be translated
            source_language (string): the language that the query is in. If 'en', translates into Japanese. 
            
        Returns: 
            query (string): the translated query
    '''
    if source_language == 'en':
        return dl_translator.translate_text(query, source_lang='EN', target_lang='JA')
    else:
        return dl_translator.translate_text(query, source_lang='JA', target_lang='EN-US')

In [32]:
def search_en(query, lemma=True):
    '''
    Performs the entire search process on an English query. First the query is preprocessed, then translated into Japanese, 
    then a search is performed on Japanese wikipedia.
        Parameters:
            query (string): the query 
            lemma (bool): whether or not to lemmatize the query before translation
            
        Returns: 
            results (list[str]): the titles of the top 10 results from Japanese Wikipedia. 
    '''
    query = convert_query_en(query, lemma)
    query = translate_query(query, 'en')
    wikipedia.set_lang('ja')
    results = wikipedia.search(query)
    return results

def search_ja(query, tokenizer, mode, lemma=False, normalize=False):
    '''
    Performs the entire search process on an English query. First the query is preprocessed, then translated into Japanese, 
    then a search is performed on Japanese wikipedia.
        Parameters:
            query (string): the query 
            tokenizer: the sudachi tokenizer
            mode: the mode for the sudachi tokenizer
            lemma (bool): whether or not to lemmatize the words in the query
            normalize (bool): whether or not to normalize the words in the query
            
        Returns: 
            results (list[str]): the titles of the top 10 results from English Wikipedia. 
    '''
    query = convert_query(query=query, tokenizer=tokenizer, mode=mode, lemma=lemma, normalize=normalize)
    query = translate_query(query, 'ja')
    wikipedia.set_lang('en')
    results = wikipedia.search(query)
    return results

In [33]:
# Test
print(search_en("When did Ash set his Butterfree free?"))
print(search_ja("サトシはいつバタフリーになったのか？", tokenizer=tokenizer_s, mode=mode))

['アニメ版ポケットモンスターの登場人物', 'ポケモンの一覧 (102-151)', '林原めぐみ', '山本太郎', 'ワンパンマン', '孤独のグルメ (テレビドラマ)', 'ウルトラ怪獣擬人化計画', "Pop'n musicの登場キャラクター", 'アイシールド21の登場人物', '龍が如くシリーズの登場人物']
['Pokémon: Indigo League', 'Pokémon the Movie: I Choose You!', 'Pokémon: Black & White: Adventures in Unova and Beyond', 'List of generation I Pokémon', 'List of Pokémon episodes (seasons 1–13)', 'List of Pokémon episodes (seasons 14–present)', 'Pokémon Adventures']


In [34]:
# Short query
print(search_en("Pokemon Diamond release date"))
print(search_ja("ポケットモンスター ダイヤモンド発売日", tokenizer=tokenizer_s, mode=mode))

['ポケットモンスター ダイヤモンド・パール', 'ポケットモンスター ダイヤモンド&パール', '劇場版ポケットモンスター ダイヤモンド&パール 幻影の覇者 ゾロアーク', 'ポケットモンスター', '劇場版ポケットモンスター ダイヤモンド&パール アルセウス 超克の時空へ', 'ポケットモンスター ブリリアントダイヤモンド・シャイニングパール', 'ポケットモンスターダイヤモンド・パール物語 ポケモンDP', 'ポケットモンスター (アニメ)', 'ポケットモンスター ダイヤモンド・パールの登場人物', 'ポケットモンスター 赤・緑']
['Pokémon Diamond and Pearl', 'Pokémon Brilliant Diamond and Shining Pearl', 'Pokémon Platinum', 'Pokémon (video game series)', 'Pokémon Black and White', 'Pokémon: Diamond and Pearl: Battle Dimension', 'Pokémon: Diamond and Pearl: Galactic Battles', 'Pokémon the Series: Diamond and Pearl', 'List of Pokémon video games', 'List of Pokémon films']
