# Translate and DSL Search: Introduction

September 18, 2019 (updated September 26, 2019)

---

**Original requirements**
1. you enter a search text (e.g. 乳がん)
2. you detect the language using Google translate (in the example: Japanese)
3. in case non-English text was entered
  * you translate the text into English (in the example: breast cancer)
  * you expand the search text (in the example: 乳がん OR "breast cancer") To be able to set the quotes, you need to get the tokens from the translation service.  
4. You are performing a search in all available Dimensions data sources and you are returning the number of results without and with expansion

---

**Lessons Learned**

* Language detection is not necessary as the [googletrans](https://pypi.org/project/googletrans/) library does it automatically
* Google translate does not offer any tokenizer service for tranlations. I implemented a basic tokenizer for nouns/noun_phrases using [textblob](https://textblob.readthedocs.io/en/dev/) which is built on top of nltk
* The [dimcli](https://pypi.org/project/dimcli/) library for Dimensions API has been updated so that it loads nicely in Google Colab. No need to restart the Kernel anymore, just run the cells from top to bottom

# 1. Install Libraries


In [1]:
!pip install dimcli plotly_express googletrans textblob -U
!python -m textblob.download_corpora

Requirement already up-to-date: dimcli in /Users/michele.pasin/Envs/dslqa/lib/python3.8/site-packages (0.7.3)


Requirement already up-to-date: plotly_express in /Users/michele.pasin/Envs/dslqa/lib/python3.8/site-packages (0.4.1)
Requirement already up-to-date: googletrans in /Users/michele.pasin/Envs/dslqa/lib/python3.8/site-packages (3.0.0)


Requirement already up-to-date: textblob in /Users/michele.pasin/Envs/dslqa/lib/python3.8/site-packages (0.15.3)














You should consider upgrading via the '/Users/michele.pasin/Envs/dslqa/bin/python -m pip install --upgrade pip' command.[0m


[nltk_data] Downloading package brown to
[nltk_data]     /Users/michele.pasin/nltk_data...


[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/michele.pasin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/michele.pasin/nltk_data...


[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/michele.pasin/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/michele.pasin/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/michele.pasin/nltk_data...


[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


# 2. Log into the Dimensions API

And also setup a bunch of useful libraries. 

In [2]:
import pandas
import time
from tqdm import tqdm_notebook as tqdm
from googletrans import Translator
import dimcli
import plotly_express as px
from textblob import TextBlob
from getpass import getpass

##
# LOG IN 
##

user = ""  #@param {type: "string"}
password = getpass('Enter password here')
print('=> username is', user)
print('=> password is', "*" * len(password))
dimcli.login(user, password)

##
# OBJECTS 
##

dsl = dimcli.Dsl()
translator = Translator()


##
# FUNCTIONS 
##

def nouns_and_noun_phrases(s, dolower=True):
  """ eg
  > nouns_and_noun_phrases("women with breast cancer and men with breast cancer")
  > => ['women', 'men', 'breast cancer']
  """
  if dolower: # this avoids bad tokenizations sometimes
    s = s.lower()
  blob, out = TextBlob(s), []
  noun_phrases = list(set(blob.noun_phrases))
  nouns = [n for n,t in blob.tags if t in ['NN', 'NNS', 'NNP', 'NNPS']]
  for x in nouns:
    flag = False
    for y in noun_phrases:
      if x in y:
        flag = True
    if not flag:
      out.append(x)
  return out + noun_phrases

def build_keywords(word):
    """
    From a single keyword eg 乳がん, returned a list of keywords from translation eg
    ['\\"乳がん\\"', '\\"Breast cancer\\"', '\\"乳がん\\" OR \\"Breast cancer\\"']
    """
    word = word.strip()
    quotes_flag = False
    if word[0] == '"' and word[-1] == '"':
      word = word.replace("\"", "")
      quotes_flag = True
    t = translate(word)
    if t:
      # if it's a non-english word
      if quotes_flag:
        word = f"""\\\"{word}\\\""""
        t = f"""\\\"{t}\\\""""
        return [f"""{word}""", f"""{t}"""]
      else:
        # if not in quotes, try to tokenize the english version
        tokens = nouns_and_noun_phrases(t)
        s = " AND ".join([f"""\\\"{x}\\\"""" for x in tokens])
        return [f"""{word}""", f"""{s}"""]
    else:
      # if not translation
      if quotes_flag:
        word = f"""\\\"{word}\\\""""
      return [f"""{word}"""] 


def translate(word):
    """ Uses the  https://pypi.org/project/googletrans/ library to translate"""
    # eg translate('乳がん')
    r = translator.translate(word)
    if r.src == "en":
      return False
    else:
      return r.text

# The next function launches a query ```search publications for \"keyword\" return publications limit 1``` on all DSL sources. 

def multi_search(keywords, sources):
    "Launch multiple keyword searches across multiple DSL sources"
    out = []
    for k in tqdm(keywords, desc='1st loop: keywords'):
        for s in tqdm(sources, desc='2nd loop: sources'):
            res = dsl.query(f"""search {s} for \"{k}\" return {s} limit 1""",  show_results=False)
    #         print(s, res.total_count)
            out.append({'source' : s, 'objects' : res.count_total, 'query' : k})
            time.sleep(1)
    return pandas.DataFrame.from_dict(out)

# EG 
# print(translate(' \u4E73\u304C\u3093'))
# print(build_keywords("Breast cancer"))
# print(nouns_and_noun_phrases("Breast cancer"))
# print(build_keywords(' \u4E73\u304C\u3093'))
# print(build_keywords('乳がんの女性')) . # tokenizer will kick in here

# => 
# >> Breast cancer
# >> ['breast cancer']
# >> Women with breast cancer
# >> ['women', 'breast cancer']
# >> ['乳がんの女性', '\\"women\\" AND \\"breast cancer\\"']

StdinNotImplementedError: getpass was called, but this frontend does not support input requests.

# 3. Run Your Query

You can use quotes for exact matches for example "乳がん"

In [3]:
#@title Enter a Search Term

search_term = '\u4E73\u304C\u3093\u306E\u5973\u6027'  #@param {type: "string"}

keywords = build_keywords(search_term)
print("Expanded into.. :", keywords)
sources = [x for x in dimcli.G.sources() if x != 'researchers']
print("Searching in.. :", sources)
df = multi_search(build_keywords(search_term), sources)
df

NameError: name 'build_keywords' is not defined

# 4. Plot the results 

Two simple bar charts display the data retrieved in alternate ways.

In [4]:
fig1 = px.bar(df, x="query", y="objects", color="source")
fig2 = px.bar(df, x="source", y="objects", facet_col="query")
fig1.show()
fig2.show()

NameError: name 'df' is not defined