2023/05 mjke

Motivated by: https://www.haihai.ai/gpt-gdrive/

In [1]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

In [2]:
DIR_DATA = '../data/'

# https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html
DIR_CHROMA_DB = f'{DIR_DATA}/chroma'

langchain contains lots of loaders at: https://python.langchain.com/en/latest/modules/indexes/document_loaders.html

In [3]:
## gdrive version

#from langchain.document_loaders import GoogleDriveLoader

#folder_id = "YOUR_FOLDER_ID"
#loader = GoogleDriveLoader(
#    folder_id=folder_id,
#    recursive=False
#)
# docs = loader.load()

In [4]:
from langchain.document_loaders import DirectoryLoader, UnstructuredWordDocumentLoader

loader = DirectoryLoader(DIR_DATA, glob="*.docx", use_multithreading=True, show_progress=True,
                         loader_cls=UnstructuredWordDocumentLoader)
docs = loader.load()

100%|██████████| 1/1 [00:00<00:00,  1.47it/s]


In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000, chunk_overlap=0, separators=[" ", ",", "\n"]
    )

embedding choice (chromadb defaults to `text-embedding-ada-002`)
- https://platform.openai.com/docs/guides/embeddings/use-cases
- https://docs.trychroma.com/embeddings

In [6]:
texts = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(texts, embeddings, persist_directory=DIR_CHROMA_DB) # most expensive step
retriever = db.as_retriever()
db.persist()

In [7]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

In [8]:
print(qa.run("what kind of deep learning did they use?"))

The study used several deep learning models, including a fully connected "multi-layer perceptron" (Twin-NN), a DenseNet 32 architecture (Twin-DN), and recurrent architectures such as LSTM and GRU. They also used deep metric learning models for identifying drug replicates inducing noticeable but subtle changes in zebrafish behavior.


In [9]:
print(qa.run("what scientific journal would be good for this paper?"))

Based on the context provided, it seems that the paper is related to drug discovery and deep learning models for behavioral screening data. Some potential scientific journals that may be a good fit for this paper could include Journal of Medicinal Chemistry, Nature Methods, or Drug Discovery Today. However, the final decision on which journal to submit to should be based on the specific focus and scope of the paper, as well as the target audience.


In [10]:
print(qa.run("how many zebrafish are there per well in the screen?"))

There are 8 zebrafish per well in the screening platform.


In [11]:
print(qa.run("was the first screen randomized?"))

No, the first screen was not randomized. The unintended outcome of the first metric learning models exploiting "shortcut learning" on the original dataset, which had not used randomized plate layouts, led to the need for a second high-replicate screen of NT-650, but this time with the treatments fully robotically randomized across plates and wells.


In [12]:
print(qa.run("did deep metric learning have any limitations? did it matter whether they used the simple or dense model?"))

Yes, deep metric learning had limitations. One of the limitations was that the models could fall prey to "shortcut learning," where they exploited experimental artifacts in the screening dataset to achieve misleading performance that did not generalize to similar but independent zebrafish screens. This deep mis-learning was nuanced and eluded conventional cross-validation, sanity checks, and exploratory data analysis tests. 

Regarding the simple or dense model, the Twin-DN model was more computationally expressive than the Twin-NN model, but it was found to readily memorize the high-frequency components in the time series, which may come from artifacts such as plate vibrations or high-frequency noise in the imaging sensor. This resulted in the model's exceptional performance relying on shortcut learning or the exploitation of hidden artifactual cues encoded within the data that were invisible to human researchers but perceivable by the deep learning model.


In [14]:
db.persist()