# Pattern Matching
### (Are these allowed to be published? Please let me know.)

https://drive.google.com/file/d/1_vFzsHsZ78Cg7Hy2-V6Fk-hpXifeqsgx/view

https://drive.google.com/file/d/1upNVxYjCRbw5eLbRvkOE2uEHWEj2PrBX/view

These were previously used methods to classify chemical reactions that used pattern matching, and they are also what inspired the CheRMiT project to begin with back in 2020. You need not read through these in detail, but give them a skim so you can build some context about our project's history!

Essentially, they are using patterns in the sentence structures to predict whether the sentence contains a reaction or not. While this is a good start, it's not comprehensive enough to generalize over all scientific literature. Different papers will have different types of sentences, and therefore, better patterns must be used for this type of method to work. That's why we're using ML for our project.


# Snorkel

Paper using Snorkel to extract chemical reactions data from scientific literature (what CheRMiT does):

https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-018-0723-6

Snorkel's website:

https://www.snorkel.org/

We are using Snorkel to help create training data for our machine learning models. Snorkel is a library that allows users to create labeling functions. Each labeling function assigns a label (True, False, or Abstain) to a sentence based on a user-coded heuristic. All of the labeling functions are applied simultaneously to the unlabeled sentences and their results are aggregated to give the sentence a final label. 

Snorkel is important because machine learning models need accurate labeled data to perform well. A machine learning model is only as good as its data. Please check out the resources above for more context around Snorkel! Once you have a good understanding, continue with the assignment below.

-----------

Below, please write a Snorkel labeling function that classifies a given sentence as **True** (contains a reaction). You can use any aspects of the sentence, such as the words, the sentence structure, or the position of words. Get creative!

The rest of the code below will allow you to see your function's accuracy.

In [33]:
!pip install snorkel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [34]:
from snorkel.labeling import labeling_function
import regex as re
import pandas as pd
import numpy as np

ABSTAIN = -1

# Example labeling functions below.

# includes_reaction_words
# If the sentence contains reactions words, we label True
reaction_terms_specific = "conver|yield|synthesiz|synthesis|oxid|reduc|phosphorylat" + \
                 "|metaboliz|metabolis|generat|hydroly" + \
                 "|methylat|brominat|aminat|dehydrat|condensat|degradat|decompos|carboxylat"
@labeling_function()
def includes_reaction_words2(x):
    structure = "(" + reaction_terms_specific + ")"
    if (re.search(structure, x[0])):
        return True
    return ABSTAIN

# puts chemicals separated by or for regex structures
def helper_sep_chems_with_or(chemicals):
    final = ""
    for chem in chemicals:
        if (final == ""):
            final += re.escape(chem)
        else:
            final += "|" + re.escape(chem)
    return final

# Phil's version: structure_jtsui_pattern_1
# If part of the sentence contains the specific structure
# [trigger1] <0,3> chemical [transition] <0,3> chemical, we label True

TRANS = "from|to|into|by|are|yield"
TRIG1 = "phosphoryl|condens|hydrolys|metabol|reduc|conver|produc|form|oxid|transform|bioconver|synthes|react|interconver"
TRANS_p = "(" + TRANS + ")"
TRIG1_p = "(" + TRIG1 + ")"

@labeling_function()
def structure_jtsui_pattern_1(x):
    chemicals = helper_sep_chems_with_or(x[1])
    chemicals_p = "(" + chemicals + ")"

    structure = r"\b" + r"{}".format(TRIG1_p) + r"\w*(\s\w*){0,3}\s" + r"{}".format(chemicals_p) + r"\s" + r"{}".format(TRANS_p) + r"(\s\w*){0,3}\s" + r"{}".format(chemicals_p)

    if (re.search(structure, x[0])):
        return True
    return ABSTAIN

In [39]:
#TODO: Code your labeling function here and add it to your_lf

@labeling_function()
def your_lf_here(x):
    return False
your_lf = [your_lf_here]

In [40]:
#Ground truth values to test your LF
df_test = pd.read_csv("df_test.csv", index_col=[0])
df_test.head()

Unnamed: 0,sentence,chemicals,truth,substrates,products,text
0,"indomethacin inhibited both hcox-1 and hcox-2,...","['ns-398', 'indomethacin', 'dup-697']",0,,,['indomethacin inhibited both hcox-1 and hcox-...
1,both ns-398 and dup-697 exhibited time-depende...,"['ns-398', 'dup-697', 'indomethacin']",0,,,['both ns-398 and dup-697 exhibited time-depen...
2,to understand the signal transduction pathway ...,"['ceramide', 'c2-ceramide']",0,,,['to understand the signal transduction pathwa...
3,although dopamine does not readily cross the b...,"['dopamine', 'levodopa']",0,,,['although dopamine does not readily cross the...
4,because gastric aadc and comt degrade levodopa...,"['carbidopa', 'benserazide', 'levodopa']",0,levodopa,,['because gastric aadc and comt degrade levodo...


In [41]:
from snorkel.labeling import PandasLFApplier

applier_test = PandasLFApplier(lfs=your_lf)
L_test = applier_test.apply(df=df_test)

from snorkel.labeling.model import MajorityLabelVoter

majority_model = MajorityLabelVoter()
df_test["label_voter"] = majority_model.predict(L=L_test)
df_test.head()

100%|██████████| 405/405 [00:00<00:00, 68322.13it/s]


Unnamed: 0,sentence,chemicals,truth,substrates,products,text,label_voter
0,"indomethacin inhibited both hcox-1 and hcox-2,...","['ns-398', 'indomethacin', 'dup-697']",0,,,['indomethacin inhibited both hcox-1 and hcox-...,0
1,both ns-398 and dup-697 exhibited time-depende...,"['ns-398', 'dup-697', 'indomethacin']",0,,,['both ns-398 and dup-697 exhibited time-depen...,0
2,to understand the signal transduction pathway ...,"['ceramide', 'c2-ceramide']",0,,,['to understand the signal transduction pathwa...,0
3,although dopamine does not readily cross the b...,"['dopamine', 'levodopa']",0,,,['although dopamine does not readily cross the...,0
4,because gastric aadc and comt degrade levodopa...,"['carbidopa', 'benserazide', 'levodopa']",0,levodopa,,['because gastric aadc and comt degrade levodo...,0


In [42]:
# The number of positive sentences in our ground truth data
num_correct = len(df_test[(df_test["truth"] == 1) & (df_test["label_voter"] == 1)])

num_positive = len(df_test[df_test["truth"] == 1])

accuracy = num_correct / num_positive

print("Number of sentences your LF labeled correctly: " + str(num_correct))
print("Number of true sentences: " + str(num_positive))
print("Your Lf's accuracy: " + str(accuracy))

Number of sentences your LF labeled correctly: 0
Number of true sentences: 45
Your Lf's accuracy: 0.0


# Machine Learning

Machine learning is the second integral part of our project. When it comes to automatically classifying reactions, there is only so much that can be achieved using chemical logic (RDKit) and pattern matching (Snorkel).

For example, the Snorkel pipeline that we're currently using has lots of limitations. It only works on single sentences with explicit substrate and product, it only uses a small number of labeling functions to classify the sentences, it doesn't account for sentences with enzymatic reactions or multiple reactions, and it would probably have trouble with new types of reactions.

However, with machine learning, we can solve all of these shortcomings.

-------

The overarching goal of machine learning is to **find patterns in data and use them to make predictions**. That is, given a set of labeled data, predict a new, unlabeled data point's label. The quality of an ML model is determined by how well it can predict the labels of data points it hasn't seen before.

Often, the patterns that ML models are learning to make these predictions aren't obvious to humans. These patterns must be learned by **optimization algorithms**, which essentially find maxima or minima on functions relevant to the data. 

The most common optimization algorithm used is **gradient descent**, which calculates the gradient of a function, and then updates the model by traveling slightly in the direction of the gradient. This works becuse the gradient will be the direction of steepest ascent/descent, so traveling down the gradient will bring you closer to the extrema.

Why would you want to find the extrema of a function? An example function that an optimization algorithm might be used on is a **cost function**. Cost functions are usually an aggregate of **loss functions**. And most loss functions consist of comparing the output of the ML model (inpute label * weights) to the ground truth label. So minimizing a cost function essentially minimizes your error.



https://people.eecs.berkeley.edu/~jrs/189/lec/17.pdf

Read the CS 189 lecture notes on neural networks and answer the following questions.

What is the function being optimized for NNs?

What is backpropagation? (You'll see this again in the PyTorch section)

# PyTorch

PyTorch is a common library to use for general machine learning. Most of the ML tools we'll use in CheRMiT will build off of PyTorch or a similar library (e.g. TensorFlow or Keras). We want you to have a general understanding of how PyTorch works, as well as how neural networks in general work.

Please go through this PyTorch tutorial if you don't know the basics of PyTorch. Try out the Colab notebooks! You'll get to train a simple neural network.

https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

Once you've gone through the PyTorch tutorial, answer the following question:

How would you code a training loop in torch? Write your loop code below.

In [None]:
import torch

# TODO: Write your PyTorch training loop code.

# HuggingFace

NLP is the subset of ML that we will be using for CheRMiT. It is concerned with making predictions about language. The most prominent library for using and understanding NLP is HuggingFace.

To get a sense of what NLP models are capable of, we'd like you to check out this Colab notebook.

https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter1/section3.ipynb

The associated tutorial for this notebook is on HuggingFace's webpage: https://huggingface.co/course/chapter1/3?fw=pt

These models are super cool! We hope you think so too. To show you've checked out these resources, code a simple pipeline function below and run it on a sentence of your choice.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

#TODO: Call a pipeline function on a sentene of your choice.

# ChemRxnBert

While the general use NLP models are cool and give us a sense of what types of tasks are possible to achieve with language, we need models trained on more specific data for our use case. Specifically, we need models that are trained on chemical entities and sientific literature. 


https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.1c00284

This paper was released last year. It describes a deep learning model that extracts chemical reaction data from scientific literature. Please read through the whole thing closely and annotate it, since we'll be referencing and building off of it heavily this year. In a future team meeting, we will all discuss our annotations.

https://github.com/jiangfeng1124/ChemRxnExtractor

Please use the README on their Github to install their package. Then, play around with it! Download the trained models and use them to predict unlabeled sentences that you've found. (There are some in the Snorkel df_test, if you need.) 

You can find the correct data format for using their predict function on the inputs.txt file, which is located on their pipeline branch.

https://github.com/asibanez/chemie-turk

You can also use this package to annotate sentences in their specific format if you need to. 

------

I'm thinking that this year, we can use their models along with our additional tools to make an improved pipeline! They seemed to have the most problems with differentiating between substrates and catalysts, but our cheminformatics suite should be able to help with that.

In [9]:
!git clone https://github.com/jiangfeng1124/ChemRxnExtractor
%cd ChemRxnExtractor
!pip install -r requirements.txt
!pip install -e .

Cloning into 'ChemRxnExtractor'...
remote: Enumerating objects: 249, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 249 (delta 42), reused 39 (delta 39), pack-reused 178[K
Receiving objects: 100% (249/249), 1.02 MiB | 13.62 MiB/s, done.
Resolving deltas: 100% (101/101), done.
/content/ChemRxnExtractor/ChemRxnExtractor
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.5.0
  Downloading torch-1.5.0-cp37-cp37m-manylinux1_x86_64.whl (752.0 MB)
[K     |████████████████████████████████| 752.0 MB 9.5 kB/s 
[?25hCollecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 54.4 MB/s 
Collecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 58.3 

In [10]:
from chemrxnextractor import RxnExtractor