# Task: Reading Abstracts + Full Papers from your assigned PMID

**Packages to install:**

`pip install chemlistem`

`pip install tensorflow`

`pip install pdfminer`

ChemDataExtractor Installation [Here](http://chemdataextractor.org/docs/install). TL;DR:
```
conda config --add channels conda-forge
conda install chemdataextractor
cde data download
```

Note: Some tensorflow errors might be solved by checking versions:
- Numpy must be older than 1.19.0 (1.18.5 is good)
- Python 3.7 and under works, not python 3.8

In [30]:
import pandas as pd
import numpy as np
from chemlistem import get_ensemble_model
import chemdataextractor  

In [31]:
# Initiates the chemLSTM model (takes a while to load)
model = get_ensemble_model() 

## Instructions 
1. Refer to the [Google Doc](https://www.notion.so/notsec/Meeting-Notes-54299fe5d9a14362b0778d7e5bf8db4d#2396dcdaaf684855bef261f29ef6722b) for your PMID's.
2. Fill in the cell below:
    - PMID: put in the xxxxx.0 integer (not string)
    - fulltext: file path to the full text pdf of the paper (corresponding to the PMID)
3. Run the rest of the notebook
4. Compare the abstracts vs full paper data; and the chemlistem vs chemdataextractor outputs.

In [113]:
# Fill in with your paper info!!
pmid: int = 16388583.0
fulltext: str = '../sample.pdf'

***

## Running Abstracts

In [99]:
# Use chunks to avoid killing ur machine
row_abstract  = []

chunksize = 10**4
for chunk in pd.read_csv("../Data_CSVs/rxns_abstract9_21.csv", chunksize=chunksize):
    if len(row_abstract) == 0:
        row_abstract = chunk.loc[chunk['pubmedId'] == pmid]
        abstract = row_abstract['abstract']
    else:
        abstract = str(abstract.iloc[0])
        break;
abstract

'(S)-1-Phenylethanol dehydrogenase (PED) from the denitrifying bacterium strain EbN1 catalyzes the NAD+-dependent, stereospecific oxidation of (S)-1-phenylethanol to acetophenone and the biotechnologically interesting reverse reaction. This novel enzyme belongs to the short-chain alcohol dehydrogenase/aldehyde reductase family. The coding gene (ped) was heterologously expressed in Escherichia coli and the purified protein was crystallized. The X-ray structures of the apo-form and the NAD+-bound form were solved at a resolution of 2.1 and 2.4 A, respectively, revealing that the enzyme is a tetramer with two types of hydrophobic dimerization interfaces, similar to beta-oxoacyl-[acyl carrier protein] reductase (FabG) from E. coli. NAD+-binding is associated with a conformational shift of the substrate binding loop of PED from a crystallographically unordered "open" to a more ordered "closed" form. Modeling the substrate acetophenone into the active site revealed the structural prerequisit

In [80]:
# ChemLSTM
res_abs_lstm = model.process(abstract)
df_abs_lstm = pd.DataFrame(res_abs_lstm).rename(columns={0: 'start_char', 1: 'end_char', 2: 'LSTM_entity', 3: 'confidence?', 4: '??'})
df_abs_lstm;

In [79]:
# ChemDataExtractor
doc = chemdataextractor.Document(abstract)
res_abs_cde = doc.cems

# convert to df
i = 0
df_abs_cde = pd.DataFrame(columns=['start_char', 'end_char', 'CDE_entity'])
for span in res_abs_cde:
    df_abs_cde.loc[i] = [span.start, span.end, span.text]
    i += 1
    
df_abs_cde = df_abs_cde.sort_values(by=['start_char']).reset_index(drop=True)
df_abs_cde;

In [106]:
# Compare:
print(f"Abstract: {pmid}")
display(df_abs_lstm) # what LSTM got
display(df_abs_cde) # what ChemDataExtractor got
display(row_abstract.drop(columns=['Unnamed: 0'])) # actual brenda data

Abstract: 16388583.0


Unnamed: 0,start_char,end_char,entity,confidence?,??
0,0,19,(S)-1-Phenylethanol,0.906309,True
1,142,161,(S)-1-phenylethanol,0.65762,True
2,165,177,acetophenone,0.597175,True
3,280,287,alcohol,0.759294,True
4,670,688,beta-oxoacyl-[acyl,0.613858,True
5,930,942,acetophenone,0.814766,True
6,1445,1460,1-phenylethanol,0.862643,True
7,1761,1774,phenylalanine,0.776211,True


Unnamed: 0,start_char,end_char,entity
0,0,19,(S)-1-Phenylethanol
1,98,102,NAD+
2,142,161,(S)-1-phenylethanol
3,165,177,acetophenone
4,280,287,alcohol
5,302,310,aldehyde
6,488,492,NAD+
7,737,741,NAD+
8,930,942,acetophenone
9,1445,1460,1-phenylethanol


Unnamed: 0,literatureSubstrates,ecNumber,substrates,products,pubmedId,abstract
14159,685078.0,1.1.1.311,(S)-1-phenylethanol + NAD+,acetophenone + NADH + H+,16388583.0,(S)-1-Phenylethanol dehydrogenase (PED) from t...
14163,685078.0,1.1.1.311,(S)-1-phenylethanol + NAD+,acetophenone + NADH + H+,16388583.0,(S)-1-Phenylethanol dehydrogenase (PED) from t...


## Running Full Papers

In [122]:
f = open(fulltext, 'rb')


TypeError: '_io.BufferedReader' object is not subscriptable

In [116]:
# from https://stackoverflow.com/a/56799049

import subprocess
def pdf_to_text(filepath):
    print('Getting text content for {}...'.format(filepath))
    process = subprocess.Popen(['pdf2txt.py', filepath], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    stdout, stderr = process.communicate()

    if process.returncode != 0 or stderr:
        raise OSError('Executing the command for {} caused an error:\nCode: {}\nOutput: {}\nError: {}'.format(filepath, process.returncode, stdout, stderr))
    print('Done.')
    return stdout.decode('utf-8')

fulltext_str = pdf_to_text(fulltext)

Getting text content for ../sample.pdf...
Done.


In [127]:
doc = chemdataextractor.Document.from_string(fulltext_str.encode())

In [129]:
# ChemLSTM
res_full_lstm = model.process(fulltext_str)
df_full_lstm = pd.DataFrame(res_full_lstm).rename(columns={0: 'start_char', 1: 'end_char', 2: 'LSTM_entity', 3: 'confidence?', 4: '??'})
df_full_lstm;

In [130]:
# ChemDataExtractor
doc_full = chemdataextractor.Document.from_string(fulltext_str.encode())
res_full_cde = doc_full.cems

# convert to df
i = 0
df_full_cde = pd.DataFrame(columns=['start_char', 'end_char', 'CDE_entity'])
for span in res_full_cde:
    df_full_cde.loc[i] = [span.start, span.end, span.text]
    i += 1
    
df_full_cde = df_full_cde.sort_values(by=['start_char']).reset_index(drop=True)
df_full_cde;

In [135]:
# Compare:
print(f"Full Text: {pmid}")
display(df_full_lstm) # what LSTM got
display(df_full_cde) # what ChemDataExtractor got
display(row_abstract.drop(columns=['Unnamed: 0'])) # actual brenda data

Full Text: 16388583.0


Unnamed: 0,start_char,end_char,LSTM_entity,confidence?,??
0,795,814,(S)-1-Phenylethanol,0.700453,True
1,937,956,(S)-1-phenylethanol,0.640667,True
2,960,972,acetophenone,0.601028,True
3,1077,1084,alcohol,0.547566,True
4,1734,1746,acetophenone,0.838851,True
...,...,...,...,...,...
280,60699,60706,glucose,0.873042,True
281,60993,61004,cis-retinol,0.603578,True
282,61474,61486,ethylbenzene,0.646770,True
283,61690,61707,3R-hydroxysteroid,0.652048,True


Unnamed: 0,start_char,end_char,CDE_entity
0,90,105,1-Phenylethanol
1,785,804,(S)-1-Phenylethanol
2,883,887,NAD+
3,927,946,(S)-1-phenylethanol
4,950,962,acetophenone
...,...,...,...
496,61291,61298,toluene
497,61492,61506,hydroxysteroid
498,61521,61529,carbonyl
499,61853,61867,hydroxysteroid


Unnamed: 0,literatureSubstrates,ecNumber,substrates,products,pubmedId,abstract
14159,685078.0,1.1.1.311,(S)-1-phenylethanol + NAD+,acetophenone + NADH + H+,16388583.0,(S)-1-Phenylethanol dehydrogenase (PED) from t...
14163,685078.0,1.1.1.311,(S)-1-phenylethanol + NAD+,acetophenone + NADH + H+,16388583.0,(S)-1-Phenylethanol dehydrogenase (PED) from t...
