# Case study example - potential therapeutic target S-protein binding partner identification

### Using our toolset we will show how by using MeDEP, we can identify some relevant drug targets

1. Using out MeDEP API we will fetch relevant data that is semanatically connected
2. We will use 3 general words as a primary query
3. We will display their topics
3. We will explore on the results

In [1]:
## the code corresponding to c1.
import json
import requests
#import nglview as nv
#nglview not working curr. !#$&
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import nltk
from gensim import corpora
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
from gensim.models.ldamodel import LdaModel

from IPython.display import Image

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

def prepare_text_for_lda(text):
    tokens = text.split(" ")
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens


[nltk_data] Downloading package wordnet to /home/blazs/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/blazs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# First receptor-related queries

Query using: "s-protein", "s-protein-binding","receptor", "adhesion"


In [2]:
interesting_keywords = ["s-protein", "s-protein-binding","receptor", "adhesion"]
json_data_all = []
for keyword in interesting_keywords:
    example_query = "http://covid19explorer.ijs.si/gp/api?keyword={}".format(keyword)
    response = requests.get(example_query)
    json_data = json.loads(response.text)
    json_data_all+=json_data

## get scores and titles
top_docs_abstracts = []
top_docs_titles = []
for hit in json_data_all:
    title, abstract = hit['article_title'], hit['article_abstract']
    if len(abstract) > 30:
        top_docs_abstracts.append(abstract)
        top_docs_titles.append(title)

## clean
clean_text = []
for el in top_docs_abstracts:
    tokens = prepare_text_for_lda(el)
    clean_text.append(tokens)


# Identified article topics
We next explored the topics of the articles that are related to the obtained set of articles.

In [3]:
#topic detection
dictionary = corpora.Dictionary(clean_text)
corpus = [dictionary.doc2bow(text) for text in clean_text]
NUM_TOPICS = 20
ldamodel = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
topics = ldamodel.print_topics(num_words=5)

for enx, topic in enumerate(topics):
    parts = [x[6:].replace("\"","").replace("*","") for x in topic[1].split("+")[0:4]]
    print("TOPIC {} KEYWORDS: ".format(enx+1)+"AND ".join(parts))

TOPIC 1 KEYWORDS: virus AND fiber AND receptor AND viral 
TOPIC 2 KEYWORDS: infection AND carbohydrate AND receptor AND palate 
TOPIC 3 KEYWORDS: neutralize AND SARS-CoV-2 AND entry AND binding 
TOPIC 4 KEYWORDS: receptor AND response AND activation AND different 
TOPIC 5 KEYWORDS: receptor AND epitope AND residue AND control 
TOPIC 6 KEYWORDS: fusion AND protein AND membrane AND without 
TOPIC 7 KEYWORDS: receptor AND piglet AND suppress AND RSV-induced 
TOPIC 8 KEYWORDS: virus AND vaccine AND elicit AND immunize 
TOPIC 9 KEYWORDS: cell AND receptor AND gene AND expression 
TOPIC 10 KEYWORDS: villus AND virus AND piglet AND infection 
TOPIC 11 KEYWORDS: bibliography AND database AND Streptococcus AND review 
TOPIC 12 KEYWORDS: receptor AND Si-Ni-San AND study AND protein 
TOPIC 13 KEYWORDS: signal AND detection AND receptor AND mediate 
TOPIC 14 KEYWORDS: isolate AND COVID-19 AND coronavirus AND probiotic 
TOPIC 15 KEYWORDS: cell AND adhesion AND function AND suggest 
TOPIC 16 KEYWORD


# Main observations
Topics identified by GENSIM's LDA supports our keyword query: virus, protein, receptor, binding and similar.

Lets display identified articles and their abstracts, we expect some odd 100 articles (which is more useful than classic literature search)
semantic connection between articles in established on the basis of whole article-text analysis and not just based on classic literature keyword search where semantic interpretation is left to the user...


In [5]:
#go articles
n = 1
for article in top_docs_titles:
    print("top scored article no.: " + str(n))
    print("title:") 
    print(article + "\n\n" +  "abstract:")
    print (top_docs_abstracts[n-1])
    print("\n")
    n += 1

top scored article no.: 1
title:
The SARS coronavirus spike glycoprotein is selectively recognized by lung surfactant protein D and activates macrophages

abstract:
The severe acute respiratory syndrome coronavirus (SARS-CoV) infects host cells with its surface glycosylated spike-protein (S-protein). Here we expressed the SARS-CoV S-protein to investigate its interactions with innate immune mechanisms in the lung. The purified S-protein was detected as a 210 kDa glycosylated protein. It was not secreted in the presence of tunicamycin and was detected as a 130 kDa protein in the cell lysate. The purified S-protein bound to Vero but not 293T cells and was itself recognized by lung surfactant protein D (SP-D), a collectin found in the lung alveoli. The binding required Ca 2+ and was inhibited by maltose. The serum collectin, mannan-binding lectin (MBL), exhibited no detectable binding to the purified S-protein. S-protein binds and activates macrophages but not dendritic cells (DCs). It su

### The search identified angiotensin related articles and ACEII receptor
A lot of related work on ACEII receptor and its S protein interactions were reported in COVID-19 primary literture recently

Our identified articles: <br>
1- Fast assessment of human receptor-binding capability of 2019 novel coronavirus (2019-nCoV)<br>
Shocking effects of endothelial bradykinin B1 receptors<br>
2- Type 1 angiotensin receptor pharmacology: Signaling beyond G proteins<br>
3- Devil and angel in the renin-angiotensin system: ACE-angiotensin II-AT 1 receptor axis vs. ACE2-angiotensin-(1-7)-Mas receptor axis<br>
4- ROLE OF CHANGES IN SARS-COV-2 SPIKE PROTEIN IN THE INTERACTION WITH THE HUMAN ACE2 RECEPTOR: AN IN SILICO ANALYSIS <br>

Can we directly mine the documents, MeDEP offers?

In [7]:
#pass on ACE data
wordlist = ["Angiotensin", "angiotensin", "bradykinin", "ACE2"]
n = 1
for article in top_docs_titles:
    if any(word in top_docs_abstracts[n-1] for word in wordlist):
        pass
    else:
        print("top scored article no.: " + str(n))
        print("title:") 
        print(article + "\n\n" +  "abstract:")
        print (top_docs_abstracts[n-1])
        print("\n")
    n += 1

top scored article no.: 1
title:
The SARS coronavirus spike glycoprotein is selectively recognized by lung surfactant protein D and activates macrophages

abstract:
The severe acute respiratory syndrome coronavirus (SARS-CoV) infects host cells with its surface glycosylated spike-protein (S-protein). Here we expressed the SARS-CoV S-protein to investigate its interactions with innate immune mechanisms in the lung. The purified S-protein was detected as a 210 kDa glycosylated protein. It was not secreted in the presence of tunicamycin and was detected as a 130 kDa protein in the cell lysate. The purified S-protein bound to Vero but not 293T cells and was itself recognized by lung surfactant protein D (SP-D), a collectin found in the lung alveoli. The binding required Ca 2+ and was inhibited by maltose. The serum collectin, mannan-binding lectin (MBL), exhibited no detectable binding to the purified S-protein. S-protein binds and activates macrophages but not dendritic cells (DCs). It su

# Results: <br>
Lots of semantically connected data on virus-host receptor interactions - could this be directly useful for COVID-19.
#### 1 .SARS coronavirus spike receptor-binding domain - RBD
#### 2. immediate surfactant protein D and activation of macrophages correlation
#### 3. AB-binding studies
#### 4. immediate cytokine production results
#### 5. S1P 1 receptor
#### 6. CCR5 receptor
#### 7. TMPRSS2 protease
#### 8. CD147
 
We further explore CD147.





#### CD147 is currently "under-researched" viral host receptor

In [14]:
#view = nv.show_pdbid("5x0t")
#view.render_image()
#view._display_image()
print ("CD147")
Image(url= "https://cdn.rcsb.org/images/rutgers/x0/5x0t/5x0t.pdb-500.jpg", width=300, height=300)

CD147


#### Exploration of CD147-related articles

In [16]:
interesting_keywords = ["cd147"]
json_data_all = []
for keyword in interesting_keywords:
    example_query = "http://covid19explorer.ijs.si/gp/api?keyword={}".format(keyword)
    response = requests.get(example_query)
    json_data = json.loads(response.text)
    json_data_all+=json_data

## get scores and titles
top_docs_abstracts = []
top_docs_titles = []
for hit in json_data_all:
    title, abstract = hit['article_title'], hit['article_abstract']
    if len(abstract) > 30:
        top_docs_abstracts.append(abstract)
        top_docs_titles.append(title)

## clean
clean_text = []
for el in top_docs_abstracts:
    tokens = prepare_text_for_lda(el)
    clean_text.append(tokens)
    
    
#topic detection
dictionary = corpora.Dictionary(clean_text)
corpus = [dictionary.doc2bow(text) for text in clean_text]
NUM_TOPICS = '10'
ldamodel = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
topics = ldamodel.print_topics(num_words=5)

In [15]:

for enx, topic in enumerate(topics):
    parts = [x[6:].replace("\"","").replace("*","") for x in topic[1].split("+")[0:4]]
    print("TOPIC {} KEYWORDS: ".format(enx+1)+"AND ".join(parts))

TOPIC 1 KEYWORDS: protein AND mediate AND antiviral AND CD147 
TOPIC 2 KEYWORDS: CD147 AND differentiation AND glycosylated AND inhibit 
TOPIC 3 KEYWORDS: virus, AND infectious AND cyclophilins AND cause 
TOPIC 4 KEYWORDS: protein AND antiviral AND response AND target 
TOPIC 5 KEYWORDS: important AND major AND KSHV, AND severe 
TOPIC 6 KEYWORDS: CD147 AND apoptosis AND inhibit AND However, 
TOPIC 7 KEYWORDS: CD147 AND inhibit AND glycosylated AND differentiation 
TOPIC 8 KEYWORDS: CD147 AND virus AND apoptosis AND synovioblast 
TOPIC 9 KEYWORDS: protein AND believe AND function AND provide 
TOPIC 10 KEYWORDS: virus AND influenza AND replication AND inhibit 


### Topics seem to support our hypothesis

Further exploration of articles is also possible.

In [10]:
#pass on ACE data
wordlist = ["Angiotensin", "angiotensin", "bradykinin", "ACE2"]
n = 1
for article in top_docs_titles:
    if any(word in top_docs_abstracts[n-1] for word in wordlist):
        pass
    else:
        print("top scored article no.: " + str(n))
        print("title:") 
        print(article + "\n\n" +  "abstract:")
        print (top_docs_abstracts[n-1])
        print("\n")
    n += 1

top scored article no.: 1
title:
SARS-CoV-2 invades host cells via a novel route: CD147-spike protein

abstract:
Currently, COVID-19 caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been widely spread around the world; nevertheless, so far there exist no specific antiviral drugs for treatment of the disease, which poses great challenge to control and contain the virus. Here, we reported a research finding that SARS-CoV-2 invaded host cells via a novel route of CD147-spike protein (SP). SP bound to CD147, a receptor on the host cells, thereby mediating the viral invasion.


top scored article no.: 2
title:
CD147 promotes IKK/IκB/NF-κB pathway to resist TNF-induced apoptosis in rheumatoid arthritis synovial fibroblasts

abstract:
TNF is highly expressed in synovial tissue of rheumatoid arthritis (RA) patients, where it induces proinflammatory cytokine secretion. However, in other cases, TNF will cause cell death. Considering the abnormal proliferation and activa

## Immediate data on additional S-protein binding partners crutial for potential novel drug research (14 hits)
Correlaction to previous data and hint on protein relevance in COVID-19

1st hit: SARS-CoV-2 invades host cells via a novel route: CD147-spike protein!



# Are there any molecules reported in this context for design of potential new drugs ?
compound candidate retrieval from above hitlist of articles <br>
small molecules <br>
Antimalarials

(rationale: hitlist literature supported + CD147 is an essential receptor on red blood cells for the human malaria parasite)

In [13]:
# small molcules
Image(url= "https://www.cureffi.org/media/2015/06/quinoline-containing-approved-antimalarials.png", width=600, height=600)


## Are there any "big molecules" reported in this context for design of potential new drugs ?

#### Meplazumab
(rationale: studied for treatment of malaria, hitlist literature support: Meplazumab treats COVID-19 pneumonia: an open-labelled, concurrent controlled add-on clinical trial) <br>
#### Cyclophilin A 
(rationale: CyPA recent studies indicate that it can be secreted by cells in response to inflammatory stimuli, implicated in viral infections, hitlist literature support: Suppression of Coronavirus Replication by Cyclophilin Inhibitors)

In [11]:
# big molls
print ("Meplazumab")
Image(url= "https://s3-us-west-2.amazonaws.com/drugbank/protein_structures/full/DB06612.png?1452831503", width=500, height=500)

Meplazumab


In [12]:
print ("CyPA")
Image(url= "https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Cyclophilin_A-cyclosporin_complex_1CWA.png/440px-Cyclophilin_A-cyclosporin_complex_1CWA.png", width=300, height=300)

CyPA


# Our toolset successfully retrieves critical information useful in medicinal chemistry and/or drug research scenario !
 
Critical observation - semantically pre-linked data reveals article connections not necessary identifiable via classic literature search and can provide invaluable research data for further potential drug research !

### output data can easily be plugged in other search engines and expanded if necessary
