# Setup Python environment

In [18]:
!conda create --name semesterproject python=3.7 -y
!conda activate semesterproject

Collecting package metadata (current_repodata.json): done
Solving environment: unsuccessful attempt using repodata from current_repodata.json, retrying with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done


  current version: 23.7.4
  latest version: 23.11.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.11.0



## Package Plan ##

  environment location: /home/olivier/anaconda3/envs/semesterproject

  added / updated specs:
    - python=3.7


The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
  ca-certificates    pkgs/main/linux-64::ca-certificates-2023.12.12-h06a4308_0 
  certifi            pkgs/main/linux-64::certifi-2022.12.7-py37h06a4308_0 
  ld_

In [None]:
!conda install -c conda-forge tesseract
!conda install -c conda-forge poppler
!conda install -c anaconda nltk
!conda install -c anaconda pandas
!pip install pytesseract pdf2image zero-shot-re neo4j

done
Solving environment: | 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::tifffile==2023.4.12=py311h06a4308_0
  - defaults/noarch::pure_eval==0.2.2=pyhd3eb1b0_0
  - defaults/linux-64::anyio==3.5.0=py311h06a4308_0
  - defaults/linux-64::markdown-it-py==2.2.0=py311h06a4308_1
  - defaults/linux-64::pycurl==7.45.2=py311hdbd6064_1
  - defaults/linux-64::notebook==6.5.4=py311h06a4308_1
  - defaults/noarch::itsdangerous==2.0.1=pyhd3eb1b0_0
  - defaults/noarch::backports.functools_lru_cache==1.6.4=pyhd3eb1b0_0
  - defaults/linux-64::networkx==3.1=py311h06a4308_0
  - defaults/linux-64::jupyter_console==6.6.3=py311h06a4308_0
  - defaults/linux-64::numexpr==2.8.4=py311h65dcdc2_1
  - defaults/linux-64::idna==3.4=py311h06a4308_0
  - defaults/noarch::jupyterlab_pygments==0.1.2=py_0
  - defaults/linux-64::cytoolz==0.12.0=py311h5eee18b_0
  - defaults/linux-64::y-py==0.5.9=py311h

# Reading a PDF file with OCR
To start you will be using the Tissue Engineering of Skin Regeneration and Hair Growth
paper written by Mohammadreza Ahmadi. The PDF version of the article is available under
the CC0 1.0 license, which means you can easily download it with Python. The pytesseract
library is one of the most popular libraries for OCR.

In [2]:
import requests
import pdf2image
import pytesseract
import os
os.environ['TESSDATA_PREFIX'] = r'/home/olivier/anaconda3/share/tessdata'

pdf = requests.get('https://arxiv.org/pdf/2110.03526.pdf')
doc = pdf2image.convert_from_bytes(pdf.content)

article = []
for page_number, page_data in enumerate(doc):
    text = pytesseract.image_to_string(page_data).encode('utf-8')
    if page_number < 6:
        article.append(text.decode('utf-8'))
article_text = ' '.join(article)
print(article_text)

Mohammadreza Ahmadi

Tissue Engineering and Regeneration of Skin
and Hair Follicle Growth From Stem Cells

INTRODUCTION

Many people with skin diseases such as chronic wounds, non-healing and diabetic
ulcers need reconstruction and regeneration of their skin. In addition, the medical industry also
needed a method of skin rejuvenation and reconstruction for cosmetic purposes, even for
healthy people. Reconstructive medicine used the method to deliver pluripotent stem cells to the
targeted tissue.

33 years after the introduction of bone marrow stem cells, fat-derived stem cells have
become an excellent source for cell therapy. In 1961, two Canadian scientists first introduced
stem cells. These cells, later found to be hematopoietic stem cells, have been used successfully
to treat leukemia and some severe autoimmune diseases called bone marrow transplants. In
1968, another stem cell was introduced into the bone marrow, which has been shown to be
effective due to its high ability to regul

# Text preprocessing
Now that you have the article content available, you will need to remove section titles and
figure descriptions from the text. Here is an example on how to possibly do it in Python:

In [3]:
import nltk
nltk.download('punkt')
def clean_text(text):
    """Remove section titles and figure descriptions from text"""
    clean = "\n".join([row for row in text.split("\n") if (len(row.split(" "))) > 3 and not (row.startswith("(a)")) and not row.startswith("Figure")])
    return clean
text = article_text.split("INTRODUCTION")[1]
ctext = clean_text(text)
sentences = nltk.tokenize.sent_tokenize(ctext)
print(sentences)

['Many people with skin diseases such as chronic wounds, non-healing and diabetic\nulcers need reconstruction and regeneration of their skin.', 'In addition, the medical industry also\nneeded a method of skin rejuvenation and reconstruction for cosmetic purposes, even for\nhealthy people.', 'Reconstructive medicine used the method to deliver pluripotent stem cells to the\n33 years after the introduction of bone marrow stem cells, fat-derived stem cells have\nbecome an excellent source for cell therapy.', 'In 1961, two Canadian scientists first introduced\nstem cells.', 'These cells, later found to be hematopoietic stem cells, have been used successfully\nto treat leukemia and some severe autoimmune diseases called bone marrow transplants.', 'In\n1968, another stem cell was introduced into the bone marrow, which has been shown to be\neffective due to its high ability to regulate immunity in many diseases, including skin, bone, joint\ndiseases, heart, brain, nerves, and kidney.', 'Nevert

[nltk_data] Downloading package punkt to /home/olivier/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Biomedical named entity recognition
Now comes the exciting part. Named entity recognition techniques are used to detect relevant
entities or concepts in the text. For example, in the biomedical domain, we want to identify
various genes, drugs, diseases, and other concepts in the text.

In [4]:
import hashlib

def query_plain(text, url='http://bern2.korea.ac.kr/plain'):
    return requests.post(url, json={'text':str(text)}).json()

entity_list = []
for s in sentences[:-1]:
    entity_list.append(query_plain(s))
    
parsed_entities = []
for entities in entity_list:
    e = []
    if not entities.get('annotations'):
        parsed_entities.append({'text':entities['text'], 'text_sha256':
                    hashlib.sha256(entities ['text'].encode('utf-8')).hexdigest()})
        continue
    for entity in entities['annotations']:
        other_ids = [id for id in entity['id'] if not id.startswith('BERN')]
        entity_type = entity['obj']
        entity_name = entities['text'][entity['span']['begin']:entity['span']['end']]
        try:
            entity_id = [id for id in entity['id'] if id.startswith('BERN')][0]
        except IndexError:
            entity_id = entity_name
e.append({'entity_id': entity_id, 'other_ids': other_ids,'entity_type': entity_type, 'entity': entity_name})
parsed_entities.append({'entities':e, 'text':entities['text'],'text_sha256': hashlib.sha256(entities['text'].encode('utf-8')).hexdigest()})
print(parsed_entities)

[{'text': 'Also, they are very accessible, inexpensive, easy to extract, and reproducible.', 'text_sha256': 'd4dbd1fb9e7e5285e6546710fe83bf7ea44a809fe45c245e25e634c5d33f8653'}, {'text': 'The scaffold has to have specific characteristics that would avoid a foreign body reaction; It could not be antigenic ,allergic, or infectious.', 'text_sha256': '3b7b8620ed5944b531c7700750151aed83c98aecf3bac9bf454198dc36bf0f7b'}, {'text': 'Additionally, it needed to be accessible and cheap to produce.', 'text_sha256': '3cc8fc706ee9b1439fb08ac70c6b895ee0779a02642d0bf123acabdc1658282c'}, {'text': 'The complete restoration of the anatomy and physiology of uninjured skin is highly dependent on the modulation of biological pathways of embryonic and fetal formation.', 'text_sha256': '89d412d0aa5c1ef5b7e39ffbe53997f256583e668e167f9e7bff52754c1eaba6'}, {'text': 'These advancements also resulted in substantial decreases in death, hospitalization, and long-term morbidity.', 'text_sha256': '4356f0a81812d5e028f56e

# Constructing a knowledge graph
Before looking at relation extraction techniques, we will construct a biomedical knowledge
graph using only entities and examine the possible applications. You don’t have to deal with
preparing a local Neo4j environment. Instead, you can use a free Neo4j Sandbox instance

In [5]:
from neo4j import GraphDatabase
import pandas as pd

host = 'bolt://3.222.189.78:7687'
user = 'neo4j'
password = 'river-windows-stake'
driver = GraphDatabase.driver(host, auth=(user, password))

def neo4j_query(query, params=None):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([dict(record) for record in result],
        columns=result.keys())


# Importing author and article in Neo4j
The article node will contain only
the title. If you open the Neo4j Browser, you should see a graph that looks like the one here
below. In this example, Mahammadrez Ahmadi is the author and “Tissue Engineering and
Regeneration of Skin and Hair Follicle Growth” is the title of the article

In [6]:
author = article_text.split('\n')[0]
title = " ".join(article_text.split('\n')[2:4])

neo4j_query("""
    MERGE (a:Author{name:$author})
    MERGE (b:Article{title:$title})
    MERGE (a)-[:WROTE]->(b)
    """, {'title':title, 'author':author})

# Importing sentences and entities in Neo4j

In [7]:
neo4j_query("""
    MATCH (a:Article)
    UNWIND $data as row
    MERGE (s:Sentence{id:row.text_sha256})
    SET s.text = row.text
    MERGE (a)-[:HAS_SENTENCE]->(s)
    WITH s, row.entities as entities
    UNWIND entities as entity
    MERGE (e:Entity{id:entity.entity_id})
    ON CREATE SET e.other_ids = entity.other_ids,
    e.name = entity.entity,
    e.type = entity.entity_type
    MERGE (s)-[m:MENTIONS]->(e)
    ON CREATE SET m.count = 1
    ON MATCH SET m.count = m.count + 1
    """, {'data': parsed_entities})

# Executing queries in Neo4j to investigate the knowledge graph
search engine

In [8]:
neo4j_query("""
MATCH (e:Entity)<-[:MENTIONS]-(s:Sentence)
WHERE e.name = "autoimmune diseases"
RETURN s.text as result
""")

Unnamed: 0,result


Co-occurence

In [9]:
neo4j_query("""
MATCH (e1:Entity)<-[:MENTIONS]-()-[:MENTIONS]->(e2:Entity)
WHERE id(e1) < id(e2)
RETURN e1.name as entity1, e2.name as entity2, count(*) as
cooccurrence
ORDER BY cooccurrence
DESC LIMIT 3
""")

Unnamed: 0,entity1,entity2,cooccurrence


Author expertise

In [10]:
neo4j_query("""
MATCH (a:Author)-[:WROTE]->()-[:HAS_SENTENCE]->()-[:MENTIONS]-
>(e:Entity)
RETURN a.name as author, e.name as entity, count(*) as count
ORDER BY count DESC
LIMIT 5
""")

Unnamed: 0,author,entity,count
0,Mohammadreza Ahmadi,collagen,1


# Relation extraction
Now you will try to extract relations between medical concepts. You can use the zero-shot
relation extractor based on the paper Exploring the zero-shot limit of FewRel[2]. While it
wouldn’t be recommendable to put this model into production, it is good enough for the sake
of this project. The model is available on HuggingFace, so we don’t have to deal with training
or setting up the model.

In [16]:
from transformers import AutoTokenizer
from zero_shot_re import RelTaggerModel, RelationExtractor


ModuleNotFoundError: No module named 'zero_shot_re'