# Ontologies and text mining

> Overview:

*   Exact match with a dictionary
*   Word2Vec





In this tutorial, you will learn how text embeddings can be generated and used to facilitate learning from text.





In [1]:
!pip install gensim==4.0.0
!pip install scikit-learn
import matplotlib.pyplot as plt
from google.colab import drive
import gensim

Collecting gensim==4.0.0
  Downloading gensim-4.0.0.tar.gz (23.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m56.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gensim
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for gensim (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for gensim[0m[31m
[0m[?25h  Running setup.py clean for gensim
Failed to build gensim
[31mERROR: Could not build wheels for gensim, which is required to install pyproject.toml-based projects[0m[31m


First: We create a dictionary for our `Family Ontology`

In [2]:
dictionary={'father': 'http://Father', 'female': 'http://Female', 'male': 'http://Male', 'mother': 'http://Mother', 'parent': 'http://Parent', 'person': 'http://Person'}

Let us look at the following text:


**Can we identify that "parent" and "mother" are both an instance of the class "http://Parent"?**

In [3]:
text="Sarah's only parent was her mother"

Using a dictionary, let us first split the text into words. We do that by using the space as a delimiter. We next look up each word in our dictionary.


In [4]:
tokens = text.split(' ')
print('The splitted text looks as follows: ', tokens)
for i,token in enumerate(tokens):
  if token in dictionary:
    print('Identified mention of a class: ',token, 'which is the',i+1,'th token, is an instance of:', dictionary[token])

The splitted text looks as follows:  ["Sarah's", 'only', 'parent', 'was', 'her', 'mother']
Identified mention of a class:  parent which is the 3 th token, is an instance of: http://Parent
Identified mention of a class:  mother which is the 6 th token, is an instance of: http://Mother


We can only detect 'mother' which is an exact match to what we have in the dictionary.

For this part you need to access a model from: https://drive.google.com/drive/folders/18qPfA4c8otrphwhTI-DxONojeT2_2-g5?usp=sharing


Let us try Word2Vec embeddings which were trained on a large corpus:




In [None]:
drive.mount('/content/drive')
print(gensim.__version__)
model=gensim.models.Word2Vec.load('/content/drive/MyDrive/w2v_model')

We can visualize the embeddings of the ontology classes that we have:




In [None]:
 from sklearn.manifold import TSNE
 import pandas as pd
 embeddings=[]
 classes=[]
 for class_ in dictionary:
   classes.append(class_)
   embeddings.append(model.wv[class_])
tsne_vectors = TSNE(early_exaggeration=1 ,random_state=6,init='random').fit_transform(embeddings)
for i,class_ in enumerate(classes):
  plt.scatter(tsne_vectors[i,0]+0.4,tsne_vectors[i,1]+0.4)
  plt.text(tsne_vectors[i,0]+0.4, tsne_vectors[i,1]+0.4, class_, fontsize=9)

# Application on Gene-Disease association
In this section, we are interested in solving a real biological problem of finding gene-disease associations.



We can train Word2Vec to generate emebddings for any entity of interest.

For example for the following sentence:

In [9]:
sentence='It was found that familial dysalbuminemic hyperthyroxinemia is spreading rapidly.'

There is a mention of a disease:
OMIM_615999 of the name **familial dysalbuminemic hyperthyroxinemia**.
To that end, we can replace the entire mention with the disease ID

In [10]:
sentence=sentence.replace('familial dysalbuminemic hyperthyroxinemia','http://OMIM_615999')
print(sentence)

It was found that http://OMIM_615999 is spreading rapidly.


We can then update our previous Word2Vec model as follows:

In [None]:
model.build_vocab([sentence.split()], update=True, min_count=1)
model.train([sentence.split()],total_examples=model.corpus_count,epochs=5)

For computational purposes, we have already trained the provided Word2Vec model with such sentences. We will use it now to show how we can infer relations about two entities according to their similarities. We have associated Genes and Diseases for which we will show their similarity.

---
The numerical IDs correspond to genes, and IDs starting with 'OMIM' correspond to genes. Entities of the same color are associated.


In [12]:
labels, embeddings=[],[]
genes=['http://3791','http://3815','http://4233']
diseases=[ 'http://OMIM_114500','http://OMIM_606764','http://OMIM_254500']
for entity in genes+diseases:
   labels.append(entity)
   embeddings.append(model.wv[entity])
tsne_vectors = TSNE(early_exaggeration=1 ,random_state=1,init='random').fit_transform(embeddings)
colors=['r','g','b','r','g','b']
for i,clr in enumerate(colors):
  plt.scatter(tsne_vectors[i,0]+0.4,tsne_vectors[i,1]+0.4,color=clr)
  plt.text(tsne_vectors[i,0]+0.4, tsne_vectors[i,1]+0.4, labels[i], fontsize=9)

NameError: name 'model' is not defined

# Fun exercise
Can you find which of the following genes is associated with http://OMIM_601626

**HINT: You can check the function:** model.wv.distances(word, list_of_words)

<div class="alert alert-block alert-info">


<b> Tip:</b> You can use [NCBI](https://www.ncbi.nlm.nih.gov/gene/) to look for the Gene ID, or [Mouse Genome Informatics (MGI)](https://www.informatics.jax.org/quicksearch/summary?queryType=keywords&query)
For example : [Gene: 71743](https://www.ncbi.nlm.nih.gov/gene/?term=71743) in NCBI, and [Gene: 71743 - MGI:1918993](https://www.informatics.jax.org/marker/MGI:1918993) in MGI database.

</div>


In [None]:
genes=['http://55364', 'http://1674', 'http://6418', 'http://4233', 'http://351', 'http://595']
disease='http://OMIM_601626'