# Exploring the Intuition Behind Doc2Vec for PyPatent

This notebook was made in order to help researchers from outside the field of **Natural Language Processing** to understand the _intuition_ behind the **Doc2Vec** algorithm. We will load our model, which was trained on our patents and abstracts using the Python library `gensim`, and show you how our model has learned (i.e. been trained) to "understand" biomedical/surgical patents. This notebook will contain mostly `Python` code, with `print` statements designed to show you how our model is working.

If you would like to understand the fundamentals of **Doc2Vec** please read [the original paper by Quoc V. Le and Tomas Mikolov (2014)](https://arxiv.org/abs/1405.4053)

In [1]:
#Firstly, we will import all the libraries/packages we need for the model
import os, gensim
from gensim import utils
from gensim.models import Doc2Vec
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

In [2]:
#We are working with Gensim version 1.0.1
gensim.__version__

'1.0.1'

In [3]:
#Function to load our saved model
def load_model():
    path = '/Users/hclent/Desktop/PyPatent/train/a2v.d2v'
    model = Doc2Vec.load(path)
    return model

model = load_model()
print("We have loaded our model: " + str(model))

We have loaded our model: Doc2Vec(dm/m,d300,n5,w10,mc5,s0.001,t11)


## Most_Similar

We will begin exploring the intuition of our `model`, by finding out what words it says are similar. In the next few cells, you can interpret the method `model.most_similar(WORD)` to mean "What does the `model` think are the most relevant synonyms for _WORD_"? 

In [4]:
print(model.most_similar('surgery'))

[('surgeries', 0.3442145586013794), ('procedure', 0.3184318244457245), ('operation', 0.31324779987335205), ('surgical', 0.30597832798957825), ('procedures', 0.28746357560157776), ('arthroplasty', 0.2694127559661865), ('operations', 0.26199889183044434), ('myotomy', 0.2579127550125122), ('surgeons', 0.257735937833786), ('analyte', 0.2530910074710846)]


In [5]:
print(model.most_similar('stapler'))

[('staplers', 0.5225029587745667), ('stapled', 0.4372146725654602), ('cutters', 0.34573912620544434), ('stapling', 0.33906006813049316), ('regression', 0.32182642817497253), ('dichroism', 0.312794029712677), ('reinforcements', 0.2912917733192444), ('endostapler', 0.2803919315338135), ('staples', 0.2744087278842926), ('pores', 0.2686074376106262)]


In [6]:
print(model.most_similar('cardiac'))

[('cardiovascular', 0.27750590443611145), ('cardioverter', 0.26840659976005554), ('heart', 0.26839470863342285), ('ventricular', 0.2524974048137665), ('neurostimulator', 0.25177001953125), ('carotid', 0.24607184529304504), ('atrial', 0.24096183478832245), ('enzyme', 0.2282566875219345), ('aorta', 0.2266642153263092), ('coronary', 0.22369340062141418)]


## Doesnt match

Next we will see which words our `model` has determined **NOT** to be similar, by giving it a set of words and seeing which word it thinks does not belong.

In [7]:
misfit1 = (model.doesnt_match(['surgeries', 'procedure', 'operation', 'aorta']))
print(str(misfit1) + " does NOT belong")

aorta does NOT belong


In [8]:
misfit2 = (model.doesnt_match(['gastric', 'stomach', 'abdomen', 'spine']))
print(str(misfit2) + " does NOT belong")

spine does NOT belong


## Document Vectors

Now we go beyond word-level intuition checks for the doc2vec algorithm and look at how doc2vec works at a document-level. The examples below will reference the patent [US20090082789A1:Insertion Shroud for Surgical Instrument](https://patents.google.com/patent/US20090082789A1/en), which is a patent for surgical instrument related to stapling.

Below is the vector for the abstract of `US20090082789A1`, learned by our `model`:


In [9]:
#The 100_ prefix for the filename is because US20090082789A1 is the 100th patent in our literature search
docvec100 = model.docvecs['100_US20090082789A1.txt'] #Document Vector for US Patent
print(docvec100)

[ -5.37599146e-01   5.82410872e-01  -9.11289528e-02   1.27289101e-01
  -1.15000904e+00  -3.81910950e-01   3.97254884e-01  -1.42942205e-01
  -8.32893431e-01  -6.54209077e-01   3.52837414e-01   4.51039463e-01
  -2.12634102e-01  -1.12266636e+00   8.87699246e-01  -6.64432943e-01
  -2.74389654e-01   8.37969601e-01  -2.56044865e-01   6.35576427e-01
  -6.46699786e-01   7.95875609e-01   9.35347974e-02  -3.08808595e-01
  -2.27562524e-02  -4.47416693e-01  -1.16460674e-01  -3.45282890e-02
  -2.09293459e-02   3.20549756e-01   4.92052019e-01  -1.36537254e+00
   6.74144208e-01  -1.00432408e+00  -1.25720873e-01  -3.17415327e-01
  -7.39395559e-01  -3.10687333e-01  -2.02358818e+00  -6.64136589e-01
   6.09621704e-01  -5.67504525e-01   4.96429026e-01  -9.91610229e-01
  -1.08694875e+00   5.51266968e-01   1.04492700e+00   2.56416440e-01
  -3.12293768e-01   3.87195081e-01   7.51868367e-01  -7.30785504e-02
  -5.78872442e-01   5.52410424e-01  -9.33602273e-01   1.60721630e-01
   2.73472778e-02  -1.33668318e-01

In [10]:
print("Vectors in our model have the data type: " + str(type(docvec100)))
print("Vectors in our model are size: " + str(docvec100.size))


Vectors in our model have the data type: <class 'numpy.ndarray'>
Vectors in our model are size: 300


### To a person, a vector with 300 numbers doesn't mean much. But in the computer, the vector encodes information about the syntax and semantics of the words used in the document.

## Document Similarity

Now we will use `gensim`'s built-in method to see which documents have the most similar vectors as `US20090082789A1`.

In [11]:
#Which documents are the most similar to patent US20090082789A1?
sims = model.docvecs.most_similar(['100_US20090082789A1.txt'])
print(sims)

[('9_US20050006432A1.txt', 0.6722902655601501), ('34_47.txt', 0.6552937030792236), ('23_344.txt', 0.6413425207138062), ('41_US20050103819A1.txt', 0.6390607953071594), ('42_US20050103819A1.txt', 0.6362971067428589), ('5_118.txt', 0.6350618004798889), ('12_60.txt', 0.6329587697982788), ('70_12.txt', 0.6313894987106323), ('7_115.txt', 0.6299792528152466), ('7_34.txt', 0.6261162757873535)]


### According to our model, the most similar document to US20090082789A1 is another patent, US20050006432A1, which also documents features for a surgical stapling device.

Our model says they are 67.22% similar

In [12]:
with(open("/Users/hclent/Desktop/PyPatent/train/9_US20050006432A1.txt", "r")) as f:
    text = f.read()
    print(text)

A surgical device is disclosed which includes a handle portion, a central body portion and a SULU. The SULU includes a proximal body portion, an intermediate pivot member and a tool assembly. The intermediate pivot member is pivotally secured to the proximal body portion about a first pivot axis and the tool assembly is pivotally secured to the intermediate pivot member about a second pivot axis which is orthogonal to the first pivot axis. The SULU includes a plurality of articulation links which are operably connected to the tool assembly by non-rigid links. The articulation links are adapted to releasably engage articulation links positioned in the central body portion. The body portion articulation links are connected to an articulation actuator which is supported for omni-directional movement to effect articulation of the tool assembly about the first and second axes. The handle portion includes a spindle and barrel assembly drive mechanism for advancing and retracting a drive memb

### Document most_similarity within literature searches

For our purposes, we only want to look at document similarity between patent and abstracts when the abstract was retrieved from a literature search with special regards to that patent. In other words, for each of the patents we have, we retrieved a special collection of abstracts for that patent via key word word search. Now we want to know if any of those abstracts are actually similar to the patent of interest.

We will do this by taking the **cosine similarity** of the patent vector and all of the abstract vectors: `cos(patent, abstract)`

In [13]:
p_vec = model.docvecs["100_US20090082789A1.txt"] #Patent vector 
P = sparse.csr_matrix(p_vec) #Sparse Patent Vector 
a_vec = model.docvecs["100_15.txt"] #An Abstract from the 100_prefix 
A = sparse.csr_matrix(a_vec)
sim = cosine_similarity(P, A) #cos(patent, abstract) #
percent = str((sim[0][0]) * 100) + "%"
print("Patent 100_US20090082789A1 and Abstract 100_15 are " +str(percent)+ " similar" )

Patent 100_US20090082789A1 and Abstract 100_15 are 53.9578855038% similar


In [14]:
with(open("/Users/hclent/Desktop/PyPatent/train/100_15.txt", "r")) as f:
    text = f.read()
    print(text)

	Author(s): Giaccaglia, V (Giaccaglia, Valentina); Antonelli, MS (Antonelli, Maria Serena); Chieco, PA (Chieco, Paola Addario); Cocorullo, G (Cocorullo, Gianfranco); Cavallini, M (Cavallini, Marco); Gulotta, G (Gulotta, Gaspare)


	Title: Technical characteristics can make the difference in a surgical linear stapler. Or not?


	Abstract: Background: Anastomotic leak (AL) after gastrointestinal surgery is a severe complication associated with relevant short-and long-term sequelae. Most of the anastomosis are currently performed with a surgical stapler that is required to have appropriate characteristics to guarantee good performances. The aim of our study was to evaluate, in the laboratory, pressure resistance and tensile strength of anastomosis performed with different surgical linear staplers, available in the market. Materials and methods: We have been studying three linear staplers, with diverse cartridges and staple heights, of three different companies, used for gastrointestinal a

### Abstract 100_15 was 50% similar to the Patent US20090082789A1. Why does this similarity score make sense? 

If we read the abstract for `100_15` (i.e. the 15th abstract obtained in the literature search for patent # 100), we see supporting evidence that they are related:

* performed with a surgical stapler
* performed with different surgical linear staplers
* We have been studying three linear staplers, with diverse cartridges and staple heights

**Both the patent and the abstract are discussing staplers, and thus they are similar**

## In the end...

Our literature search contains over 16,000 abstracts for 120 patents. As most people do not have the time or resources to read thousands of biomedical abstracts, we are using **doc2vec** in order to automatically set aside dissimilar (i.e. irrelevant) abstracts, and thus focus only on the most relevant abstracts for our patents. 