<a href="https://colab.research.google.com/github/linkvarun/Jupyter_Notebook/blob/master/Information_Retreival.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Information retrieval** (IR) is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.  Web search engines are the most visible IR applications.

An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.

An object is an entity that is represented by information in a content collection or database. User queries are matched against the database information. However, as opposed to classical SQL queries of a database, in information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching.

Depending on the application the data objects may be, for example, text documents, images, audio, mind maps or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata.

Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.

Where is Information Retrieval used?

**Use Case 1: Digital Library**

A digital library is a library in which collection of data are stored in digital formats and accessible by computers. The digital content may be stored locally, or accessed remotely via computer networks. A digital library is a type of information retrieval system.

**Use Case 2: Search Engine**

A search engine is one of the most the practical applications of information retrieval techniques to large scale text collections.

**Use Case 3**: **Image retrieval**

An image retrieval system is a computer system for browsing, searching and retrieving images from a large database of digital images.

![alt text](https://jamesmccaffrey.files.wordpress.com/2016/10/precisionandrecall_informationretrieval.jpg)

**Finding Similar Documents**: A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover?

## Case Study : Retrieving Similar Publications @ UNM College of Pharmacy and School of Medicine

**Goal: Find similar papers using Title and Abstract text**

In this case we will use Pandas, NLTK, Numpy, and SKLearn libraries to find similar articles published in PubMed using k-Nearest Neighbors.

Steps:
* Find the important keywords of each document using tf-idf
* Apply knn_model on tf-idf to find similar papers
* Cleaning:

 * Clean text from \n and \x things like that by Replacing \n and \x with white-spaces

 * Apply unicode

 * Make everything lower case

In [None]:
! pip install biopython

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biopython
  Downloading biopython-1.81-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.81


In [None]:
import pandas as pd
import sklearn
import numpy as np
import nltk
import re
from Bio import Medline

Downloading paper abstracts
First we must download text data that we are interested in using. To do this we will use articles indexed in pubmed.gov. For this notebook we are interested only in article published from the University of New Mexico College of Pharmacy and School of Medicine. Pubmed allows the use of filters/keywords to restrict your search to certain institutions. Retrieve articles affiliated with UNM CoP and SoM by using the following search string "university of new mexico"[AD] AND ("pharmacy"[AD] OR "medicine"[AD])

Steps:

* Navigate to pubmed.gov
* Enter "university of new mexico"[AD] AND ("pharmacy"[AD] OR "medicine"[AD]) into the search box
*Click 'Send to:' and choose 'File' and 'Format: MEDLINE'
*Click 'Create File'

As of 10 Oct 2019  there are a total of 8,315 articles found matching this search criteria. A file called pubmed_result.txt should have been saved to your computer. This file contains all of the articles matching the search criteria in MEDLINE format.


**Lets import the article data**

In [None]:
# Function that uses the Medline module from
# the Biopython library to parse and read MEDLINE
# formatted files. Results are stored in a Pandas
# DataFrame
def read_medline_data(filename):
    recs = Medline.parse(open(filename, 'r'))
    text = pd.DataFrame(columns = ["title", "authors", "abstract"])
    count = 0
    for rec in recs:
        try:
            abstr = rec["AB"]
            title = rec["TI"]
            auths = rec["AU"]
            text = text.append(pd.DataFrame([[title, auths, abstr]],
                                     columns=['title', 'authors', 'abstract']),
                              ignore_index=True)
        except:
            pass
    return text

In [None]:
# Read in MEDLINE formatted text
papers = read_medline_data("/content/pubmed_result.txt")

# Show the top few papers
papers.head(25)

Unnamed: 0,title,authors,abstract
0,Peer-Centered Versus Standard Physician-Center...,"[Krantz TE, Rogers RG, Petersen TR, Dunivan GC...",OBJECTIVES: Peer counseling may improve upon p...
1,ISOPT Clinical Hot Topic Panel Discussion on C...,"[Asbell PA, Aquavella JV, Hamrah P, Pepose JS,...",The cornea and its adnexa pose a unique situat...
2,Speed and quality goals in procedural skills l...,"[Cook DA, Gas BL, Pankratz VS, Farley DR, Pusi...",Purpose: Compare time (speed) and product qual...
3,Amphotericin B Penetrates into the Central Ner...,"[Petraitis V, Petraitiene R, Valdez JM, Pyrgos...",Hematogenous Candida meningoencephalitis (HCME...
4,A contemporary review of obstructive sleep apnea.,"[Ralls F, Cutchen L]",PURPOSE OF REVIEW: This review provides a cont...
5,[EXPRESS] Sustained Relief of Trigeminal Neuro...,"[Zhang M, Hu M, Montera MA, Westlund KN High]",The blood-brain (BBB) and blood-nerve barriers...
6,MutEx: a multifaceted gateway for exploring in...,"[Ping J, Oyebamiji O, Yu H, Ness S, Chien J, Y...",Somatic mutation and gene expression dysregula...
7,A Resting-State Network Comparison of Combat-R...,"[Vanasse TJ, Franklin C, Salinas FS, Ramage AE...",Resting-state functional connectivity (rsFC) i...
8,Organizational strategies to reduce physician ...,"[Olson K, Marchalik D, Farley H, Dean SM, Lawr...",Burnout is highly prevalent among physicians a...
9,"Early life risk factors of motor, cognitive an...","[Sania A, Sudfeld CR, Danaei G, Fink G, McCoy ...",OBJECTIVE: To determine the magnitude of relat...


In [None]:
print ("Title: ", papers['title'][11])
print ('\n')
print ("Abstract: ", papers['abstract'][11])

Title:  Can BDDCS illuminate targets in drug design?


Abstract:  The fact that pharmacokinetic (PK) properties of drugs influence their interaction with protein targets is a principle known for decades. The same cannot be said for the opposite, namely that targets influence the PK properties of drugs. Evidence confirming this possibility is introduced here for the first time, as we show that certain protein families have a clear preference for drugs with specific PK properties. We investigate this by cross-referencing 'druggable target' annotations for >1000 US Food and Drug Administration (FDA)-approved drugs with their PK profile, as defined by the Biopharmaceutics Drug Disposition Classification System (BDDCS) criteria, and then examine the BDDCS preference for several major target protein families and therapeutic categories. Our findings suggest a novel way to conduct drug discovery by focusing PK profiles at the very early stage of target selection.


In [None]:
# Function that cleans text by removing '\x0c' and '\n' characters
# as well as all non-alpha characters and finally converts everything
# to lower case
def clean_text(text):
    stop_words = ['\x0c', '\n']
    for i in stop_words:
        text.replace(i, ' ')
    clean_text = re.sub('[^a-zA-Z]+', ' ', text)
    return clean_text.lower()

# Create a column for cleaned Abstract and cleaned Title
papers['clean_abstract'] = papers['abstract'].apply(clean_text)
papers['clean_title'] = papers['title'].apply(clean_text)

papers.head()

Unnamed: 0,title,authors,abstract,clean_abstract,clean_title
0,Peer-Centered Versus Standard Physician-Center...,"[Krantz TE, Rogers RG, Petersen TR, Dunivan GC...",OBJECTIVES: Peer counseling may improve upon p...,objectives peer counseling may improve upon pr...,peer centered versus standard physician center...
1,ISOPT Clinical Hot Topic Panel Discussion on C...,"[Asbell PA, Aquavella JV, Hamrah P, Pepose JS,...",The cornea and its adnexa pose a unique situat...,the cornea and its adnexa pose a unique situat...,isopt clinical hot topic panel discussion on c...
2,Speed and quality goals in procedural skills l...,"[Cook DA, Gas BL, Pankratz VS, Farley DR, Pusi...",Purpose: Compare time (speed) and product qual...,purpose compare time speed and product quality...,speed and quality goals in procedural skills l...
3,Amphotericin B Penetrates into the Central Ner...,"[Petraitis V, Petraitiene R, Valdez JM, Pyrgos...",Hematogenous Candida meningoencephalitis (HCME...,hematogenous candida meningoencephalitis hcme ...,amphotericin b penetrates into the central ner...
4,A contemporary review of obstructive sleep apnea.,"[Ralls F, Cutchen L]",PURPOSE OF REVIEW: This review provides a cont...,purpose of review this review provides a conte...,a contemporary review of obstructive sleep apnea


In [None]:
print ("Title: ", papers['title'][4])
print ('\n')
print ("Abstract: ", papers['abstract'][4])

Title:  A contemporary review of obstructive sleep apnea.


Abstract:  PURPOSE OF REVIEW: This review provides a contemporary review of sleep apnea with emphasis on definitions, epidemiology, and consequences. RECENT FINDINGS: Amyloid beta-42 is one of the main peptides forming amyloid plaques in the brains of Alzheimer patients. Poorer sleep quality and shorter sleep duration have been associated with a higher amyloid burden. Decreased sleep time in the elderly is a precipitating factor in amyloid retention. Studies have shown that the dysregulation of the homeostatic balance of the major inhibitory and excitatory amino acid neurotransmitter systems of gamma-aminobutyric acid (GABA) and glutamate play a role in sleep disordered breathing (SDB). SUMMARY: Untreated sleep disordered breathing (obstructive sleep apnea and/or central sleep apnea) are an important cause of medical mortality and morbidity. OSA is characterized by recurrent episodes of partial or complete collapse of the uppe

In [None]:
print ("Title: ", papers['clean_title'][4])
print ('\n')
print ("Abstract: ", papers['clean_abstract'][4])

Title:  a contemporary review of obstructive sleep apnea 


Abstract:  purpose of review this review provides a contemporary review of sleep apnea with emphasis on definitions epidemiology and consequences recent findings amyloid beta is one of the main peptides forming amyloid plaques in the brains of alzheimer patients poorer sleep quality and shorter sleep duration have been associated with a higher amyloid burden decreased sleep time in the elderly is a precipitating factor in amyloid retention studies have shown that the dysregulation of the homeostatic balance of the major inhibitory and excitatory amino acid neurotransmitter systems of gamma aminobutyric acid gaba and glutamate play a role in sleep disordered breathing sdb summary untreated sleep disordered breathing obstructive sleep apnea and or central sleep apnea are an important cause of medical mortality and morbidity osa is characterized by recurrent episodes of partial or complete collapse of the upper airway during slee

In [None]:
'''Build tf-idf matrix based on Abstract and Title
Use NLTK word_tokenize() and SnowballStemmer() to tokenize and stem document Title and Abstract'''

# Function that takes text, tokenizes it and returns list of stemmed tokens
def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    stemmer = nltk.stem.snowball.SnowballStemmer("english")
    return [i for i in [stemmer.stem(t) for t in tokens] if len(i) > 2]

**Create a tf-idf vectorizer using sklearn TfidfVectorizer**

1. First we create the vectorizer specifying the paramters
    * max_df is the maximum allowable document frequency for a token this is set to 0.50 to include terms that appear in less than 50% of documents.
    * min_df is the minimum allowable document frequency for a token and is set to 0 to include all terms, even those that appear in only one document
    * max_features sets the maximum number of features allowed and is set to an arbitrarily large number (i.e. 200,000) to ensure we capture at least as many features
    * stop_words specifies the words/tokens to remove from the corpus
    * use_idf enables reweighting each feature by its inverse-document-frequency when set to true
    * tokenizer specifies which tokenizer to use, we want to tokenize and stem so we pass it our tokenized_and_stem() function we created above. The default tokenizer will tokenize words and include those greater than two characters in length.
2. We then fit the vectorizer to our cleaned text using *vectorizer.fit_transform()*
3. The output is a n*m matrix where n is the number of documents in our corpus and m is the number of features.
4. We can inspect the features using *vectorizer.get_feature_names()*

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Import the TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Create vectorizer for Abstracts, max_df is set to 0.5, we only want
# to include terms that appear in less tha 50% of the documents (i.e. rare terms)
abs_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=1, max_features=100000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

# Create vectorizer for Title, max_df is set to 0.5, we only want
# to include terms that appear in less than 50% of the documents (i.e. rare terms)
title_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=1, max_features=100000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

# Compute TF-IDF weights for Abstracts
tfidf_weights_abs = abs_tfidf_vectorizer.fit_transform(papers['clean_abstract'])



In [None]:
# Compute TF-IDF weights for Title
tfidf_weights_title = title_tfidf_vectorizer.fit_transform(papers['clean_title'])

# Get feature names for Abstract and Title models
tfidf_features_title = title_tfidf_vectorizer.get_feature_names_out()
tfidf_features_abs = abs_tfidf_vectorizer.get_feature_names_out()

In [None]:
tfidf_features_abs

array(['aabb', 'aadc', 'aall', ..., 'zwisch', 'zygos', 'zymographi'],
      dtype=object)

**Write function to get the top-k features associated with a document**

In [None]:
# Function for returning the top_k features of an Abstract
# or Title
def get_top_features(rownum, weights, features, top_k=30):
    weight_vec = weights.toarray()[rownum,:]
    top_idx = np.argsort(weight_vec)[::-1][:top_k]
    return [features[i] for i in top_idx]

# Top k features of Abstract 1
get_top_features(1, tfidf_weights_abs, tfidf_features_abs)

['cornea',
 'discuss',
 'wet',
 'discomfort',
 'situat',
 'network',
 'neural',
 'seen',
 'maintain',
 'issu',
 'address',
 'tear',
 'moreso',
 'wind',
 'adnexa',
 'perfect',
 'blink',
 'corneal',
 'elabor',
 'engulf',
 'refract',
 'isopt',
 'desper',
 'hypercomplex',
 'util',
 'busi',
 'humid',
 'reader',
 'alik',
 'transpar']

In [None]:
# Top k features of Title 1
get_top_features(1, tfidf_weights_title, tfidf_features_title)

['hot',
 'discuss',
 'cornea',
 'anterior',
 'isopt',
 'segment',
 'topic',
 'panel',
 'diseas',
 'clinic',
 'extraglott',
 'factor',
 'ezh',
 'eye',
 'extrem',
 'zuni',
 'faculti',
 'extracellular',
 'extern',
 'extens',
 'extend',
 'express',
 'exposur',
 'expos',
 'exploratori',
 'extract',
 'failur',
 'faecal',
 'fail',
 'featur']

**Build Nearest Neighbors model using Abstract and Title TF-IDF matrices**

In [None]:

# Build model to return 5 closest neighbors
from sklearn.neighbors import NearestNeighbors

# Create the k-NN model using k=5
nn_abs = NearestNeighbors(n_neighbors=5, algorithm='auto')
nn_title = NearestNeighbors(n_neighbors=5, algorithm='auto')

# Fit the models to the TF-IDF weights matrix
nn_fitted_abs = nn_abs.fit(tfidf_weights_abs)
nn_fitted_title = nn_title.fit(tfidf_weights_title)

# function to return the top-k nearest papers

def find_nearest_papers(row, kNNmodel, tfidf_weights, tfidf_features, papers):
    keywords = get_top_features(row, tfidf_weights, tfidf_features)
    dist,idx = kNNmodel.kneighbors(tfidf_weights[row,:])
    idx = list(idx[0])
    return {'papers':papers.iloc[idx], 'keywords':keywords}

**Return papers based on Abstract similarity**

Now that we have a function to return similar papers, we can use it to find papers with similar abstracts. We can return Authors, Title, or Abstract of similar matches

In [None]:
find_nearest_papers(9, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)['papers']

Unnamed: 0,title,authors,abstract,clean_abstract,clean_title
9,"Early life risk factors of motor, cognitive an...","[Sania A, Sudfeld CR, Danaei G, Fink G, McCoy ...",OBJECTIVE: To determine the magnitude of relat...,objective to determine the magnitude of relati...,early life risk factors of motor cognitive and...
677,Predictors of Overweight and Obesity in Americ...,"[Adams AK, Tomayko EJ, A Cronin K, J Prince R,...",OBJECTIVE: To describe sociodemographic factor...,objective to describe sociodemographic factors...,predictors of overweight and obesity in americ...
114,Protecting children's health in a calorie-surp...,"[Cunningham SA, Chandrasekar EK, Cartwright K,...",Studies from the social and health sciences ha...,studies from the social and health sciences ha...,protecting children s health in a calorie surp...
396,Exposure to environmental toxicants and young ...,"[Davis AN, Carlo G, Gulseven Z, Palermo F, Lin...",Background Understanding the role of environme...,background understanding the role of environme...,exposure to environmental toxicants and young ...
538,Behavioral problems are associated with cognit...,"[Lowe JR, Fuller JF, Do BT, Vohr BR, Das A, Hi...",OBJECTIVE: To evaluate the relationship of par...,objective to evaluate the relationship of pare...,behavioral problems are associated with cognit...


**Return papers based on Title similarity**

Now that we have a function to return similar papers, we can use it to find papers with similar Titles. We can return Authors, Title, or Abstract of similar matches

In [None]:
find_nearest_papers(1, nn_fitted_title, tfidf_weights_title, tfidf_features_title, papers)['papers']

Unnamed: 0,title,authors,abstract,clean_abstract,clean_title
1,ISOPT Clinical Hot Topic Panel Discussion on C...,"[Asbell PA, Aquavella JV, Hamrah P, Pepose JS,...",The cornea and its adnexa pose a unique situat...,the cornea and its adnexa pose a unique situat...,isopt clinical hot topic panel discussion on c...
511,Racial variation in the complexity of coronary...,"[Elbadawi A, Alotaki E, Vazquez C, Barssoum K,...",BACKGROUND: Racial variations in presentation ...,background racial variations in presentation o...,racial variation in the complexity of coronary...
234,Critical developments of 2018: A review of the...,"[Cohn CS, Allen ES, Cushing MM, Dunbar NM, Fri...",BACKGROUND: The AABB compiles an annual synops...,background the aabb compiles an annual synopsi...,critical developments of a review of the liter...
266,Development and Validation of a MicroRNA Panel...,"[Ormseth MJ, Solus JF, Sheng Q, Ye F, Wu Q, Gu...",OBJECTIVE: MicroRNA (miRNA) are short noncodin...,objective microrna mirna are short noncoding r...,development and validation of a microrna panel...
665,A Simple Framework for Weighting Panels Across...,"[Kamnetz S, Trowbridge E, Lochner J, Koslov S,...",BACKGROUND: Health system redesign necessitate...,background health system redesign necessitates...,a simple framework for weighting panels across...


**Let's find similar articles using Abstract similarity**

In [None]:
title = "A contemporary review of obstructive sleep apnea." #provide actual name of a paper
papers[papers['title']==title]

Unnamed: 0,title,authors,abstract,clean_abstract,clean_title
4,A contemporary review of obstructive sleep apnea.,"[Ralls F, Cutchen L]",PURPOSE OF REVIEW: This review provides a cont...,purpose of review this review provides a conte...,a contemporary review of obstructive sleep apnea


In [None]:
nearest_papers = find_nearest_papers(4, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)
for i in nearest_papers['keywords']: print ("Keywords: ", i)

Keywords:  sleep
Keywords:  amyloid
Keywords:  osa
Keywords:  apnea
Keywords:  breath
Keywords:  elder
Keywords:  review
Keywords:  acid
Keywords:  gaba
Keywords:  excitatori
Keywords:  aminobutyr
Keywords:  qualiti
Keywords:  neurotransmitt
Keywords:  arous
Keywords:  hypoxemia
Keywords:  sdb
Keywords:  apneic
Keywords:  contemporari
Keywords:  increas
Keywords:  precipit
Keywords:  glutam
Keywords:  sympathet
Keywords:  alzheim
Keywords:  collaps
Keywords:  homeostat
Keywords:  disord
Keywords:  emphasi
Keywords:  fibril
Keywords:  untreat
Keywords:  puls


In [None]:
# Show the abstracts of similar papers
for i in nearest_papers['papers']['abstract']: print ("Abstract: "+i+"\n")

Abstract: PURPOSE OF REVIEW: This review provides a contemporary review of sleep apnea with emphasis on definitions, epidemiology, and consequences. RECENT FINDINGS: Amyloid beta-42 is one of the main peptides forming amyloid plaques in the brains of Alzheimer patients. Poorer sleep quality and shorter sleep duration have been associated with a higher amyloid burden. Decreased sleep time in the elderly is a precipitating factor in amyloid retention. Studies have shown that the dysregulation of the homeostatic balance of the major inhibitory and excitatory amino acid neurotransmitter systems of gamma-aminobutyric acid (GABA) and glutamate play a role in sleep disordered breathing (SDB). SUMMARY: Untreated sleep disordered breathing (obstructive sleep apnea and/or central sleep apnea) are an important cause of medical mortality and morbidity. OSA is characterized by recurrent episodes of partial or complete collapse of the upper airway during sleep followed by hypoxia and sympathetic act

Applying k-Nearest Neighbors to TF-IDF weights matrix seems to be pretty effective at returning similar articles. The parameters that were chosen to build the TF-IDF models and k-Nearest Neighbors models were somewhat arbitrary. It would be resonable to assume that the accuracy of document retrieval could be improved if more time was invested in selecting optimal tuning parameters.

Try optimizing this one with K-means, Agglomerative clustering and DBSCAN.